
Recherche avancée
Médias (91)
-
Corona Radiata
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
-
Lights in the Sky
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
-
Head Down
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
-
Echoplex
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
-
Discipline
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
-
Letting You
26 septembre 2011, par
Mis à jour : Septembre 2011
Langue : English
Type : Audio
Autres articles (76)
-
Websites made with MediaSPIP
2 mai 2011, parThis page lists some websites based on MediaSPIP.
-
Possibilité de déploiement en ferme
12 avril 2011, parMediaSPIP peut être installé comme une ferme, avec un seul "noyau" hébergé sur un serveur dédié et utilisé par une multitude de sites différents.
Cela permet, par exemple : de pouvoir partager les frais de mise en œuvre entre plusieurs projets / individus ; de pouvoir déployer rapidement une multitude de sites uniques ; d’éviter d’avoir à mettre l’ensemble des créations dans un fourre-tout numérique comme c’est le cas pour les grandes plate-formes tout public disséminées sur le (...) -
Ajouter des informations spécifiques aux utilisateurs et autres modifications de comportement liées aux auteurs
12 avril 2011, parLa manière la plus simple d’ajouter des informations aux auteurs est d’installer le plugin Inscription3. Il permet également de modifier certains comportements liés aux utilisateurs (référez-vous à sa documentation pour plus d’informations).
Il est également possible d’ajouter des champs aux auteurs en installant les plugins champs extras 2 et Interface pour champs extras.
Sur d’autres sites (4875)
-
Upcoming conferences / workshops
10 septembre 2010, par silviaLots is happening in open source multimedia land in the next few months. Check out these cool upcoming conferences / workshops / miniconfs… September 29th and 30th, New York Open Subtitles Design Summit October 1st and 2nd, New York Open Video Conference October 3rd and 4th, New York Foundations (...)
-
Anatomy of an optimization : H.264 deblocking
As mentioned in the previous post, H.264 has an adaptive deblocking filter. But what exactly does that mean — and more importantly, what does it mean for performance ? And how can we make it as fast as possible ? In this post I’ll try to answer these questions, particularly in relation to my recent deblocking optimizations in x264.
H.264′s deblocking filter has two steps : strength calculation and the actual filter. The first step calculates the parameters for the second step. The filter runs on all the edges in each macroblock. That’s 4 vertical edges of length 16 pixels and 4 horizontal edges of length 16 pixels. The vertical edges are filtered first, from left to right, then the horizontal edges, from top to bottom (order matters !). The leftmost edge is the one between the current macroblock and the left macroblock, while the topmost edge is the one between the current macroblock and the top macroblock.
Here’s the formula for the strength calculation in progressive mode. The highest strength that applies is always selected.
If we’re on the edge between an intra macroblock and any other macroblock : Strength 4
If we’re on an internal edge of an intra macroblock : Strength 3
If either side of a 4-pixel-long edge has residual data : Strength 2
If the motion vectors on opposite sides of a 4-pixel-long edge are at least a pixel apart (in either x or y direction) or the reference frames aren’t the same : Strength 1
Otherwise : Strength 0 (no deblocking)These values are then thrown into a lookup table depending on the quantizer : higher quantizers have stronger deblocking. Then the actual filter is run with the appropriate parameters. Note that Strength 4 is actually a special deblocking mode that performs a much stronger filter and affects more pixels.
One can see somewhat intuitively why these strengths are chosen. The deblocker exists to get rid of sharp edges caused by the block-based nature of H.264, and so the strength depends on what exists that might cause such sharp edges. The strength calculation is a way to use existing data from the video stream to make better decisions during the deblocking process, improving compression and quality.
Both the strength calculation and the actual filter (not described here) are very complex if naively implemented. The latter can be SIMD’d with not too much difficulty ; no H.264 decoder can get away with reasonable performance without such a thing. But what about optimizing the strength calculation ? A quick analysis shows that this can be beneficial as well.
Since we have to check both horizontal and vertical edges, we have to check up to 32 pairs of coefficient counts (for residual), 16 pairs of reference frame indices, and 128 motion vector values (counting x and y as separate values). This is a lot of calculation ; a naive implementation can take 500-1000 clock cycles on a modern CPU. Of course, there’s a lot of shortcuts we can take. Here’s some examples :
- If the macroblock uses the 8×8 transform, we only need to check 2 edges in each direction instead of 4, because we don’t deblock inside of the 8×8 blocks.
- If the macroblock is a P-skip, we only have to check the first edge in each direction, since there’s guaranteed to be no motion vector differences, reference frame differences, or residual inside of the macroblock.
- If the macroblock has no residual at all, we can skip that check.
- If we know the partition type of the macroblock, we can do motion vector checks only along the edges of the partitions.
- If the effective quantizer is so low that no deblocking would be performed no matter what, don’t bother calculating the strength.
But even all of this doesn’t save us from ourselves. We still have to iterate over a ton of edges, checking each one. Stuff like the partition-checking logic greatly complicates the code and adds overhead even as it reduces the number of checks. And in many cases decoupling the checks to add such logic will make it slower : if the checks are coupled, we can avoid doing a motion vector check if there’s residual, since Strength 2 overrides Strength 1.
But wait. What if we could do this in SIMD, just like the actual loopfilter itself ? Sure, it seems more of a problem for C code than assembly, but there aren’t any obvious things in the way. Many years ago, Loren Merritt (pengvado) wrote the first SIMD implementation that I know of (for ffmpeg’s decoder) ; it is quite fast, so I decided to work on porting the idea to x264 to see if we could eke out a bit more speed here as well.
Before I go over what I had to do to make this change, let me first describe how deblocking is implemented in x264. Since the filter is a loopfilter, it acts “in loop” and must be done in both the encoder and decoder — hence why x264 has it too, not just decoders. At the end of encoding one row of macroblocks, x264 goes back and deblocks the row, then performs half-pixel interpolation for use in encoding the next frame.
We do it per-row for reasons of cache coherency : deblocking accesses a lot of pixels and a lot of code that wouldn’t otherwise be used, so it’s more efficient to do it in a single pass as opposed to deblocking each macroblock immediately after encoding. Then half-pixel interpolation can immediately re-use the resulting data.
Now to the change. First, I modified deblocking to implement a subset of the macroblock_cache_load function : spend an extra bit of effort loading the necessary data into a data structure which is much simpler to address — as an assembly implementation would need (x264_macroblock_cache_load_deblock). Then I massively cleaned up deblocking to move all of the core strength-calculation logic into a single, small function that could be converted to assembly (deblock_strength_c). Finally, I wrote the assembly functions and worked with Loren to optimize them. Here’s the result.
And the timings for the resulting assembly function on my Core i7, in cycles :
deblock_strength_c : 309
deblock_strength_mmx : 79
deblock_strength_sse2 : 37
deblock_strength_ssse3 : 33Now that is a seriously nice improvement. 33 cycles on average to perform that many comparisons–that’s absurdly low, especially considering the SIMD takes no branchy shortcuts : it always checks every single edge ! I walked over to my performance chart and happily crossed off a box.
But I had a hunch that I could do better. Remember, as mentioned earlier, we’re reloading all that data back into our data structures in order to address it. This isn’t that slow, but takes enough time to significantly cut down on the gain of the assembly code. And worse, less than a row ago, all this data was in the correct place to be used (when we just finished encoding the macroblock) ! But if we did the deblocking right after encoding each macroblock, the cache issues would make it too slow to be worth it (yes, I tested this). So I went back to other things, a bit annoyed that I couldn’t get the full benefit of the changes.
Then, yesterday, I was talking with Pascal, a former Xvid dev and current video hacker over at Google, about various possible x264 optimizations. He had seen my deblocking changes and we discussed that a bit as well. Then two lines hit me like a pile of bricks :
<_skal_> tried computing the strength at least ?
<_skal_> while it’s freshWhy hadn’t I thought of that ? Do the strength calculation immediately after encoding each macroblock, save the result, and then go pick it up later for the main deblocking filter. Then we can use the data right there and then for strength calculation, but we don’t have to do the whole deblock process until later.
I went and implemented it and, after working my way through a horde of bugs, eventually got a working implementation. A big catch was that of slices : deblocking normally acts between slices even though normal encoding does not, so I had to perform extra munging to get that to work. By midday today I was able to go cross yet another box off on the performance chart. And now it’s committed.
Sometimes chatting for 10 minutes with another developer is enough to spot the idea that your brain somehow managed to miss for nearly a straight week.
NB : the performance chart is on a specific test clip at a specific set of settings (super fast settings) relevant to the company I work at, so it isn’t accurate nor complete for, say, default settings.
Update : Here’s a higher resolution version of the current chart, as requested in the comments.
-
CD-R Read Speed Experiments
21 mai 2011, par Multimedia Mike — Science Projects, Sega DreamcastI want to know how fast I can really read data from a CD-R. Pursuant to my previous musings on this subject, I was informed that it is inadequate to profile reading just any file from a CD-R since data might be read faster or slower depending on whether the data is closer to the inside or the outside of the disc.
Conclusion / Executive Summary
It is 100% true that reading data from the outside of a CD-R is faster than reading data from the inside. Read on if you care to know the details of how I arrived at this conclusion, and to find out just how much speed advantage there is to reading from the outside rather than the inside.Science Project Outline
- Create some sample CD-Rs with various properties
- Get a variety of optical drives
- Write a custom program that profiles the read speed
Creating The Test Media
It’s my understanding that not all CD-Rs are created equal. Fortunately, I have 3 spindles of media handy : Some plain-looking Memorex discs, some rather flamboyant Maxell discs, and those 80mm TDK discs :
My approach for burning is to create a single file to be burned into a standard ISO-9660 filesystem. The size of the file will be the advertised length of the CD-R minus 1 megabyte for overhead— so, 699 MB for the 120mm discs, 209 MB for the 80mm disc. The file will contain a repeating sequence of 0..0xFF bytes.
Profiling
I don’t want to leave this to the vagaries of any filesystem handling layer so I will conduct this experiment at the sector level. Profiling program outline :- Read the CD-ROM TOC and get the number of sectors that comprise the data track
- Profile reading the first 20 MB of sectors
- Profile reading 20 MB of sectors in the middle of the track
- Profile reading the last 20 MB of sectors
Unfortunately, I couldn’t figure out the raw sector reading on modern Linux incarnations (which is annoying since I remember it being pretty straightforward years ago). So I left it to the filesystem after all. New algorithm :
- Open the single, large file on the CD-R and query the file length
- Profile reading the first 20 MB of data, 512 kbytes at a time
- Profile reading 20 MB of sectors in the middle of the track (starting from filesize / 2 - 10 MB), 512 kbytes at a time
- Profile reading the last 20 MB of sectors (starting from filesize - 20MB), 512 kbytes at a time
Empirical Data
I tested the program in Linux using an LG Slim external multi-drive (seen at the top of the pile in this post) and one of my Sega Dreamcast units. I gathered the median value of 3 runs for each area (inner, middle, and outer). I also conducted a buffer flush in between Linux runs (as root :'sync; echo 3 > /proc/sys/vm/drop_caches'
).LG Slim external multi-drive (reading from inner, middle, and outer areas in kbytes/sec) :
- TDK-80mm : 721, 897, 1048
- Memorex-120mm : 1601, 2805, 3623
- Maxell-120mm : 1660, 2806, 3624
So the 120mm discs can range from about 10.5X all the way up to a full 24X on this drive. For whatever reason, the 80mm disc fares a bit worse — even at the inner track — with a range of 4.8X - 7X.
Sega Dreamcast (reading from inner, middle, and outer areas in kbytes/sec) :
- TDK-80mm : 502, 632, 749
- Memorex-120mm : 499, 889, 1143
- Maxell-120mm : 500, 890, 1156
It’s interesting that the 80mm disc performed comparably to the 120mm discs in the Dreamcast, in contrast to the LG Slim drive. Also, the results are consistent with my previous profiling experiments, which largely only touched the inner area. The read speeds range from 3.3X - 7.7X. The middle of a 120mm disc reads at about 6X.
Implications
A few thoughts regarding these results :- Since the very definition of 1X is the minimum speed necessary to stream data from an audio CD, then presumably, original 1X CD-ROM drives would have needed to be capable of reading 1X from the inner area. I wonder what the max read speed at the outer edges was ? It’s unlikely I would be able to get a 1X drive working easily in this day and age since the earliest CD-ROM drives required custom controllers.
- I think 24X is the max rated read speed for CD-Rs, at least for this drive. This implies that the marketing literature only cites the best possible numbers. I guess this is no surprise, similar to how monitors and TVs have always been measured by their diagonal dimension.
- Given this data, how do you engineer an ISO-9660 filesystem image so that the timing-sensitive multimedia files live on the outermost track ? In the Dreamcast case, if you can guarantee your FMV files will live somewhere between the middle and the end of the disc, you should be able to count on a bitrate of at least 900 kbytes/sec.
Source Code
Here is the program I wrote for profiling. Note that the filename is hardcoded (#define FILENAME
). Compiling for Linux is a simple'gcc -Wall profile-cdr.c -o profile-cdr'
. Compiling for Dreamcast is performed in the standard KallistiOS manner (people skilled in the art already know what they need to know) ; the only variation is to compile with the'-D_arch_dreamcast'
flag, which the default KOS environment adds anyway.C :-
#ifdef _arch_dreamcast
-
#include <kos .h>
-
-
/* map I/O functions to their KOS equivalents */
-
#define open fs_open
-
#define lseek fs_seek
-
#define read fs_read
-
#define close fs_close
-
-
#define FILENAME "/cd/bigfile"
-
#else
-
#include <stdio .h>
-
#include <sys /types.h>
-
#include </sys><sys /stat.h>
-
#include </sys><sys /time.h>
-
#include <fcntl .h>
-
#include <unistd .h>
-
-
#define FILENAME "/media/Full disc/bigfile"
-
#endif
-
-
/* Get a current absolute millisecond count ; it doesn’t have to be in
-
* reference to anything special. */
-
unsigned int get_current_milliseconds()
-
{
-
#ifdef _arch_dreamcast
-
return timer_ms_gettime64() ;
-
#else
-
struct timeval tv ;
-
gettimeofday(&tv, NULL) ;
-
return tv.tv_sec * 1000 + tv.tv_usec / 1000 ;
-
#endif
-
}
-
-
#define READ_SIZE (20 * 1024 * 1024)
-
#define READ_BUFFER_SIZE (512 * 1024)
-
-
int main()
-
{
-
int i, j ;
-
int fd ;
-
char read_buffer[READ_BUFFER_SIZE] ;
-
off_t filesize ;
-
unsigned int start_time, end_time ;
-
-
fd = open(FILENAME, O_RDONLY) ;
-
if (fd == -1)
-
{
-
return 1 ;
-
}
-
filesize = lseek(fd, 0, SEEK_END) ;
-
-
for (i = 0 ; i <3 ; i++)
-
{
-
if (i == 0)
-
{
-
lseek(fd, 0, SEEK_SET) ;
-
}
-
else if (i == 1)
-
{
-
lseek(fd, (filesize / 2) - (READ_SIZE / 2), SEEK_SET) ;
-
}
-
else
-
{
-
lseek(fd, filesize - READ_SIZE, SEEK_SET) ;
-
}
-
/* read 20 MB ; 40 chunks of 1/2 MB */
-
start_time = get_current_milliseconds() ;
-
for (j = 0 ; j <(READ_SIZE / READ_BUFFER_SIZE) ; j++)
-
if (read(fd, read_buffer, READ_BUFFER_SIZE) != READ_BUFFER_SIZE)
-
{
-
break ;
-
}
-
end_time = get_current_milliseconds() ;
-
end_time, start_time, end_time - start_time,
-
READ_SIZE / (end_time - start_time)) ;
-
}
-
-
close(fd) ;
-
-
return 0 ;
-
}