[XviD-devel] XVID profiling

Christoph Lampert chl at math.uni-bonn.de
Sat Mar 1 19:17:13 CET 2003


On Sat, 1 Mar 2003, Michael Militzer wrote:
> 
> forget it. It's sad but I fear that its not possible to make these two
> functions significantly faster. yv12_to_yv12 makes already heavy use
> of prefetch instructions 

I can see exactly 2 prefetch instructions in yv12_to_yv12's COPY_PLANE.

        prefetchnta [esi + 64]  ; non temporal prefetch 
        prefetchnta [esi + 64+32]       ; non temporal prefetch 

1) They rely on a fixed Cacheline_size of 32 bytes (on Athlon/P4 it's 
   64 and the second prefetchnta won't do a thing). 

2) They fetch memory which will be accessed in the next iteration, 
   which is 20 instructions later (and those will most likely be 
   executed somehow in parallel). 

A cache miss is 150 cycles or more. I can't believe this is optimal. 

Isn't yv12_to_yv12 a simple memcpy()? There are so many different
versions, I found a benchmark: 
http://sourcefrog.net/projects/memcpyspeed/speed.c

memcpy 1024kB -- aligned blocks
      libc memcpy                                        3.060000 s
      MMX memcpy using MOVQ                              2.720000 s
      arjanv's MOVQ (with prefetch)                      2.710000 s
      arjanv's MOVNTQ (with prefetch removed)            2.000000 s
      arjanv's MOVNTQ (with prefetch)                    1.940000 s
      arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  1.670000 s

The XVID version seems to be almost the same as "arjanv's MOVNTQ", and
indeed prefetching doesn't seem to change much. But the interleaved seems
quite a few percent faster. 

Of course prefetch won't help in all cases, but only 3 prefetch
instructions in full XVID core? I do think there is something to optimize
there. 

> and I don't expect much from prefetch for transfer_8to16sub. I know it
> seems to be popular to plaster mmx code with a lot of prefetch
> instructions (you mentioned ffmpeg, but libmpeg2's interpolation code
> does the same iirc). However it is not faster, trust me. To me it
> seems that you really really have to know where exactly to put these
> prefetch instructions to have at least little gain, however if not
> even Skal has come up with something yet, it seems not that easy.

Yes, it may need different versions/macros for different CPUs. 
But it's a great playing ground, maybe somebody is interested... 

gruel 




More information about the XviD-devel mailing list