[XviD-devel] XVID profiling
Christoph Lampert
chl at math.uni-bonn.de
Sat Mar 1 19:17:13 CET 2003
On Sat, 1 Mar 2003, Michael Militzer wrote:
>
> forget it. It's sad but I fear that its not possible to make these two
> functions significantly faster. yv12_to_yv12 makes already heavy use
> of prefetch instructions
I can see exactly 2 prefetch instructions in yv12_to_yv12's COPY_PLANE.
prefetchnta [esi + 64] ; non temporal prefetch
prefetchnta [esi + 64+32] ; non temporal prefetch
1) They rely on a fixed Cacheline_size of 32 bytes (on Athlon/P4 it's
64 and the second prefetchnta won't do a thing).
2) They fetch memory which will be accessed in the next iteration,
which is 20 instructions later (and those will most likely be
executed somehow in parallel).
A cache miss is 150 cycles or more. I can't believe this is optimal.
Isn't yv12_to_yv12 a simple memcpy()? There are so many different
versions, I found a benchmark:
http://sourcefrog.net/projects/memcpyspeed/speed.c
memcpy 1024kB -- aligned blocks
libc memcpy 3.060000 s
MMX memcpy using MOVQ 2.720000 s
arjanv's MOVQ (with prefetch) 2.710000 s
arjanv's MOVNTQ (with prefetch removed) 2.000000 s
arjanv's MOVNTQ (with prefetch) 1.940000 s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA 1.670000 s
The XVID version seems to be almost the same as "arjanv's MOVNTQ", and
indeed prefetching doesn't seem to change much. But the interleaved seems
quite a few percent faster.
Of course prefetch won't help in all cases, but only 3 prefetch
instructions in full XVID core? I do think there is something to optimize
there.
> and I don't expect much from prefetch for transfer_8to16sub. I know it
> seems to be popular to plaster mmx code with a lot of prefetch
> instructions (you mentioned ffmpeg, but libmpeg2's interpolation code
> does the same iirc). However it is not faster, trust me. To me it
> seems that you really really have to know where exactly to put these
> prefetch instructions to have at least little gain, however if not
> even Skal has come up with something yet, it seems not that easy.
Yes, it may need different versions/macros for different CPUs.
But it's a great playing ground, maybe somebody is interested...
gruel
More information about the XviD-devel
mailing list