[XviD-devel] Profilign XVID, Part II

Mon Mar 3 21:10:18 CET 2003

Hi!

I have a few comments / experiences that might be helpful.

Really great too see some actual numbers in this thread - I'm amazed
each time, by how much can actually be gained (we also recently got an
isse-optimized blitter for our project).

The reason you are not seeing big improvements on using
prefetch-instructions are probably for two reasons. First of all - when
you access data linear, or semi-linear, the hardware prefetcher is
detecting this, and issues prefetches automatically - therefore the
manual prefetches are ignored, since they are already running. AMD's way
of avoiding starting the hardware prefetcher (when not intended, as in
massive byte-copying) is to read data backwards - this may in some cases
actually be faster than reading forward.
Second of all - your proposed prefetch distance (64 bytes) is far too
little for the prefetch to be any useful. To get any gain from using
prefetch, use a distance of at least 256 bytes.

If I recall correctly, the AMD examples actually force-prefethes up to
192k of data, before beginning to actually move the data - not 8k, as
you mention. I could be wrong though, but it's about half the size of
the data cache, so it might be true.

Anyway, you definately seem to be right track here! 

Klaus Post
AviSynth project.

-----Original Message-----
From: Christoph Lampert [mailto:chl at math.uni-bonn.de] 
Sent: 2. marts 2003 14:22
To: xvid-devel at xvid.org
Subject: Re: [XviD-devel] Profilign XVID, Part II 

On Sat, 1 Mar 2003, Michael Militzer wrote:
> again: forget it. Memory transfers are unfortunately even more
dominant for 
> decoding than for encoding. And you have just profiled decoding with
yv12 
> output. Just try the same for rgb output: rgb conversion needs more
time than 
> the whole decoding process...

I never use RGB output, and due to graphics cards overlay, I thing
nobody
else should, either :) 

Still, I just wanted to post a quick result. I'm no ASM guru, it took me
quite a while to debug the few instructions, but finally I ported AMD's
example for fast-memcpy on Athlon with prefetch of complete 8K blocks
of memory instead of just a few bytes in every iteration. 
Maybe we can't use it in XVID, since we have to skip padding areas but
after all it was just a test for prefetching: 

Athlon XP 1.4GHz (hardware prefetch, 64Byte cacheline, DDR-PC2100) 

glibc memcpy()                                     3.250s   146 MB/s
with MOVQ                                          3.080s   154 MB/s
AMD reference (fistful of cache for Athlon)        0.690s   689 MB/s
arjanv's MOVNTQ (without prefetch, for Athlon)     0.830s   573 MB/s
arjanv's MOVNTQ (with prefetch, for Athlon)        0.830s   573 MB/s
arjanv's interleaved MOVQ/MOVNTQ without prefetchNTA 1.110s 428 MB/s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  0.840s   566MB/s

Btw. according to AMD  1976 MB/s with XP 1800+ is possible. 

PentiumIII 700MHz (32Byte cacheline, SDR-PC100)

glibc memcpy()                                     5.960s    79MB/s
with MOVQ                                          8.280s    57MB/s
AMD reference (fistful of cache)                   1.290s   368MB/s
arjanv's MOVNTQ (without prefetch)                 2.320s   205MB/s
arjanv's MOVNTQ (with prefetch)                    2.250s   211MB/s
arjanv's interleaved MOVQ/MOVNTQ without prefetchNTA 2.970s 160MB/s
arjanv's interleaved MOVQ/MOVNTQ with prefetchNTA  2.290s   207MB/s

gruel

_______________________________________________
XviD-devel mailing list
XviD-devel at xvid.org
http://list.xvid.org/mailman/listinfo/xvid-devel