[XviD-devel] [PATCH] calc_cbp_sse2 optimization

Sun Apr 18 17:19:20 CEST 2004

Mat Hostetter (mat at curl.com) wrote:
> This change (against 1.0.0-rc4) speeds up calc_cbp_sse2 from 131
> cycles to 112 cycles on the Pentium 4 (for the in-cache case).
> 
> This change uses pcmpgtb/pmovmskb to extract a zero/nonzero mask,
> rather than the longer sequence used previously, and eliminates all
> conditional branches.  I also changed a movdqu to movdqa; if there's a
> reason the load might be unaligned (some bogus platform that can't
> align static arrays mod 16?) please let me know.

The blocks passed to calc_cbp  should be aligned as they're allocated on
stack and  aligned by  DECLARE_ALIGNED_ARRAY (or matrix,  never remember
its name).

> I am new to XviD so I don't know what your standard practice is for
> correctness validation and benchmarking.  So I wrote my own test and
> benchmark for this proc.  My test tries the 258048 coeff[] arrays I
> consider "interesting" with both calc_cbp_sse2 and calc_cbp_plain.
> I always get the same result so I am pretty confident this patch is
> correct.
> 
> I chose calc_cbp_sse2 at random just to get a feel for XviD's sources.
> If someone can point me at some more important code you'd like
> optimized, and tell me how you benchmark it, I may be able to
> contribute (I'm a professional compiler programmer).  I'm sure
> you've done lots of optimizations already but another pair of eyes
> never hurts.  :-)

Well there is no "one parctice" for all developers as some are using
windows, others GNU/Linux etc...

With windows:
 - with an AMD, you can use the AMD tools available in their developer
   section.
 - with an Intel... hmmm dunno, you can use xvid_bench, but its timing
   function isn't very precise because it's based on ms (time duration
   not  MS(tm)) precision.  Maybe  you can  give  a try  at better  high
   precision timers available in Win32 APIs.
 - Purify(?).  I don't  know if  it can  simulate a  complete x86  cpu +
   caches. If that's the case then  you can get an idea if your function
   trashes the caches or not.
 - your own little C program.

With GNU/Linux:
 - Oprofile (available on linux 2.6  based systems), it works quite well
   on all ia32 implementations. It reads CPUs state registers, so it can
   retrieve lot of information about  the time spent on functions, cache
   misses etc etc.
 - Valgrind to simulate CPU+cache.
 - gprof: not that  good at telling you things  about ASMed functions as
   they don't have the GNU profile header and tailer code to save timing
   information.
 - xvid_bench,  its  precision could  be  improved  just  by taking  the
   complete information returned by gettimeofday (1000x gain).
 - your own C programs

Your patch will mae it into HEAD.  And btw, don't spend too much time on
calc_cbp, it's  used once per block  :-) Do a general  profiling to find
hot functions  (last time i  did one, most  time was spent on  fdct when
using VHQ>=1, then some compensation functions)

-- 
Edouard Gomez