[XviD-devel] [PATCH] calc_cbp_sse2 optimization
Edouard Gomez
ed.gomez at free.fr
Sun Apr 18 17:19:20 CEST 2004
Mat Hostetter (mat at curl.com) wrote:
> This change (against 1.0.0-rc4) speeds up calc_cbp_sse2 from 131
> cycles to 112 cycles on the Pentium 4 (for the in-cache case).
>
> This change uses pcmpgtb/pmovmskb to extract a zero/nonzero mask,
> rather than the longer sequence used previously, and eliminates all
> conditional branches. I also changed a movdqu to movdqa; if there's a
> reason the load might be unaligned (some bogus platform that can't
> align static arrays mod 16?) please let me know.
The blocks passed to calc_cbp should be aligned as they're allocated on
stack and aligned by DECLARE_ALIGNED_ARRAY (or matrix, never remember
its name).
> I am new to XviD so I don't know what your standard practice is for
> correctness validation and benchmarking. So I wrote my own test and
> benchmark for this proc. My test tries the 258048 coeff[] arrays I
> consider "interesting" with both calc_cbp_sse2 and calc_cbp_plain.
> I always get the same result so I am pretty confident this patch is
> correct.
>
> I chose calc_cbp_sse2 at random just to get a feel for XviD's sources.
> If someone can point me at some more important code you'd like
> optimized, and tell me how you benchmark it, I may be able to
> contribute (I'm a professional compiler programmer). I'm sure
> you've done lots of optimizations already but another pair of eyes
> never hurts. :-)
Well there is no "one parctice" for all developers as some are using
windows, others GNU/Linux etc...
With windows:
- with an AMD, you can use the AMD tools available in their developer
section.
- with an Intel... hmmm dunno, you can use xvid_bench, but its timing
function isn't very precise because it's based on ms (time duration
not MS(tm)) precision. Maybe you can give a try at better high
precision timers available in Win32 APIs.
- Purify(?). I don't know if it can simulate a complete x86 cpu +
caches. If that's the case then you can get an idea if your function
trashes the caches or not.
- your own little C program.
With GNU/Linux:
- Oprofile (available on linux 2.6 based systems), it works quite well
on all ia32 implementations. It reads CPUs state registers, so it can
retrieve lot of information about the time spent on functions, cache
misses etc etc.
- Valgrind to simulate CPU+cache.
- gprof: not that good at telling you things about ASMed functions as
they don't have the GNU profile header and tailer code to save timing
information.
- xvid_bench, its precision could be improved just by taking the
complete information returned by gettimeofday (1000x gain).
- your own C programs
Your patch will mae it into HEAD. And btw, don't spend too much time on
calc_cbp, it's used once per block :-) Do a general profiling to find
hot functions (last time i did one, most time was spent on fdct when
using VHQ>=1, then some compensation functions)
--
Edouard Gomez
More information about the XviD-devel
mailing list