[XviD-devel] XVID profiling

Mon Mar 3 21:41:20 CET 2003

I've been doing some work with your color conversion routines, and I
noticed some minor details.

YUY2 to YV12 conversion.
You might be able to squeeze out a tiny bit of performance by doing an
integer see version. 
The chroma could be averaged, by using pavgb instead of unpacking to
words, add and divide by bitshifting. You actually should also do
rounding before bitshifting, as you are currently doing:

chroma = (upper line + lower line) >> 1 

and not:

chroma = (upper line + lower line + 1 ) >> 1

pavgb does have "proper" rounding.

Furthermore, the routine might have a minor gain from writing using
movntq.

YV12 -> YUY2
------------
The output from YUY2 mode would be nicer, if you interpolated chroma to
the line above or below, and not simply copied it.  This will of course
make the routing more complex, and probably slightly slower. It ensures
more correct chroma placement, but will slightly blur chroma after
several conversions. 

Many people still deliver YUY2, when encoding with XviD, and some people
also use YUY2 for output (because of overlay). So these things might
have influence on a lot of people.

Regards, Klaus Post
AviSynth project

-----Original Message-----
From: Christoph Lampert [mailto:chl at math.uni-bonn.de] 
Sent: 1. marts 2003 15:42
To: xvid-devel at xvid.org
Subject: [XviD-devel] XVID profiling

Hi,
I got some profiling results about XVID for those you are interested in
MMXing a little more. So far, I just checked encoding. From the logfile
you can see: 

With MMX, it's always the SAD that is slowest, either sad16v_mmx because

of INTER4V-mode, or sad16bi_mmx because of b-frames interpolate/direct
mode. Only CheckCandidates-Routines in motion-estimation seem like
candidate for some speedup. They've indeed grown rather large. 

GOAL 0)   Clean up "CheckCandidate"-mechanism  (but that may influence 
          ME structure, so it's not #1 on the list). 

With XMM, all SADs are faster than with mmx. CheckCandidate gets
relatively more influence, in particular in Bframe mode. Without
B-frames
and Q-pel, mem transfer and interpolation become more important. 

GOAL 1)  Speed up Mem-Transfers, in particular transfer_8to16sub (_mmx)

         and yv12_to_yv12 (_xmm). Maybe those are candidates for
prefetch. 

For QPel, it become obvious that not everything is ASMed yet: 
interpolate16x16_lowpass_h_c and interpolate16x16_lowpass_v_c
are obvious candidated for ASMing: 

GOAL 2)  Create SIMDed versions of interpolate16x16_lowpass_h_c 
         and interpolate16x16_lowpass_v_c

also, interpolate-average functions take quite a lot of time and seem to
be mmx, not xmm. 

GOAL 3) Create XMM versions of interpolate8x8_avg4_mmx
                               interpolate8x8_6tap_lowpass_v_mmx
                               interpolate8x8_avg2_mmx