[XviD-devel] XVID profiling

Sat Mar 1 18:06:47 CET 2003

Hi,

Quoting Christoph Lampert <chl at math.uni-bonn.de>:

> 
> Hi,
> I got some profiling results about XVID for those you are interested in
> MMXing a little more. So far, I just checked encoding. From the logfile
> you can see: 
> 
> With MMX, it's always the SAD that is slowest, either sad16v_mmx because 
> of INTER4V-mode, or sad16bi_mmx because of b-frames interpolate/direct
> mode. Only CheckCandidates-Routines in motion-estimation seem like
> candidate for some speedup. They've indeed grown rather large. 
> 
> GOAL 0)   Clean up "CheckCandidate"-mechanism  (but that may influence 
>           ME structure, so it's not #1 on the list). 
> 
> 
> With XMM, all SADs are faster than with mmx. CheckCandidate gets
> relatively more influence, in particular in Bframe mode. Without B-frames
> and Q-pel, mem transfer and interpolation become more important. 
> 
> 
> 
> GOAL 1)  Speed up Mem-Transfers, in particular transfer_8to16sub (_mmx)  
>          and yv12_to_yv12 (_xmm). Maybe those are candidates for prefetch. 
> 

forget it. It's sad but I fear that its not possible to make these two 
functions significantly faster. yv12_to_yv12 makes already heavy use of 
prefetch instructions and I don't expect much from prefetch for 
transfer_8to16sub. I know it seems to be popular to plaster mmx code with a lot 
of prefetch instructions (you mentioned ffmpeg, but libmpeg2's interpolation 
code does the same iirc). However it is not faster, trust me. To me it seems 
that you really really have to know where exactly to put these prefetch 
instructions to have at least little gain, however if not even Skal has come up 
with something yet, it seems not that easy.

> For QPel, it become obvious that not everything is ASMed yet: 
> interpolate16x16_lowpass_h_c and interpolate16x16_lowpass_v_c
> are obvious candidated for ASMing: 
> 
> 
> GOAL 2)  Create SIMDed versions of interpolate16x16_lowpass_h_c 
>          and interpolate16x16_lowpass_v_c
> 

hm, qpel was fully asmed when I commited the stuff. So the current situation is 
maybe just a small mistake or has been caused by a quick bugfix etc? I don't 
have much free time right now, however I'll have a quick tonight why the asm 
optimizations are not used...

> also, interpolate-average functions take quite a lot of time and seem to
> be mmx, not xmm. 
> GOAL 3) Create XMM versions of interpolate8x8_avg4_mmx
>                                interpolate8x8_6tap_lowpass_v_mmx
>                                interpolate8x8_avg2_mmx

well, you cannot write xmm optimizations for everything. Please remember that 
xmm just means extended mmx. So it's basically mmx + some extra instructions. 
These are sometimes useful, but sometimes not. Unfortunately there are no 
useful xmm instructions for interpolate8x8_6tap_lowpass_v for example.

However I have xmm versions for interpolate8x8_avg[2,4] somewhere here on my 
disk. But iirc, I had rounding problems with avg4_xmm, but avg2_xmm should work 
without modifications...

bye,
Michael