[XviD-devel] Question about bvop decoding

Tue Jul 20 17:22:46 CEST 2004

On Tue, 20 Jul 2004, Edouard Gomez wrote:
> Hmm i don't have such a profile at all...
> See:
> http://ed.gomez.free.fr/vrac/profile-gprof.txt
> http://ed.gomez.free.fr/vrac/profile-oprofile.txt

valgrind/cachegrind seems to produce results similar to yours,
decode_bf_interpolate_mbinter has 14% of instructions, and 
5.3% of total CPU cycles. With both, it's top of the list, followed by 
decoder_bframes (7.3% of instructions) and decode_mbinter(6.8%). 

The largest portion is due to complicated calculation of 

const uint8_t *const src = refn + (int)((y+(dy>>1))*stride+x+(dx>>1)

and the less complicated 

uint8_t *const dst = cur + (int)(y*stride+x);

switch (((dx&1)<<1)+(dy&1)) {

Those are in fact not in decoder.c, but inlined from 
interpolate8x8_switch(), which is called 6 times per MB. 
So I guess that high number of cycles is due to counting inlined code. 
Have you maybe checked how big interpolate_mbinter is in the ASM step?

Indeed, this way of calling interpolate8x8_switch isn't optimal. 
I guess, since each time all addresses are recalculated, and in most 
cases, the vectors will not even be different. 

chl