[XviD-devel] Question about bvop decoding

Tue Jul 20 21:24:10 CEST 2004

Christoph Lampert (chl at math.uni-bonn.de) wrote:
> valgrind/cachegrind seems to produce results similar to yours,
> decode_bf_interpolate_mbinter has 14% of instructions, and 
> 5.3% of total CPU cycles. With both, it's top of the list, followed by 
> decoder_bframes (7.3% of instructions) and decode_mbinter(6.8%). 

Glad to see i'm not crazy, and/or my box doesn't behave like
being part of the 4th dimension !

> The largest portion is due to complicated calculation of 
> 
> const uint8_t *const src = refn + (int)((y+(dy>>1))*stride+x+(dx>>1)
> 
> and the less complicated 
> 
> uint8_t *const dst = cur + (int)(y*stride+x);
> 
> switch (((dx&1)<<1)+(dy&1)) {
>  
> Those are in fact not in decoder.c, but inlined from 
> interpolate8x8_switch(), which is called 6 times per MB. 
> So I guess that high number of cycles is due to counting inlined code. 
> Have you maybe checked how big interpolate_mbinter is in the ASM step?
> 
> Indeed, this way of calling interpolate8x8_switch isn't optimal. 
> I guess, since each time all addresses are recalculated, and in most 
> cases, the vectors will not even be different. 

I'll have a look.

Btw, i'd like to tell to Linux users some discoveries i've
done this week while profiling the decoder.

As i'm interested by kernel latency improvements, i tested the
Con Kolivas patchset. And i got quite strange results during a
mplayer+xvidcore benchmark. To make it short, the bench was
20s to 30s shorter (15% improvement) when running on this CK
kernel compared to kernel.org kernels.

After some investigation (was gettimeofday borked because of
voluntary preemption, because of stair case scheduler,
different HZ value etc etc?), it seems the stair case scheduler
catches the fact a process (here mplayer) can monopolize the
CPU w/o perturbing interactivity of other tasks (near 0
interactivity as i'm just waiting for the result).

So for testing/bench purpose, don't use CK kernels unless you
are sure you reproduce the exact same load during the test.
Else you'll get results with quite a big variance (5% to 10%
of total time).

I'd recommend this kernel if encoding time benefits from this
scheduling policy :-)

I'm still amazed the CK kernel could bring 15% improvement for
free (of course that implies you do nothing else but decoding)

-- 
Edouard Gomez