[XviD-devel] Thought about idct_add
Edouard Gomez
ed.gomez at free.fr
Mon Aug 23 16:31:44 CEST 2004
Hello,
Just a few thoughts about idct+add merging:
- ffmpeg doesn't use this, they just wrap both operations in one function
call... and as we all know ffmpeg is faaaaast.
- I detailed memory transactions:
Today's solution:
- horizontal pass reads the dct coeffs once and write them to
the buffer (64*2*2)
- vertical pass reads them and write them (2*2*64)
- transfer function reads the 16bit values (64*2 readings) and
adds them to the destination (64 readings) and writes the result
to destination (64 writings)
--- Total: 896bytes per 8x8 block
I won't count SIMD table access as this can vary among implementations
and will be constant. between the idct_add and idct (i hope so).
iDCT+add solution:
- horizontal pass does 64*2 readings and 64*2 writings
- vertical pass does 64*2 readings from horizontal pass intermediate
results, 64*2 readings from dst because we must unpack its values in
order to add+clip with idct values. So it's 64bytes readings for 4
first columns and the same again for last four columns.
As for the reading, we must write dst twice (64*2)
--- Total: 540bytes per 8x8 block
If i'm not mistaken, this is a 45% bandwidth usage saving. But as the
bframe trick showed, a 50% percent bandwidth saving gives roughly a 10% max
improvement. My C tests showed the speed was the same, but as i said, C
versions are rarely bandwidth limited, so it's not that surprising that
saving memory accesses doesn't help getting faster in that case.
Moreover as i see it, the fact idct does the last pass on groups of
4 columns will really blow up the complexity required to write
only 4bytes to destination at each sub pass. I don't see any elegant
solution except the usual 0xffffffff00000000 masking of destination
and then oring the 4 resulting bytes with that masked destination. Or
i could use 32bit writings, saving to a usual 32bit register the result
of clip((dst + clipped_idct), 0, 255) and then to dst... anyway, in all
cases i plan this will be very tricky/complicated...
SO, if an asm expert could give his opinion, i'd really appreciate
because the more i think about this optimsation, the more i think it
won't help much compared to the required effort to hack it.
PS: neither skal codec nor ffmpeg propose such a trick, maybe that's
because it's not worth it.
--
Edouard Gomez
More information about the XviD-devel
mailing list