[XviD-devel] Thought about idct_add

Mon Aug 23 16:31:44 CEST 2004

Hello,

Just a few thoughts about idct+add merging:
 - ffmpeg doesn't use this, they just wrap both operations in one function
   call... and as we all know ffmpeg is faaaaast.
 - I detailed memory transactions:

Today's solution:
 - horizontal pass reads the dct coeffs once and write them to
   the buffer (64*2*2)
 - vertical pass reads them and write them (2*2*64)
 - transfer function reads the 16bit values (64*2 readings) and
   adds them to the destination (64 readings) and writes the result
   to destination (64 writings)
 --- Total: 896bytes per 8x8 block

I won't count SIMD table access as this can vary among implementations
and will be constant. between the idct_add and idct (i hope so).

iDCT+add solution:
 - horizontal pass does 64*2 readings and 64*2 writings
 - vertical pass does 64*2 readings from horizontal pass intermediate
   results, 64*2 readings from dst because we must unpack its values in
   order to add+clip with idct values. So it's 64bytes readings for 4
   first columns and the same again for last four columns.
   As for the reading, we must write dst twice (64*2)
 --- Total: 540bytes per 8x8 block

If i'm not mistaken, this is a 45% bandwidth usage saving. But as the
bframe trick showed, a 50% percent bandwidth saving gives roughly a 10% max
improvement. My C tests showed the speed was the same, but as i said, C
versions are rarely bandwidth limited, so it's not that surprising that
saving memory accesses doesn't help getting faster in that case.

Moreover as i see it, the fact idct does the last pass on groups of
4 columns will really blow up the complexity required to write
only 4bytes to destination at each sub pass. I don't see any elegant
solution except the usual 0xffffffff00000000 masking of destination
and then oring the 4 resulting bytes with that masked destination. Or
i could use 32bit writings, saving to a usual 32bit register the result
of clip((dst + clipped_idct), 0, 255) and then to dst... anyway, in all
cases i plan this will be very tricky/complicated...

SO, if an asm expert could give his opinion, i'd really appreciate
because the more i think about this optimsation, the more i think it
won't help much compared to the required effort to hack it.

PS: neither skal codec nor ffmpeg propose such a trick, maybe that's
    because it's not worth it.

--
Edouard Gomez