[XviD-devel] Thought about idct_add

Thu Aug 26 13:30:55 CEST 2004

	howdy gents,


On Mon, 2004-08-23 at 16:31, Edouard Gomez wrote:
> Hello,
> 
> Just a few thoughts about idct+add merging:

	some thoughts too:

> 
> iDCT+add solution:
>  - horizontal pass does 64*2 readings and 64*2 writings
>  - vertical pass does 64*2 readings from horizontal pass intermediate
>    results, 64*2 readings from dst because we must unpack its values in
>    order to add+clip with idct values. So it's 64bytes readings for 4
>    first columns and the same again for last four columns.
>    As for the reading, we must write dst twice (64*2)
>  --- Total: 540bytes per 8x8 block

	In fact, a little more I/O than that, because of temporary
	spills (SCRATCH) during vertical pass. Currently, i see 2
	spills in xvid's code (b0 and b3), but note that it could
	be reduced to 1. 

> 
> If i'm not mistaken, this is a 45% bandwidth usage saving.

	Bandwidth arithmetic isn't that simple. You can't just
	count the number of I/O. Their impact depends a lot
	of cache pollution status. For instance the intermediate
	Dct values are most probably on L1 (they're on the esp
	stack), whilst the pixels source/dst are most probably
	on L2 at best unless you prefetched them somehow. This
	varies a lot of course... So: real-situation timing
	still is the best for profiling cache.

>  But as the
> bframe trick showed, a 50% percent bandwidth saving gives roughly a 10% max
> improvement. My C tests showed the speed was the same, but as i said, C
> versions are rarely bandwidth limited, so it's not that surprising that
> saving memory accesses doesn't help getting faster in that case.
> 
> Moreover as i see it, the fact idct does the last pass on groups of
> 4 columns will really blow up the complexity required to write
> only 4bytes to destination at each sub pass. I don't see any elegant
> solution except the usual 0xffffffff00000000 masking of destination
> and then oring the 4 resulting bytes with that masked destination. 

	That's not very complicated: you can easily write 32bits
	with MMX registers (movd instead of movq), or even use
	a CPU reg (eax) as intermediate (for 4-bytes aligned writes).
	Interesting for idct+put/add is to put the intermediate result
	of idct on columns [0..3] into the source array (16b), still
	unpacked. Once the last 4 columns [4..7] are also idct'd,
	you can get these back, packuswb them all, and send the
	whole 8bytes to pixel destination (better caching). 

> Or
> i could use 32bit writings, saving to a usual 32bit register the result
> of clip((dst + clipped_idct), 0, 255) and then to dst... anyway, in all
> cases i plan this will be very tricky/complicated...

	A seen above, that's not really tricky... Processing of
	first 4 columns is the same, and processing of the last
	one have an additional 'packuswb' before the final 'movq' ;)
 
	(note: you needn't clip_idct for ASM version: dynamic
	range is 16bits, fully used before final descaling, and 
	the saturation ops -paddsw- are doing it for 'free').

> 
> SO, if an asm expert could give his opinion, i'd really appreciate
> because the more i think about this optimsation, the more i think it
> won't help much compared to the required effort to hack it.
> 
> PS: neither skal codec nor ffmpeg propose such a trick, maybe that's
>     because it's not worth it.

	hmm... i did it for the ASM version, since the memory I/O
	then becomes the real bottleneck compared to the actual
	arithmetic computations (which are fast once SIMD'd). So
	you must take every chances to further process data before
	storing them. Contrary, it's less critical for C-version,
	where arithmetic is overwhelming I/Os (it's even
	counter-productive to use 16b storage..)
	just my 0.02euros

	bye!
Skal