[XviD-devel] More MMX improvements => funky cbp!

Wed Aug 3 12:48:37 CEST 2005

	Hello Carlo!

On Fri, 2005-07-29 at 17:59, carlo.bramix wrote:
> Hello,
> thanks a lot for your replies.
> Unfortunately, I thought those routines were speed critical for the fact they were rewritten in ASM.
> I will try to do improvements on other parts of the codec.

	Actually, i've tried applying your patch, 
	and experience some binary differences in
	the output (using xvid_encraw.c with forced
	used of MMX cpu).

	Could you cross check your function is Ok?
	If not, then xvid_bench.c should you be
	enhanced to remove this false-positive.
	But it might just be me that messed the test up.


	Anyway, do you feel like exercising a little,
	just for the sport of it? Yes?

	'coz i've had a look at your ASM code, and
	the final bit-by-bit computation of the CBP
	could be sped up a little, IMHO.
	Attention, we're just talking about few %
	speed-up of few % cpu use, here, but that's
	just for the challenge (summer is sooo boring;)

	Here it goes:

	cbp computation (for the luma part) is in fact
	a scalar product:

	cbp_y = 1.a + 2.b + 4.c + 8.d,

	where a,b,c, and d are boolean values
	deduced from or'ing all the 8x8 (luma) DCT coeffs
	(with exception to the DC), and 'pcmpgtw'ing
	them to zero.

	Now, you can easily compute this scalar product
	with good ol' 32bits-mult as:

	cbp_y = ( 0xdcba * 0x1248 ) >> 24

	where 0xdcba is the 32bit integer resulting
	from packing (packssdw/wb) the four bools as

	0xdcba = (d<<24) | (c<<16) | (b<<8) | (a)

	This works because no overflow occur for each
	individual terms. Just write the actual mult
	(like in school) to see it:

          0x   d  c  b  a
      *  0x   1  2  4  8
  --------------------
+             8d 8c 8b 8a
+         4d 4c 4b 4a
+     2d 2c 2b 2a
+ 1d 1c 1b 1a
---------------------
= .........^^

	and look at the sum in the fourth column.
	(yes, multiplication really is a convolution).

	Shifting this column to LSBits (with >>24), you get
	the cbp_y result with very few instructions.

	haf phun,

-Skal

(this mult trick is used in the GMC code, btw)