[XviD-devel] More MMX improvements => funky cbp!

carlo.bramix carlo.bramix at libero.it
Thu Aug 4 23:06:40 CEST 2005


Hello,
after receiving your message, I tested again that MMX function.
I extracted some parts from the benchmark and I did a small program for cross testing, like you suggested me.
As you can see from the attached source, I also added an additional loop for testing all steps of the routine.
All results are compared with the ones obtained from the C implementation.
Everything works fine here.
I don't know how the original bench test has been studied, but if the second loop doesn't generate errors, then the routine should really be ok.

I perfectly understood your algorithm, but I had some doubts it could give us more speed.
So, after an afternoon of work I was able coded this alternative function.
It uses the method you suggested (just a bit modified because MMX instructions capabilities).
It's a completely different method than currently used one, because we need all terms for doing that multiply.
MMX can do just 16x16 bit multiply.
So I used two accumulated 16x16 bit multiply and I got the same thing.
Unfortunately, the final result can't be obtained so easily as you thought.
Since there are six terms to accumulate, we need a shift-and-add anyways, like previous method.

I did some benchmark for looking how it could be fast, so I wrote a very simple DOS program and I compiled it with DJGPP: with RDTSC, I read the elapsed time.
These are the number of ticks:
Current CVS function:    15005-14767
Latest patched function: 13127
This new method:         11953
However, I must say that I was almost forced to remove the loop and load all terms sequentially.
Otherwise it runs even slower with 14231 ticks.

I hope this could be helpful.

Sincerely,

Carlo Bramini.

PS: I hope I passed the test on my coding skills with this exercise ^_-

---------- Initial Header -----------

>From      : xvid-devel-bounces at xvid.org
To          : xvid-devel at xvid.org
Cc          : 
Date      : Wed, 03 Aug 2005 12:48:37 +0200
Subject : Re: [XviD-devel] More MMX improvements => funky cbp!







> 
> 	Hello Carlo!
> 
> On Fri, 2005-07-29 at 17:59, carlo.bramix wrote:
> > Hello,
> > thanks a lot for your replies.
> > Unfortunately, I thought those routines were speed critical for the fact they were rewritten in ASM.
> > I will try to do improvements on other parts of the codec.
> 
> 	Actually, i've tried applying your patch, 
> 	and experience some binary differences in
> 	the output (using xvid_encraw.c with forced
> 	used of MMX cpu).
> 
> 	Could you cross check your function is Ok?
> 	If not, then xvid_bench.c should you be
> 	enhanced to remove this false-positive.
> 	But it might just be me that messed the test up.
> 
> 
> 	Anyway, do you feel like exercising a little,
> 	just for the sport of it? Yes?
> 
> 	'coz i've had a look at your ASM code, and
> 	the final bit-by-bit computation of the CBP
> 	could be sped up a little, IMHO.
> 	Attention, we're just talking about few %
> 	speed-up of few % cpu use, here, but that's
> 	just for the challenge (summer is sooo boring;)
> 
> 	Here it goes:
> 
> 	cbp computation (for the luma part) is in fact
> 	a scalar product:
> 
> 	cbp_y = 1.a + 2.b + 4.c + 8.d,
> 
> 	where a,b,c, and d are boolean values
> 	deduced from or'ing all the 8x8 (luma) DCT coeffs
> 	(with exception to the DC), and 'pcmpgtw'ing
> 	them to zero.
> 
> 	Now, you can easily compute this scalar product
> 	with good ol' 32bits-mult as:
> 
> 	cbp_y = ( 0xdcba * 0x1248 ) >> 24
> 
> 	where 0xdcba is the 32bit integer resulting
> 	from packing (packssdw/wb) the four bools as
> 
> 	0xdcba = (d<<24) | (c<<16) | (b<<8) | (a)
> 
> 	This works because no overflow occur for each
> 	individual terms. Just write the actual mult
> 	(like in school) to see it:
> 
>           0x   d  c  b  a
>       *  0x   1  2  4  8
>   --------------------
> +             8d 8c 8b 8a
> +         4d 4c 4b 4a
> +     2d 2c 2b 2a
> + 1d 1c 1b 1a
> ---------------------
> = .........^^
> 
> 	and look at the sum in the fourth column.
> 	(yes, multiplication really is a convolution).
> 
> 	Shifting this column to LSBits (with >>24), you get
> 	the cbp_y result with very few instructions.
> 
> 	haf phun,
> 
> -Skal
> 
> (this mult trick is used in the GMC code, btw)
> 
> 
> 
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
> 



____________________________________________________________
6X velocizzare la tua navigazione a 56k? 6X Web Accelerator di Libero!
Scaricalo su INTERNET GRATIS 6X http://www.libero.it
-------------- next part --------------
;/****************************************************************************
; *
; *  XVID MPEG-4 VIDEO CODEC
; *  - MMX CBP computation -
; *
; *  Copyright (C) 2001-2003 Peter Ross <pross at xvid.org>
; *                2002-2003 Pascal Massimino <skal at planet-d.net>
; *
; *  This program is free software ; you can redistribute it and/or modify
; *  it under the terms of the GNU General Public License as published by
; *  the Free Software Foundation ; either version 2 of the License, or
; *  (at your option) any later version.
; *
; *  This program is distributed in the hope that it will be useful,
; *  but WITHOUT ANY WARRANTY ; without even the implied warranty of
; *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
; *  GNU General Public License for more details.
; *
; *  You should have received a copy of the GNU General Public License
; *  along with this program ; if not, write to the Free Software
; *  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307 USA
; *
; * $Id: cbp_mmx.asm,v 1.12 2004/08/29 10:02:38 edgomez Exp $
; *
; ***************************************************************************/

BITS 32

;=============================================================================
; Macros
;=============================================================================

%macro cglobal 1
	%ifdef PREFIX
		%ifdef MARK_FUNCS
			global _%1:function %1.endfunc-%1
			%define %1 _%1:function %1.endfunc-%1
		%else
			global _%1
			%define %1 _%1
		%endif
	%else
		%ifdef MARK_FUNCS
			global %1:function %1.endfunc-%1
		%else
			global %1
		%endif
	%endif
%endmacro

;=============================================================================
; Local data
;=============================================================================

%ifdef FORMAT_COFF
SECTION .rodata
%else
SECTION .rodata align=16
%endif

ALIGN 16

mult_mask:
    db 0x10,0x20,0x04,0x08,0x01,0x02,0x00,0x00
ignore_dc:
    dw 0, -1, -1, -1

;=============================================================================
; Code
;=============================================================================

SECTION .text

cglobal _calc_cbp_mmx

%macro      MAKE_LOAD         1
  por mm0, [eax-128*1+%1*8]
  por mm1, [eax+128*0+%1*8]
  por mm2, [eax+128*1+%1*8]
  por mm3, [eax+128*2+%1*8]
  por mm4, [eax+128*3+%1*8]
  por mm5, [eax+128*4+%1*8]
%endmacro

;-----------------------------------------------------------------------------
; uint32_t calc_cbp_mmx(const int16_t coeff[6][64]);
;-----------------------------------------------------------------------------

ALIGN 16
_calc_cbp_mmx:
  mov eax, [esp + 4]            ; coeff

  movq mm7, [ignore_dc]
  psubd mm6, mm6                ; used only for comparing
  movq mm0, [eax+128*0]
  movq mm1, [eax+128*1]
  movq mm2, [eax+128*2]
  movq mm3, [eax+128*3]
  movq mm4, [eax+128*4]
  movq mm5, [eax+128*5]
  add eax, 8+128
  pand mm0, mm7
  pand mm1, mm7
  pand mm2, mm7
  pand mm3, mm7
  pand mm4, mm7
  pand mm5, mm7

  MAKE_LOAD 0
  MAKE_LOAD 1
  MAKE_LOAD 2
  MAKE_LOAD 3
  MAKE_LOAD 4
  MAKE_LOAD 5
  MAKE_LOAD 6
  MAKE_LOAD 7
  MAKE_LOAD 8
  MAKE_LOAD 9
  MAKE_LOAD 10
  MAKE_LOAD 11
  MAKE_LOAD 12
  MAKE_LOAD 13
  MAKE_LOAD 14

  movq mm7, [mult_mask]
  packssdw mm0, mm1
  packssdw mm2, mm3
  packssdw mm4, mm5
  packssdw mm0, mm2
  packssdw mm4, mm6
  pcmpgtw mm0, mm6
  pcmpgtw mm4, mm6
  psrlw mm0, 15
  psrlw mm4, 15
  packuswb mm0, mm4
  pmaddwd mm0, mm7

  movq mm1, mm0
  psrlq mm1, 32
  paddusb mm0, mm1

  movd eax, mm0
  shr eax, 8
  and eax, 0x3F
  ret
.endfunc

-------------- next part --------------
#include <stdio.h>

#define int16_t signed short int

#define DECLARE_ALIGNED_MATRIX(name, x, y, tp, al) \
tp name[x*y]

int calc_cbp_mmx(const int16_t *coeff);

/* naive C */
int calc_cbp_plain(const int16_t codes[6 * 64])
{
    int i, j, cbp = 0;

    for (i = 0; i < 6; i++) {
        for (j=1; j<64;j++) {
            if (codes[64*i+j]) {
                cbp |= 1 << (5-i);
                break;
            }
        }
    }
    return cbp;
}

int test_cbp(const int16_t *coeff)
{
    int cbp = calc_cbp_mmx(coeff);
    int res = calc_cbp_plain(coeff);

    printf("calc_cbp#1 cbp=0x%02x %s\n", cbp, (cbp!=res)?"| ERROR": "");
}

int main()
{
    int tst, cbp;
    int i;
    DECLARE_ALIGNED_MATRIX(Src1, 6, 64, int16_t, 16);
    DECLARE_ALIGNED_MATRIX(Src2, 6, 64, int16_t, 16);
    DECLARE_ALIGNED_MATRIX(Src3, 6, 64, int16_t, 16);
    DECLARE_ALIGNED_MATRIX(Src4, 6, 64, int16_t, 16);
    DECLARE_ALIGNED_MATRIX(Src5, 6, 64, int16_t, 16);
    DECLARE_ALIGNED_MATRIX(Src6, 6, 64, int16_t, 16);

    printf( "\n =====  test cbp =====\n" );

    for(i=0; i<6*64; ++i) {
        Src1[i] = (i*i*3/8192)&(i/64)&1;  /* 'random' */
        Src2[i] = (i<3*64);               /* half-full */
        Src3[i] = ((i+32)>3*64);
        Src4[i] = (i==(3*64+2) || i==(5*64+9));
    }
    test_cbp(Src1);
    test_cbp(Src2);
    test_cbp(Src3);
    test_cbp(Src4);

    for (tst=0; tst<64; tst++) {
        for(i=0; i<6*64; ++i) {
                Src1[i] = (i==tst+64*0) ? 1 : 0;
                Src2[i] = (i==tst+64*1) ? 1 : 0;
                Src3[i] = (i==tst+64*2) ? 1 : 0;
                Src4[i] = (i==tst+64*3) ? 1 : 0;
                Src5[i] = (i==tst+64*4) ? 1 : 0;
                Src6[i] = (i==tst+64*5) ? 1 : 0;
        }
        test_cbp(Src1);
        test_cbp(Src2);
        test_cbp(Src3);
        test_cbp(Src4);
        test_cbp(Src5);
        test_cbp(Src6);
    }

    return 0;
}


More information about the XviD-devel mailing list