[XviD-devel] Inlined ASM code again

Wed Aug 20 16:45:20 CEST 2003

Hello,

I'd like to comment on this: Your benchmark is very artificial, so it
doesn't say much. I'd suggest you should create a sad16 replacement using
gcc intrinsics, patch XviD to use your newly created sad16 version, switch
to a 16x16 block search only quality mode (<4) and compare encoding speed
between your patch and the standard XviD version.

If the version using intrinsics is really faster, I agree that we should
discuss how to make use of them.

bye,
Michael

Quoting Edouard Gomez <ed.gomez at free.fr>:

> Hello,
> 
> I know this has been discussed in the past but i think we should really
> consider inlining the assembly code instead of using function pointer
> (optimizations would be chosen at compile time)
> 
> Here's a very silly implmenetation of the sad8 opertaor using gcc
> intrinsics for mmxext.
> 
> #define SADROW(a) \
> 	tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
> 	accum = _mm_add_pi16(tmp, accum)
> static int __inline
> sad8_xmmgcc(const char *cur,
> 			const char *ref,
> 			const int stride)
> {
> 	int i;
> 	__m64 accum;
> 	__m64 tmp;
> 
> 	/* Initialize the accumulator */
> 	accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
> 	SADROW(1);
> 	SADROW(2);
> 	SADROW(3);
> 	SADROW(4);
> 	SADROW(5);
> 	SADROW(6);
> 	SADROW(7);
> 
> 	return(_mm_cvtsi64_si32(accum));
> }
> #undef SADROW
> 
> If this code is not inlined (using  -O2) it performs as well as the more
> optimized code available for my CPU  (3dne) and if it's inlined (-O3) it
> becomes 4 times faster !
> 
> Using no inlined code:
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 40 (up to 42)
> 
> Using inlined code in a simple loop (see [1]):
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 12
> 
> Using 3dne version (not inlined as the compiler know nothing about it at
> compile time):
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 40 (up to 42)
> 
> I  tried with  the cbp  operator  (during MC)  and i  obtained the  same
> results as the plain MMX code with a rather silly implementation (gcc
> was responsible of finding the complete code)
> 
> Cycles  per  call:  112 (up  to  120)  same  result  for the  plain  MMX
> version.  The 3dne  version is  faster (90  cycles) but  cbp is  used so
> often (once per MB whereas SAD is called more more times per MB). 
> 
> See the stupid code:
> #define por2(a, b) _mm_or_si64((a), (b))
> #define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
> #define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
>     por4(por4((a), (b), (c), (d)), \
>          por4((e), (f), (g), (h)), \
>          por4((i), (j), (k), (l)), \
>          por4((m), (n), (o), (p)))
> 
> static int __inline
> calc_cbp_mmxgcc(const short *coeff)
> {
> 
>         register unsigned int res = 0;
>         register unsigned int i;
>         register __m64 *ptr =(__m64*)coeff;
>         register __m64 zero = _mm_setzero_si64();
> 
>         /* Process one block per loop */
>         for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
>                 register __m64 mm0;
> 
>                 mm0 = por16(*(ptr + 0),
>                             *(ptr + 1),
>                             *(ptr + 2),
>                             *(ptr + 3),
>                             *(ptr + 4),
>                             *(ptr + 5),
>                             *(ptr + 6),
>                             *(ptr + 7),
>                             *(ptr + 8),
>                             *(ptr + 9),
>                             *(ptr + 10),
>                             *(ptr + 11),
>                             *(ptr + 12),
>                             *(ptr + 13),
>                             *(ptr + 14),
>                             *(ptr + 15));
> 
>                 /* We only want to know if it's != 0 */
>                 mm0 = _mm_cmpgt_pi32(mm0, zero);
> 
>                 /* Pack the result in te lower 32bit part */
>                 mm0 = por2(_mm_srli_si64(mm0, 32), mm0);
> 
>                 /* Apply the mask only if mm0(31..1) is non zero */
>                 res |= (_mm_cvtsi64_si32(mm0)&i);
>         }
> 
>         return(res);
> }
> #undef por2
> #undef por4
> #undef por16
> 
> 
> Seeing these results, I think it's really  worth it to dig a bit more in
> that way  and continue writing gcc  intrinsics versions of  all mmx code
> that could benefit from inlining  (sad operators, perhaps DCT). I'm sure
> we could really speed up XviD. 
> 
> Is someone willing to help ? Or just comment ? 
> 
> PS: The  code depends highly on the  gcc version, gcc 3.2.3  and gcc 3.4
> (experimental) do very  well, while gcc 3.3.x does  complete shit adding
> lot  of read/write  operations  to  the stack  between  each mmx  opcode
> (slowing the code up to 3x times). 
> 
> [1]
> I  know this  bench is  somewhat  artificial but  i had  to measure  the
> performance in a simple way.
> #define NBTEST 1000000
> #define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))
> 
> int
> main()
> {
>         int i;
>         int stat;
>         short coeff[6*64];
>         unsigned long long total;
> 
>         memset(coeff, 0, 6*64*sizeof(short));
> 
>         for(total=0, i=0; i<NBTEST; i++) {
>                 unsigned long long start, end;
>                 rdtscll(start);
>                 /* Stupid call */
>                 sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
>                 rdtscll(end);
>                 total += (end - start);
>         }
> 
>         printf("Cycles: %d\n", (int)(total/NBTEST));
> 
>         return sad;
> }
> 
> -- 
> Edouard Gomez
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>