[XviD-devel] Inlined ASM code again

Wed Aug 20 19:54:51 CEST 2003

Hi,

2cents from me: 

I absolutely believe the inlined 8x8 version is faster than the
one using function pointers, no question.

But I believe the bigger problem which you couldn't measure in your
tests is memory latency.
If the routines waits 200 cycles to read the last byte of a 
8x8 or 16x16 block, it doesn't really matter anymore if the 
calculations are done in 11 or 30 cycles. 

gruel

On Wed, 20 Aug 2003, Michael Militzer wrote:
> Hello,
> 
> I'd like to comment on this: Your benchmark is very artificial, so it
> doesn't say much. I'd suggest you should create a sad16 replacement using
> gcc intrinsics, patch XviD to use your newly created sad16 version, switch
> to a 16x16 block search only quality mode (<4) and compare encoding speed
> between your patch and the standard XviD version.
> 
> If the version using intrinsics is really faster, I agree that we should
> discuss how to make use of them.
> 
> bye,
> Michael
> 
> 
> Quoting Edouard Gomez <ed.gomez at free.fr>:
> 
> > Hello,
> > 
> > I know this has been discussed in the past but i think we should really
> > consider inlining the assembly code instead of using function pointer
> > (optimizations would be chosen at compile time)
> > 
> > Here's a very silly implmenetation of the sad8 opertaor using gcc
> > intrinsics for mmxext.
> > 
> > #define SADROW(a) \
> > 	tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
> > 	accum = _mm_add_pi16(tmp, accum)
> > static int __inline
> > sad8_xmmgcc(const char *cur,
> > 			const char *ref,
> > 			const int stride)
> > {
> > 	int i;
> > 	__m64 accum;
> > 	__m64 tmp;
> > 
> > 	/* Initialize the accumulator */
> > 	accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
> > 	SADROW(1);
> > 	SADROW(2);
> > 	SADROW(3);
> > 	SADROW(4);
> > 	SADROW(5);
> > 	SADROW(6);
> > 	SADROW(7);
> > 
> > 	return(_mm_cvtsi64_si32(accum));
> > }
> > #undef SADROW
> > 
> > If this code is not inlined (using  -O2) it performs as well as the more
> > optimized code available for my CPU  (3dne) and if it's inlined (-O3) it
> > becomes 4 times faster !
> > 
> > Using no inlined code:
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 40 (up to 42)
> > 
> > Using inlined code in a simple loop (see [1]):
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 12
> > 
> > Using 3dne version (not inlined as the compiler know nothing about it at
> > compile time):
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 40 (up to 42)
> > 
> > I  tried with  the cbp  operator  (during MC)  and i  obtained the  same
> > results as the plain MMX code with a rather silly implementation (gcc
> > was responsible of finding the complete code)
> > 
> > Cycles  per  call:  112 (up  to  120)  same  result  for the  plain  MMX
> > version.  The 3dne  version is  faster (90  cycles) but  cbp is  used so
> > often (once per MB whereas SAD is called more more times per MB). 
> > 
> > See the stupid code:
> > #define por2(a, b) _mm_or_si64((a), (b))
> > #define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
> > #define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
> >     por4(por4((a), (b), (c), (d)), \
> >          por4((e), (f), (g), (h)), \
> >          por4((i), (j), (k), (l)), \
> >          por4((m), (n), (o), (p)))
> > 
> > static int __inline
> > calc_cbp_mmxgcc(const short *coeff)
> > {
> > 
> >         register unsigned int res = 0;
> >         register unsigned int i;
> >         register __m64 *ptr =(__m64*)coeff;
> >         register __m64 zero = _mm_setzero_si64();
> > 
> >         /* Process one block per loop */
> >         for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
> >                 register __m64 mm0;
> > 
> >                 mm0 = por16(*(ptr + 0),
> >                             *(ptr + 1),
> >                             *(ptr + 2),
> >                             *(ptr + 3),
> >                             *(ptr + 4),
> >                             *(ptr + 5),
> >                             *(ptr + 6),
> >                             *(ptr + 7),
> >                             *(ptr + 8),
> >                             *(ptr + 9),
> >                             *(ptr + 10),
> >                             *(ptr + 11),
> >                             *(ptr + 12),
> >                             *(ptr + 13),
> >                             *(ptr + 14),
> >                             *(ptr + 15));
> > 
> >                 /* We only want to know if it's != 0 */
> >                 mm0 = _mm_cmpgt_pi32(mm0, zero);
> > 
> >                 /* Pack the result in te lower 32bit part */
> >                 mm0 = por2(_mm_srli_si64(mm0, 32), mm0);
> > 
> >                 /* Apply the mask only if mm0(31..1) is non zero */
> >                 res |= (_mm_cvtsi64_si32(mm0)&i);
> >         }
> > 
> >         return(res);
> > }
> > #undef por2
> > #undef por4
> > #undef por16
> > 
> > 
> > Seeing these results, I think it's really  worth it to dig a bit more in
> > that way  and continue writing gcc  intrinsics versions of  all mmx code
> > that could benefit from inlining  (sad operators, perhaps DCT). I'm sure
> > we could really speed up XviD. 
> > 
> > Is someone willing to help ? Or just comment ? 
> > 
> > PS: The  code depends highly on the  gcc version, gcc 3.2.3  and gcc 3.4
> > (experimental) do very  well, while gcc 3.3.x does  complete shit adding
> > lot  of read/write  operations  to  the stack  between  each mmx  opcode
> > (slowing the code up to 3x times). 
> > 
> > [1]
> > I  know this  bench is  somewhat  artificial but  i had  to measure  the
> > performance in a simple way.
> > #define NBTEST 1000000
> > #define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))
> > 
> > int
> > main()
> > {
> >         int i;
> >         int stat;
> >         short coeff[6*64];
> >         unsigned long long total;
> > 
> >         memset(coeff, 0, 6*64*sizeof(short));
> > 
> >         for(total=0, i=0; i<NBTEST; i++) {
> >                 unsigned long long start, end;
> >                 rdtscll(start);
> >                 /* Stupid call */
> >                 sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
> >                 rdtscll(end);
> >                 total += (end - start);
> >         }
> > 
> >         printf("Cycles: %d\n", (int)(total/NBTEST));
> > 
> >         return sad;
> > }
> > 
> > -- 
> > Edouard Gomez
> > _______________________________________________
> > XviD-devel mailing list
> > XviD-devel at xvid.org
> > http://list.xvid.org/mailman/listinfo/xvid-devel
> > 
> 
> 
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>