[XviD-devel] Inlined ASM code again

Wed Aug 20 14:59:22 CEST 2003

Hello,

I know this has been discussed in the past but i think we should really
consider inlining the assembly code instead of using function pointer
(optimizations would be chosen at compile time)

Here's a very silly implmenetation of the sad8 opertaor using gcc
intrinsics for mmxext.

#define SADROW(a) \
	tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
	accum = _mm_add_pi16(tmp, accum)
static int __inline
sad8_xmmgcc(const char *cur,
			const char *ref,
			const int stride)
{
	int i;
	__m64 accum;
	__m64 tmp;

	/* Initialize the accumulator */
	accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
	SADROW(1);
	SADROW(2);
	SADROW(3);
	SADROW(4);
	SADROW(5);
	SADROW(6);
	SADROW(7);

	return(_mm_cvtsi64_si32(accum));
}
#undef SADROW

If this code is not inlined (using  -O2) it performs as well as the more
optimized code available for my CPU  (3dne) and if it's inlined (-O3) it
becomes 4 times faster !

Using no inlined code:
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 40 (up to 42)

Using inlined code in a simple loop (see [1]):
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 12

Using 3dne version (not inlined as the compiler know nothing about it at
compile time):
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 40 (up to 42)

I  tried with  the cbp  operator  (during MC)  and i  obtained the  same
results as the plain MMX code with a rather silly implementation (gcc
was responsible of finding the complete code)

Cycles  per  call:  112 (up  to  120)  same  result  for the  plain  MMX
version.  The 3dne  version is  faster (90  cycles) but  cbp is  used so
often (once per MB whereas SAD is called more more times per MB). 

See the stupid code:
#define por2(a, b) _mm_or_si64((a), (b))
#define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
#define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
    por4(por4((a), (b), (c), (d)), \
         por4((e), (f), (g), (h)), \
         por4((i), (j), (k), (l)), \
         por4((m), (n), (o), (p)))

static int __inline
calc_cbp_mmxgcc(const short *coeff)
{

        register unsigned int res = 0;
        register unsigned int i;
        register __m64 *ptr =(__m64*)coeff;
        register __m64 zero = _mm_setzero_si64();

        /* Process one block per loop */
        for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
                register __m64 mm0;

                mm0 = por16(*(ptr + 0),
                            *(ptr + 1),
                            *(ptr + 2),
                            *(ptr + 3),
                            *(ptr + 4),
                            *(ptr + 5),
                            *(ptr + 6),
                            *(ptr + 7),
                            *(ptr + 8),
                            *(ptr + 9),
                            *(ptr + 10),
                            *(ptr + 11),
                            *(ptr + 12),
                            *(ptr + 13),
                            *(ptr + 14),
                            *(ptr + 15));

                /* We only want to know if it's != 0 */
                mm0 = _mm_cmpgt_pi32(mm0, zero);

                /* Pack the result in te lower 32bit part */
                mm0 = por2(_mm_srli_si64(mm0, 32), mm0);

                /* Apply the mask only if mm0(31..1) is non zero */
                res |= (_mm_cvtsi64_si32(mm0)&i);
        }

        return(res);
}
#undef por2
#undef por4
#undef por16

Seeing these results, I think it's really  worth it to dig a bit more in
that way  and continue writing gcc  intrinsics versions of  all mmx code
that could benefit from inlining  (sad operators, perhaps DCT). I'm sure
we could really speed up XviD. 

Is someone willing to help ? Or just comment ? 

PS: The  code depends highly on the  gcc version, gcc 3.2.3  and gcc 3.4
(experimental) do very  well, while gcc 3.3.x does  complete shit adding
lot  of read/write  operations  to  the stack  between  each mmx  opcode
(slowing the code up to 3x times). 

[1]
I  know this  bench is  somewhat  artificial but  i had  to measure  the
performance in a simple way.
#define NBTEST 1000000
#define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))

int
main()
{
        int i;
        int stat;
        short coeff[6*64];
        unsigned long long total;

        memset(coeff, 0, 6*64*sizeof(short));

        for(total=0, i=0; i<NBTEST; i++) {
                unsigned long long start, end;
                rdtscll(start);
                /* Stupid call */
                sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
                rdtscll(end);
                total += (end - start);
        }

        printf("Cycles: %d\n", (int)(total/NBTEST));

        return sad;
}

-- 
Edouard Gomez