[XviD-devel] Inlined ASM code again
Edouard Gomez
ed.gomez at free.fr
Wed Aug 20 14:59:22 CEST 2003
Hello,
I know this has been discussed in the past but i think we should really
consider inlining the assembly code instead of using function pointer
(optimizations would be chosen at compile time)
Here's a very silly implmenetation of the sad8 opertaor using gcc
intrinsics for mmxext.
#define SADROW(a) \
tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
accum = _mm_add_pi16(tmp, accum)
static int __inline
sad8_xmmgcc(const char *cur,
const char *ref,
const int stride)
{
int i;
__m64 accum;
__m64 tmp;
/* Initialize the accumulator */
accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
SADROW(1);
SADROW(2);
SADROW(3);
SADROW(4);
SADROW(5);
SADROW(6);
SADROW(7);
return(_mm_cvtsi64_si32(accum));
}
#undef SADROW
If this code is not inlined (using -O2) it performs as well as the more
optimized code available for my CPU (3dne) and if it's inlined (-O3) it
becomes 4 times faster !
Using no inlined code:
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 40 (up to 42)
Using inlined code in a simple loop (see [1]):
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 12
Using 3dne version (not inlined as the compiler know nothing about it at
compile time):
[edy at leeloo:x86_asm] $ ./a.out
Cycles per call: 40 (up to 42)
I tried with the cbp operator (during MC) and i obtained the same
results as the plain MMX code with a rather silly implementation (gcc
was responsible of finding the complete code)
Cycles per call: 112 (up to 120) same result for the plain MMX
version. The 3dne version is faster (90 cycles) but cbp is used so
often (once per MB whereas SAD is called more more times per MB).
See the stupid code:
#define por2(a, b) _mm_or_si64((a), (b))
#define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
#define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
por4(por4((a), (b), (c), (d)), \
por4((e), (f), (g), (h)), \
por4((i), (j), (k), (l)), \
por4((m), (n), (o), (p)))
static int __inline
calc_cbp_mmxgcc(const short *coeff)
{
register unsigned int res = 0;
register unsigned int i;
register __m64 *ptr =(__m64*)coeff;
register __m64 zero = _mm_setzero_si64();
/* Process one block per loop */
for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
register __m64 mm0;
mm0 = por16(*(ptr + 0),
*(ptr + 1),
*(ptr + 2),
*(ptr + 3),
*(ptr + 4),
*(ptr + 5),
*(ptr + 6),
*(ptr + 7),
*(ptr + 8),
*(ptr + 9),
*(ptr + 10),
*(ptr + 11),
*(ptr + 12),
*(ptr + 13),
*(ptr + 14),
*(ptr + 15));
/* We only want to know if it's != 0 */
mm0 = _mm_cmpgt_pi32(mm0, zero);
/* Pack the result in te lower 32bit part */
mm0 = por2(_mm_srli_si64(mm0, 32), mm0);
/* Apply the mask only if mm0(31..1) is non zero */
res |= (_mm_cvtsi64_si32(mm0)&i);
}
return(res);
}
#undef por2
#undef por4
#undef por16
Seeing these results, I think it's really worth it to dig a bit more in
that way and continue writing gcc intrinsics versions of all mmx code
that could benefit from inlining (sad operators, perhaps DCT). I'm sure
we could really speed up XviD.
Is someone willing to help ? Or just comment ?
PS: The code depends highly on the gcc version, gcc 3.2.3 and gcc 3.4
(experimental) do very well, while gcc 3.3.x does complete shit adding
lot of read/write operations to the stack between each mmx opcode
(slowing the code up to 3x times).
[1]
I know this bench is somewhat artificial but i had to measure the
performance in a simple way.
#define NBTEST 1000000
#define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))
int
main()
{
int i;
int stat;
short coeff[6*64];
unsigned long long total;
memset(coeff, 0, 6*64*sizeof(short));
for(total=0, i=0; i<NBTEST; i++) {
unsigned long long start, end;
rdtscll(start);
/* Stupid call */
sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
rdtscll(end);
total += (end - start);
}
printf("Cycles: %d\n", (int)(total/NBTEST));
return sad;
}
--
Edouard Gomez
More information about the XviD-devel
mailing list