[XviD-devel] Inlined ASM code again
Michael Militzer
michael at xvid.org
Wed Aug 20 16:45:20 CEST 2003
Hello,
I'd like to comment on this: Your benchmark is very artificial, so it
doesn't say much. I'd suggest you should create a sad16 replacement using
gcc intrinsics, patch XviD to use your newly created sad16 version, switch
to a 16x16 block search only quality mode (<4) and compare encoding speed
between your patch and the standard XviD version.
If the version using intrinsics is really faster, I agree that we should
discuss how to make use of them.
bye,
Michael
Quoting Edouard Gomez <ed.gomez at free.fr>:
> Hello,
>
> I know this has been discussed in the past but i think we should really
> consider inlining the assembly code instead of using function pointer
> (optimizations would be chosen at compile time)
>
> Here's a very silly implmenetation of the sad8 opertaor using gcc
> intrinsics for mmxext.
>
> #define SADROW(a) \
> tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
> accum = _mm_add_pi16(tmp, accum)
> static int __inline
> sad8_xmmgcc(const char *cur,
> const char *ref,
> const int stride)
> {
> int i;
> __m64 accum;
> __m64 tmp;
>
> /* Initialize the accumulator */
> accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
> SADROW(1);
> SADROW(2);
> SADROW(3);
> SADROW(4);
> SADROW(5);
> SADROW(6);
> SADROW(7);
>
> return(_mm_cvtsi64_si32(accum));
> }
> #undef SADROW
>
> If this code is not inlined (using -O2) it performs as well as the more
> optimized code available for my CPU (3dne) and if it's inlined (-O3) it
> becomes 4 times faster !
>
> Using no inlined code:
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 40 (up to 42)
>
> Using inlined code in a simple loop (see [1]):
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 12
>
> Using 3dne version (not inlined as the compiler know nothing about it at
> compile time):
> [edy at leeloo:x86_asm] $ ./a.out
> Cycles per call: 40 (up to 42)
>
> I tried with the cbp operator (during MC) and i obtained the same
> results as the plain MMX code with a rather silly implementation (gcc
> was responsible of finding the complete code)
>
> Cycles per call: 112 (up to 120) same result for the plain MMX
> version. The 3dne version is faster (90 cycles) but cbp is used so
> often (once per MB whereas SAD is called more more times per MB).
>
> See the stupid code:
> #define por2(a, b) _mm_or_si64((a), (b))
> #define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
> #define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
> por4(por4((a), (b), (c), (d)), \
> por4((e), (f), (g), (h)), \
> por4((i), (j), (k), (l)), \
> por4((m), (n), (o), (p)))
>
> static int __inline
> calc_cbp_mmxgcc(const short *coeff)
> {
>
> register unsigned int res = 0;
> register unsigned int i;
> register __m64 *ptr =(__m64*)coeff;
> register __m64 zero = _mm_setzero_si64();
>
> /* Process one block per loop */
> for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
> register __m64 mm0;
>
> mm0 = por16(*(ptr + 0),
> *(ptr + 1),
> *(ptr + 2),
> *(ptr + 3),
> *(ptr + 4),
> *(ptr + 5),
> *(ptr + 6),
> *(ptr + 7),
> *(ptr + 8),
> *(ptr + 9),
> *(ptr + 10),
> *(ptr + 11),
> *(ptr + 12),
> *(ptr + 13),
> *(ptr + 14),
> *(ptr + 15));
>
> /* We only want to know if it's != 0 */
> mm0 = _mm_cmpgt_pi32(mm0, zero);
>
> /* Pack the result in te lower 32bit part */
> mm0 = por2(_mm_srli_si64(mm0, 32), mm0);
>
> /* Apply the mask only if mm0(31..1) is non zero */
> res |= (_mm_cvtsi64_si32(mm0)&i);
> }
>
> return(res);
> }
> #undef por2
> #undef por4
> #undef por16
>
>
> Seeing these results, I think it's really worth it to dig a bit more in
> that way and continue writing gcc intrinsics versions of all mmx code
> that could benefit from inlining (sad operators, perhaps DCT). I'm sure
> we could really speed up XviD.
>
> Is someone willing to help ? Or just comment ?
>
> PS: The code depends highly on the gcc version, gcc 3.2.3 and gcc 3.4
> (experimental) do very well, while gcc 3.3.x does complete shit adding
> lot of read/write operations to the stack between each mmx opcode
> (slowing the code up to 3x times).
>
> [1]
> I know this bench is somewhat artificial but i had to measure the
> performance in a simple way.
> #define NBTEST 1000000
> #define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))
>
> int
> main()
> {
> int i;
> int stat;
> short coeff[6*64];
> unsigned long long total;
>
> memset(coeff, 0, 6*64*sizeof(short));
>
> for(total=0, i=0; i<NBTEST; i++) {
> unsigned long long start, end;
> rdtscll(start);
> /* Stupid call */
> sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
> rdtscll(end);
> total += (end - start);
> }
>
> printf("Cycles: %d\n", (int)(total/NBTEST));
>
> return sad;
> }
>
> --
> Edouard Gomez
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>
More information about the XviD-devel
mailing list