[XviD-devel] Inlined ASM code again
Christoph Lampert
chl at math.uni-bonn.de
Wed Aug 20 19:54:51 CEST 2003
Hi,
2cents from me:
I absolutely believe the inlined 8x8 version is faster than the
one using function pointers, no question.
But I believe the bigger problem which you couldn't measure in your
tests is memory latency.
If the routines waits 200 cycles to read the last byte of a
8x8 or 16x16 block, it doesn't really matter anymore if the
calculations are done in 11 or 30 cycles.
gruel
On Wed, 20 Aug 2003, Michael Militzer wrote:
> Hello,
>
> I'd like to comment on this: Your benchmark is very artificial, so it
> doesn't say much. I'd suggest you should create a sad16 replacement using
> gcc intrinsics, patch XviD to use your newly created sad16 version, switch
> to a 16x16 block search only quality mode (<4) and compare encoding speed
> between your patch and the standard XviD version.
>
> If the version using intrinsics is really faster, I agree that we should
> discuss how to make use of them.
>
> bye,
> Michael
>
>
> Quoting Edouard Gomez <ed.gomez at free.fr>:
>
> > Hello,
> >
> > I know this has been discussed in the past but i think we should really
> > consider inlining the assembly code instead of using function pointer
> > (optimizations would be chosen at compile time)
> >
> > Here's a very silly implmenetation of the sad8 opertaor using gcc
> > intrinsics for mmxext.
> >
> > #define SADROW(a) \
> > tmp = _mm_sad_pu8(*((__m64*)cur + (a)), *((__m64*)ref + (a))); \
> > accum = _mm_add_pi16(tmp, accum)
> > static int __inline
> > sad8_xmmgcc(const char *cur,
> > const char *ref,
> > const int stride)
> > {
> > int i;
> > __m64 accum;
> > __m64 tmp;
> >
> > /* Initialize the accumulator */
> > accum =_mm_sad_pu8(*((__m64*)cur), *((__m64*)ref));
> > SADROW(1);
> > SADROW(2);
> > SADROW(3);
> > SADROW(4);
> > SADROW(5);
> > SADROW(6);
> > SADROW(7);
> >
> > return(_mm_cvtsi64_si32(accum));
> > }
> > #undef SADROW
> >
> > If this code is not inlined (using -O2) it performs as well as the more
> > optimized code available for my CPU (3dne) and if it's inlined (-O3) it
> > becomes 4 times faster !
> >
> > Using no inlined code:
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 40 (up to 42)
> >
> > Using inlined code in a simple loop (see [1]):
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 12
> >
> > Using 3dne version (not inlined as the compiler know nothing about it at
> > compile time):
> > [edy at leeloo:x86_asm] $ ./a.out
> > Cycles per call: 40 (up to 42)
> >
> > I tried with the cbp operator (during MC) and i obtained the same
> > results as the plain MMX code with a rather silly implementation (gcc
> > was responsible of finding the complete code)
> >
> > Cycles per call: 112 (up to 120) same result for the plain MMX
> > version. The 3dne version is faster (90 cycles) but cbp is used so
> > often (once per MB whereas SAD is called more more times per MB).
> >
> > See the stupid code:
> > #define por2(a, b) _mm_or_si64((a), (b))
> > #define por4(a, b, c, d) por2(por2((a), (b)), por2((c), (d)))
> > #define por16(a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p) \
> > por4(por4((a), (b), (c), (d)), \
> > por4((e), (f), (g), (h)), \
> > por4((i), (j), (k), (l)), \
> > por4((m), (n), (o), (p)))
> >
> > static int __inline
> > calc_cbp_mmxgcc(const short *coeff)
> > {
> >
> > register unsigned int res = 0;
> > register unsigned int i;
> > register __m64 *ptr =(__m64*)coeff;
> > register __m64 zero = _mm_setzero_si64();
> >
> > /* Process one block per loop */
> > for(i=(1<<5); i>=1; i>>=1, ptr += 16) {
> > register __m64 mm0;
> >
> > mm0 = por16(*(ptr + 0),
> > *(ptr + 1),
> > *(ptr + 2),
> > *(ptr + 3),
> > *(ptr + 4),
> > *(ptr + 5),
> > *(ptr + 6),
> > *(ptr + 7),
> > *(ptr + 8),
> > *(ptr + 9),
> > *(ptr + 10),
> > *(ptr + 11),
> > *(ptr + 12),
> > *(ptr + 13),
> > *(ptr + 14),
> > *(ptr + 15));
> >
> > /* We only want to know if it's != 0 */
> > mm0 = _mm_cmpgt_pi32(mm0, zero);
> >
> > /* Pack the result in te lower 32bit part */
> > mm0 = por2(_mm_srli_si64(mm0, 32), mm0);
> >
> > /* Apply the mask only if mm0(31..1) is non zero */
> > res |= (_mm_cvtsi64_si32(mm0)&i);
> > }
> >
> > return(res);
> > }
> > #undef por2
> > #undef por4
> > #undef por16
> >
> >
> > Seeing these results, I think it's really worth it to dig a bit more in
> > that way and continue writing gcc intrinsics versions of all mmx code
> > that could benefit from inlining (sad operators, perhaps DCT). I'm sure
> > we could really speed up XviD.
> >
> > Is someone willing to help ? Or just comment ?
> >
> > PS: The code depends highly on the gcc version, gcc 3.2.3 and gcc 3.4
> > (experimental) do very well, while gcc 3.3.x does complete shit adding
> > lot of read/write operations to the stack between each mmx opcode
> > (slowing the code up to 3x times).
> >
> > [1]
> > I know this bench is somewhat artificial but i had to measure the
> > performance in a simple way.
> > #define NBTEST 1000000
> > #define rdtscll(val) __asm__ __volatile__("rdtsc" : "=A" (val))
> >
> > int
> > main()
> > {
> > int i;
> > int stat;
> > short coeff[6*64];
> > unsigned long long total;
> >
> > memset(coeff, 0, 6*64*sizeof(short));
> >
> > for(total=0, i=0; i<NBTEST; i++) {
> > unsigned long long start, end;
> > rdtscll(start);
> > /* Stupid call */
> > sad = sad8_xmmgcc((char*)&coeff[0], (char*)&coeff[1], 8);
> > rdtscll(end);
> > total += (end - start);
> > }
> >
> > printf("Cycles: %d\n", (int)(total/NBTEST));
> >
> > return sad;
> > }
> >
> > --
> > Edouard Gomez
> > _______________________________________________
> > XviD-devel mailing list
> > XviD-devel at xvid.org
> > http://list.xvid.org/mailman/listinfo/xvid-devel
> >
>
>
>
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>
More information about the XviD-devel
mailing list