[XviD-devel] MMX/SSE/SEE2 implementation

James Hauxwell xvid-devel@xvid.org
Thu, 12 Dec 2002 10:24:31 -0000


Hi,

I have been doing some experiments lately with the Intel/Microsoft
compiler with intrinsics and have been very surprised with the results.

To touch base here, what are people opinions to recoding the mmx and
such routines using compiler intrinsics?

My investigations have discovered the following plus points.

1, easier to read/code and debug.
2, you don't have to worry about register allocation and scheduling as
the
   compiler does it for you.
3, you can rebuild for different CPU targets, P4 or P3 or Athlon and the
   compiler will best decide how to schedule the instructions to avoid
stall.
4, easier to test new optimizations.
5, don't need NASM to build, only your compiler.

Negative points are

1, the work required to do it.
2, GCC and PC compliers to not share the same intrinsic names.
3, probably others as you will write back and inform me :-)

As an example, here is a quick version of add_c which took about
10minutes to write.

#include <mmintrin.h>

void add_c(unsigned char *restrict predictor,
		short *restrict error,
		int predictor_stride)
{
	int i, j;

#pragma unroll(2)
	for (i = 0; i < 8; i++)
	{
		__m64 x0_high;
		__m64 x0_low = ((__m64 *)error)[i];
		__m64 zero = _mm_setzero_si64();
		x0_high = x0_low;

		/* extract out 8 to 16 bit */
		x0_low = _mm_unpacklo_pi16(x0_low, zero);
		x0_high = _mm_unpackhi_pi16(x0_high, zero);

		/* add the error */
		x0_low = _mm_adds_pu16(x0_low, ((__m64 *)predictor)[0]);
		x0_high = _mm_adds_pu16(x0_high, ((__m64
*)predictor)[1]);
		predictor += predictor_stride;

		/* saturate and pack */
		((__m64 *)error)[0] = _mm_packs_pu16(x0_low, x0_high);
	}
}

You can see that it's very easy to change the unroll amount, whether you
use prefetch or not, or remove the restricted pointers and go back to
normal aliasing mode.

What do people think?

Jim