[XviD-devel] MMX/SSE/SEE2 implementation

Christoph Lampert xvid-devel@xvid.org
Thu, 12 Dec 2002 12:23:41 +0100 (CET)


Hi, 

we switches from inline assembler to NASM a while ago (for x86 asm),
because NASM is available and compatible for almost any x86 plattform
whereas inline assembler isn't. This way, assembler code only has to be 
written once, and doesn't have to be rewritten for every supported x86
compiler. Before that, the usual behaviour was that hackers using Windows
didn't write assembler for gcc and vice versa. :( 

There are some methods to overcome this problem (by special macros), but
at the moment, we are rather happy with nasm (if only people learned to
install a recent version).

Btw. are you sure that inline assembler is really optimized (reordered) by
the compiler? If I were a assembler programmer, I would _hate_ this
"feature"... 

gruel 


On Thu, 12 Dec 2002, James Hauxwell wrote:
> Hi,
> 
> I have been doing some experiments lately with the Intel/Microsoft
> compiler with intrinsics and have been very surprised with the
> results.
> 
> To touch base here, what are people opinions to recoding the mmx and
> such routines using compiler intrinsics?
> 
> My investigations have discovered the following plus points.
> 
> 1, easier to read/code and debug.
> 2, you don't have to worry about register allocation and scheduling as
> the
>    compiler does it for you.
> 3, you can rebuild for different CPU targets, P4 or P3 or Athlon and the
>    compiler will best decide how to schedule the instructions to avoid
> stall.
> 4, easier to test new optimizations.
> 5, don't need NASM to build, only your compiler.
> 
> Negative points are
> 
> 1, the work required to do it.
> 2, GCC and PC compliers to not share the same intrinsic names.
> 3, probably others as you will write back and inform me :-)
> 
> As an example, here is a quick version of add_c which took about
> 10minutes to write.
> 
> #include <mmintrin.h>
> 
> void add_c(unsigned char *restrict predictor,
> 		short *restrict error,
> 		int predictor_stride)
> {
> 	int i, j;
> 
> #pragma unroll(2)
> 	for (i = 0; i < 8; i++)
> 	{
> 		__m64 x0_high;
> 		__m64 x0_low = ((__m64 *)error)[i];
> 		__m64 zero = _mm_setzero_si64();
> 		x0_high = x0_low;
> 
> 		/* extract out 8 to 16 bit */
> 		x0_low = _mm_unpacklo_pi16(x0_low, zero);
> 		x0_high = _mm_unpackhi_pi16(x0_high, zero);
> 
> 		/* add the error */
> 		x0_low = _mm_adds_pu16(x0_low, ((__m64 *)predictor)[0]);
> 		x0_high = _mm_adds_pu16(x0_high, ((__m64
> *)predictor)[1]);
> 		predictor += predictor_stride;
> 
> 		/* saturate and pack */
> 		((__m64 *)error)[0] = _mm_packs_pu16(x0_low, x0_high);
> 	}
> }
> 
> You can see that it's very easy to change the unroll amount, whether you
> use prefetch or not, or remove the restricted pointers and go back to
> normal aliasing mode.
> 
> What do people think?
> 
> Jim
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel@xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>