[XviD-devel] MMX/SSE/SEE2 implementation

Thu, 12 Dec 2002 14:10:47 +0100 (CET)

On Thu, 12 Dec 2002, James Hauxwell wrote:
> 
> Yes the compiler can reorder the code, and it is a good feature.  Most
> architectures have multiple pipes, so the ideas is that you need to keep
> them busy for most of the time.  Out of order execution is common, and
> the compiler is normally better at taking advantage of this. 

In _theory_, yes, of course. But compilers don't know everything,
e.g. sometimes it's good to do software prefetch at the right place
instead of another, which a compiler does not understand, because it
can't see all dependencies. 
If I were a asm hacker, I would refuse to use a system where this
feature cannot be switched off. But I guess it can? 

Anyway, if I remember correctly, the most frequently used function in XVID
is the very simple sad16() i.e. sad16_c(), sad16_mmx(), sad16_sse() etc.

If sad16 got faster by using intrinsics (including reordering) insteaed
of nasm, this might be a good point in favour of switching (as
Edouard suggested a while ago). Can you check this? 

> Other points taken, but I'm not talking about inline assembly, but
> rather intrinsics.  There is a difference.

Sorry, you are right. However, there's not file mmintrin.h or similar on
my Linux system, so people would have to install that instead of nasm. 
I guess, the file is compiler dependent, so we can't just include all into
XVID.

gruel

> -----Original Message-----
> From: xvid-devel-admin@xvid.org [mailto:xvid-devel-admin@xvid.org] On
> Behalf Of chl@math.uni-bonn.de
> Sent: 12 December 2002 11:24
> To: xvid-devel@xvid.org
> Subject: Re: [XviD-devel] MMX/SSE/SEE2 implementation
> 
> Hi, 
> 
> we switches from inline assembler to NASM a while ago (for x86 asm),
> because NASM is available and compatible for almost any x86 plattform
> whereas inline assembler isn't. This way, assembler code only has to be 
> written once, and doesn't have to be rewritten for every supported x86
> compiler. Before that, the usual behaviour was that hackers using
> Windows
> didn't write assembler for gcc and vice versa. :( 
> 
> There are some methods to overcome this problem (by special macros), but
> at the moment, we are rather happy with nasm (if only people learned to
> install a recent version).
> 
> Btw. are you sure that inline assembler is really optimized (reordered)
> by
> the compiler? If I were a assembler programmer, I would _hate_ this
> "feature"... 
> 
> gruel 
> 
> 
> On Thu, 12 Dec 2002, James Hauxwell wrote:
> > Hi,
> > 
> > I have been doing some experiments lately with the Intel/Microsoft
> > compiler with intrinsics and have been very surprised with the
> > results.
> > 
> > To touch base here, what are people opinions to recoding the mmx and
> > such routines using compiler intrinsics?
> > 
> > My investigations have discovered the following plus points.
> > 
> > 1, easier to read/code and debug.
> > 2, you don't have to worry about register allocation and scheduling as
> > the
> >    compiler does it for you.
> > 3, you can rebuild for different CPU targets, P4 or P3 or Athlon and
> the
> >    compiler will best decide how to schedule the instructions to avoid
> > stall.
> > 4, easier to test new optimizations.
> > 5, don't need NASM to build, only your compiler.
> > 
> > Negative points are
> > 
> > 1, the work required to do it.
> > 2, GCC and PC compliers to not share the same intrinsic names.
> > 3, probably others as you will write back and inform me :-)
> > 
> > As an example, here is a quick version of add_c which took about
> > 10minutes to write.
> > 
> > #include <mmintrin.h>
> > 
> > void add_c(unsigned char *restrict predictor,
> > 		short *restrict error,
> > 		int predictor_stride)
> > {
> > 	int i, j;
> > 
> > #pragma unroll(2)
> > 	for (i = 0; i < 8; i++)
> > 	{
> > 		__m64 x0_high;
> > 		__m64 x0_low = ((__m64 *)error)[i];
> > 		__m64 zero = _mm_setzero_si64();
> > 		x0_high = x0_low;
> > 
> > 		/* extract out 8 to 16 bit */
> > 		x0_low = _mm_unpacklo_pi16(x0_low, zero);
> > 		x0_high = _mm_unpackhi_pi16(x0_high, zero);
> > 
> > 		/* add the error */
> > 		x0_low = _mm_adds_pu16(x0_low, ((__m64 *)predictor)[0]);
> > 		x0_high = _mm_adds_pu16(x0_high, ((__m64
> > *)predictor)[1]);
> > 		predictor += predictor_stride;
> > 
> > 		/* saturate and pack */
> > 		((__m64 *)error)[0] = _mm_packs_pu16(x0_low, x0_high);
> > 	}
> > }
> > 
> > You can see that it's very easy to change the unroll amount, whether
> you
> > use prefetch or not, or remove the restricted pointers and go back to
> > normal aliasing mode.
> > 
> > What do people think?
> > 
> > Jim
> > 
> > _______________________________________________
> > XviD-devel mailing list
> > XviD-devel@xvid.org
> > http://list.xvid.org/mailman/listinfo/xvid-devel
> > 
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel@xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel@xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
>