[XviD-devel] MMX/SSE/SEE2 implementation

Thu, 12 Dec 2002 13:27:35 -0000

I'll code up the sad() stuff tonight, and see what I get, as I am at
work ATM.

The example I gave was just an example and not meant for inclusion.  I
just wanted to see what the .asm file look like that the compiler spat
out.  mmintrin.h is only available on windows platforms, gcc has it's
own intrinsics naming convention, but that's the only difference.

For example

_mm_adds_pu16() translates to __builtin_ia32_paddusw() on gcc.  It would
be easy to set up a header with #defines in to do the translation.

Anyway, before I go that far, I'll see if I get any speed increase on
sad()

Jim

-----Original Message-----
From: xvid-devel-admin@xvid.org [mailto:xvid-devel-admin@xvid.org] On
Behalf Of chl@math.uni-bonn.de
Sent: 12 December 2002 13:11
To: xvid-devel@xvid.org
Subject: RE: [XviD-devel] MMX/SSE/SEE2 implementation

On Thu, 12 Dec 2002, James Hauxwell wrote:
> 
> Yes the compiler can reorder the code, and it is a good feature.  Most
> architectures have multiple pipes, so the ideas is that you need to
keep
> them busy for most of the time.  Out of order execution is common, and
> the compiler is normally better at taking advantage of this. 

In _theory_, yes, of course. But compilers don't know everything,
e.g. sometimes it's good to do software prefetch at the right place
instead of another, which a compiler does not understand, because it
can't see all dependencies. 
If I were a asm hacker, I would refuse to use a system where this
feature cannot be switched off. But I guess it can? 

Anyway, if I remember correctly, the most frequently used function in
XVID
is the very simple sad16() i.e. sad16_c(), sad16_mmx(), sad16_sse() etc.

If sad16 got faster by using intrinsics (including reordering) insteaed
of nasm, this might be a good point in favour of switching (as
Edouard suggested a while ago). Can you check this? 

> Other points taken, but I'm not talking about inline assembly, but
> rather intrinsics.  There is a difference.

Sorry, you are right. However, there's not file mmintrin.h or similar on
my Linux system, so people would have to install that instead of nasm. 
I guess, the file is compiler dependent, so we can't just include all
into
XVID.

gruel

> -----Original Message-----
> From: xvid-devel-admin@xvid.org [mailto:xvid-devel-admin@xvid.org] On
> Behalf Of chl@math.uni-bonn.de
> Sent: 12 December 2002 11:24
> To: xvid-devel@xvid.org
> Subject: Re: [XviD-devel] MMX/SSE/SEE2 implementation
> 
> Hi, 
> 
> we switches from inline assembler to NASM a while ago (for x86 asm),
> because NASM is available and compatible for almost any x86 plattform
> whereas inline assembler isn't. This way, assembler code only has to
be 
> written once, and doesn't have to be rewritten for every supported x86
> compiler. Before that, the usual behaviour was that hackers using
> Windows
> didn't write assembler for gcc and vice versa. :( 
> 
> There are some methods to overcome this problem (by special macros),
but
> at the moment, we are rather happy with nasm (if only people learned
to
> install a recent version).
> 
> Btw. are you sure that inline assembler is really optimized
(reordered)
> by
> the compiler? If I were a assembler programmer, I would _hate_ this
> "feature"... 
> 
> gruel 
> 
> 
> On Thu, 12 Dec 2002, James Hauxwell wrote:
> > Hi,
> > 
> > I have been doing some experiments lately with the Intel/Microsoft
> > compiler with intrinsics and have been very surprised with the
> > results.
> > 
> > To touch base here, what are people opinions to recoding the mmx and
> > such routines using compiler intrinsics?
> > 
> > My investigations have discovered the following plus points.
> > 
> > 1, easier to read/code and debug.
> > 2, you don't have to worry about register allocation and scheduling
as
> > the
> >    compiler does it for you.
> > 3, you can rebuild for different CPU targets, P4 or P3 or Athlon and
> the
> >    compiler will best decide how to schedule the instructions to
avoid
> > stall.
> > 4, easier to test new optimizations.
> > 5, don't need NASM to build, only your compiler.
> > 
> > Negative points are
> > 
> > 1, the work required to do it.
> > 2, GCC and PC compliers to not share the same intrinsic names.
> > 3, probably others as you will write back and inform me :-)
> > 
> > As an example, here is a quick version of add_c which took about
> > 10minutes to write.
> > 
> > #include <mmintrin.h>
> > 
> > void add_c(unsigned char *restrict predictor,
> > 		short *restrict error,
> > 		int predictor_stride)
> > {
> > 	int i, j;
> > 
> > #pragma unroll(2)
> > 	for (i = 0; i < 8; i++)
> > 	{
> > 		__m64 x0_high;
> > 		__m64 x0_low = ((__m64 *)error)[i];
> > 		__m64 zero = _mm_setzero_si64();
> > 		x0_high = x0_low;
> > 
> > 		/* extract out 8 to 16 bit */
> > 		x0_low = _mm_unpacklo_pi16(x0_low, zero);
> > 		x0_high = _mm_unpackhi_pi16(x0_high, zero);
> > 
> > 		/* add the error */
> > 		x0_low = _mm_adds_pu16(x0_low, ((__m64 *)predictor)[0]);
> > 		x0_high = _mm_adds_pu16(x0_high, ((__m64
> > *)predictor)[1]);
> > 		predictor += predictor_stride;
> > 
> > 		/* saturate and pack */
> > 		((__m64 *)error)[0] = _mm_packs_pu16(x0_low, x0_high);
> > 	}
> > }
> > 
> > You can see that it's very easy to change the unroll amount, whether
> you
> > use prefetch or not, or remove the restricted pointers and go back
to
> > normal aliasing mode.
> > 
> > What do people think?
> > 
> > Jim
> > 
> > _______________________________________________
> > XviD-devel mailing list
> > XviD-devel@xvid.org
> > http://list.xvid.org/mailman/listinfo/xvid-devel
> > 
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel@xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
> 
> _______________________________________________
> XviD-devel mailing list
> XviD-devel@xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel
> 

_______________________________________________
XviD-devel mailing list
XviD-devel@xvid.org
http://list.xvid.org/mailman/listinfo/xvid-devel