[XviD-devel] Quality optimization

Fri Feb 28 17:59:30 CET 2003

Is it not faster  to make them positive (by adding something)
and then using psubusb or psubusw (does the latter exist?)
//mm0,mm1 contains 4 packed shorts
paddw mm0,something
paddw mm1,something
movq mm2,mm0
psubusw mm0,mm1
psubusw mm1,mm2
paddw mm0,mm1 <-- abs diff

or (similar to yours...)

movq mm2,mm0
pcmpgtw mm2,mm1
pxor mm0,mm2   maximum or (65535-minimum)
pxor mm1,mm2   mininum or (65535-maximum)
psubw mm1,mm0 <-- abs diff  (maximum-minimum) or
((65535-minimum)-(65535-maximum)-->maximum-minimum)

Just

----- Original Message -----
From: <trbarry at trbarry.com>
To: <xvid-devel at xvid.org>
Sent: Thursday, February 27, 2003 6:02 PM
Subject: RE: [XviD-devel] Quality optimization


> This post reminded me of a section I'd seen in the Intel Optimization
> reference for the abs diff of signed numbers. I've copied it here:
>
> ------ quote -----------
> The technique used here is to first sort the corresponding elements of
> the input
> operands into packed words of the maximum values, and packed words of
> the
> minimum values. Then the minimum values are subtracted from the maximum
> values
> to generate the required absolute difference. The key is a fast sorting
> technique that
> uses the fact that B = xor(A, xor(A,B)) and A = xor(A,0). Thus in a
> packed data
> type, having some elements being xor(A,B) and some being 0, you could
> xor such an
> operand with A and receive in some places values of A and in some values
> of B. The
> following examples assume a packed-word data type, each element being a
> signed
> value.
>
> Example 4-17 Absolute Difference of Signed Numbers
>
> ;Input:
> ; MM0 signed source operand
> ; MM1 signed source operand
> ;Output:
> ; MM0 absolute difference of the unsigned operands
>       movq MM2, MM0   ; make a copy of source1 (A)
>       pcmpgtw MM0, MM1 ; create mask of
>                       ; source1>source2 (A>B)
>       movq MM4, MM2   ; make another copy of A
>       pxor MM2, MM1   ; create the intermediate value of
>                       ; the swap operation - xor(A,B)
>       pand MM2, MM0   ; create a mask of 0s and xor(A,B)
>                       ; elements. Where A>B there will
>                       ; be a value xor(A,B) and where
>                       ; A<=B there will be 0.
>       pxor MM4, MM2   ; minima-xor(A, swap mask)
>       pxor MM1, MM2   ; maxima-xor(B, swap mask)
>       psubw MM1, MM4  ; absolute difference =
>                       ; maxima-minima
> ---------- /quote ---------------------------
>
> Although thinking about it, for machines that support these instructions
> it seems you
> could just use:
>
>      movq   mm2, mm0  ; make a copy of source
>      pminsw mm0, mm1  ; signed word min
>      pmaxsw mm1, mm2  ; signed word max
> psubw  mm1, mm0  ; big - small
>
> - Tom
>
>
>
>
>
>
> | -----Original Message-----
> | From: xvid-devel-bounces at xvid.org
> | [mailto:xvid-devel-bounces at xvid.org]On
> | Behalf Of skal
> | Sent: Thursday, February 27, 2003 10:31 AM
> | To: xvid-devel at xvid.org
> | Subject: Re: [XviD-devel] Quality optimization
> |
> |
> |
> | Hi,
> |
> | On Wed, 2003-02-26 at 19:24, Christoph Lampert wrote:
> |
> | > IDCT is
> | >
> | > PLAINC -  1.395 usec   (<- slower than fDCT?)
> |
> | most of the time, yes, because iDCT needs final
> | [-256,255] clipping...
> |
> | > MMX    -  0.219 usec
> | > MMXEXT -  0.199 usec
> | > SSE2   -  0.219 usec
> | > 3DNOW  -  1.247 usec
> | > 3DNOWE -  0.184 usec
> | >
> | > whereas Hadamard is
> | >
> | > PLAINC  - 0.549 usec
> | > MMXEXT  - 0.089 usec
> | >
> | > 0.089 is about the time of sad16() with MMXEXT needs, too,
> | > so a search routine based on hadamard+sad should not slow
> | things down too
> | > much.
> |
> | That's not that easy :)
> | In fact, having the Hadamard transform done doesn't mean
> | you're off with the hard work. Taking the abs values with
> | MMX is painful. For SSE, we have the mighty 'psadbw' instr.,
> | but it works on 8bits data, whereas the output of Hadamard
> | is 11bits (yes, it's scaled by 8, not 64!:). So? Should we
> | descale 11->8bits? Another idea would be to multiply the
> | Hadamard output by a pseudo-quant matrix that mimics the real
> | quantizers (and the missing cosines, maybe)... Dunno.
> | Anyway, here's an Hadamard_SAD for 8x8 or 16x16 byte input,
> | in replacement for SAD. I'm not quite satisfied with it
> | mainly because of the above (and not just because it's
> | 8 times slower than pure SAD :)
> |
> | bye!
> |
> | Skal
> |
> | I
> | >
> | >
> | >
> | >
> | > Btw. what we would need in the end is a SATD function (SAD of
> | > transformed), so either, we would have to do
> | >
> | > SAD (  Hadamard(Cur) , Hadamard(Ref) )   (*)
> | >
> | > with usual sad-routine or
> | >
> | > sum(abs( ( Hadamard( Cur - Ref ) ))      (**)
> | >
> | > In theory these should be identical (Hadamard is linear),
> | but maybe they
> | > are not...
> | >
> | > Anyway, would it be faster to combine these steps into a
> | larger routine,
> | > or rather not? Again, I would believe it would, because for (**) the
> | > result of Hadamard doesn't have to be saved, only summed
> | up, but of course
> | > I'm no expert...
> |
> |
>
>
> _______________________________________________
> XviD-devel mailing list
> XviD-devel at xvid.org
> http://list.xvid.org/mailman/listinfo/xvid-devel