[XviD-devel] Quality optimization

Wed Feb 26 12:14:46 CET 2003

	Hi,

	I forward an answer to a question I received,
	since it can be of general interest:

On Tue, 2003-02-25 at 18:20, skal wrote:
> 	here's a C/MMX/SSE version of the Hadamard transform (16bits).

Q: > Would there be any improvement between SSE and SSE2?

        Well, most probably. The obvious improvement is for
        the vertical pass: instead of dealing with 4 + 4 columns
        subsequently, they could all be done in one pass, replacing:
 HADAMARD_VPASS eax
 HADAMARD_VPASS eax+8

        by a single HADAMARD_VPASS eax where all the 'mm?' registers
        are replaced by 'xmm?' in this macro (and taking care of
        alignments).

        This being said, such heavily SIMD'd functions rapidly hit
        the memory bandwith bottleneck. Actually, in the Hadamard
        transform I posted, only HALF of the time (tick-wise) is 
	spent doing the arithmetic computations. The rest is spent
        loading/storing data. For the F/Idct, data I/O
        is also a great part of the stuff, considering how cheap
        are the mults.
        Note also that for this in-place funcs, prefetching is
        almost useless (haven't tested it, though).

> 	Without the 'pshufw' re-ordering, output columns are re-ordered
> 	according to: [03127465]. C-version spits the correct order...
> 	Note: Output is also scaled by 8.

        oops! Output is scaled by 64, not 8!.

        bye,
                Skal