[XviD-devel] Quality optimization

Wed Feb 26 15:56:57 CET 2003

	Hi 

On Tue, 2003-02-25 at 18:50, Christoph Lampert wrote:
> since you are still (or again) amongst us ;-) 
	bwehe

> I saw in your xvid_bench.c code that it checks e.g. SAD speed on 
> 16*16 arrays (so stride is 16) and only in perfect alignment. Isn't that
> untypical, because in "real life" stride should be something like 720,
> in any way much larger than cacheline, 

	well, actually, xvid_bench.c is meant to:

	a) perform some unit tests and non-regression (CRC)
	b) provide a *lower* bound about how fast a func could
	run, given perfect condition.

	In real life, most xvid_bench.c's funcs will run slower,
	for sure, and a test-suite of sequence will prove it :)

> and also only "Reference" pointer
> would be aligned, not "Current"? Or doesn't this matter on x86? 

	Well, you of course get a (few ticks) penalty in case of
	misalignment during memory access. It starts to matter
	with SSE2, where special instructions are provided for
	known aligned (or un-aligned) read/write. Sure, a good
	assumption is that 'current' is aligned (to 16 for SSE2.
	no chroma here!) whereas 'ref' isn't.
	Here's for instance a 16x8 simple ref->cur SSE2 copy:

  movdqu xmm0,  [eax]
  movdqu xmm1,  [eax+edx]
  movdqu xmm2,  [eax+2*edx]
  movdqu xmm3,  [eax+ebx]
  lea eax, [eax+4*edx]
  movdqu xmm4,  [eax]
  movdqu xmm5,  [eax+edx]
  movdqu xmm6,  [eax+2*edx]
  movdqu xmm7,  [eax+ebx]
  movdqa [ecx],      xmm0 
  movdqa [ecx+edx],  xmm1
  movdqa [ecx+2*edx],xmm2
  movdqa [ecx+ebx],  xmm3  
  lea ecx, [ecx+4*edx]
  movdqa [ecx],      xmm4
  movdqa [ecx+edx],  xmm5
  movdqa [ecx+2*edx],xmm6
  movdqa [ecx+ebx],  xmm7

> 
> Also, even though your code is so fast, I didn't find any
> "prefetch" instructions in ASM or C whereas ffmpeg's SAD routines are full
> of them. Didn't you test them, or didn't they yield a speedup? 

	I've played a little with, without any definitive conclusion
	to provide. I thought Pentium4's hardware prefetch would be the
	cure, but it appears that this platform is sometimes slower
	(with my code, not xvid) than a PIII. Since I'm most of the time
	limited by memory bandwith, now, it might be time to really
	use the prefetch. As a rule of the thumb, I'd say, taking
	motion-compensation as instance, that 'ref' should be
	prefetched, and 'cur' should be non-temporal-moved (except
	for b-frames). From my small experience, prefetch can eventually
	be very powerful (cf. all the various memcpy() flavors), but
        it's also very very easy to use it very very badly, especially
	for data whose lifespan is not clear enough. So...

	bye,
		Skal