[XviD-devel] GMC rc3 - TODO

Christoph Lampert xvid-devel@xvid.org
Sun, 12 Jan 2003 13:51:17 +0100 (CET)


Hi,

here is the promised mail about GMC improvement, some results 
of the current CVS version are attached.
We have to optimize for 2 goals, quality and speed. 

As you might notice when profiling, the main new routines are

GlobalMotionEst()       in motion_est.c, encoding only

generate_GMCparameters  in motion_comp.c, encoding and decoding
generate_GMCimageMB     in motion_comp.c, encoding and decoding


MotionEst is a little bit tricky, surely there are going to be several
changes. It's more than a "hack" at the moment, but it's far from perfect,
too. This should first be optimized for quality. For this it would be
interesting which clips/frames get _larger_ at fixed quant with GMC
enabled and which get smaller. A perfect mode decision should only
activate GMC, if one really benefits from it. 



The other two, generate_GMCparameters() and generate_GMCimageMB() are
almost in their final form. They only support 2 warp parameters at the
moment instead of 3, yet, but this is easy to include when we really want
to. Apart from this, they are good candidates for optimization:

generate_GMCparameters()   is only called once per frame. It does some
general calculations with the warppoints. It can surely be sped up, by
ordinary peephole optimization, but ASM will not be needed. 


generate_GMCimageMB()  this is the main warping part, the reference image
is transformed into the globally transformed image. This routine does the
warping for one 16x16 block and calculated the "average motion vector"
for the block. 
In decoding, it is call for every MB which is GMC-coded. In encoding, it
is called for every block (in a big loop by the routine
generate_GMCimage() ). In the maximal case, the routine
generate_GMCimageMB() does some calculations once for _every_ pixel in the
image, Lumi as well as Chroma. So it has to be fast! Much faster than it
is now. 

generate_GMCimageMB() contains a ordinary loop 
	for (J=16*mj;J<16*(mj+1);J++)
	for (I=16*mi;I<16*(mi+1);I++)

so every pixel of the block is treated seperately. Maybe this can be
slightly parallized? 

Anyway, a first step would be to change these 

int F= i0s + ( ((-r*i0s+i1ss)*I + (r*j0s-j1ss)*J +(1<<(alpha+rho-1))) >>  (alpha+rho) );
int G= j0s + ( ((-r*j0s+j1ss)*I + (-r*i0s+i1ss)*J +(1<<(alpha+rho-1))) >> (alpha+rho) );

difficult and wasteful calculations to a incremental approach, only add
something every step instead of multiplying. 

The values of F and G are taken as x- and y-positions in memory, and the
corresponding pixels are bilinearily interpolated: 

		Y00 = pRef->y[ G*stride + F ];				// Lumi values
		Y01 = pRef->y[ G*stride + F+1 ];
		Y10 = pRef->y[ G*stride + F+stride ];
		Y11 = pRef->y[ G*stride + F+stride+1 ];
		
		/* bilinear interpolation */
		Y00 = ((s-ri)*Y00 + ri*Y01);
		Y10 = ((s-ri)*Y10 + ri*Y11);
		Y00 = ((s-rj)*Y00 + rj*Y10 + s*s/2 - rounding ) >>(sigma+sigma); 

I really hope there is a better way of doing this than. Prefetching might
help, too. Andreas had some good ideas, e.g. the interpolation can be sped
up if e.g. Y00, Y01, Y10 and Y11 are identical, or at least 2 of them.
Also note that very often (almost always)  G*stride+F  for the current
pixel is within +-1 positions of the previous pixel, etc. 


However, I have to admit that the compiler here does a great job
already! On AthlonXP 1600+ (1.4GHz, hardware prefetch) and compiled with
gcc 3.1 one call to generate_GMCimageMB() is 21 us, interpolation of the
whole CIF image is 8.2 ms of 720x576 it's 33.5 ms. In the end, encoding is
half the speed than without GMC... 

gruel