[XviD-devel] A SSIM Plugin for XviD

Thu Oct 26 18:54:29 CEST 2006

   Hi Johannes and all,

> Message du 13/10/06 19:37
> De : "Johannes Reinhardt" <Johannes.Reinhardt at uni-konstanz.de>

> >> The SSE2 implementation of consim is not faster than the mmx version 
> >> with all CPUs
> >> (Pentium IV and Pentium M) I tested. Is there a chance to speed it up or 
> >> should I
> >> disable SSE2? Or is SSE2 perhaps faster on other CPUs?

   Note: the mmx version consim_mmx uses 'pshufw', which is SSE+ instruction.

> >>     
> >
> >    I didn't had a deeper look at the ASM yet, because as said,
> >    there's one thing to decide first: are you sure you want a 
> >    square window for filtering? :)

> hmm, I think the gaussian window will be _very_ slow. I will try to 
> implement it as a extra mode, but I am not sure if it will be useful or 
> usable. Perhaps it could be interesting to see how coarse the 
> approximation used by most implementations is.

   agreed. Note: i've committed a slighly faster version of lum_8x8_mmx.

   Anyway, i had a look at the c version of consim, and am
   not sure it couldn't be turned into a faster way (and *then*
   optimized in SSE ;). If get you right, you computing deviates
   as <a-<a>><b-<b>>, where < > is the average operator \sum_i{a_i} / N
   (and this is where it could also be \sum_i{a_i w_i } / \sum_i { w_i })

   Now, we have <a-<a>><b-<b>> = <ab> - <a><b> which is lighter (less subs).
   So the loop could be something like:

============
        int valo, valc, devo =0, devc=0, corr=0;
        int i,j;
        for(i=0;i< 8;i++){
                for(j=0;j< 8;j++){
                        valo = ptro[j];
                        valc = ptrc[j];
                        devo += valo*valo;
                        devc += valc*valc;
                        corr += valo*valc;
                }
        ptro += stride;
        ptrc += stride;
        }
        devo -= 64*lumo*lumo;
        devc -= 64*lumc*lumc;
        corr -= 64*lumo*lumc;
        *pdevo = devo;
        *pdevc = devc;
        *pcorr = corr;
========

     but we have a precision problem around lumo/lumc which are already
     descaled by 64 (oh! and btw: using (meanc+32)>>6 instead of just
     meanc>>6 would be better rounded) (oh, and btw2: at line 267 of
     plugin_ssim.c, 'fmeanc' and 'fmeano' are not the means per se, but
     the sum of coeffs, without the /64. So i don't know if the formulae
     is ok).

     Waiting for your updated c-version now :)

Skal