Re[2]: [XviD-devel] New Motion Estimation (from sysKin) committed to branch

Michael Militzer xvid-devel@xvid.org
Tue, 24 Sep 2002 17:06:50 +0200


Hi,

----- Original Message -----
From: "Radoslaw 'sysKin' Czyz" <syskin@bigpond.com>
To: <xvid-devel@xvid.org>
Sent: Tuesday, September 24, 2002 4:29 PM
Subject: Re[2]: [XviD-devel] New Motion Estimation (from sysKin) committed
to branch

> >> I'll re-do the changes to the cvs' encoder.c and post them, ok?
>
> > Sure, please do.
>
> OK I did. I attach all affected files, it should work (I'll never
> write 'it will work' again ;> ) when old files are just replaced with
> new ones.
>
> Now, allow me to write a small 'todo' list.
> What we need to make it work even better, is:
> - mmx-ed sad16v. sad16v is a base of entire P-frame search if inter4v
> is on, but we currently only have XMM-ed version. All other
> architectures will also use cpu-specific, but not optimal, code.
>
> It's defined as follows:
> uint32_t sad16v(const uint8_t * const cur,
>                 const uint8_t * const ref,
>                 const uint32_t stride,
>                 int32_t *sad)
> {
>         sad[0] = sad8(cur, ref, stride);
>         sad[1] = sad8(cur + 8, ref + 8, stride);
>         sad[2] = sad8(cur + 8*stride, ref + 8*stride, stride);
>         sad[3] = sad8(cur + 8*stride + 8, ref + 8*stride + 8, stride);
>
>         return sad[0]+sad[1]+sad[2]+sad[3];
> }

shouldn't be a problem.

> - hinted ME & B-frames. Well it shouldn't be difficult to fix, I have
> no idea why it doesn't work. I mean for P-frames only, of course. It
> speeds up 2nd pass without affecting quality (or it's increasing it a
> bit).
>
> - first pass P/B decisions are not saved for second pass. Again, it's
> a waste of speed, and again I have no idea how to do it.
>
> OK, I don't know what else ;>

hm, I have some additional ideas:

1) do INTRA/INTER decision as early as possible: I tried yesterday and did
an early INTRA check just before the halfpel (16) refine. Final filesize
even got slightly smaller (But I don't think one can generalize this
behaviour. We all know that the INTER_BIAS isn't always perfect and that
good results can be achieved with lots of values depending on the input
material...). Even though one saves the 4MV search + the refine16 +
4*refine8 for most INTRA blocks, speed was only slightly higher, don't know
why... :-(

2) do INTER/INTER4V decision before any halfpel refine: Idea is to first do
the normal search16, then do NO halfpel refine (instead maybe do the early
INTRA/INTER decision mentioned above) and start the 4MV search (search8).

Also do not refine the search8 results yet. Ok, now we have minSADs from the
search16 and the search8, with these informations we can perform the
INTER/INTER4V decision. Depending on which mode is chosen, we finally refine
the result (either a halfpel16_refine or 4xrefine8). I have no idea if this
works out at all, or how much worse this quick decision is vs. the correct
way.

I quickly tried to implement this idea yesterday into SysKin's ME, but
because lot of pointers were used (SearchData etc.) I couldn't easily keep
track which data gets modified at all and obviously the search8 step
destroyed some information of my earlier search16 :-( Is it that much slower
to really store all results within SearchData instead of using pointers to
common variables? Maybe it's a good idea not to throw away any information
at all (best MV before refinement etc.), lots of computing power has been
invested to calculate those intermediate results and you never know, if it
might not be useful in the future again...

I don't know if this idea helps to speed up halfpel mode a lot, but because
most MBs are not INTER4V but mostly INTER, one could save the halfpel8
refinement for most MBs which might not be too bad... And when quarterpel
refinment comes into play, the speed gain might be even more noticable.

3) I'd like to have a small data structure where the SADs of the neighboured
blocks (top, bottom, left, right) of the current best match are stored.
Well, while I think about it, even better would be all 8 neighbours. This
could allow faster qpel refinement...

bye,
Michael