[XviD-devel] mulithread rework

Wed Apr 23 01:26:05 CEST 2008

> Con Kolivas wrote:
> > It doesn't sound particularly scalable, but if the macroblocks above are
> always ahead of the the next thread  and there's some decent locking I can
> see how it can be beneficial without altering at all the motion estimation.
> One of the problems with very short lived threads is they are spawned on the
> same cpu and may not move to another cpu in any reasonable time. I'm not
> sure if you have any way of detecting how many cpus the machine has (like an
> auto mode for thread number), but if that was there you could make exactly 1
> thread per cpu and physically bind them at the time of pthread_create.
> >
> > I'll see what I can do. Don't expect miracles or anything soon (unless I
> get really enthusiastic) :-)

I've given it a reasonable run with proper locking and I'm afraid the
news isn't that great.

If you look back at the original code and its performance you'll see
that more than one CPU is never fully utilised, and less so the more
CPUs you have. Using sched_yield as locking and looping back to the
start of the loop means you will be burning some cpu doing basically
nothing, but if the CPU is not fully utilised then the progress of
that thread further into the loop is not CPU limited.

Converting it to strict locking has the effect of performing at
virtually the same rate of frame encoding, but with much less CPU
usage. What this means is that no CPU is wasted whatsoever by burning
in the yield() loopback. The reason we don't get any more encoding
speed is that as I said previously, it's not CPU limiting that
prevents further speed gains; it's actually the fact that each
successive thread depends entirely on the progress of the previous
thread before it can make any forward progress. This is what I was
worried about back earlier when I said it didn't sound particularly
scalable.

Does this change provide any useful advantage? Well, more efficient
CPU usage is always a worthwhile change, even if the overall
performance doesn't change, because it allows the CPU to be available
for other tasks on the machine; ie the user will feel the slowdown of
xvid encoding less. However, there is a serious misconception about
CPU usage and performance in the community (just see all the doom9
threads) and you'll see that their yardstick for performance is how
much their CPUs are utilised (you even hinted at it in your email ;-)
). So in essence ordinary users will think that the performance has
dropped, despite the fact that FPS/CPU has increased.

So are there any other potential gains? There are minor improvements
to actual FPS encoded if all the threads are generated in advance and
not generated/destroyed on each frame being encoded because thread
creation is not exactly instantaneous, and allows CPU balancing to
move them around in advance rather than always starting on the same
CPU as the parent process. However, even this does not amount to much
improvement (I can't give you a firm figure but eyeballing the FPS was
not impressive).

Anyway, for the moment, I don't have any magic bullets on how to speed
up the encoding. There simply isn't enough work for each thread to do
before it blocks waiting on the other thread. Unless there was another
way of parallelising the workloads such that each thread has more work
to do before blocking, I think you're pretty close to the ceiling of
the scalability of this approach. I'll look to see if there's anything
else that can be done, but I'm not optimistic about the chances.

It was fun playing though :-)

Regards
--
-ck