[XviD-devel] mulithread rework

Tue Apr 22 16:39:33 CEST 2008

On Tue, 22 Apr 2008 23:44:41 Radek Czyz wrote:
> Hello, I'm the author of this code (don't laugh ;P) so I should be able
> to help.
>
> First, this code doesn't actually divide picture into slices. Output is
> identical regardless of number of threads - this was my design goal and
> that's the reason why it works the way it does. I'm not particularly
> attached to this design goal though, it was more of "can it be done"
> experiment.
>
> In order to encode a macroblock, a thread needs a macroblock on
> top/right already encoded.
>
> So, two things happen:
>
> * each thread updates a count of "how many macroblocks are encoded in
> this row" under its "complete_count_self" memory location.
>
> * each thread encodes no more blocks than "complete_count_above", which
> is again a memory location and it's updated by whatever thread works on
> the macroblock row above it.
>
> Thread initialization code ensures that thread for row N and for N+1
> share the same pointer in order to communicate. The single memory
> location is updated by thread for row n and read by thread for row n+1.
> All synchronization happens with this set of memory locations, nothing
> more is needed.
>
> yield() simply means that current thread can't encode a block because
> the thread above it didn't manage to encode on time (or perhaps we're
> getting older value because of caches, in which case there's still no
> harm done).
>
> Hopefully this helps
> Radek

Thanks for your reply. I certainly would not laugh at code that works.

 I'm not sure I entirely understand based on what you've said but at least I 
know what to look for now :-) 

Yield isn't really harmless here, but sort of kind of works (try removing it 
and see). Yield gets worse and worse for efficiency when the threads are on 
different cpus. If you assume only one thread is on each cpu, then when that 
thread yields it just yields back to itself so it's effectively a no-op.

To retain your code as is, if I'm not mistaken, the thread which has encoded 
the top/right macroblock first should signal the next thread that it now has 
work to do, and the other thread should really block until then instead of 
wasting cpu cycles. 

Now just to get this straight, if a thread updates the value of its 
complete_count_above (which appears to happen at the end of the loop), then 
this would be the time to signal the thread below that it has more work to 
do?

It doesn't sound particularly scalable, but if the macroblocks above are 
always ahead of the the next thread  and there's some decent locking I can 
see how it can be beneficial without altering at all the motion estimation. 
One of the problems with very short lived threads is they are spawned on the 
same cpu and may not move to another cpu in any reasonable time. I'm not sure 
if you have any way of detecting how many cpus the machine has (like an auto 
mode for thread number), but if that was there you could make exactly 1 
thread per cpu and physically bind them at the time of pthread_create.

I'll see what I can do. Don't expect miracles or anything soon (unless I get 
really enthusiastic) :-)

-- 
-ck