[XviD-devel] mulithread rework

Tue Apr 22 17:35:59 CEST 2008

Oh and one more thing, to avoid a thread yield()ing to itself you can 
set more threads than cores. It can be shown that the thread which 
yields will always do so because the third thread (which is not running) 
can do useful work.

BUT guess what, this slows down encoding as well (although looks more 
impressive in CPU usage graphs, heh)....

R

Con Kolivas wrote:
> On Tue, 22 Apr 2008 23:44:41 Radek Czyz wrote:
>> Hello, I'm the author of this code (don't laugh ;P) so I should be able
>> to help.
>>
>> First, this code doesn't actually divide picture into slices. Output is
>> identical regardless of number of threads - this was my design goal and
>> that's the reason why it works the way it does. I'm not particularly
>> attached to this design goal though, it was more of "can it be done"
>> experiment.
>>
>> In order to encode a macroblock, a thread needs a macroblock on
>> top/right already encoded.
>>
>> So, two things happen:
>>
>> * each thread updates a count of "how many macroblocks are encoded in
>> this row" under its "complete_count_self" memory location.
>>
>> * each thread encodes no more blocks than "complete_count_above", which
>> is again a memory location and it's updated by whatever thread works on
>> the macroblock row above it.
>>
>> Thread initialization code ensures that thread for row N and for N+1
>> share the same pointer in order to communicate. The single memory
>> location is updated by thread for row n and read by thread for row n+1.
>> All synchronization happens with this set of memory locations, nothing
>> more is needed.
>>
>> yield() simply means that current thread can't encode a block because
>> the thread above it didn't manage to encode on time (or perhaps we're
>> getting older value because of caches, in which case there's still no
>> harm done).
>>
>> Hopefully this helps
>> Radek
> 
> Thanks for your reply. I certainly would not laugh at code that works.
> 
>  I'm not sure I entirely understand based on what you've said but at least I 
> know what to look for now :-) 
> 
> Yield isn't really harmless here, but sort of kind of works (try removing it 
> and see). Yield gets worse and worse for efficiency when the threads are on 
> different cpus. If you assume only one thread is on each cpu, then when that 
> thread yields it just yields back to itself so it's effectively a no-op.
> 
> To retain your code as is, if I'm not mistaken, the thread which has encoded 
> the top/right macroblock first should signal the next thread that it now has 
> work to do, and the other thread should really block until then instead of 
> wasting cpu cycles. 
> 
> Now just to get this straight, if a thread updates the value of its 
> complete_count_above (which appears to happen at the end of the loop), then 
> this would be the time to signal the thread below that it has more work to 
> do?
> 
> It doesn't sound particularly scalable, but if the macroblocks above are 
> always ahead of the the next thread  and there's some decent locking I can 
> see how it can be beneficial without altering at all the motion estimation. 
> One of the problems with very short lived threads is they are spawned on the 
> same cpu and may not move to another cpu in any reasonable time. I'm not sure 
> if you have any way of detecting how many cpus the machine has (like an auto 
> mode for thread number), but if that was there you could make exactly 1 
> thread per cpu and physically bind them at the time of pthread_create.
> 
> I'll see what I can do. Don't expect miracles or anything soon (unless I get 
> really enthusiastic) :-)
>