[XviD-devel] mulithread rework

Tue Apr 22 17:29:40 CEST 2008

Hi,

Yes, updating own complete_count_self (which means updating someone 
else's complete_count_above) can be changed to a proper signal for that 
thread below to continue.

Just FYI, I did try to bind threads to cores (by setting affinity mask), 
but at least on my X2 4200+ that reduced speed...

Have fun and good luck :)

Radek

Con Kolivas wrote:
> On Tue, 22 Apr 2008 23:44:41 Radek Czyz wrote:
>> Hello, I'm the author of this code (don't laugh ;P) so I should be able
>> to help.
>>
>> First, this code doesn't actually divide picture into slices. Output is
>> identical regardless of number of threads - this was my design goal and
>> that's the reason why it works the way it does. I'm not particularly
>> attached to this design goal though, it was more of "can it be done"
>> experiment.
>>
>> In order to encode a macroblock, a thread needs a macroblock on
>> top/right already encoded.
>>
>> So, two things happen:
>>
>> * each thread updates a count of "how many macroblocks are encoded in
>> this row" under its "complete_count_self" memory location.
>>
>> * each thread encodes no more blocks than "complete_count_above", which
>> is again a memory location and it's updated by whatever thread works on
>> the macroblock row above it.
>>
>> Thread initialization code ensures that thread for row N and for N+1
>> share the same pointer in order to communicate. The single memory
>> location is updated by thread for row n and read by thread for row n+1.
>> All synchronization happens with this set of memory locations, nothing
>> more is needed.
>>
>> yield() simply means that current thread can't encode a block because
>> the thread above it didn't manage to encode on time (or perhaps we're
>> getting older value because of caches, in which case there's still no
>> harm done).
>>
>> Hopefully this helps
>> Radek
> 
> Thanks for your reply. I certainly would not laugh at code that works.
> 
>  I'm not sure I entirely understand based on what you've said but at least I 
> know what to look for now :-) 
> 
> Yield isn't really harmless here, but sort of kind of works (try removing it 
> and see). Yield gets worse and worse for efficiency when the threads are on 
> different cpus. If you assume only one thread is on each cpu, then when that 
> thread yields it just yields back to itself so it's effectively a no-op.
> 
> To retain your code as is, if I'm not mistaken, the thread which has encoded 
> the top/right macroblock first should signal the next thread that it now has 
> work to do, and the other thread should really block until then instead of 
> wasting cpu cycles. 
> 
> Now just to get this straight, if a thread updates the value of its 
> complete_count_above (which appears to happen at the end of the loop), then 
> this would be the time to signal the thread below that it has more work to 
> do?
> 
> It doesn't sound particularly scalable, but if the macroblocks above are 
> always ahead of the the next thread  and there's some decent locking I can 
> see how it can be beneficial without altering at all the motion estimation. 
> One of the problems with very short lived threads is they are spawned on the 
> same cpu and may not move to another cpu in any reasonable time. I'm not sure 
> if you have any way of detecting how many cpus the machine has (like an auto 
> mode for thread number), but if that was there you could make exactly 1 
> thread per cpu and physically bind them at the time of pthread_create.
> 
> I'll see what I can do. Don't expect miracles or anything soon (unless I get 
> really enthusiastic) :-)
>