[XviD-devel] mulithread rework
Radek Czyz
radoslaw at syskin.cjb.net
Tue Apr 22 17:29:40 CEST 2008
Hi,
Yes, updating own complete_count_self (which means updating someone
else's complete_count_above) can be changed to a proper signal for that
thread below to continue.
Just FYI, I did try to bind threads to cores (by setting affinity mask),
but at least on my X2 4200+ that reduced speed...
Have fun and good luck :)
Radek
Con Kolivas wrote:
> On Tue, 22 Apr 2008 23:44:41 Radek Czyz wrote:
>> Hello, I'm the author of this code (don't laugh ;P) so I should be able
>> to help.
>>
>> First, this code doesn't actually divide picture into slices. Output is
>> identical regardless of number of threads - this was my design goal and
>> that's the reason why it works the way it does. I'm not particularly
>> attached to this design goal though, it was more of "can it be done"
>> experiment.
>>
>> In order to encode a macroblock, a thread needs a macroblock on
>> top/right already encoded.
>>
>> So, two things happen:
>>
>> * each thread updates a count of "how many macroblocks are encoded in
>> this row" under its "complete_count_self" memory location.
>>
>> * each thread encodes no more blocks than "complete_count_above", which
>> is again a memory location and it's updated by whatever thread works on
>> the macroblock row above it.
>>
>> Thread initialization code ensures that thread for row N and for N+1
>> share the same pointer in order to communicate. The single memory
>> location is updated by thread for row n and read by thread for row n+1.
>> All synchronization happens with this set of memory locations, nothing
>> more is needed.
>>
>> yield() simply means that current thread can't encode a block because
>> the thread above it didn't manage to encode on time (or perhaps we're
>> getting older value because of caches, in which case there's still no
>> harm done).
>>
>> Hopefully this helps
>> Radek
>
> Thanks for your reply. I certainly would not laugh at code that works.
>
> I'm not sure I entirely understand based on what you've said but at least I
> know what to look for now :-)
>
> Yield isn't really harmless here, but sort of kind of works (try removing it
> and see). Yield gets worse and worse for efficiency when the threads are on
> different cpus. If you assume only one thread is on each cpu, then when that
> thread yields it just yields back to itself so it's effectively a no-op.
>
> To retain your code as is, if I'm not mistaken, the thread which has encoded
> the top/right macroblock first should signal the next thread that it now has
> work to do, and the other thread should really block until then instead of
> wasting cpu cycles.
>
> Now just to get this straight, if a thread updates the value of its
> complete_count_above (which appears to happen at the end of the loop), then
> this would be the time to signal the thread below that it has more work to
> do?
>
> It doesn't sound particularly scalable, but if the macroblocks above are
> always ahead of the the next thread and there's some decent locking I can
> see how it can be beneficial without altering at all the motion estimation.
> One of the problems with very short lived threads is they are spawned on the
> same cpu and may not move to another cpu in any reasonable time. I'm not sure
> if you have any way of detecting how many cpus the machine has (like an auto
> mode for thread number), but if that was there you could make exactly 1
> thread per cpu and physically bind them at the time of pthread_create.
>
> I'll see what I can do. Don't expect miracles or anything soon (unless I get
> really enthusiastic) :-)
>
More information about the XviD-devel
mailing list