1 ENVIRONMENTS
Table 1 ENV A
CPU | Intel(R) Xeon(R) CPU E7- 4870 @ 2.40GHz 10-core |
Memory | 100G |
Linux Version | 3.11.0 |
Ubuntu Version | 12.04 |
2 RESULTS
*We could find that MENCODER with PARROT is faster than that without PARROT overfour threads.
3 EXPLANATIONS
We need to explain why MENCODER with PARROT is faster than that without PARROT overfour threads. [1] tells us that the Synchronization Context Switches with PARROT is much fewerthan that without PARROT. And the reason is that PARROT’s round-robin scheduling reducescontention and its synchronizations use a more efficient wait that combines spin- and block- waits.So we should do the following steps to find the detailed reasons.
(a) First, is Syn Context Switches the main factor that leads to the result?
(b) If it is, please see (c-e); otherwise jump to (f).
(c) What’ s the relationship between Syn Context Switches and threads contention(a.k.a. lockscontention)?
(d) What causes the contention?
(e) How does PARROT’ s round-robin scheduling reduce contention?
(f) We need to find other factors.
3.1 Preparations
We can’t read the source code of MENCODER, and don’t know where PARROT works. So I wantto find a simple model which is similar to MENCODER. It is located in xtern/apps/mplayer/, andcontains speedup-example.cpp(with PARROT) and speedup-example-mutex-only.cpp(withoutPARROT). We call it EXAMPLE, and I write the mk of speedup-example.cpp.
#!/bin/bash
cd $XTERN_ROOT/apps/mplayer
gcc speedup-example.cpp -o speedup-example -O2 -g \
-I$XTERN_ROOT/include -L$XTERN_ROOT/dync_hook -Wl,--rpath,$XTERN_ROOT/dync_hook
-lxtern-annot \
-lpthread
So we could run the model. Before this, however, we should prove that we could use EXAMPLEto explain MENCODER. We could regard MENCODER as the common producer-consumeridiom. The producer reads blocks from mp4, and the consumers convert into avi. When manyconsumers strive for the same lock, it will causes performance loss. This is the key point. If wewant to explore the reasons of the speedup of MENCODER with PARROT, we need to simplify itas the above. And EXAMPLE is a good choice. It lets many threads access the same criticalresource, and uses PARROT to get a speedup. Additionally, EXAMPLE and MENCODER bothenjoy big speedups. So I think if we can find the detailed reasons of the speedup of theEXAMPLE with PARROT, the obtained reasons also apply to MENCODER.
3.2 Is Syn Context Switches the main factor that leads to the result?
Now we could use EXAMPLE to explain MENCODER. We run EXAMPLE over four threads,ENV A. Results are as follows.
Table 2 Results of EXAMPLE over four threads, ENV A
WITHOUTPARROT | WITH PARROT | |
Elapsed Time | 7.85s | 4.36s |
Table 3 Results of EXAMPLE without PARROT over 1 thread, ENV A
WITHOUTPARROT | |
Elapsed Time | 1.38s |
Now we try to explain the results. First, let’s see the source code.
speedup-example-mutex-only.cpp
void*thread_func(void*arg){
...
for (int j = 0; j < M; j++) {
pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the
mutex needs to be a little bit congested.
sum += i;
pthread_mutex_unlock(&mutex);
...
}
}
int main(int argc, char* argv[]){
...
for(unsigned i=0; i<N; ++i) {
ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
assert(!ret && "pthread_create() failed!");
}
...
exit(0);
}
According Amdahl's law,
Threads=1, without PARROT
Threads=4, without PARROT
Threads=4, with PARROT
Fig.2. and Fig.1. show that with the increase of the number of threads, Synchronization ContextSwitches is larger and larger. Fig.2. and Fig.3. show that PARROT reduces the number of ContextSwitches. Namely, with the increase of Synchronization Context Switches, the Elapsed Timeincreases, and with the reduce of it, the Elapsed Time reduces. Above all, we could prove that SynContext Switches is the main factor that leads to the result that we try to explain.
3.3 What’s the relationship between Syn Context Switches and threadscontention?
We have known that Syn Context Switches is the main factor. And we also know that with theincrease of the number of threads, there is also increased threads contention.What’s therelationship between Syn Context Switches and threads contention? Contention not only causesserialization, but also brings the overhead of thread switches[2]. From VTune Fig.4., we could seethat threads spend more Wait Time on the Mutex lock, which caused by threads contention.
So if we could reduce the contention, or use a more efficient wait, we will get a high speedup. Sowe need to know how it is generated.
3.4 What causes the contention?
Let’s see the source code of EXAMPLE. Every thread executes the following codes:
pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) sum += i;
pthread_mutex_unlock(&mutex);
When many threads strive for the same lock, contention happens. And only one thread could getthe lock, while others should wait. So if we want to reduce contention, we should not let too manythreads wait on the same lock. We should take some scheduling policies rather than let threads runindefinitely.
3.5 How does PARROT’s round-robin scheduling reduce contention?
First, let’s see the Locks and Waits of EXAMPLE with PARROT and without PARROT fromVTune.
With PARROT
Without PARROT
Without PARROT, threads spend more Wait Time on the Mutex lock. While with PARROT,threads spend more wait time on Condition Variable rather than Mutex. It is because PARROTadds
to wake up all threads together. Let’s try to explain the benefit.Thread A finishes its access, and unlock the shared resources. Next, there will be many threadsstriving for the lock, including A itself. While PARROT lets A wait by (2), and lets B, C, D waitsubsequently. So the contention has been reduced. When all four threads lie in the wait queue, (3)will let the four threads run again. The following graph describes the number of threads that strivefor the lock as time goes on.
So PARROT could reduce contention by schedule threads in order.
4 CONCLUSION
PARROT tells thread A, B, C and D: When you finish a access, you should wait for others,because you are cooperators. So the enemy, Syn Context Switches, reduces gradually. Althoughthey need to wait, in fact the performance will be improved as a whole.
5 REFERENCES
[1] Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, JunfengYang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime forDeterministic, Stable, and Reliable Threads". Proceedings of the 24th ACMSymposium on Operating Systems Principles (SOSP ’13)
[2] Weiming Zhou, Multi-core Computing and Programming, Wuhan: Huazhong University ofScience and Technology Press, 2009(in Chinese)