The Performance Analysis of MENCODER with PARROT

1 ENVIRONMENTS

Table 1 ENV A

CPU
Intel(R) Xeon(R) CPU E7- 4870 @
2.40GHz 10-core
Memory
100G
Linux Version
3.11.0
Ubuntu Version
12.04

2 RESULTS
*We could find that MENCODER with PARROT is faster than that without PARROT overfour threads.


3 EXPLANATIONS
We need to explain why MENCODER with PARROT is faster than that without PARROT overfour threads. [1] tells us that the Synchronization Context Switches with PARROT is much fewerthan that without PARROT. And the reason is that PARROT’s round-robin scheduling reducescontention and its synchronizations use a more efficient wait that combines spin- and block- waits.So we should do the following steps to find the detailed reasons.
(a) First, is Syn Context Switches the main factor that leads to the result?
(b) If it is, please see (c-e); otherwise jump to (f).
(c) What’ s the relationship between Syn Context Switches and threads contention(a.k.a. lockscontention)?
(d) What causes the contention?
(e) How does PARROT’ s round-robin scheduling reduce contention?
(f) We need to find other factors.


3.1 Preparations
We can’t read the source code of MENCODER, and don’t know where PARROT works. So I wantto find a simple model which is similar to MENCODER. It is located in xtern/apps/mplayer/, andcontains speedup-example.cpp(with PARROT) and speedup-example-mutex-only.cpp(withoutPARROT). We call it EXAMPLE, and I write the mk of speedup-example.cpp.

#!/bin/bash
cd $XTERN_ROOT/apps/mplayer
gcc speedup-example.cpp -o speedup-example -O2 -g \
-I$XTERN_ROOT/include -L$XTERN_ROOT/dync_hook -Wl,--rpath,$XTERN_ROOT/dync_hook
-lxtern-annot \
-lpthread


So we could run the model. Before this, however, we should prove that we could use EXAMPLEto explain MENCODER. We could regard MENCODER as the common producer-consumeridiom. The producer reads blocks from mp4, and the consumers convert into avi. When manyconsumers strive for the same lock, it will causes performance loss. This is the key point. If wewant to explore the reasons of the speedup of MENCODER with PARROT, we need to simplify itas the above. And EXAMPLE is a good choice. It lets many threads access the same criticalresource, and uses PARROT to get a speedup. Additionally, EXAMPLE and MENCODER bothenjoy big speedups. So I think if we can find the detailed reasons of the speedup of theEXAMPLE with PARROT, the obtained reasons also apply to MENCODER.


3.2 Is Syn Context Switches the main factor that leads to the result?
Now we could use EXAMPLE to explain MENCODER. We run EXAMPLE over four threads,ENV A. Results are as follows.

Table 2 Results of EXAMPLE over four threads, ENV A


WITHOUTPARROT
WITH PARROT
Elapsed Time
7.85s
4.36s
Very good. PARROT get a high speedup. Let’s run it over a single thread without PARROT.

Table 3 Results of EXAMPLE without PARROT over 1 thread, ENV A


WITHOUTPARROT
Elapsed Time
1.38s

Now we try to explain the results. First, let’s see the source code.


speedup-example-mutex-only.cpp

void*thread_func(void*arg){
...
for (int j = 0; j < M; j++) {
pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) // This is the key of speedup for parrot: the
mutex needs to be a little bit congested.
sum += i;
pthread_mutex_unlock(&mutex);
...
}
}
int main(int argc, char* argv[]){
...
for(unsigned i=0; i<N; ++i) {
ret = pthread_create(&th[i], NULL, thread_func, (void*)i);
assert(!ret && "pthread_create() failed!");
}
...
exit(0);
}

According Amdahl's law,
the Elapsed Time of execution without PARROT over four threads should be four times that over asingle thread. However it is more than four times. The execution with PARROT enjoys a goodspeedup, and the Elapsed Time is about four times that over a single thread. Now let’s use VTuneto find something.


Threads=1, without PARROT


Threads=4, without PARROT


Threads=4, with PARROT


Fig.2. and Fig.1. show that with the increase of the number of threads, Synchronization ContextSwitches is larger and larger. Fig.2. and Fig.3. show that PARROT reduces the number of ContextSwitches. Namely, with the increase of Synchronization Context Switches, the Elapsed Timeincreases, and with the reduce of it, the Elapsed Time reduces. Above all, we could prove that SynContext Switches is the main factor that leads to the result that we try to explain.


3.3 What’s the relationship between Syn Context Switches and threadscontention?
We have known that Syn Context Switches is the main factor. And we also know that with theincrease of the number of threads, there is also increased threads contention.What’s therelationship between Syn Context Switches and threads contention? Contention not only causesserialization, but also brings the overhead of thread switches[2]. From VTune Fig.4., we could seethat threads spend more Wait Time on the Mutex lock, which caused by threads contention.


So if we could reduce the contention, or use a more efficient wait, we will get a high speedup. Sowe need to know how it is generated.


3.4 What causes the contention?


Let’s see the source code of EXAMPLE. Every thread executes the following codes:

pthread_mutex_lock(&mutex);
for (long i = 0; i < loops; i++) sum += i;
pthread_mutex_unlock(&mutex);

When many threads strive for the same lock, contention happens. And only one thread could getthe lock, while others should wait. So if we want to reduce contention, we should not let too manythreads wait on the same lock. We should take some scheduling policies rather than let threads runindefinitely.


3.5 How does PARROT’s round-robin scheduling reduce contention?


First, let’s see the Locks and Waits of EXAMPLE with PARROT and without PARROT fromVTune.


With PARROT


Without PARROT


Without PARROT, threads spend more Wait Time on the Mutex lock. While with PARROT,threads spend more wait time on Condition Variable rather than Mutex. It is because PARROTadds


pthread_cond_wait(&cond,&mutex);............(2)


which adds threads to a wait queue, then uses


pthread_cond_broadcast(&cond); .............(3)


to wake up all threads together. Let’s try to explain the benefit.Thread A finishes its access, and unlock the shared resources. Next, there will be many threadsstriving for the lock, including A itself. While PARROT lets A wait by (2), and lets B, C, D waitsubsequently. So the contention has been reduced. When all four threads lie in the wait queue, (3)will let the four threads run again. The following graph describes the number of threads that strivefor the lock as time goes on.


So PARROT could reduce contention by schedule threads in order.


4 CONCLUSION

PARROT tells thread A, B, C and D: When you finish a access, you should wait for others,because you are cooperators. So the enemy, Syn Context Switches, reduces gradually. Althoughthey need to wait, in fact the performance will be improved as a whole.



5 REFERENCES
[1] Heming Cui, Jiri Simsa, Yi-Hong Lin, Hao Li, Ben Blum, Xinan Xu, JunfengYang, Garth Gibson, and Randy Bryant. "Parrot: a Practical Runtime forDeterministic, Stable, and Reliable Threads". Proceedings of the 24th ACMSymposium on Operating Systems Principles (SOSP ’13)
[2] Weiming Zhou, Multi-core Computing and Programming, Wuhan: Huazhong University ofScience and Technology Press, 2009(in Chinese)


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值