背景
任务数量大约100W,如果在一个线程下跑的话,巨耗时,所以考虑,在主线程下,创建多线程的方式,并行进行海量任务的处理。本文以多次循环求和作为例子。
单线程:
int main()
{
boost::posix_time::ptime start =boost::posix_time::microsec_clock::local_time();
uint64_t result = 0;
for (int i = 0; i < max_sum_item; i++)
result += i;
std::cout << "sum="<<result<<std::endl;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}
运行结果如下:
sum=499999999500000000
cost time:4061
如果将其以新建一个线程的方式处理这个任务呢?
代码:
const int max_sum_item = 1000000000;
void do_sum(uint64_t *total)
{
*total = 0;
for (int i = 0; i < max_sum_item; i++)
*total += i;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
uint64_t result = 0;
boost::thread worker(do_sum, &result);
worker.join();
std::cout << "sum="<<result<<std::endl;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}
运行结果:
sum=499999999500000000
cost time:4346
能够看出,不用main的线程进行运算,而是自己新建一个线程,做处理的话,运行时间会稍微多些,毕竟多做的这些(新建和删除线程)是需要开销的。但是这开销好像有点儿大?
将do_sum函数优化下,代码如下:
void do_sum(uint64_t *total)
{
uint64_t localTotal = 0;
for (int i = 0; i < max_sum_item; i++)
localTotal += i;
*total = localTotal;
}
采用优化过的do_sum进行运算,耗时如下:
sum=499999999500000000
cost time:4068
这是因为在每轮的循环中,未做优化的do_sum中我们采用引用的方式使其指向total(*total += i;),但是这部分的时间开销大于算数运算的耗时。所以,最优化方案是在函数内部采用一个局部的localTotal 变量来存储求和的结果,只在最后步骤写一次给引用的指针total 。
多线程:
注意,C++11 lambdas表达式需要GCC/G++ 4.5以上版本, 对于 G++ 4.4.是不允许的,编译时候 直接报错,所以请注意了。可以参考: http://gcc.gnu.org/projects/cxx0x.html。
否则是可以采用lambdas来求和的。
std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });
代码如下:
std::vector<uint64_t *> part_sums;
const int threads_to_use = 2;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
uint64_t sub_result = 0
for (int i = start_val; i < start_val + sums_to_do; i++)
sub_result += i;
*final = sub_result;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
part_sums.clear();
part_sums1.clear();
for (int i = 0; i < threads_to_use; i++)
{
part_sums.push_back(new uint64_t(0));
}
std::vector<boost::thread *> t;
int sums_per_thread = max_sum_item / threads_to_use;
for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_per_thread));
}
for (int i = 0; i < threads_to_use; i++)
t[i]->join();
uint64_t result = 0;
// std::for_each(part_sums.begin(), part_sums.end(),myfunc);
//vector中元素求和
for(int i = 0; i < threads_to_use; i++)
{
uint64_t *temp = part_sums[i];
// std::cout<<*temp<<std::endl;
result += *temp;//注意这里的取值方式
}
// result = accumulate(part_sums1.begin() , part_sums1.end() ,0);
for (int i = 0; i < threads_to_use; i++)
{
delete t[i];
delete part_sums[i];
}
std::cout << "sum="<<result<<std::endl;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}
开启两个线程,允许结果如下:
sum=499999999500000000
cost time:1907
提速非常明显。
注意上述的vector求和,也可以简化写成
for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it) result += **it;
线程数和任务数的分配问题
比如上述const int max_sum_item = 1000000000;如果此时的线程数量为7的话,每个线程负责的数据量为142,857,142.8 。为此,我们进行向下取整,142,857,142。此时7个进程处理的总数为999,999,994 而对于尾数那些数据,我们可以指定给最后一个线程进行处理。
int sums_per_thread = max_sum_item / threads_to_use;
for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
// Lump extra bits onto last thread if work items is not equally divisible by number of threads
int sums_to_do = sums_per_thread;
if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内
t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));
if (sums_to_do != sums_per_thread)
break;//当第一个非标准任务数量被分配的时候,因为尾部线程的任务数量是大于1倍标准任务数的。如果该循环没有的话,则会进入下一个外循环,使得start_val=999,999,994,此时便会再创建一个没有必要的错误线程。
}
完整代码如下(开启7个线程):
const int max_sum_item = 1000000000;
std::vector<uint64_t *> part_sums;
const int threads_to_use = 7;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
uint64_t sub_result = 0;
for (int i = start_val; i < start_val + sums_to_do; i++)
sub_result += i;
*final = sub_result;
}
int main()
{
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
uint64_t result = 0;
part_sums.clear();
part_sums1.clear();
for (int i = 0; i < threads_to_use; i++)
{
part_sums.push_back(new uint64_t(0));
}
std::vector<boost::thread *> t;
int sums_per_thread = max_sum_item / threads_to_use;
for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
// Lump extra bits onto last thread if work items is not equally divisible by number of threads
int sums_to_do = sums_per_thread;
if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内
t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));
if (sums_to_do != sums_per_thread)
break;
}
for (int i = 0; i < threads_to_use; i++)
t[i]->join();
//vector中元素求和
int tt=0;
for(int i = 0; i < threads_to_use; i++)
{
uint64_t *temp = part_sums[i];
// std::cout<<*temp<<std::endl;
result += *temp;
}
// result = accumulate(part_sums1.begin() , part_sums1.end() ,0);
for (int i = 0; i < threads_to_use; i++)
{
delete t[i];
delete part_sums[i];
// delete part_sums1[i];
}
std::cout << "sum="<<result<<std::endl;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
//************************多线程测试************************************//
return 0;
}
运行结果如下:
sum=499999999500000000
cost time:546
发现提升的速度,不仅仅是7 倍。这是什么原因呢?难道是由于单个线程的任务数变少了,任务数的处理过程并不是线性耗时的?欢迎大家对此进行补充,讨论。
准确记录每个线程的耗时情况
主要代码:
std::vector<uint64_t *> part_sums;
boost::mutex coutmutex;//同步对象
const int threads_to_use = 7;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
coutmutex.lock();
std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
coutmutex.unlock();
//You can simply output text to cout or a file stream, but as discussed in the first part of this series, stream operations in C++ are not atomic so you must wrap their use in a synchronization //object.
//Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
uint64_t sub_result = 0;
for (int i = start_val; i < start_val + sums_to_do; i++)
sub_result += i;
*final = sub_result;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
coutmutex.lock();
std::cout << "End : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "<< timeTaken.total_milliseconds() << std::endl;
//Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).
coutmutex.unlock();//如果没有解锁的话,就一直尴尬地等待了
}
主函数代码和上述例子是一样的。
运行结果如下:
Start: TID 7f7a85a6a700 starting at 142857142, workload of 142857142 items
Start: TID 7f7a84668700 starting at 428571426, workload of 142857142 items
Start: TID 7f7a83266700 starting at 714285710, workload of 142857142 items
Start: TID 7f7a82865700 starting at 857142852, workload of 142857148 items
Start: TID 7f7a8646b700 starting at 0, workload of 142857142 items
Start: TID 7f7a85069700 starting at 285714284, workload of 142857142 items
Start: TID 7f7a83c67700 starting at 571428568, workload of 142857142 items
End : TID 7f7a85a6a700 with result 30612244459183675, time taken 542
End : TID 7f7a82865700 with result 132653065561224474, time taken 543
End : TID 7f7a84668700 with result 71428570500000003, time taken 543
End : TID 7f7a8646b700 with result 10204081438775511, time taken 544
End : TID 7f7a83266700 with result 112244896540816331, time taken 544
End : TID 7f7a83c67700 with result 91836733520408167, time taken 582
End : TID 7f7a85069700 with result 51020407479591839, time taken 583
sum=499999999500000000
cost time:583
注意前面提到的,输出操作并不是原子操作,所以注意加锁。
其他部分,有待补充。。。。
完整代码
C11版本
#include <iostream> // for std::cout
#include <cstdint> // for uint64_t
#include <chrono> // for std::chrono::high_resolution_clock
#include <thread> // for std::thread
#include <vector> // for std::vector
#include <algorithm> // for std::for_each
#include <cassert> // for assert
#define TRACE
#ifdef TRACE
#include <mutex> // for std::mutex
std::mutex coutmutex;
#endif
std::vector<uint64_t *> part_sums;
const int max_sum_item = 1000000000;
const int threads_to_use = 7;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
#ifdef TRACE
coutmutex.lock();
std::cout << "Start: TID " << std::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
coutmutex.unlock();
auto start = std::chrono::high_resolution_clock::now();
#endif
uint64_t sub_result = 0;
for (int i = start_val; i < start_val + sums_to_do; i++)
sub_result += i;
*final = sub_result;
#ifdef TRACE
auto end = std::chrono::high_resolution_clock::now();
coutmutex.lock();
std::cout << "End : TID " << std::this_thread::get_id() << " with result " << sub_result << ", time taken "
<< (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
coutmutex.unlock();
#endif
}
int main()
{
part_sums.clear();
for (int i = 0; i < threads_to_use; i++)
part_sums.push_back(new uint64_t(0));
std::vector<std::thread *> t;
int sums_per_thread = max_sum_item / threads_to_use;
auto start = std::chrono::high_resolution_clock::now();
for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
// Lump extra bits onto last thread if work items is not equally divisible by number of threads
int sums_to_do = sums_per_thread;
if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
sums_to_do = max_sum_item - start_val;
t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));
if (sums_to_do != sums_per_thread)
break;
}
for (int i = 0; i < threads_to_use; i++)
t[i]->join();
uint64_t result = 0;
std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });
auto end = std::chrono::high_resolution_clock::now();
for (int i = 0; i < threads_to_use; i++)
{
delete t[i];
delete part_sums[i];
}
assert(result == uint64_t(499999999500000000));
std::cout << "Result is correct" << std::endl;
std::cout << "Time taken: " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
}
boost版本
#include <iostream> // for std::cout
#include <boost/cstdint.hpp> // for boost::boost::uint64_t
#include <boost/chrono.hpp> // for boost::chrono::high_resolution_clock
#include <boost/thread.hpp> // for boost::thread and boost::mutex
#include <vector> // for std::vector
#include <cassert> // for assert
#define TRACE
#ifdef TRACE
boost::mutex coutmutex;
#endif
std::vector<boost::uint64_t *> part_sums;
const int max_sum_item = 1000000000;
const int threads_to_use = 7;
void do_partial_sum(boost::uint64_t *final, int start_val, int sums_to_do)
{
#ifdef TRACE
coutmutex.lock();
std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
coutmutex.unlock();
boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();
#endif
boost::uint64_t sub_result = 0;
for (int i = start_val; i < start_val + sums_to_do; i++)
sub_result += i;
*final = sub_result;
#ifdef TRACE
boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();
coutmutex.lock();
std::cout << "End : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "
<< (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;
coutmutex.unlock();
#endif
}
int main()
{
part_sums.clear();
for (int i = 0; i < threads_to_use; i++)
part_sums.push_back(new boost::uint64_t(0));
std::vector<boost::thread *> t;
int sums_per_thread = max_sum_item / threads_to_use;
boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();
for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
// Lump extra bits onto last thread if work items is not equally divisible by number of threads
int sums_to_do = sums_per_thread;
if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
sums_to_do = max_sum_item - start_val;
t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));
if (sums_to_do != sums_per_thread)
break;
}
for (int i = 0; i < threads_to_use; i++)
t[i]->join();
boost::uint64_t result = 0;
for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)
result += **it;
boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();
for (int i = 0; i < threads_to_use; i++)
{
delete t[i];
delete part_sums[i];
}
assert(result == boost::uint64_t(499999999500000000));
std::cout << "Result is correct" << std::endl;
std::cout << "Time taken: " << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;
}
多核处理器
多处理器方式是真正的并行,而不是通过系统的调度实现时间切片的方式。那如何确定多核机器上面线程的开启数量呢?
std::thread::hardware_concurrency() (or boost::thread::hardware_concurrency())
可以获悉CPU上面正在运行的处理器核数。注意,这里的结果是系统所能够探析到的逻辑核数量。例如拥有4核处理器的i7准测试机的超线程能够实现8核。
使用方法如下:
for (int threads_to_use = 1; threads_to_use <= static_cast<int>(std::thread::hardware_concurrency()); threads_to_use++)
{
// original code
std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
}
boost版本下采用boost::thread::hardware_concurrency()。
动态方式设置线程数,运行代码只要去除原先设计的threads_to_use的const 属性,而设置成一个动态的,并在原来的main函数部分增加一层循环
for (int threads_to_use = 1; threads_to_use <= static_cast(boost::thread::hardware_concurrency()); threads_to_use++)
{
// original code
std::cout << “Time taken with ” << threads_to_use << ” core” << (threads_to_use != 1? “s”:”“) << “: ” << timeTaken.total_milliseconds()<< std::endl;
}
具体如下:
int main()
{
for (int threads_to_use = 1; threads_to_use <= static_cast<int>(boost::thread::hardware_concurrency()); threads_to_use++)
{
//原先的代码放在这里。在这里,启用的进程数threads_to_use,以一个循环进行变化。注意,本文所用的测试机,为24核。
}
std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << timeTaken.total_milliseconds()<< std::endl;
return 0;
}
运行结果如下:
Time taken with 1 core: 3874
Time taken with 2 cores: 1927
Time taken with 3 cores: 1289
Time taken with 4 cores: 965
Time taken with 5 cores: 773
Time taken with 6 cores: 643
Time taken with 7 cores: 552
Time taken with 8 cores: 482
Time taken with 9 cores: 429
Time taken with 10 cores: 386
Time taken with 11 cores: 358
Time taken with 12 cores: 327
Time taken with 13 cores: 406
Time taken with 14 cores: 387
Time taken with 15 cores: 374
Time taken with 16 cores: 394
Time taken with 17 cores: 337
Time taken with 18 cores: 304
Time taken with 19 cores: 314
Time taken with 20 cores: 303
Time taken with 21 cores: 296
Time taken with 22 cores: 285
Time taken with 23 cores: 279
Time taken with 24 cores: 267
从下图可以看出,大概在12的时候,开始出现了反弹现象,并出现了波动。所以,最佳线程数选择可用核数的一半。
查看机器的cpu数量和核数,从中可以看出该机器有2个cpu(物理cpu,cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l),逻辑cpu个数为24(核数,cat /proc/cpuinfo |grep “processor”|wc -l ),每个cpu有6个核(cat /proc/cpuinfo |grep “cores”|uniq )。之所以是24个逻辑处理器,是因为支持超线程。
绘图所用python代码:
线程同步
虽然多线程的使用可以提高应用程序的性能,但也增加了复杂性。 如果使用线程在同一时间执行几个函数,访问共享资源时必须相应地同步。 一旦应用达到了一定规模,这涉及相当一些工作。 本段介绍了Boost.Thread提供同步线程的类。
代码:
import matplotlib.pyplot as plt
import numpy as np
y = [3874,1927,1289,965,773,643,552,482,429,386,358,327,406,387,374,394,337,304,314,303,296,285,279,267]
x = np.arange(1,25)
x1 = x.tolist()
print(type(x1))
print(len(x1))
print(len(y))
print(y)
plt.plot(x1,y,'r--')
plt.axis([1, 24, 0, 4000])
plt.title('cost time of cores')
plt.xlabel('number of cores')
plt.ylabel('cost time/milliseconds')
plt.show()