多线程编程学习笔记-海量数据求和

背景

任务数量大约100W,如果在一个线程下跑的话,巨耗时,所以考虑,在主线程下,创建多线程的方式,并行进行海量任务的处理。本文以多次循环求和作为例子。

单线程

int main()
{
boost::posix_time::ptime start =boost::posix_time::microsec_clock::local_time();
uint64_t result = 0;
for (int i = 0; i < max_sum_item; i++)
    result += i;
std::cout << "sum="<<result<<std::endl;
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}

运行结果如下:
sum=499999999500000000
cost time:4061
如果将其以新建一个线程的方式处理这个任务呢?
代码:

const int max_sum_item = 1000000000;
void do_sum(uint64_t *total)
{
      *total = 0; 
      for (int i = 0; i < max_sum_item; i++)
                *total += i;
}
int main()
{
    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
    uint64_t result = 0;
    boost::thread worker(do_sum, &result);
    worker.join();
    std::cout << "sum="<<result<<std::endl;

    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration timeTaken = end - start;
    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}

运行结果:
sum=499999999500000000
cost time:4346
能够看出,不用main的线程进行运算,而是自己新建一个线程,做处理的话,运行时间会稍微多些,毕竟多做的这些(新建和删除线程)是需要开销的。但是这开销好像有点儿大?
将do_sum函数优化下,代码如下:

void do_sum(uint64_t *total)
{
  uint64_t localTotal = 0;
  for (int i = 0; i < max_sum_item; i++)
    localTotal += i;

  *total = localTotal;
}

采用优化过的do_sum进行运算,耗时如下:
sum=499999999500000000
cost time:4068
这是因为在每轮的循环中,未做优化的do_sum中我们采用引用的方式使其指向total(*total += i;),但是这部分的时间开销大于算数运算的耗时。所以,最优化方案是在函数内部采用一个局部的localTotal 变量来存储求和的结果,只在最后步骤写一次给引用的指针total 。

多线程:

注意,C++11 lambdas表达式需要GCC/G++ 4.5以上版本, 对于 G++ 4.4.是不允许的,编译时候 直接报错,所以请注意了。可以参考: http://gcc.gnu.org/projects/cxx0x.html
否则是可以采用lambdas来求和的。

std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });

代码如下:

std::vector<uint64_t *> part_sums;
const int threads_to_use = 2;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
    uint64_t sub_result = 0
    for (int i = start_val; i < start_val + sums_to_do; i++)
        sub_result += i;

    *final = sub_result;
}
int main()
{
    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
    part_sums.clear();
    part_sums1.clear();
    for (int i = 0; i < threads_to_use; i++)
    {
        part_sums.push_back(new uint64_t(0));
    }
    std::vector<boost::thread *> t;
    int sums_per_thread = max_sum_item / threads_to_use;
    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
    {
        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_per_thread));
    }
    for (int i = 0; i < threads_to_use; i++)
        t[i]->join();
    uint64_t result = 0;
    // std::for_each(part_sums.begin(), part_sums.end(),myfunc);
    //vector中元素求和
    for(int i = 0; i < threads_to_use; i++)
    {
        uint64_t *temp = part_sums[i];
        // std::cout<<*temp<<std::endl;
        result += *temp;//注意这里的取值方式
    }
    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);
    for (int i = 0; i < threads_to_use; i++)
    {
        delete t[i];
        delete part_sums[i];
    }
    std::cout << "sum="<<result<<std::endl;

    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration timeTaken = end - start;
    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
}

开启两个线程,允许结果如下:
sum=499999999500000000
cost time:1907
提速非常明显。
注意上述的vector求和,也可以简化写成

for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)  result += **it;

线程数和任务数的分配问题

比如上述const int max_sum_item = 1000000000;如果此时的线程数量为7的话,每个线程负责的数据量为142,857,142.8 。为此,我们进行向下取整,142,857,142。此时7个进程处理的总数为999,999,994 而对于尾数那些数据,我们可以指定给最后一个线程进行处理。

int sums_per_thread = max_sum_item / threads_to_use;

for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
{
    // Lump extra bits onto last thread if work items is not equally divisible by number of threads
    int sums_to_do = sums_per_thread;

    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
        sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内

    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));

    if (sums_to_do != sums_per_thread)
        break;//当第一个非标准任务数量被分配的时候,因为尾部线程的任务数量是大于1倍标准任务数的。如果该循环没有的话,则会进入下一个外循环,使得start_val=999,999,994,此时便会再创建一个没有必要的错误线程。
}

完整代码如下(开启7个线程):

const int max_sum_item = 1000000000;
std::vector<uint64_t *> part_sums;
const int threads_to_use = 7;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
    uint64_t sub_result = 0;

    for (int i = start_val; i < start_val + sums_to_do; i++)
        sub_result += i;

    *final = sub_result;
}
int main()
{
    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
    uint64_t result = 0;
    part_sums.clear();
    part_sums1.clear();
    for (int i = 0; i < threads_to_use; i++)
    {
        part_sums.push_back(new uint64_t(0));
    }
    std::vector<boost::thread *> t;
    int sums_per_thread = max_sum_item / threads_to_use;

    for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
    {
        // Lump extra bits onto last thread if work items is not equally divisible by number of threads
        int sums_to_do = sums_per_thread;

        if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
            sums_to_do = max_sum_item - start_val;//尾部处理,一倍间距之上,两倍间距以内

        t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));

        if (sums_to_do != sums_per_thread)
            break;
    }
    for (int i = 0; i < threads_to_use; i++)
        t[i]->join();
    //vector中元素求和
    int tt=0;
    for(int i = 0; i < threads_to_use; i++)
    {
        uint64_t *temp = part_sums[i];
        // std::cout<<*temp<<std::endl;
        result += *temp;
    }
    // result = accumulate(part_sums1.begin() , part_sums1.end() ,0);
    for (int i = 0; i < threads_to_use; i++)
    {
        delete t[i];
        delete part_sums[i];
        // delete part_sums1[i];
    }
    std::cout << "sum="<<result<<std::endl;

    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration timeTaken = end - start;
    std::cout <<"cost time:"<< timeTaken.total_milliseconds() << std::endl;
    //************************多线程测试************************************//
    return 0;
}

运行结果如下:

sum=499999999500000000
cost time:546

发现提升的速度,不仅仅是7 倍。这是什么原因呢?难道是由于单个线程的任务数变少了,任务数的处理过程并不是线性耗时的?欢迎大家对此进行补充,讨论。

准确记录每个线程的耗时情况

主要代码:

std::vector<uint64_t *> part_sums;
boost::mutex coutmutex;//同步对象
const int threads_to_use = 7;
void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
    coutmutex.lock();
    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
    coutmutex.unlock();
    //You can simply output text to cout or a file stream, but as discussed in the first part of this series, stream operations in C++ are not atomic so you must wrap their use in a synchronization //object.
    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).
    boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();

    uint64_t sub_result = 0;

    for (int i = start_val; i < start_val + sums_to_do; i++)
        sub_result += i;

    *final = sub_result;
    boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
    boost::posix_time::time_duration timeTaken = end - start;
    coutmutex.lock();
    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "<< timeTaken.total_milliseconds() << std::endl;
    //Notice that all uses of std::cout must be wrapped in mutex locks as provided by the lock() method of std::mutex (or boost::mutex).
    coutmutex.unlock();//如果没有解锁的话,就一直尴尬地等待了
}

主函数代码和上述例子是一样的。

运行结果如下:
Start: TID 7f7a85a6a700 starting at 142857142, workload of 142857142 items
Start: TID 7f7a84668700 starting at 428571426, workload of 142857142 items
Start: TID 7f7a83266700 starting at 714285710, workload of 142857142 items
Start: TID 7f7a82865700 starting at 857142852, workload of 142857148 items
Start: TID 7f7a8646b700 starting at 0, workload of 142857142 items
Start: TID 7f7a85069700 starting at 285714284, workload of 142857142 items
Start: TID 7f7a83c67700 starting at 571428568, workload of 142857142 items
End : TID 7f7a85a6a700 with result 30612244459183675, time taken 542
End : TID 7f7a82865700 with result 132653065561224474, time taken 543
End : TID 7f7a84668700 with result 71428570500000003, time taken 543
End : TID 7f7a8646b700 with result 10204081438775511, time taken 544
End : TID 7f7a83266700 with result 112244896540816331, time taken 544
End : TID 7f7a83c67700 with result 91836733520408167, time taken 582
End : TID 7f7a85069700 with result 51020407479591839, time taken 583
sum=499999999500000000
cost time:583
注意前面提到的,输出操作并不是原子操作,所以注意加锁。
其他部分,有待补充。。。。

完整代码

C11版本

#include <iostream>       // for std::cout
#include <cstdint>        // for uint64_t
#include <chrono>     // for std::chrono::high_resolution_clock
#include <thread>     // for std::thread
#include <vector>     // for std::vector
#include <algorithm>  // for std::for_each
#include <cassert>        // for assert

#define TRACE

#ifdef TRACE
#include <mutex>      // for std::mutex

std::mutex coutmutex;
#endif

std::vector<uint64_t *> part_sums;
const int max_sum_item = 1000000000;
const int threads_to_use = 7;

void do_partial_sum(uint64_t *final, int start_val, int sums_to_do)
{
#ifdef TRACE
    coutmutex.lock();
    std::cout << "Start: TID " << std::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
    coutmutex.unlock();

    auto start = std::chrono::high_resolution_clock::now();
#endif

    uint64_t sub_result = 0;

    for (int i = start_val; i < start_val + sums_to_do; i++)
        sub_result += i;

    *final = sub_result;

#ifdef TRACE
    auto end = std::chrono::high_resolution_clock::now();

    coutmutex.lock();
    std::cout << "End  : TID " << std::this_thread::get_id() << " with result " << sub_result << ", time taken "
        << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
    coutmutex.unlock();
#endif
}

int main()
{
  part_sums.clear();

  for (int i = 0; i < threads_to_use; i++)
    part_sums.push_back(new uint64_t(0));

  std::vector<std::thread *> t;

  int sums_per_thread = max_sum_item / threads_to_use;

  auto start = std::chrono::high_resolution_clock::now();

  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
  {
    // Lump extra bits onto last thread if work items is not equally divisible by number of threads
    int sums_to_do = sums_per_thread;

    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
        sums_to_do = max_sum_item - start_val;

    t.push_back(new std::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));

    if (sums_to_do != sums_per_thread)
        break;
  }

  for (int i = 0; i < threads_to_use; i++)
    t[i]->join();

  uint64_t result = 0;

  std::for_each(part_sums.begin(), part_sums.end(), [&result] (uint64_t *subtotal) { result += *subtotal; });

  auto end = std::chrono::high_resolution_clock::now();

  for (int i = 0; i < threads_to_use; i++)
  {
    delete t[i];
    delete part_sums[i];
  }

  assert(result == uint64_t(499999999500000000));

  std::cout << "Result is correct" << std::endl;

  std::cout << "Time taken: " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
}

boost版本

#include <iostream>                   // for std::cout
#include <boost/cstdint.hpp>      // for boost::boost::uint64_t
#include <boost/chrono.hpp>           // for boost::chrono::high_resolution_clock
#include <boost/thread.hpp>           // for boost::thread and boost::mutex
#include <vector>                 // for std::vector
#include <cassert>                    // for assert

#define TRACE

#ifdef TRACE

boost::mutex coutmutex;
#endif

std::vector<boost::uint64_t *> part_sums;
const int max_sum_item = 1000000000;
const int threads_to_use = 7;

void do_partial_sum(boost::uint64_t *final, int start_val, int sums_to_do)
{
#ifdef TRACE
    coutmutex.lock();
    std::cout << "Start: TID " << boost::this_thread::get_id() << " starting at " << start_val << ", workload of " << sums_to_do << " items" << std::endl;
    coutmutex.unlock();

    boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();
#endif

    boost::uint64_t sub_result = 0;

    for (int i = start_val; i < start_val + sums_to_do; i++)
        sub_result += i;

    *final = sub_result;

#ifdef TRACE
    boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();

    coutmutex.lock();
    std::cout << "End  : TID " << boost::this_thread::get_id() << " with result " << sub_result << ", time taken "
        << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;
    coutmutex.unlock();
#endif
}

int main()
{
  part_sums.clear();

  for (int i = 0; i < threads_to_use; i++)
    part_sums.push_back(new boost::uint64_t(0));

  std::vector<boost::thread *> t;

  int sums_per_thread = max_sum_item / threads_to_use;

  boost::chrono::high_resolution_clock::time_point start = boost::chrono::high_resolution_clock::now();

  for (int start_val = 0, i = 0; start_val < max_sum_item; start_val += sums_per_thread, i++)
  {
    // Lump extra bits onto last thread if work items is not equally divisible by number of threads
    int sums_to_do = sums_per_thread;

    if (start_val + sums_per_thread < max_sum_item && start_val + sums_per_thread * 2 > max_sum_item)
        sums_to_do = max_sum_item - start_val;

    t.push_back(new boost::thread(do_partial_sum, part_sums[i], start_val, sums_to_do));

    if (sums_to_do != sums_per_thread)
        break;
  }

  for (int i = 0; i < threads_to_use; i++)
    t[i]->join();

  boost::uint64_t result = 0;

  for (std::vector<boost::uint64_t *>::iterator it = part_sums.begin(); it != part_sums.end(); ++it)
      result += **it;

  boost::chrono::high_resolution_clock::time_point end = boost::chrono::high_resolution_clock::now();

  for (int i = 0; i < threads_to_use; i++)
  {
    delete t[i];
    delete part_sums[i];
  }

  assert(result == boost::uint64_t(499999999500000000));

  std::cout << "Result is correct" << std::endl;

  std::cout << "Time taken: " << (end - start).count() * ((double) boost::chrono::high_resolution_clock::period::num / boost::chrono::high_resolution_clock::period::den) << std::endl;
}

多核处理器

多处理器方式是真正的并行,而不是通过系统的调度实现时间切片的方式。那如何确定多核机器上面线程的开启数量呢?
std::thread::hardware_concurrency() (or boost::thread::hardware_concurrency())
可以获悉CPU上面正在运行的处理器核数。注意,这里的结果是系统所能够探析到的逻辑核数量。例如拥有4核处理器的i7准测试机的超线程能够实现8核。
使用方法如下:

for (int threads_to_use = 1; threads_to_use <= static_cast<int>(std::thread::hardware_concurrency()); threads_to_use++)
{
  // original code

  std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << (end - start).count() * ((double) std::chrono::high_resolution_clock::period::num / std::chrono::high_resolution_clock::period::den) << std::endl;
}

boost版本下采用boost::thread::hardware_concurrency()。
动态方式设置线程数,运行代码只要去除原先设计的threads_to_use的const 属性,而设置成一个动态的,并在原来的main函数部分增加一层循环
for (int threads_to_use = 1; threads_to_use <= static_cast(boost::thread::hardware_concurrency()); threads_to_use++)
{
// original code
std::cout << “Time taken with ” << threads_to_use << ” core” << (threads_to_use != 1? “s”:”“) << “: ” << timeTaken.total_milliseconds()<< std::endl;
}
具体如下:

int main()
{
    for (int threads_to_use = 1; threads_to_use <= static_cast<int>(boost::thread::hardware_concurrency()); threads_to_use++)
    {
        //原先的代码放在这里。在这里,启用的进程数threads_to_use,以一个循环进行变化。注意,本文所用的测试机,为24核。
    }
    std::cout << "Time taken with " << threads_to_use << " core" << (threads_to_use != 1? "s":"") << ": " << timeTaken.total_milliseconds()<< std::endl;    
    return 0;
}

运行结果如下:
Time taken with 1 core: 3874
Time taken with 2 cores: 1927
Time taken with 3 cores: 1289
Time taken with 4 cores: 965
Time taken with 5 cores: 773
Time taken with 6 cores: 643
Time taken with 7 cores: 552
Time taken with 8 cores: 482
Time taken with 9 cores: 429
Time taken with 10 cores: 386
Time taken with 11 cores: 358
Time taken with 12 cores: 327
Time taken with 13 cores: 406
Time taken with 14 cores: 387
Time taken with 15 cores: 374
Time taken with 16 cores: 394
Time taken with 17 cores: 337
Time taken with 18 cores: 304
Time taken with 19 cores: 314
Time taken with 20 cores: 303
Time taken with 21 cores: 296
Time taken with 22 cores: 285
Time taken with 23 cores: 279
Time taken with 24 cores: 267
从下图可以看出,大概在12的时候,开始出现了反弹现象,并出现了波动。所以,最佳线程数选择可用核数的一半。

这里写图片描述
查看机器的cpu数量和核数,从中可以看出该机器有2个cpu(物理cpu,cat /proc/cpuinfo |grep “physical id”|sort |uniq|wc -l),逻辑cpu个数为24(核数,cat /proc/cpuinfo |grep “processor”|wc -l ),每个cpu有6个核(cat /proc/cpuinfo |grep “cores”|uniq )。之所以是24个逻辑处理器,是因为支持超线程。
这里写图片描述
绘图所用python代码:

线程同步

虽然多线程的使用可以提高应用程序的性能,但也增加了复杂性。 如果使用线程在同一时间执行几个函数,访问共享资源时必须相应地同步。 一旦应用达到了一定规模,这涉及相当一些工作。 本段介绍了Boost.Thread提供同步线程的类。
代码:

import matplotlib.pyplot as plt
import numpy as np
y = [3874,1927,1289,965,773,643,552,482,429,386,358,327,406,387,374,394,337,304,314,303,296,285,279,267]
x = np.arange(1,25)
x1 = x.tolist()
print(type(x1))
print(len(x1))
print(len(y))
print(y)
plt.plot(x1,y,'r--')
plt.axis([1, 24, 0, 4000])
plt.title('cost time of cores')
plt.xlabel('number of cores')
plt.ylabel('cost time/milliseconds')
plt.show()

参考:
https://katyscode.wordpress.com/2013/08/15/c11-boost-multi-threading-the-parallel-aggregation-pattern/

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值