多线程性能测试：ConcurrentQueue（无锁队列）、std::atomic_flag 和 std::mutex

置顶橘色的喵

已于 2024-11-09 17:51:08 修改

阅读量1.9k

点赞数

分类专栏： C++ 性能优化、功能优化文章标签： c++ 无锁队列 boost.spinlock mutex 多线程 atomic 性能测试

于 2022-07-01 03:24:36 首次发布

本文链接：https://blog.youkuaiyun.com/stallion5632/article/details/125551132

版权

C++ 同时被 2 个专栏收录

125 篇文章

订阅专栏

性能优化、功能优化

73 篇文章

订阅专栏

本文通过实验对比了ConcurrentQueue、std::atomic_flag与std::mutex在多线程环境下的性能差异，特别是在不同数据大小及操作系统上的表现。结果显示，在特定条件下，ConcurrentQueue表现出最优性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

1. 概要

本文对比了多线程环境下，ConcurrentQueue（无锁队列）、std::atomic_flag 和 std::mutex 三种同步机制的性能表现。测试场景包括插入（push）和删除（pop）操作，分别针对小数据项和大数据项进行评测。

测试目标:

比较 std::mutex 和 std::atomic_flag 的性能差异。参考文献: mutex 和 spin lock 的区别
测试多线程环境下的 concurrentqueue 队列性能。
测试单线程环境下的 readerwriterqueue 性能。
验证基于 C++ STL 利用 CAS 原子操作封装的无锁 list 的性能。参考文献: 无锁 list

结论：

对于小型数据结构（如 std::list 或 std::deque）的多线程操作，推荐使用 std::atomic_flag 替代 std::mutex，以提高性能。
对于较大的数据结构，std::mutex 在保证线程安全的同时提供了更好的性能。
新的业务代码可以根据需求适当使用无锁队列 ConcurrentQueue，尤其在处理大量数据时表现优秀。

完整测试代码: lock_test

2. 性能测试结果

2.1 测试一：30 线程并发，处理 1 万条 2KB 大小的数据（`push` 和 `pop`）

测试结果

测试平台	队列类型	`pushTime` (ms)	`popTime` (ms)
Linux-arm1	`ConcurrentQueue`	595.476	328.856
	`atomic_flag`	412.675	955.207
	`std::mutex`	946.301	907.553
Linux-arm2	`ConcurrentQueue`	1584.1	333.36
	`atomic_flag`	576.209	1479.5
	`std::mutex`	1133.68	1107.63
Linux-arm3	`ConcurrentQueue`	1005.89	244.84
	`atomic_flag`	355.606	402.343
	`std::mutex`	597.448	739.805
Linux	`ConcurrentQueue`	140.899	80.4264
	`atomic_flag`	136.703	136.91
	`std::mutex`	231.019	213.732
Windows	`ConcurrentQueue`	200.119	142.239
	`atomic_flag`	602.542	482.394
	`std::mutex`	483.498	306.393

结论：

Linux 平台：
- push 数据时，atomic_flag 优于 ConcurrentQueue，mutex 最差。
- pop 数据时，ConcurrentQueue 性能最佳。
Windows 平台：
- 在所有场景中，ConcurrentQueue 的表现最好。

2.2 测试二：30 线程并发，处理 1 万条 20KB 大小的数据（`push` 和 `pop`）

测试结果

测试平台	队列类型	`pushTime` (ms)	`popTime` (ms)
Linux-arm1	`ConcurrentQueue`	6936.41	732.457
	`atomic_flag`	4256.77	7103.61
	`std::mutex`	5165.12	4044.51
Linux-arm2	`ConcurrentQueue`	18411.2	942.713
	`atomic_flag`	5750.07	11232.5
	`std::mutex`	7236.02	6573.35
Linux-arm3	`ConcurrentQueue`	5399.57	247.965
	`atomic_flag`	2253.27	1285.47
	`std::mutex`	2625.55	1588.46
Linux-x86	`ConcurrentQueue`	2231.09	183.022
	`atomic_flag`	1117.93	715.601
	`std::mutex`	1288.95	805.378
Windows	`ConcurrentQueue`	8047.59	5098.22
	`atomic_flag`	16736.2	26468.2
	`std::mutex`	21498.3	50173.6

结论：

随着数据项大小的增大，atomic_flag 的性能不再领先，反而有所下降，而 ConcurrentQueue 的性能持续优异，尤其在插入（push）操作中。

Linux 平台：
- push 数据时，atomic_flag 优于 std::mutex，ConcurrentQueue 最慢。
- pop 数据时，ConcurrentQueue 性能最好。
Windows 平台：
- ConcurrentQueue 的表现始终最优。

3. 硬件配置

Linux-x86

CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz (8 cores)
内存: 16GB
操作系统: Linux 5.4.0-47-generic #51~18.04.1-Ubuntu SMP x86_64

测试机器配置 (Linux-arm1)

CPU: 96 cores, ARM architecture
内存: 64GB
操作系统: Linux 4.15.0-71-generic #4 SMP aarch64

测试机器配置 (Linux-arm2)

CPU: 96 cores, ARM architecture
内存: 256GB
操作系统: Linux 4.15.0-71-generic #2 SMP aarch64

测试机器配置 (Linux-arm3)

CPU: Unknown
内存: 256GB
操作系统: Linux 4.15.0-45-generic #48~16.04.1-Ubuntu SMP x86_64

Windows 10 x64 测试机配置

CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
内存: 8GB (1867 MHz)

4. 其他说明

测试中曾尝试验证基于 C++ STL 利用 CAS 原子操作封装的无锁 list 的效果，但结果不尽如人意，可能与新版 GCC 优化有关。因此不推荐使用此方案。

5. 参考文献

6. 完整代码

#include <iostream>
#include <list>
#include <deque>
#include <vector>
#include <thread>
#include <atomic>
#include <mutex>
#include <chrono>
#include "concurrentqueue.h"  // Include ConcurrentQueue library

// 定义小数据结构
struct SmallItem {
  int data[20];
};

// 定义大数据结构
struct LargeItem {
  int data[1024];
};

// 测试参数
const int NUM_THREADS = 30;
const int NUM_ITEMS = 10000;

// 测试函数模板
template<typename T>
void testPerformance(std::atomic_flag& lock, std::list<T>& container, std::vector<T>& data) {
  auto start = std::chrono::high_resolution_clock::now();

  std::vector<std::thread> threads;
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      for (const auto& item : data) {
        while (lock.test_and_set(std::memory_order_acquire));
        container.push_back(item);
        lock.clear(std::memory_order_release);
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;
  std::cout << "Push time: " << elapsed.count() << " seconds\n";

  start = std::chrono::high_resolution_clock::now();
  threads.clear();
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
        while (lock.test_and_set(std::memory_order_acquire));
        if (!container.empty()) {
          container.pop_front();
        }
        lock.clear(std::memory_order_release);
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  end = std::chrono::high_resolution_clock::now();
  elapsed = end - start;
  std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}

template<typename T>
void testPerformance(std::mutex& mtx, std::list<T>& container, std::vector<T>& data) {
  auto start = std::chrono::high_resolution_clock::now();

  std::vector<std::thread> threads;
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      for (const auto& item : data) {
        std::lock_guard<std::mutex> lock(mtx);
        container.push_back(item);
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;
  std::cout << "Push time: " << elapsed.count() << " seconds\n";

  start = std::chrono::high_resolution_clock::now();
  threads.clear();
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
        std::lock_guard<std::mutex> lock(mtx);
        if (!container.empty()) {
          container.pop_front();
        }
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  end = std::chrono::high_resolution_clock::now();
  elapsed = end - start;
  std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}

template<typename T>
void testPerformance(moodycamel::ConcurrentQueue<T>& queue, std::vector<T>& data) {
  auto start = std::chrono::high_resolution_clock::now();

  std::vector<std::thread> threads;
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      for (const auto& item : data) {
        queue.enqueue(item);
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  auto end = std::chrono::high_resolution_clock::now();
  std::chrono::duration<double> elapsed = end - start;
  std::cout << "Push time: " << elapsed.count() << " seconds\n";

  start = std::chrono::high_resolution_clock::now();
  threads.clear();
  for (int i = 0; i < NUM_THREADS; ++i) {
    threads.push_back(std::thread([&]() {
      T item;
      for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
        queue.try_dequeue(item);
      }
      }));
  }

  for (auto& t : threads) {
    t.join();
  }

  end = std::chrono::high_resolution_clock::now();
  elapsed = end - start;
  std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}

int main() {
  std::vector<SmallItem> smallData(NUM_ITEMS);
  std::vector<LargeItem> largeData(NUM_ITEMS);

  std::cout << "Testing with std::atomic_flag and SmallItem\n";
  std::list<SmallItem> smallList;
  std::atomic_flag atomicFlag = ATOMIC_FLAG_INIT;
  testPerformance(atomicFlag, smallList, smallData);

  std::cout << "Testing with std::mutex and SmallItem\n";
  std::list<SmallItem> smallListMutex;
  std::mutex mtx;
  testPerformance(mtx, smallListMutex, smallData);

  std::cout << "Testing with ConcurrentQueue and SmallItem\n";
  moodycamel::ConcurrentQueue<SmallItem> smallQueue;
  testPerformance(smallQueue, smallData);

  std::cout << "Testing with std::atomic_flag and LargeItem\n";
  std::list<LargeItem> largeList;
  testPerformance(atomicFlag, largeList, largeData);

  std::cout << "Testing with std::mutex and LargeItem\n";
  std::list<LargeItem> largeListMutex;
  testPerformance(mtx, largeListMutex, largeData);

  std::cout << "Testing with ConcurrentQueue and LargeItem\n";
  moodycamel::ConcurrentQueue<LargeItem> largeQueue;
  testPerformance(largeQueue, largeData);

  return 0;
}