文章目录
1. 概要
本文对比了多线程环境下,ConcurrentQueue
(无锁队列)、std::atomic_flag
和 std::mutex
三种同步机制的性能表现。测试场景包括插入(push)和删除(pop)操作,分别针对小数据项和大数据项进行评测。
测试目标:
- 比较
std::mutex
和std::atomic_flag
的性能差异。参考文献: mutex 和 spin lock 的区别 - 测试多线程环境下的 concurrentqueue 队列性能。
- 测试单线程环境下的 readerwriterqueue 性能。
- 验证基于 C++ STL 利用 CAS 原子操作封装的无锁 list 的性能。参考文献: 无锁 list
结论:
- 对于小型数据结构(如
std::list
或std::deque
)的多线程操作,推荐使用std::atomic_flag
替代std::mutex
,以提高性能。 - 对于较大的数据结构,
std::mutex
在保证线程安全的同时提供了更好的性能。 - 新的业务代码可以根据需求适当使用无锁队列
ConcurrentQueue
,尤其在处理大量数据时表现优秀。
完整测试代码: lock_test
2. 性能测试结果
2.1 测试一:30 线程并发,处理 1 万条 2KB 大小的数据(push
和 pop
)
测试结果
测试平台 | 队列类型 | pushTime (ms) | popTime (ms) |
---|---|---|---|
Linux-arm1 | ConcurrentQueue | 595.476 | 328.856 |
atomic_flag | 412.675 | 955.207 | |
std::mutex | 946.301 | 907.553 | |
Linux-arm2 | ConcurrentQueue | 1584.1 | 333.36 |
atomic_flag | 576.209 | 1479.5 | |
std::mutex | 1133.68 | 1107.63 | |
Linux-arm3 | ConcurrentQueue | 1005.89 | 244.84 |
atomic_flag | 355.606 | 402.343 | |
std::mutex | 597.448 | 739.805 | |
Linux | ConcurrentQueue | 140.899 | 80.4264 |
atomic_flag | 136.703 | 136.91 | |
std::mutex | 231.019 | 213.732 | |
Windows | ConcurrentQueue | 200.119 | 142.239 |
atomic_flag | 602.542 | 482.394 | |
std::mutex | 483.498 | 306.393 |
结论:
- Linux 平台:
push
数据时,atomic_flag
优于ConcurrentQueue
,mutex
最差。pop
数据时,ConcurrentQueue
性能最佳。
- Windows 平台:
- 在所有场景中,
ConcurrentQueue
的表现最好。
- 在所有场景中,
2.2 测试二:30 线程并发,处理 1 万条 20KB 大小的数据(push
和 pop
)
测试结果
测试平台 | 队列类型 | pushTime (ms) | popTime (ms) |
---|---|---|---|
Linux-arm1 | ConcurrentQueue | 6936.41 | 732.457 |
atomic_flag | 4256.77 | 7103.61 | |
std::mutex | 5165.12 | 4044.51 | |
Linux-arm2 | ConcurrentQueue | 18411.2 | 942.713 |
atomic_flag | 5750.07 | 11232.5 | |
std::mutex | 7236.02 | 6573.35 | |
Linux-arm3 | ConcurrentQueue | 5399.57 | 247.965 |
atomic_flag | 2253.27 | 1285.47 | |
std::mutex | 2625.55 | 1588.46 | |
Linux-x86 | ConcurrentQueue | 2231.09 | 183.022 |
atomic_flag | 1117.93 | 715.601 | |
std::mutex | 1288.95 | 805.378 | |
Windows | ConcurrentQueue | 8047.59 | 5098.22 |
atomic_flag | 16736.2 | 26468.2 | |
std::mutex | 21498.3 | 50173.6 |
结论:
随着数据项大小的增大,atomic_flag
的性能不再领先,反而有所下降,而 ConcurrentQueue
的性能持续优异,尤其在插入(push
)操作中。
- Linux 平台:
push
数据时,atomic_flag
优于std::mutex
,ConcurrentQueue
最慢。pop
数据时,ConcurrentQueue
性能最好。
- Windows 平台:
ConcurrentQueue
的表现始终最优。
3. 硬件配置
- Linux-x86
CPU: Intel(R) Core(TM) i7-9700 CPU @ 3.00GHz (8 cores)
内存: 16GB
操作系统: Linux 5.4.0-47-generic #51~18.04.1-Ubuntu SMP x86_64
- 测试机器配置 (Linux-arm1)
CPU: 96 cores, ARM architecture
内存: 64GB
操作系统: Linux 4.15.0-71-generic #4 SMP aarch64
- 测试机器配置 (Linux-arm2)
CPU: 96 cores, ARM architecture
内存: 256GB
操作系统: Linux 4.15.0-71-generic #2 SMP aarch64
- 测试机器配置 (Linux-arm3)
CPU: Unknown
内存: 256GB
操作系统: Linux 4.15.0-45-generic #48~16.04.1-Ubuntu SMP x86_64
- Windows 10 x64 测试机配置
CPU: Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz
内存: 8GB (1867 MHz)
4. 其他说明
测试中曾尝试验证基于 C++ STL 利用 CAS 原子操作封装的无锁 list 的效果,但结果不尽如人意,可能与新版 GCC 优化有关。因此不推荐使用此方案。
5. 参考文献
6. 完整代码
#include <iostream>
#include <list>
#include <deque>
#include <vector>
#include <thread>
#include <atomic>
#include <mutex>
#include <chrono>
#include "concurrentqueue.h" // Include ConcurrentQueue library
// 定义小数据结构
struct SmallItem {
int data[20];
};
// 定义大数据结构
struct LargeItem {
int data[1024];
};
// 测试参数
const int NUM_THREADS = 30;
const int NUM_ITEMS = 10000;
// 测试函数模板
template<typename T>
void testPerformance(std::atomic_flag& lock, std::list<T>& container, std::vector<T>& data) {
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
for (const auto& item : data) {
while (lock.test_and_set(std::memory_order_acquire));
container.push_back(item);
lock.clear(std::memory_order_release);
}
}));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Push time: " << elapsed.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
threads.clear();
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
while (lock.test_and_set(std::memory_order_acquire));
if (!container.empty()) {
container.pop_front();
}
lock.clear(std::memory_order_release);
}
}));
}
for (auto& t : threads) {
t.join();
}
end = std::chrono::high_resolution_clock::now();
elapsed = end - start;
std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}
template<typename T>
void testPerformance(std::mutex& mtx, std::list<T>& container, std::vector<T>& data) {
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
for (const auto& item : data) {
std::lock_guard<std::mutex> lock(mtx);
container.push_back(item);
}
}));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Push time: " << elapsed.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
threads.clear();
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
std::lock_guard<std::mutex> lock(mtx);
if (!container.empty()) {
container.pop_front();
}
}
}));
}
for (auto& t : threads) {
t.join();
}
end = std::chrono::high_resolution_clock::now();
elapsed = end - start;
std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}
template<typename T>
void testPerformance(moodycamel::ConcurrentQueue<T>& queue, std::vector<T>& data) {
auto start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads;
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
for (const auto& item : data) {
queue.enqueue(item);
}
}));
}
for (auto& t : threads) {
t.join();
}
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Push time: " << elapsed.count() << " seconds\n";
start = std::chrono::high_resolution_clock::now();
threads.clear();
for (int i = 0; i < NUM_THREADS; ++i) {
threads.push_back(std::thread([&]() {
T item;
for (int j = 0; j < NUM_ITEMS / NUM_THREADS; ++j) {
queue.try_dequeue(item);
}
}));
}
for (auto& t : threads) {
t.join();
}
end = std::chrono::high_resolution_clock::now();
elapsed = end - start;
std::cout << "Pop time: " << elapsed.count() << " seconds\n";
}
int main() {
std::vector<SmallItem> smallData(NUM_ITEMS);
std::vector<LargeItem> largeData(NUM_ITEMS);
std::cout << "Testing with std::atomic_flag and SmallItem\n";
std::list<SmallItem> smallList;
std::atomic_flag atomicFlag = ATOMIC_FLAG_INIT;
testPerformance(atomicFlag, smallList, smallData);
std::cout << "Testing with std::mutex and SmallItem\n";
std::list<SmallItem> smallListMutex;
std::mutex mtx;
testPerformance(mtx, smallListMutex, smallData);
std::cout << "Testing with ConcurrentQueue and SmallItem\n";
moodycamel::ConcurrentQueue<SmallItem> smallQueue;
testPerformance(smallQueue, smallData);
std::cout << "Testing with std::atomic_flag and LargeItem\n";
std::list<LargeItem> largeList;
testPerformance(atomicFlag, largeList, largeData);
std::cout << "Testing with std::mutex and LargeItem\n";
std::list<LargeItem> largeListMutex;
testPerformance(mtx, largeListMutex, largeData);
std::cout << "Testing with ConcurrentQueue and LargeItem\n";
moodycamel::ConcurrentQueue<LargeItem> largeQueue;
testPerformance(largeQueue, largeData);
return 0;
}