伪共享问题的主要来源有两件事
- 每次cache读入一个数据时实际上读入整个数据行
- 多个线程可能会读入相同的数据行
由于cache写回内存时也是按照数据行进行写入的,导致了写回内存时并不能做到真正的共享,而是需要重新同步。
解决方案有两种
- 将需要并行计算的数据使用alignas等方法隔离在不同的cacheline
- 使用thread_local进行数据隔离
Perf
由上述分析可知,伪共享主要导致的是Cache Miss,故使用相关事件
perf stat -e cache-misses ./benchmark
用例代码
原代码
#include <iostream>
#include <thread>
#include <vector>
struct SharedData {
int x;
int y;
};
const int epoch = 1000000000;
void incrementX(SharedData &data) {
for (int i = 0; i < epoch; ++i) {
data.x++;
}
}
void incrementY(SharedData &data) {
for (int i = 0; i < epoch; ++i) {
data.y++;
}
}
int main() {
SharedData data{0, 0};
std::thread t1(incrementX, std::ref(data));
std::thread t2(incrementY, std::ref(data));
t1.join();
t2.join();
std::cout << "x: " << data.x << ", y: " << data.y << std::endl;
return 0;
}
threadlocal
#include <iostream>
#include <thread>
#include <vector>
struct ThreadData {
int x;
int y;
};
thread_local ThreadData threadData = {0, 0};
const int epoch = 1000000000;
void incrementX() {
for (int i = 0; i < epoch; ++i) {
threadData.x++;
}
}
void incrementY() {
for (int i = 0; i < epoch; ++i) {
threadData.y++;
}
}
int main() {
int totalX = 0, totalY = 0;
std::thread t1([&totalX]() {
incrementX();
totalX = threadData.x; // 将线程1的结果保存到主线程变量
});
std::thread t2([&totalY]() {
incrementY();
totalY = threadData.y; // 将线程2的结果保存到主线程变量
});
t1.join();
t2.join();
std::cout << "x: " << totalX << ", y: " << totalY << std::endl;
return 0;
}
padding
#include <iostream>
#include <thread>
#include <vector>
struct SharedData {
int x;
char padding[60];
int y;
};
const int epoch = 1000000000;
void incrementX(SharedData &data) {
for (int i = 0; i < epoch; ++i) {
data.x++;
}
}
void incrementY(SharedData &data) {
for (int i = 0; i < epoch; ++i) {
data.y++;
}
}
int main() {
SharedData data{0, 0};
std::thread t1(incrementX, std::ref(data));
std::thread t2(incrementY, std::ref(data));
t1.join();
t2.join();
std::cout << "x: " << data.x << ", y: " << data.y << std::endl;
return 0;
}
执行结果
| -O0 -g | benchmark | threadlocal | padding |
| cahce-misses | 111727010 | 422532 | 208243 |
| time(s) | 2.53153364 | 1.517304314 | 0.63421379 |
可以发现 threadlocal方案得到的cache-misses数量几乎是padding方案的两倍,通过perf查询cpu的prefecth数据可以证明是由CPU的预取机制导致的
若使用O3,则三者性能几乎无差异
1127

被折叠的 条评论
为什么被折叠?



