为什么要学习内存管理 (WHY)
关注越界访问,避免有类型错误或者算术溢出带来的内存问题
比如 char类型,在127+1后编程-128,导致越界访问或者擦写内存的问题。
避免内存泄漏
了解内存分配与释放机制,避免进行某些会导致内存泄漏的操作。
段错误
了解段错误发生的机制,淡定填坑。
Double Free
关注双重释放带来的后果,对free操作有敬畏之心。
内存碎片与性能
了解分配机制,避免内存碎片以及分配和释放带来的性能影响
性能瓶颈问题分析
应用可能一条语句,内核就要执行上千条。正所谓“领导一句话,下属跑断腿” io read write发生了两次,用mmap只需要一次拷贝
- 如何处理CPU指令重排问题?
- 如何高效利用Cpu Cache?
- 如何利用CPU 分支预测(乱序执行)?
- 如何利用内存指令重排代替锁,实现无锁队列或同步问题?
内存映射几种类型
私有匿名映射
各进程不共享,fork之后,会进行写时复制
mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_ANON|MAP_PRIVATE, -1, 0)
共享匿名映射
各进程共享,会写时复制,常用进程间通信
mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_ANON|MAP_SHARED, -1, 0)
私有文件映射
各进程不共享,常见是私有的动态库加载
mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_RPIVATE, fd, 0)
共享文件映射
各进程共享,文件映射,传输
mmap(NULL, 1024, PORT_WRITE|PORT_READ, MAP_SHARED, fd, 0)
查看命令:
cat /proc/<pid>/maps
进程地址空间
简介
在Linux中,3G~4G的内存空间,划分给了内核,所有进程中该部分都是一样的。 地址由低到高,分别是代码段区域,数据段区域(Data, Bss)、堆区、栈区。其中栈区和堆区之间,存放着mmap开辟的内存映射区域。一般栈的生长方向是由高地址向低地址。而堆则是由低至高地址生长。堆区的顶部指针,称为brk地址。在内核空间中,将一部分虚拟地址空间线性映射至物理地址空间中,32位中一般为896M,这部分称为低端内存。剩下一部分常规可分配内存称为高端内存(还有一部分是DMA内存区)。
内存分配
从进程地址空间可以看出,用户分配小内存,则是在堆区线性扩展。向上增大或者向下减小区域,对于通过brk系统调用来完成。 另外,对于分配大内存,超过128K时,使用mmap系统调用来在堆区和栈区之间分配内存。
malloc工作原理
参考:linux - How does glibc malloc work? - Reverse Engineering Stack Exchange
malloc内存块组织方式
内存块头部
{
块状态: 空闲,在使用;
前一块指针;
下一块指针;
内存块大小:
}
glic malloc源码位置:malloc.c source code [glibc/malloc/malloc.c] - Woboq Code Browser
分配的特点是通过bins管理大小相近的内存。对于ptmalloc一般有fast bins, smaller bins, unsorted bins , larger bins, 四种类型。基本上每种bins的相邻间隔为8个字节,每个bin可能是单向链表或双向链表。对于fast bins一般是不大于64B, smaller bins <= 512B, larger bins > 512B 且128K。 unsorted bins没有规定具体大小,是一个临时管理链表可以存放临时合并的chunk。 对于大于128K之类的内存,会统一放在mmaped chunk 中,释放时直接归还给系统。
malloc分配的是虚拟地址,管理的是虚拟内存的分配。那么对于物理内存如何管理呢?物理内存的分配使用相关的有slab/slub/slob分配器,还有著名的buddy system伙伴系统。
Buddy System 伙伴系统
伙伴系统最小分配单位是页,并且以页的阶数分配。两个页快是否为伙伴的规则如下:
- 大小相同,物理连续的两个页块。
- n阶的页块,第一个页框编号为2^n的整数倍
- 两个相邻n阶页块合并成n+1阶之后,第一个页框编号也为2^(n+1)的整数倍
linux默认最大分配是10阶页块。在include/linux/mmzone.h 中设置。
伙伴系统的引入主要是用来解决内存外部碎片的问题。既然有内存外部碎片问题,那就有内存内部碎片的问题。外部碎片是指,没有分配的内存无法利用起来。内部碎片是指已经分配的内存只利用了部分,剩下的内存没有利用起来。而slab分配器的引入,解决了内存内部碎片的问题。
Slab 分配器
slab是按照对象的概念来缓冲的。比如上述如果一个对象是512个字节,那么一页就可以换成8个对象。slab的工作就是管理这个8个对象的分配和释放。这里slab管理的内存区为线性映射器区,即虚拟地址和物理地址是线性关系,对于linux一般这个区域大小为896M。常见的内核驱动的对象分配比如kmalloc,kmem_cache_alloc这些接口内部即使使用slab来管理。
虚拟地址与物理地址的映射关系
1. 为什么不直接一一映射?内存不足
2. 为什么采用了多级的映射?节约内存
3. 为什么有TLB块表?cache
4. 相同虚拟地址TLB如何区分?flush, ASID。
5. 不同进程访问共享的内核地址空间如何提高效率?nG , G标志
6. 为什么有写时复制?避免在fork时就拷贝大量的物理内存数据
7. mmu和cache哪个先访问?都可以,取决于TLB放在Cache之前还是之后。各有利弊
8. cache这么小如何缓存这么大的地址空间? (set) = (memory address) / (line size) % (number of sets)
(critical stride) = (number of sets) * (line size) = (total cache size) / (number of ways).
在linux上,查看虚拟地址与物理地址的关系:https://github.com/jethrogb/ptdump
git clone https://github.com/jethrogb/ptdump
cd ptdump/dump/linux/
make
sudo insmod ptdump.ko
ls /proc/page*
运行测试app. test_virtual_address例程:
- 运行:test_virtual_address, 读取father.log并跟cat /proc/<pid>/maps比较,发现申请的内存并没有物理地址。
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
#include <assert.h>
#include <algorithm>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <chrono>
#include <sched.h>
#include <fstream>
struct Page {
uint64_t address;
uint64_t entry[512];
};
uint64_t get_phy_address(uint64_t entry)
{
static const uint64_t mask = (1LL<<63) | ((1<<12)-1);
return entry & ~mask;
}
bool writable(uint64_t entry)
{
return (entry & (1 << 1)) != 0;
}
bool executable(uint64_t entry)
{
return (entry & (1LL << 63)) == 0;
}
bool user_mode(uint64_t entry)
{
return (entry & (1LL << 2)) != 0;
}
void print_entry(FILE *fp, int level, uint64_t entry, uint64_t virtual_address)
{
fprintf(fp, "%d\t0x%-16llx\t0x%-16llx\t%d\t%d\t%d\n", level, get_phy_address(entry), virtual_address,
writable(entry),executable(entry),user_mode(entry));
}
void dump(FILE *fp, const Page*& page, int level, uint64_t virtual_address)
{
const Page *curr_page=page++;
for (int i = 0; i < 512; i++) {
const uint64_t entry = curr_page->entry[i];
const uint64_t child_virtual_address = (virtual_address << 9) | i;
if (level > 0) {
if (entry & 1) {
if (!(entry &(1 << 7))) {
dump(fp, page, level-1, child_virtual_address);
}else {
print_entry(fp, level, entry, child_virtual_address << (level*9+12));
}
}
} else {
if (entry) {
print_entry(fp, level, entry, child_virtual_address << 12);
}
}
}
}
void dump_table(FILE *fp)
{
std::ifstream ifs("/proc/page_table_3", std::ios::binary);
if (!ifs) {
return ;
}
std::string content((std::istreambuf_iterator<char>(ifs)), std::istreambuf_iterator<char>());
const Page *page = (const Page *)&content[0];
const Page *end_page = (const Page *)(&content[0]+content.length());
dump(fp, page, 3, 0);
std::cout<< (const void *)end_page << "\t" << (const void *)page << std::endl;
std::flush(std::cout);
}
int main(int argc, char **argv)
{
int cpu_index =0;
if(argc >= 3) cpu_index=atoi(argv[2]);
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu_index, &set);
sched_setaffinity(0, sizeof(set), &set);
const size_t asize = 1024*1024*8;
bool hugetable = false;
bool do_fork = false;
bool do_write = false;
if (argc >= 2) {
switch (argv[1][0])
{
case 'h': hugetable=true;do_write=true; break;
case 'f': do_fork=true;do_write=true; break;
case 'w': do_write=true; break;
default:break;
}
}
char *virtual_ptr = (char *)mmap(NULL, asize, PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|(hugetable?MAP_HUGETLB:0), -1, 0);
if (do_write) *virtual_ptr = '\0';
FILE *fp = NULL;
std::cout<< "father pid: " << getpid() << std::endl;
if (do_fork) {
pid_t pid = fork();
if (pid == 0) {
fp = fopen("child.log", "w");
std::cout<< "child pid: " << getpid() << std::endl;
}else {
fp = fopen("father.log", "w");
}
} else {
fp = fopen("father.log", "w");
}
fprintf(fp, "pid=%d map address: %p\n", getpid(), virtual_ptr);
dump_table(fp);
fclose(fp);
while(true) {
usleep(10000);
}
return 0;
}
- 运行:test_virtual_address w, 读取father.log并跟cat /proc/<pid>/maps比较,发现申请的内存有了物理地址。
- 运行:test_virtual_address f, 读取father.log和child.log并跟cat /proc/<pid>/maps比较,可以看到写时复制时的工作原理。注意看权限的不同。
普通文件read和文件映射mmap读取速度比较
static void test_file_read()
{
FILE *fp = NULL;
fp = fopen(FILE_NAME, "rb+");
char *buff = (char *)malloc(1024);
int len = 0;
size_t sum = 0;
size_t fsize;
fseek(fp,0,SEEK_END);
fsize=ftell(fp);
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 100; n++) {
fseek(fp,0,SEEK_SET);
while((len=fread(buff, 1,1024, fp)) >0 ) {
sum +=len;
}
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "test_file_read elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
printf("size=%ld\n", sum);
fclose(fp);
free(buff);
}
static void test_mmap_read()
{
FILE *fp = NULL;
fp = fopen(FILE_NAME, "rb+");
char *buff = (char *)malloc(1024);
int len;
size_t sum = 0;
int readsize;
size_t fsize;
size_t size;
fseek(fp,0,SEEK_END);
fsize=ftell(fp);
fseek(fp,0,SEEK_SET);
size = fsize;
fclose(fp);
int ffd = open(FILE_NAME, O_RDWR|O_CREAT,0666);
char *ptr=(char *)mmap(NULL, size, PROT_READ, MAP_SHARED, ffd, 0);
close(ffd);
char *readPtr = NULL;
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 100; n++) {
readPtr = ptr;
size = fsize;
while(size){
readsize = size > 1024 ? 1024 : size;
memcpy(buff, readPtr, readsize);
size -= readsize;
readPtr += readsize;
sum += 1024;
}
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "test_mmap_read elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
free(buff);
munmap(buff, fsize);
printf("size=%ld\n", sum);
}
运行test_mmap f 和test_mmap m
~/test_perf$ ./show_perf.sh ./test_mmap f
run ./test_mmap f
test_file_read elapsed milliseconds: 2245
size=10485760000
Performance counter stats for './test_mmap f':
74,743,510 cache-references # 36.454 M/sec
35,690,848 cache-misses # 47.751 % of all cache refs
8,925,687,804 instructions
1,367,859,141 branches # 667.137 M/sec
3,955,388 branch-misses # 0.29% of all branches
117 faults # 0.057 K/sec
219 context-switches # 0.107 K/sec
2050.342526 task-clock (msec) # 0.909 CPUs utilized
0 migrations # 0.000 K/sec
2.256351930 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_mmap m
run ./test_mmap m
test_mmap_read elapsed milliseconds: 489
size=10485760000
Performance counter stats for './test_mmap m':
62,951,973 cache-references # 127.169 M/sec
18,601,851 cache-misses # 29.549 % of all cache refs
1,502,833,265 instructions
147,082,645 branches # 297.121 M/sec
24,354 branch-misses # 0.02% of all branches
1,718 faults # 0.003 M/sec
8 context-switches # 0.016 K/sec
495.025615 task-clock (msec) # 0.995 CPUs utilized
0 migrations # 0.000 K/sec
0.497508855 seconds time elapsed
利用缓存命中提高性能
cache-misses
- 小结构体遍历: test_cache_miss s
- 大结构体遍历: test_cache_miss b
- 数组遍历: test_cache_miss a
- vector遍历: test_cache_miss v
- 按行顺序遍历: test_cache_miss a
- 按列顺序遍历:test_cache_miss i
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
//#pragma pack(64)
typedef struct {
int cnt;
std::string str;
char resv[24];
}algin_data_t __attribute__((aligned(64)));
//#pragma pack()
typedef struct {
int cnt;
std::string str;
}small_data_t;
typedef struct {
int cnt;
char str[128];
}big_data_t;
static void test_small_struct()
{
std::vector<small_data_t> dat(10000);
int num=0;
for (int i = 0; i < 10000; i++) {
for (auto &e:dat) {
e.cnt = num++;
}
}
}
static void test_algin_struct()
{
std::vector<algin_data_t> dat(10000);
int num=0;
for (int i = 0; i < 10000; i++) {
for (auto &e:dat) {
e.cnt = num++;
}
}
}
static void test_big_struct()
{
std::vector<big_data_t> dat(10000);
int num=0;
for (int i = 0; i < 10000; i++) {
for (auto &e:dat) {
e.cnt = num++;
}
}
}
static void test_vector_foreach()
{
std::vector<std::vector<int>> dat(10000, std::vector<int>(64, 0));
int num=0;
for (int cnt = 0; cnt < 100; cnt++) {
for (int i = 0; i < 10000; i++) {
for (int j = 0; j < 64; j++) {
dat[i][j] = num++;
}
}
}
}
static int dat_array[10000][64];
static void test_array_foreach()
{
int num=0;
for (int cnt = 0; cnt < 100; cnt++) {
for (int i = 0; i < 10000; i++) {
for (int j = 0; j < 64; j++) {
dat_array[i][j] = num++;
}
}
}
}
static void test_array_inv_foreach()
{
int num=0;
for (int cnt = 0; cnt < 100; cnt++) {
for (int i = 0; i < 64; i++) {
for (int j = 0; j < 10000; j++) {
dat_array[j][i] = num++;
}
}
}
}
int main(int argc, char **argv)
{
if(argc <= 1) {
printf("exe b or exe s\n");
return 0;
}
printf("small=%d algin:%d\n", sizeof(small_data_t), sizeof(algin_data_t));
switch (argv[1][0])
{
case 's': test_small_struct(); break;
case 'b': test_big_struct(); break;
case 'a': test_array_foreach(); break;
case 'v': test_vector_foreach(); break;
case 'i': test_array_inv_foreach(); break;
case 'l': test_algin_struct(); break;
default:break;
}
return 0;
}
~/test_perf$ ./show_perf.sh ./test_branch_miss s
run ./test_branch_miss s
Performance counter stats for './test_branch_miss s':
35,237 cache-references # 1.035 M/sec
15,563 cache-misses # 44.167 % of all cache refs
127,110,597 instructions
31,639,756 branches # 929.747 M/sec
67,240 branch-misses # 0.21% of all branches
111 faults # 0.003 M/sec
0 context-switches # 0.000 K/sec
34.030495 task-clock (msec) # 0.969 CPUs utilized
0 migrations # 0.000 K/sec
0.035105508 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss b
run ./test_branch_miss b
Performance counter stats for './test_branch_miss b':
38,823 cache-references # 20.861 M/sec
12,309 cache-misses # 31.705 % of all cache refs
2,843,103 instructions
485,745 branches # 261.009 M/sec
15,449 branch-misses # 3.18% of all branches
102 faults # 0.055 M/sec
0 context-switches # 0.000 K/sec
1.861031 task-clock (msec) # 0.679 CPUs utilized
0 migrations # 0.000 K/sec
0.002742285 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss a
run ./test_branch_miss a
Performance counter stats for './test_branch_miss a':
45,370 cache-references # 0.627 M/sec
15,784 cache-misses # 34.790 % of all cache refs
121,528,996 instructions
30,681,381 branches # 423.988 M/sec
4,005,835 branch-misses # 13.06% of all branches
111 faults # 0.002 M/sec
2 context-switches # 0.028 K/sec
72.363755 task-clock (msec) # 0.975 CPUs utilized
0 migrations # 0.000 K/sec
0.074219125 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss v
run ./test_branch_miss v
Performance counter stats for './test_branch_miss v':
31,512 cache-references # 9.504 M/sec
12,034 cache-misses # 38.189 % of all cache refs
2,857,202 instructions
488,248 branches # 147.256 M/sec
14,946 branch-misses # 3.06% of all branches
103 faults # 0.031 M/sec
0 context-switches # 0.000 K/sec
3.315638 task-clock (msec) # 0.678 CPUs utilized
0 migrations # 0.000 K/sec
0.004890918 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss i
run ./test_branch_miss i
Performance counter stats for './test_branch_miss i':
31,792 cache-references # 13.768 M/sec
13,951 cache-misses # 43.882 % of all cache refs
2,864,321 instructions
489,395 branches # 211.938 M/sec
14,898 branch-misses # 3.04% of all branches
103 faults # 0.045 M/sec
0 context-switches # 0.000 K/sec
2.309145 task-clock (msec) # 0.704 CPUs utilized
0 migrations # 0.000 K/sec
0.003280504 seconds time elapsed
#!/bin/sh
echo "run $* "
perf stat -B -e cache-references,cache-misses,instructions,branches,branch-misses,faults,context-switches,task-clock,migrations $*
- 热点数据紧凑合并 : ./test_together
- 热点数据分散合并 : ./test_together
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
#include <assert.h>
#include <algorithm>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <chrono>
#include <sched.h>
#include <fstream>
static const int array_size = 1024;
struct Dat_One{
int buff_one[array_size];
int buffer_two[array_size];
};
static Dat_One dat_one;
struct Dat_Two{
int buff_one;
int buffer_two;
};
static Dat_Two dat_two[array_size];
static int do_func(int dat)
{
static int sum = 0;
sum++;
return dat;
}
static void test1()
{
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 10000; n++)
for (int j = 0; j < array_size; j++) {
dat_one.buffer_two[j] = do_func(dat_one.buff_one[j]);
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "big struct : elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
}
static void test2()
{
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 10000; n++)
for (int j = 0; j < array_size; j++) {
dat_two[j].buffer_two= do_func(dat_two[j].buff_one);
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "small struct : elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
}
int main(int argc, char **argv)
{
int cpu_index =0;
if(argc >= 3) cpu_index=atoi(argv[2]);
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu_index, &set);
sched_setaffinity(0, sizeof(set), &set);
test1();
test2();
return 0;
}
- 增加无用数据减少cache冲突: test_Tmat 2 && test_Tmat 3
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
#include <assert.h>
#include <algorithm>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <chrono>
#include <sched.h>
#include <fstream>
//(critical stride) = (number of sets)*(line size) = (total cache size) / (number of ways).
//: (set) = (memory address) / (line size) % (number of sets)
//cat /sys/devices/system/cpu/cpu0/cache/index0/ways_of_associativity
//cat /sys/devices/system/cpu/cpu0/cache/index0/number_of_sets
//answer : https://stackoverflow.com/questions/11413855/why-is-transposing-a-matrix-of-512x512-much-slower-than-transposing-a-matrix-of
//sudo perf stat -e cache-misses ./test_Tmat 2
#define MATSIZE_512 (4096UL) //L1 cache miss
#define MATSIZE_513 (4096UL+16)
static void transpose_mat512()
{
const size_t alloc_size = MATSIZE_512*MATSIZE_512;
std::cout << alloc_size << std::endl;
char *total_mem = (char *)malloc(alloc_size);
assert(total_mem != nullptr);
char (*mat_512)[MATSIZE_512] = (char (*)[MATSIZE_512])total_mem;
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 10; n++)
for (int i = 1; i < MATSIZE_512; i++)
for (int j = 0; j < i; j++) {
int temp = mat_512[i][j];
mat_512[i][j] = mat_512[j][i];
mat_512[j][i] = temp;
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "mat512x512 elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
free(total_mem);
}
static void transpose_mat513()
{
const size_t alloc_size = MATSIZE_513*MATSIZE_513;
char *total_mem = (char *)malloc(alloc_size);
assert(total_mem != nullptr);
char (*mat_513)[MATSIZE_513] = (char (*)[MATSIZE_513])total_mem;
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 10; n++)
for (int i = 1; i < MATSIZE_513; i++)
for (int j = 0; j < i; j++) {
int temp = mat_513[i][j];
mat_513[i][j] = mat_513[j][i];
mat_513[j][i] = temp;
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "mat513x513 elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
free( total_mem);
}
int main(int argc, char **argv)
{
int cpu_index =0;
if(argc >= 3) cpu_index=atoi(argv[2]);
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu_index, &set);
sched_setaffinity(0, sizeof(set), &set);
switch(argv[1][0]) {
case '2' : transpose_mat512();break;
case '3' : transpose_mat513();break;
}
return 0;
}
~/test_perf$ ./show_perf.sh ./test_Tmat 2
run ./test_Tmat 2
16777216
mat512x512 elapsed milliseconds: 2721
Performance counter stats for './test_Tmat 2':
99,403,455 cache-references # 36.646 M/sec
55,481,043 cache-misses # 55.814 % of all cache refs
4,075,121,270 instructions
252,150,019 branches # 92.958 M/sec
728,994 branch-misses # 0.29% of all branches
8,297 faults # 0.003 M/sec
20 context-switches # 0.007 K/sec
2712.515229 task-clock (msec) # 0.995 CPUs utilized
1 migrations # 0.000 K/sec
2.725778361 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_Tmat 3
run ./test_Tmat 3
mat513x513 elapsed milliseconds: 816
Performance counter stats for './test_Tmat 3':
4,054,160 cache-references # 4.947 M/sec
2,411,165 cache-misses # 59.474 % of all cache refs
4,499,294,985 instructions
172,637,350 branches # 210.638 M/sec
73,871 branch-misses # 0.04% of all branches
8,360 faults # 0.010 M/sec
5 context-switches # 0.006 K/sec
819.594449 task-clock (msec) # 0.996 CPUs utilized
1 migrations # 0.001 K/sec
0.822494459 seconds time elapsed
branch-misses
- 随机数组遍历条件执行: test_branch_miss a
- 随机数组排序后遍历条件执行: test_branch_miss s
- likely和unlikely干预分支预测: test_branch_miss n ; test_branch_miss l
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
#include <algorithm>
#define likely(x) __builtin_expect(!!(x), 1)
#define unlikely(x) __builtin_expect(!!(x), 0)
static int dat_array[10000];
static void test_array_foreach()
{
int num=0;
for (int i = 0; i < 10000; i++) {
dat_array[i]=rand()%512;
}
for (int cnt = 0; cnt < 1000; cnt++) {
for (int i = 0; i < 10000; i++) {
if( dat_array[i] > 256) {
dat_array[i]=num++;
}
}
}
}
volatile int test_a;
static inline void test_no_use()
{
test_a = 1000;
test_a += 1;
test_a *= 10;
test_a /= 3;
test_a <<= 1;
}
static void test_array_sort_foreach()
{
int num=0;
for (int i = 0; i < 10000; i++) {
dat_array[i]=rand()%512;
}
std::sort(dat_array, dat_array+10000);
for (int cnt = 0; cnt < 1000; cnt++) {
for (int i = 0; i < 10000; i++) {
if( dat_array[i] > 256) {
dat_array[i]=num++;
}
}
}
}
static void test_nolikely_foreach()
{
int num=0;
for (int i = 0; i < 10000; i++) {
dat_array[i]=rand()%512;
}
for (int cnt = 0; cnt < 10000; cnt++) {
for (int i = 0; i < 10000; i++) {
if( dat_array[i] > 50) {
dat_array[i]=num;
num+=2;
test_no_use();
}else {
dat_array[i]=num;
num++;
}
}
}
}
static void test_likely_foreach()
{
int num=0;
for (int i = 0; i < 100000; i++) {
dat_array[i]=rand()%512;
}
for (int cnt = 0; cnt < 1000; cnt++) {
for (int i = 0; i < 10000; i++) {
if(likely( dat_array[i] > 50)) {
dat_array[i]=num;
num+=2;
test_no_use();
}else {
dat_array[i]=num;
num++;
}
}
}
}
int main(int argc, char **argv)
{
if(argc <= 1) {
printf("exe b or exe s\n");
return 0;
}
switch (argv[1][0])
{
case 's': test_array_sort_foreach(); break;
case 'a': test_array_foreach(); break;
case 'n': test_nolikely_foreach(); break;
case 'l': test_likely_foreach(); break;
default:break;
}
return 0;
}
~/test_perf$ ./show_perf.sh ./test_branch_miss a
run ./test_branch_miss a
Performance counter stats for './test_branch_miss a':
37,087 cache-references # 0.527 M/sec
13,650 cache-misses # 36.805 % of all cache refs
121,502,626 instructions
30,676,994 branches # 435.643 M/sec
4,005,535 branch-misses # 13.06% of all branches
110 faults # 0.002 M/sec
0 context-switches # 0.000 K/sec
70.417747 task-clock (msec) # 0.979 CPUs utilized
0 migrations # 0.000 K/sec
0.071961280 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss s
run ./test_branch_miss s
Performance counter stats for './test_branch_miss s':
34,839 cache-references # 0.856 M/sec
15,265 cache-misses # 43.816 % of all cache refs
127,107,931 instructions
31,638,325 branches # 777.609 M/sec
67,152 branch-misses # 0.21% of all branches
112 faults # 0.003 M/sec
0 context-switches # 0.000 K/sec
40.686661 task-clock (msec) # 0.966 CPUs utilized
0 migrations # 0.000 K/sec
0.042110582 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss n
run ./test_branch_miss n
Performance counter stats for './test_branch_miss n':
46,054 cache-references # 0.111 M/sec
20,105 cache-misses # 43.655 % of all cache refs
4,304,097,649 instructions
600,785,471 branches # 1452.078 M/sec
34,768 branch-misses # 0.01% of all branches
110 faults # 0.266 K/sec
7 context-switches # 0.017 K/sec
413.741881 task-clock (msec) # 0.994 CPUs utilized
0 migrations # 0.000 K/sec
0.416110095 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_branch_miss l
run ./test_branch_miss l
./test_branch_miss: Segmentation fault
Performance counter stats for './test_branch_miss l':
41,754 cache-references # 9.202 M/sec
15,556 cache-misses # 37.256 % of all cache refs
3,663,840 instructions
688,072 branches # 151.649 M/sec
16,502 branch-misses # 2.40% of all branches
109 faults # 0.024 M/sec
2 context-switches # 0.441 K/sec
4.537274 task-clock (msec) # 0.035 CPUs utilized
0 migrations # 0.000 K/sec
0.129464292 seconds time elapsed
利用cpu乱序执行与内存、指令重排提高性能
- cpu乱序执行 (无数据依赖遍历与有数据依赖遍历): test_reorder 4 ; test_reorder 5
- volatile原理: mfence
- 原子操作原理:
- cpu指令重排带来的多线程问题: test_reorder 1 ; test_reorder 2
- 内存重排带来的无锁编程问题
#include <stdio.h>
#include <iostream>
#include <vector>
#include <string>
#include <assert.h>
#include <algorithm>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <chrono>
#include <sched.h>
#include <thread>
#include <fstream>
#include <semaphore.h>
#include <atomic>
#define barrier() __asm__ __volatile__("": : :"memory")
#define mfence() __asm__ __volatile__("mfence": : :"memory")
//https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering
static sem_t sem[2];
static int X, Y;
static int r1, r2;
static void t1_run(void)
{
sem_wait(&sem[0]);
X=1;
//mfence();
r1=Y;
}
static void t2_run(void)
{
sem_wait(&sem[1]);
Y=1;
//mfence();
r2=X;
}
static void test_order_once()
{
static int num = 0;
static int round = 0;
round++;
X = 0;
Y = 0;
std::thread t1(t1_run);
std::thread t2(t2_run);
sem_post(&sem[0]);
sem_post(&sem[1]);
t1.join();
t2.join();
if ((r1 == 0) && (r2 == 0)) {
num++;
printf("reorder: %d, iterator: %d\n", num, round);
}
}
/
static std::atomic<int> aX;
static std::atomic<int> aY;
static int err_num = 0;
static void load_aY_to_aX()
{
sem_wait(&sem[0]);
r1 = aY.load(std::memory_order_relaxed);
aX.store(r1, std::memory_order_relaxed);
}
static void load_aX_then_store_aY()
{
sem_wait(&sem[1]);
r2 = aX.load(std::memory_order_relaxed);
aY.store(42, std::memory_order_relaxed);
}
static void test_order_relaxed_once()
{
static int num = 0;
static int round = 0;
round++;
aX = 0;
aY = 0;
std::thread t1(load_aY_to_aX);
std::thread t2(load_aX_then_store_aY);
sem_post(&sem[0]);
sem_post(&sem[1]);
t1.join();
t2.join();
if ((r1 == 42) && (r2 ==42)) {
printf("reorder: %d, iterator: %d\n", ++num, round);
}
}
/
static std::atomic<std::string*> ptr;
static int data;
static void producer()
{
std::string* p = new std::string("Hello");
data = 42;
ptr.store(p, std::memory_order_release);
}
static void consumer()
{
std::string* p2;
while (!(p2 = ptr.load(std::memory_order_acquire)));
assert(*p2 == "Hello"); // never fires
assert(data == 42); // never fires
}
static void relese_acquire_test()
{
std::thread t1(producer);
std::thread t2(consumer);
t1.join();
t2.join();
}
static void test_auto_add()
{
const int size = 1000;
float list[size], sum = 0; int i;
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 100000; n++)
for (i = 0; i < size; i++) sum += list[i];
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "test_auto_add elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
}
static void test_dependency_add()
{
const int size = 1000;
float list[size], sum1 = 0, sum2 = 0;
int i;
auto start = std::chrono::high_resolution_clock::now();
for (int n = 0; n < 100000; n++) {
for (i = 0; i < size; i+=2) {
sum1 += list[i];
sum2 += list[i+1];
}
sum1 += sum2;
}
auto end = std::chrono::high_resolution_clock::now();
size_t use_time = std::chrono::duration_cast<std::chrono::milliseconds>(end-start).count();
std::cout << "test_auto_add elapsed milliseconds: " << use_time << std::endl;
std::flush(std::cout);
}
int main(int argc, char **argv)
{
sem_init(&sem[0], 0, 0);
sem_init(&sem[1], 0, 0);
switch(argv[1][0]) {
case '1' : while(true) test_order_once(); break;
case '2' : while(true) test_order_relaxed_once();;break;
case '3' : while(true) relese_acquire_test();;break;
case '4' : test_auto_add();;break;
case '5' : test_dependency_add();;break;
}
return 0;
}
~/test_perf$ ./show_perf.sh ./test_reorder 4
run ./test_reorder 4
test_auto_add elapsed milliseconds: 310
Performance counter stats for './test_reorder 4':
41,274 cache-references # 0.132 M/sec
21,390 cache-misses # 51.824 % of all cache refs
1,004,169,232 instructions
200,905,980 branches # 640.344 M/sec
118,085 branch-misses # 0.06% of all branches
113 faults # 0.360 K/sec
0 context-switches # 0.000 K/sec
313.746779 task-clock (msec) # 0.995 CPUs utilized
0 migrations # 0.000 K/sec
0.315352555 seconds time elapsed
~/test_perf$ ./show_perf.sh ./test_reorder 5
run ./test_reorder 5
test_auto_add elapsed milliseconds: 159
Performance counter stats for './test_reorder 5':
38,816 cache-references # 0.238 M/sec
18,949 cache-misses # 48.817 % of all cache refs
854,311,607 instructions
100,877,665 branches # 617.780 M/sec
117,673 branch-misses # 0.12% of all branches
113 faults # 0.692 K/sec
0 context-switches # 0.000 K/sec
163.290585 task-clock (msec) # 0.992 CPUs utilized
0 migrations # 0.000 K/sec
0.164629185 seconds time elapsed
至于以上实验为什么会有这些测试结果。后续补上 ,先上班~~~~~~