Impact of cpu cache lines验证

本文通过两个示例探讨了CPU缓存行对内存访问和性能的影响。Example 1表明,尽管第二个循环只完成了第一个循环约6%的工作量,但它们在现代机器上的运行时间相近,因为运行时间主要受内存访问而非乘法运算影响。Example 2深入解释了当步长超过数组中16个整数的倍数时,循环时间会因缓存行的利用而显著减少。了解缓存行对于某些程序优化至关重要。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

http://igoro.com/archive/gallery-of-processor-cache-effects/


此文提到

Example 1: Memory accesses and performance

How much faster do you expect Loop 2 to run, compared Loop 1?

int[] arr = new int[64 * 1024 * 1024];

// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;

// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;

The first loop multiplies every value in the array by 3, and the second loop multiplies only every 16-th. The second loop only does about 6% of the work of the first loop, but on modern machines, the two for-loops take about the same time: 80 and 78 ms respectively on my machine.

The reason why the loops take the same amount of time has to do with memory. The running time of these loops is dominated by the memory accesses to the array, not by the integer multiplications. And, as I’ll explain on Example 2, the hardware will perform the same main memory accesses for the two loops.

Example 2: Impact of cache lines

Let’s explore this example deeper. We will try other step values, not just 1 and 16:

for (int i = 0; i < arr.Length; i += K) arr[i] *= 3;

Here are the running times of this loop for different step values (K):

image

Notice that while step is in the range from 1 to 16, the running time of the for-loop hardly changes. But from 16 onwards, the running time is halved each time we double the step.

The reason behind this is that today’s CPUs do not access memory byte by byte. Instead, they fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache. And, accessing other values from the same cache line is cheap!

Since 16 ints take up 64 bytes (one cache line), for-loops with a step between 1 and 16 have to touch the same number of cache lines: all of the cache lines in the array. But once the step is 32, we’ll only touch roughly every other cache line, and once it is 64, only every fourth.

Understanding of cache lines can be important for certain types of program optimizations. For example, alignment of data may determine whether an operation touches one or two cache lines. As we saw in the example above, this can easily mean that in the misaligned case, the operation will be twice slower.



做了个测试,一开始比较的结果有很大出入,后来发现是重复使用内存导致测试结果不公平,修正后,C测试如下:


using namespace std;

#define MAX 64 * 1024 * 1024

//return ms_timeout
inline int getLoopTimeMs(int iDataLen,int iJump)
{
    if(iJump<=0)
        iJump=1;
    long* pData = new long[iDataLen];
    memset(pData,0,iDataLen*sizeof(long));

    struct timeval start_tv,loop1_tv;
    gettimeofday(&start_tv, NULL);
    for (int i = 0; i < iDataLen; i+=iJump)
        pData[i] = 1;
    gettimeofday(&loop1_tv, NULL);
    int iMs = (loop1_tv.tv_sec - start_tv.tv_sec)*1000 + (loop1_tv.tv_usec - start_tv.tv_usec)/1000;
    delete[] pData;
    return iMs;
}


int main()
{

    long* arr;
    struct timeval start_tv,loop1_tv,loop2_tv;
    gettimeofday(&start_tv, NULL);

    // Loop 1
    int iMs = getLoopTimeMs(MAX,1);

    // Loop 2
    int iMs2 = getLoopTimeMs(MAX,16);

    printf("loop1,ms:%d,loop2,ms:%d,loop64:%d\n",iMs,iMs2,getLoopTimeMs(MAX,64));

    return 0;
}


每次执行重新申请内存,保证公平,使用long =操作,减少运算,体现内存差异。

loop1和loop2比较接近,符合文章所说的64字节cache line。

但是loop64的结果偏高,不太理解


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值