该死的,现在我还不敢发布,因为背搜到了就背了
发现一个学习cache原理的网站, 讲的非常好并且很生动. http://202.116.24.124/computer/content/theory/web/Chap03/3.5.1.htm
靠 ,原来是济南大学的做的. 吊 !
搜到一个怎么查看cpu信息的文章 http://www.ademiller.com/blogs/tech/2009/11/how-big-is-your-processor-cache-find-out-with-coreinfo/
介绍了两个碉堡的工具: http://technet.microsoft.com/en-us/sysinternals/default.aspx 和 http://technet.microsoft.com/en-us/sysinternals/cc835722.aspx
Need to know how big the caches are on your processor? You could wade through the copious documentation on Intel or AMD’s web sites. It turns out this is no fun at all. I did this last week and still didn’t have the answer after skimming several white papers.
Turns out there’s a useful little Sysinternals tool for figuring this out; Coreinfo.
Here’s coreinfo at work (click to enlarge). Run from the command prompt it displays all sorts of information about the processor hardware:

As you can see from the image my laptop has 32Kb L1 data and instruction caches on each core and a shared 6Mb L2 cache, all with 64 byte cache lines.
看看我的:
那这样那就是 64KB的一级缓存, 和2M二级缓存,
http://www.cse.ohio-state.edu/~panda/775/slides/Ch5_App_C_6.pdf
查看cpucache 或者使用命令 cat /proc/cpuinfo
http://blog.youkuaiyun.com/zklth/article/details/6280046
cpu0
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index0$ cat level
1
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index0$ cat size
64K
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index0$ cat type
Data
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index1$ cat level
1
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index1$ cat type
Instruction
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index1$ cat size
64K
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index1$
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index2$ cat level
2
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index2$ cat size
1024K
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index2$ cat type
Unified
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index2$
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index0$ cat coherency_line_size
64
lyle@lyle_ubuntu:/sys/devices/system/cpu/cpu0/cache/index0$
双核cpu, 每个cpu有 两级缓存, 一级缓存分成指令缓存和数据数据缓存, 每种都是64K, 二级缓存是共享缓存, 是1M, cache的块大小是64字节, 也就是64字节一个基本单位
测试代码
测出来的值应该和这个差不多
http://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory.
计算时间
#include “stdio.h”
#include “stdlib.h”
#include “time.h”
int
main(
void
)
{
long
i = 10000000L;
clock_t
start, finish;
double
duration;
/* 测量一个事件持续的时间*/
printf
(
"Time to do %ld empty loops is "
, i );
start =
clock
();
while
( i-- ) ;
finish =
clock
();
duration = (
double
)(finish - start) / CLOCKS_PER_SEC;
printf
(
"%f seconds\n"
, duration );
system
(
"pause"
);
}
2012-2013-2学期操作系统半期考试
In this project you willattempt to measure the effect on program execution speed of memory cache. Cacheis a layer of the memory hierarchy that stands between main memory and the CPU.Optimizing for best use of cache is increasingly important in optimizing forbest performance. A memory reference that can be satisfied from cache istypically much faster than one that must go to main memory.
First, a few words about cache memory.Semiconductor memory comes in various speeds, and the faster stuff costs more.An ad in last month's Byte magazine advertises 32KB of 12 ns memory for 12.75.For 11.50, the same ad offers 256KB of 70 ns memory. If you built an 8 Megmemory for your PC entirely with the fast stuff, you'd spend about 3000 just onthe memory, whereas 8 MB of the `slow' memory would cost only 368. For someapplications, you'd want to spend the money. Supercomputers, which are designedto get the biggest bang without regard for the bucks, take this approach.
Caching is one way to mix fast and slowmemory to get some of the advantages of each. Nowadays it is used on almost allcomputers, from PCs to mainframes. The scheme is as follows: The main memory ismade up of cheaper, slower chips. We use the fast, expensive memory to store copiesof some recently used values and their addresses in a cache. A cache controlleris interposed between the CPU and the two memory systems. When the CPU does astore, a copy of the value and the address is saved in the fast memory. Whenthe CPU asks for the contents of an address, the cache controller checks to seeif it has that address stored in fast memory, and if so, it can return thevalue more quickly than if the value is only stored in the slow memory.
If 90 of requests can be satisfied byreference to the cache, and a cache reference takes 12 ns, and the other 10 ofrequests take 80 ns to complete, the average time for a memory request will be0.90 ´ 12ns + 0.10 ´ 80 ns = 18.8 ns Of course, since we now have to store the addressas well as the value, we'll lose big if we have to have 7MB (+28MB of address)of expensive fast memory in order to reach a 90 hit rate.
Luckily it turns out that most programsexhibit good temporal locality, that is, they tend to refer to the same wordsof memory over and over, either in code loops, or as data par. So a PC cache of128K is big enough to give a 90 hit rate for most programs. In order to avoidhaving to store another 512KB of addresses, several different schemes are used;one frequent scheme is to organize adjacent bytes into cache blocks that canshare the same address. For example, if the cache is organized into 128-bytecache blocks, each of the 128 bytes in a block will have the same high-orderbits, and so we need store only 1/128 as many addresses. For our example, wecould get away with storing only 1024 addresses.
Most current operating systems pay noattention to caching; the hardware tries to hide the details from the software,and the tradeoffs change rapidly. But if you were attempting to stretch theedge of your computer's abilities, you might want to take cache properties intoaccount. Note that if you are running a program on our hypothetical PC thatnever hits cache, your program can run 70/12 times longer than a program whichis carefully tuned so that it always hits cache.
The computer system you usually useprobably has a cache system, although some PC's may not. In this homework,you'll try to discover what its parameters are. You'll attempt to answer thequestions:
1. Whatis the difference in execution time between a reference to main memory and areference to cache?
2. Howbig is a cache block?
3. Howbig is the cache?
4. Howlong does a reference to main memory take to complete?
5. Howlong does a reference that can be satisfied from cache take to complete?
The basic idea is to compare theexecution time of two loops, one of which tries to force reading a new valuefrom main memory at each iteration. Since we are talking only a few nsdifference between the two read times, you'll have to execute a lot of reads inorder to be able to measure the difference.
The test programs for this problem wouldbe easier to program in assembly language. I don't expect you to do that.However, you should probably determine what machine language instructions occurin the loops you are timing, since the loops will obviously take differentamounts of time to run if they include different instructions. On UNIX systems,one way to do this is with the -S switch to the C compiler. Some PC compilers will not provide a similarswitch; in this case you may be able to use a debugger to examine the program.
Your final program(s) won't be very long,although you may write and discard several; they may not work. (This is adifficult exercise.) So give me a clearexplanation of how you propose to perform the measurements, and what you thinkmay have worked or gone wrong.
Many modern computer systems have severallayers of cache; for example most of the current crop of microprocessors have asignificant amount of cache on the chip with the CPU, but also have an off-chipcache that is significantly faster than main memory. If you are still full ofvim after dealing with the first few questions, you might attempt to considerwhether there seem to be two or more levels of cache in your system.
在个工程中, 你将要尝试着评测缓存对程序运行速度的影响. 缓存是多级存储体系同位于CPU和主存之间的一层. 对缓存的最优化使用对于性能优化非常重要. 寻址时由缓存提供比经过主存要快得多.
首先, 关于高速缓存存储器的一些话. 半导体存储器的速度有很多种,速度越快就越贵. 上个月, 《Byte》杂志有一个广告, 访存速度是12ns,容量32KB的存储器要12.75$, 而 70ns, 256Kb容量的是11.50$. 如果你如若你用最快的那种缓存给你的电脑设计一个8兆的存储器, 但在存储器这一项就要你花费3000$,而8MB"慢速"存储器值需要花费368$, 对于一些应用,你会愿意花这些钱.超级计算机, 那些设计用来得到爆炸性速度的计算机,不会在乎这些个钱,就会采取这样的方式.
缓存技术是一个融合了高速和低速存储各自优点的方法. 现在,它被用在几乎所有的计算机上,从个人计算机到大型机. 方案如下: 主存采用较便宜,较慢一点的芯片. 我们用快的,昂贵的存储器来缓存一些最近被用到的数据和它们的地址的备份,这些备份被存储在高速缓冲存储器当中. 当CPU请求一个某一个地址的内容时,缓冲控制器会检查看看高速缓存里没有保存有这些地址.如果有的话,高速缓存会以比经过慢存储器返回更快的速度返回被请求的值.
如果90次的请求可以通过引用缓存来获得,并且这些引用需要12ns, 而其他的10次需要80ns来完成. 那么平均一次访存时间是0.9*12+0.1*80=18.8. 当然,因为现在我们必须存储值还有地址,就会失去很多, 例如我们需要存储7MB(+28MB的地址)昂贵快速的内存,以达到90%的命中率.
幸运的是,很多程序变现出了很好的空间局部性,它们倾向于一次又一次地当问同一个字单元,不是循环代码中就是数据段中. 所以一个个人计算机缓存128K对与大多数程序达到90%的命中率就足够了. 为了避免必须存储另外的512K的地址, 有个不同的方案供使用. 一个频次方案是,把相邻的地界单元组织在一个缓冲块里,他们可以共享这个块的地址. 例如, 如果缓存是以128字节块来组织的, 每一个块中的128个字节有相同的高位地址, 这样我们只需要存储至多1/128的地址. 在我们的例子中,我们有幸只需要存储1024个地址.
目前大多数操作系统并不关心缓存系统. 底层硬件试图对上层软件隐藏细节, 并权衡改变. 但是如果你想要拓展你的计算机能力的极限, 你就应该把缓存诸多性质纳入考虑. 注意到,如果我们在一台从没有命中缓存的假想机上运行程序, 你的程序需要运行要比一个经过精心调整总是命中缓存的程序慢70/12倍.
你通常用的操作系统通常会有一个缓存系统,尽管有些计算机没有. 在这个作业中,你要尝试探索它的参数是什么,你讲试图回答下面的问题.
1,一次访问主存和访问缓冲存储器的执行时间有什么区别.
2,一个缓冲块有多大.
3,缓存有多大
4,一次访问主存需要多长时间来完成
5,一次可以通过缓存来完成的访存操作需要多久来完成
基本的想法是比较两个循环的执行时间,其中一个强制在每一次迭代中从主存读取新的数据. 因为我们在两次访存之间只有几个纳秒, 所以你需要实行很多次读操作,这样才能够有足够多的时间来测量差别.
这个问题的测试程序用汇编语言来完成会比较容易. 但我不希望你那样做. 然而,你应该确定在你测试时间的循环中出现的是什么机器语言,因为这些循环如果使用了不同的指令来完成所使用的时间是不一样的. 在UNIX系统中,一个方法是使用对C编译使用-S选项开关. 一些PC 编译器,不会提供类似的开关. 在这种情况下, 你应该使用调试器来检查程序.
你的最终程序不需要太长,尽管可以写或者放弃几个. 他们可能无法工作(这是一个困难的练习). 所以你给我一个怎样进行测量的解释,和你以为会成功但是没有的.
很多现代操作系统有很多级存储,比如现在很多流行的处理器有很多数量的缓存在CPU中. 但是同样也有在芯片外的,比主存快得多的缓存,如果你在解决前面的那些问题之后还有充沛的精力,你可以尝试考虑一下你的系统有多少级缓存体系.
我的算法:
1,测试访存时间:
① 开辟大于1M的空间,因为物理cache是1M, 所以我选择了64M.
② 每次由指针访问读取一个存储器中的数据赋值给一个变量,这样可以确保每次都要访存.如果不进行赋值的话就不一定
③ 给指针设置一个增量,增量值大于1M的空间,超过之后取模.
④ 循环执行②和③直到达到测试次数
计算时间, 总时time(②+③)*loop_times,附加时间time(③)*loop_times
单纯访存时间 (总时间-附加时间)/loop_times
2,测试访问cache时间
① 开辟较小空间
② 循环访问一次,保证数据被加载到cache
③ 循环访问小空间的元素
④ 为访问指针设置增量
⑤ 循环执行③和④直到达到测试次数
⑥ 计算时间的方法与测试访问内存是相同
计算时间的基本算法:
回答文章中的问题:
1,一次访问主存和访问缓冲存储器的执行时间有什么区别?
时间相差在1到2个数量级左右,也就是说有访问缓存要比直接访问存储器要快得多
2,一个缓冲块有多大?
可以看到linux中保存有硬件的信息:
方法:
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64byte
所以缓冲块的大小是64字节的
3,缓存有多大?
我的计算机一个单核CPU是一级缓存64K的数据cache和64K指令cache,二级缓存1M, 编写一个脚本即可.
方法:
4,一次访问主存需要多长时间来完成
查质料得知是 100ns
在本次实验中为111ns
5,一次可以通过缓存来完成的访存操作需要多久来完成
查质料得知是 L1缓存是0.5ns, L2是7ns
本实验没有区别缓存级别,平均时间是3.4ns
实验数据
|
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
|
|
|
实验数据分析:
从实验数据中可以知道,测试数据是比较正确的:平均访存需要的时间是111纳秒左右,而访问缓存的时间是3.4纳秒,但是如果我们去掉第一组误差比较大的数据,那么得到的缓存访问时间应该是<3纳秒的.
误差分析:
实验中的数据和理论值是有一定差距的,避居访存时间比较慢, 这个可能和学生使用的计算机有关,存储器和缓存都比较次, 所以表现不是很好. 总体来说,实验是很成功的.
代码:
#include<stdio.h>
#include<time.h>
#include<math.h>
#include<stdlib.h>
constintsample_number=5;
int sample[sample_number];
constintbig_step=1024*1024*3+1;//设置测试存储器时的增量
constintsmall_step=3; //设置测试缓存时的增量,如果不设置,编译器可能会采取寄存器寻址.
constint arrayMax=((int)pow(2.0,26.0)); //64MB
void display_menu()
{
printf("访问次数\t总时间(ns)\t\t附加时间(ns)\t\t净访问时间(ns)\t\t每条指令(ns)\n");
}
void display_result(intn,doublecost_time_total,doublecost_time_additional)
{
/*毫秒转换成ns*/
double ns_t=cost_time_total*pow(10.0,6.0);
double ns_a=cost_time_additional*pow(10.0,6.0);
printf(" %d\t%Lf\t%Lf\t%Lf\t%Lf\n",n,ns_t,ns_a,ns_t-ns_a,(ns_t-ns_a)/n);
}
void test_memory(void)
{
char *array_pointer=(char *)malloc(sizeof(char)*arrayMax);
char c;
double cost_time_total;
double cost_time_additional;
int i,n;
int step=0;
clock_t start,finish;
for( i=0;i<sample_number;i++)
{
/*一个指定访存次数的总时间*/
n=sample[i];
start=clock();
while(n--)
{
c=*(array_pointer+step); //1048576是
step=(step+big_step)%arrayMax;
}
finish=clock();
cost_time_total=(double)(finish - start) ;
/*在非访存时间上的消耗*/
n=sample[i];
char *temp;
clock_tstart1,finish1;
start1=clock();
while(n--)
{
temp=array_pointer+step;
step=(step+big_step)%arrayMax;
}
finish1=clock();
cost_time_additional=(double)(finish1 - start1) ;
/*打印结果*/
display_result(sample[i],cost_time_total,cost_time_additional);
}
free(array_pointer);
}
void test_cache(void)
{
int space=20;//20个字节, 缓存会自动吧整个数据放到缓存当中
char *array_pointer=(char *)malloc(sizeof(char)*space);
char c;
double cost_time_total;
double cost_time_additional;
clock_t start,finish;
int i,n;
int step=0;
/*访问数据,以使数据进入缓存*/
for(intt=0;t<2;t++)
{
for(i=0;i<space;i++)
{
array_pointer[i]='\0';
}
}
for(i=0;i<sample_number;i++)
{
/*一个指定访存次数的总时间*/
n=sample[i];
start=clock();
while(n--)
{
c=*(array_pointer+step);
step=(step+small_step)%space;
}
finish=clock();
cost_time_total=(double)(finish - start) ;
/*在非访存时间上的消耗*/
n=sample[i];
char *temp;
clock_tstart1,finish1;
start1=clock();
while(n--)
{
temp=array_pointer+step;
step=(step+small_step)%space;
}
finish1=clock();
cost_time_additional=(double)(finish1 - start1) ;
/*打印结果*/
display_result(sample[i],cost_time_total,cost_time_additional);
}
free(array_pointer);
}
int main()
{
//freopen("d:\\re.txt","w",stdout);
srand((int)time(0));
sample[0]=100000;
sample[1]=1000000;
sample[2]=10000000;
sample[3]=100000000;
sample[4]=1000000000;
printf("测试存储器: \n");
display_menu();
test_memory();
printf("测试缓存: \n");
display_menu();
test_cache();
}