Cache: a place for concealment and safekeeping

本文介绍了现代 Intel 处理器中 CPU 缓存的组织结构。通过具体实例讲解了缓存行、缓存组及缓存方式的概念,并探讨了全相联、直接映射及组相联缓存的特点。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

refer from http://duartes.org/gustavo/blog/post/intel-cpu-caches


Cache: a place for concealment and safekeeping

This post shows briefly how CPU caches are organized in modern Intel processors. Cache discussions often lack concrete examples, obfuscating the simple concepts involved. Or maybe my pretty little head is slow. At any rate, here’s half the story on how a Core 2 L1 cache is accessed:

Selecting an L1 cache set (row)

The unit of data in the cache is the line, which is just a contiguous chunk of bytes in memory. This cache uses 64-byte lines. The lines are stored in cache banks or ways, and each way has a dedicated directory to store its housekeeping information. You can imagine each way and its directory as columns in a spreadsheet, in which case the rows are the sets. Then each cell in the way column contains a cache line, tracked by the corresponding cell in the directory. This particular cache has 64 sets and 8 ways, hence 512 cells to store cache lines, which adds up to 32KB of space.

In this cache’s view of the world, physical memory is divided into 4KB physical pages. Each page has 4KB / 64 bytes == 64 cache lines in it. When you look at a 4KB page, bytes 0 through 63 within that page are in the first cache line, bytes 64-127 in the second cache line, and so on. The pattern repeats for each page, so the 3rd line in page 0 is different than the 3rd line in page 1.

In a fully associative cache any line in memory can be stored in any of the cache cells. This makes storage flexible, but it becomes expensive to search for cells when accessing them. Since the L1 and L2 caches operate under tight constraints of power consumption, physical space, and speed, a fully associative cache is not a good trade off in most scenarios.

Instead, this cache is set associative, which means that a given line in memory can only be stored in one specific set (or row) shown above. So the first line of any physical page (bytes 0-63 within a page) must be stored in row 0, the second line in row 1, etc. Each row has 8 cells available to store the cache lines it is associated with, making this an 8-way associative set. When looking at a memory address, bits 11-6 determine the line number within the 4KB page and therefore the set to be used. For example, physical address 0x800010a0 has 000010 in those bits so it must be stored in set 2.

But we still have the problem of finding which cell in the row holds the data, if any. That’s where the directory comes in. Each cached line is tagged by its corresponding directory cell; the tag is simply the number for the page where the line came from. The processor can address 64GB of physical RAM, so there are 64GB / 4KB == 224 of these pages and thus we need 24 bits for our tag. Our example physical address 0x800010a0 corresponds to page number 524,289. Here’s the second half of the story:

Finding cache line by matching tags

Since we only need to look in one set of 8 ways, the tag matching is very fast; in fact, electrically all tags are compared simultaneously, which I tried to show with the arrows. If there’s a valid cache line with a matching tag, we have a cache hit. Otherwise, the request is forwarded to the L2 cache, and failing that to main system memory. Intel builds large L2 caches by playing with the size and quantity of the ways, but the design is the same. For example, you could turn this into a 64KB cache by adding 8 more ways. Then increase the number of sets to 4096 and each way can store 256KB. These two modifications would deliver a 4MB L2 cache. In this scenario, you’d need 18 bits for the tags and 12 for the set index; the physical page size used by the cache is equal to its way size.

If a set fills up, then a cache line must be evicted before another one can be stored. To avoid this, performance-sensitive programs try to organize their data so that memory accesses are evenly spread among cache lines. For example, suppose a program has an array of 512-byte objects such that some objects are 4KB apart in memory. Fields in these objects fall into the same lines and compete for the same cache set. If the program frequently accesses a given field (e.g., the vtable by calling a virtual method), the set will likely fill up and the cache will start trashing as lines are repeatedly evicted and later reloaded. Our example L1 cache can only hold the vtables for 8 of these objects due to set size. This is the cost of the set associativity trade-off: we can get cache misses due to set conflicts even when overall cache usage is not heavy. However, due to the relative speeds in a computer, most apps don’t need to worry about this anyway.

A memory access usually starts with a linear (virtual) address, so the L1 cache relies on the paging unit to obtain the physical page address used for the cache tags. By contrast, the set index comes from the least significant bits of the linear address and is used without translation (bits 11-6 in our example). Hence the L1 cache is physically tagged but virtually indexed, helping the CPU to parallelize lookup operations. Because the L1 way is never bigger than an MMU page, a given physical memory location is guaranteed to be associated with the same set even with virtual indexing. L2 caches, on the other hand, must be physically tagged and physically indexed because their way size can be bigger than MMU pages. But then again, by the time a request gets to the L2 cache the physical address was already resolved by the L1 cache, so it works out nicely.

Finally, a directory cell also stores the state of its corresponding cached line. A line in the L1 code cache is either Invalid or Shared (which means valid, really). In the L1 data cache and the L2 cache, a line can be in any of the 4 MESI states: Modified, Exclusive, Shared, or Invalid. Intel caches are inclusive: the contents of the L1 cache are duplicated in the L2 cache. These states will play a part in later posts about threading, locking, and that kind of stuff. Next time we’ll look at the front side bus and how memory access really works. This is going to be memory week.

Update: Dave brought up direct-mapped caches in a comment below. They’re basically a special case of set-associative caches that have only one way. In the trade-off spectrum, they’re the opposite of fully associative caches: blazing fast access, lots of conflict misses.


identity 身份认证 购VIP最低享 7 折! triangle vip 30元优惠券将在58:6:9后过期 去使用 triangle QT+Poppler+PDFviewer.zip 是一个用于在Windows操作系统下,使用QT5框架结合Poppler库开发PDF阅读器的项目。这个项目的核心是利用Poppler库解析PDF文档,并通过QT5进行用户界面的设计和交互。以下将详细介绍相关知识点: 1. **QT5框架**:QT(Qt)是一个跨平台的应用程序开发框架,支持多种操作系统,如Windows、Linux、macOS等。它提供了丰富的库函数和组件,使得开发者可以方便地构建图形用户界面(GUI)应用程序。QT5是QT的第五个主要版本,引入了许多新特性和改进,如QML(Qt Meta Object Language)用于声明式UI设计,以及更好的性能和API优化。 2. **Poppler库**:Poppler是一个开源的PDF文档处理库,源自Xpdf项目,主要用于PDF文件的解析、渲染和提取文本。Poppler提供了C++接口,使得开发者能够方便地在应用程序中集成PDF阅读和处理功能。它可以读取PDF文件,显示页面,提取文本和元数据,甚至支持对PDF文件进行注释和修改(但本项目可能仅涉及阅读功能)。 3. **PDF viewer的实现**:在本项目中,PDF viewer是基于QT5 GUI组件构建的,它利用Poppler库来加载和解析PDF文档。`mainwindow.cpp`和`mainwindow.h`包含了主窗口类的定义和实现,这是用户与应用程序交互的主要界面。`pdfcanvas.cpp`和`pdfcanvas.h`则可能包含了用于显示PDF页面的自定义画布类,该类使用Poppler库来渲染PDF页面到QT的画布上。 4. **项目构建与编译**:`newtime.pro`是QT项目的配置文件,用于指定项目依赖的库(如Poppler)、源代码文件、编译选项等。`.pro.user`文件则保存了用户的特定编译设置,如编译器路径或调试选项。开发者需要使用QT的qmake工具或直接在IDE如Qt Creator中打开此项目,进行编译和链接,确保所有依赖库都正确安装并链接。 5. **文件操作**:`main.cpp`通常是程序的入口点,负责初始化QT应用环境并运行主循环。在PDF viewer中,可能会在`main.cpp`中实例化主窗口,并调用Poppler库的相关函数来加载PDF文件。 6. **使用流程**:用户可以通过QT界面选择PDF文件,然后通过Poppler库读取文件内容,将页面渲染到QT的控件上。用户可以通过滚动、缩放等操作查看PDF内容。 Poppler库的强大功能使得PDF viewer可以支持多页显示、文本搜索、书签管理等高级特性。 7. **优化与扩展**:为了提升用户体验,开发者可能会对PDF viewer进行各种优化,比如添加平滑滚动、快速查找、页面预加载等功能。此外,还可以考虑支持批注、打印、PDF转换等更复杂的操作,以增强软件的功能性和实用性。 QT+Poppler+PDFviewer.zip项目提供了一个基础的PDF阅读器实现,开发者可以在此基础上进一步定制和扩展,以满足特定的PDF处理需求。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值