Ways to reclaim unused page-table pages

翻译已于 2022-07-21 08:27:30 修改 · 211 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://lwn.net/Articles/893726/

文章标签：

#linux #内存管理 #lwn 翻译

于 2022-07-19 16:59:22 首次发布

linux kernel 同时被 2 个专栏收录

19 篇文章

订阅专栏

lwn翻译

1 篇文章

订阅专栏

在2022年的LSFMM峰会上，DavidHildenbrand探讨了页表回收的重要性，指出当前内存管理系统在回收未使用的页表时存在的问题。页表是内存管理的关键数据结构，但目前它们的回收并不充分，尤其是空页表和零页映射。现有的补丁集可以实现空页表的回收，而针对零页映射的检测和回收则更为复杂。此外，讨论还涉及如何防止恶意用户空间进程滥用页表资源，提出了限制分配和监测可疑行为的方法。回收策略的优化将有助于提升系统性能和内存利用率。

One of the memory-management subsystem's most important jobs is reclaiming unused (or little-used) memory so that it can be put to better use. When it comes to one of the core memory-management data structures — page tables — though, this subsystem often falls down on the job. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), David Hildenbrand led a session on the problems posed by the lack of page-table reclaim and explored options for improving the situation.

回收未使用（或少使用）的内存是内存管理子系统中的最重要的工作之一，因此在这方面的工作值得我们做的更好。当涉及到内存管理的核心数据结构-页表的时候，内存管理子系统并非总能做的很好。在2022 Linux Storage, Filesystem, Memory-management and BPF Summit(LSFMM)会议上，David Hildenbrand主持的一个会议上讨论了缺少页表回收的问题，并探讨了一些能够改进该问题的选项。

Page tables, of course, contain the mapping between virtual and physical addresses; every virtual address used by a process must be translated by the hardware using a page-table entry. These tables are arranged in a hierarchy, three to five levels deep, and can take up a fair amount of memory in their own right. On an x86 system, a 2TB mapping will require 4GB of bottom-level page tables, 8MB of the next-level (PMD) table, and smaller amounts for the higher-level tables. These tables are not swappable or movable, and sometimes get in the way of operations like hot-unplugging memory.

理所当然地，页表（page tables）包含虚拟地址和物理地址之间的映射关系。每个进程使用的虚拟地址必须由硬件通过一个页表项来进行翻译。这些页表是分层组织的，一般是3到5层.，并且

也会消耗相当数量的内存。在x86系统上，一个2TB的地址映射需要4GB大小的底层页表，8MB的下一级（PMD）页表，还有少量的更高一级的页表，这些页表（所在的页面）是不能被交换（到交换分区）或者迁移的，并且有时候（因为无法交换或迁移）会阻塞内存热插拔操作。

Page tables are reclaimed by the memory-management subsystem in some cases now. For example, removing a range of address space with munmap() will clean up the associated page tables as well. But there are situations that create empty page tables that are not reclaimed. Memory allocators tend to map large ranges, for example, and to only use parts of the range at any given time. When a given sub-range is no longer needed, the allocator will use madvise() to release the pages, but the associated page tables will not be freed. Large mapped files can also lead to empty page tables. These are all legitimate use cases, and they can be problematic enough; it is also easy to write a malicious program that fills memory with empty page tables and brings the system to a halt, which is rather less legitimate.

现在，页表在某些场景下是可以通过内存管理子系统回收的。比如说，通过系统调用munmap()删除一个地址空间范围也会清理关联的页表。但是，也有创建空页表的场景是不能被回收的。内存分配器趋向于分配覆盖大块内存的页表，但是在后面却只仅使用了其中的一部分。当一个给定的子范围内存不再需要被使用，分配器将使用madvise()来释放页面，但是关联的页表不会被释放。大的映射文件也可能导致产生空的页表，这些合理的使用场景也是可能引发问题的。同样地，也有不合理的使用场景，比如写一个用空页表将内存耗尽并导致系统挂起的恶意程序也是不难的。

In all of these cases, many of the resulting page tables are empty and of no real use; they could be reclaimed without harm. One step in that direction, Hildenbrand said, is this patch set from Qi Zheng. It adds a reference count (pte_ref) to each page table page; when a given page's reference count drops to zero, that page can be reclaimed.

在所有这些场景中，大部分分配的页表是空的并且没有被实际使用到。这些页表其实可以进行无害回收的。在无害回收这个方向上，Hildenbrand说，Qi Zheng提了一个补丁集合。这个补丁为每一个页表页面添加了一个引用计数(pte_ref)，当一个页面的引用技术变为0的时候，该页可以被回收。

Another source of empty page-table pages is mappings to the shared zero page. This is a page maintained by the kernel, containing only zeroes, which is used to initialize anonymous-memory mappings. The mapping is copy-on-write, so user space will get a new page should it ever write to a page mapped in this way. It is easy for user space to create large numbers of page-table entries pointing to the zero page, once again filling memory with unreclaimable allocations. This can be done, for example, by simply reading one byte out of every 2MB of address space mapped as anonymous memory. These page-table pages could also be easily reclaimed, though, if the kernel were able to detect them; there is no difference (as far as user space can tell) between a page-table page filled with zero-page mappings and a missing page-table page. At worst, the page-table page would have to be recreated should an address within it be referenced again from user space.

另外一个空页表页面的来源是共享零页面映射。这是一个由内核维护的页面，页面内容全部写0，用于初始化匿名内存映射。该映射是写时拷贝的，因此当进程写入这样的映射页时，用户空间将会获取到一个新的页面。这使用户空间能够轻易创建大量的指向该零页的页表项，这又导致了不可回收内存空间的分配。这种场景是可以解决的，比如，通过简单地从每个映射到2MB匿名映射地址空间中读出一个字节，这些页表页面同样可以轻易地被回收，如果内核能够检测到的话。实际上0页面映射的页表页面与缺失的页表页面并无区别。最差情况下，当一个用户地址空间的地址引用它的时候，也就是重新分配一个页表页面。

There are other tricks user space can use to create large numbers of page-table pages. Some of them, like so many good exploits, involve the userfaultfd() system call.

有一些在用户空间创建大量页表页面的小技巧，其中有一些，比如很多好的漏洞利用方式，其中就包括userfaultfd()系统调用。

So what is to be done? Reclaiming empty page-table pages is "low-hanging fruit", and the patches to do so already exist. The kernel could also delete mappings to the zero page while scanning memory; that would eventually empty out page-table pages containing only zero-page mappings, and allow those to be reclaimed as well. Ordinary reclaim unmaps file-backed pages, which can create empty page-table pages, but that is not an explicit objective now. It might be possible to get reclaim to focus on pages that are close to each other in the hopes of emptying out page-table pages, which could then be reclaimed as well.

所以说，要采取什么措施？回收空的页表页面其实是 “容易采摘的果实”，并且已经有补丁存在了。内核在扫描内存时也会执行删除到零页面的映射。它最终会将只包含零页映射的页表页清空，并且也允许那些页面被回收。普通的回收操作会取消基于文件的页面，同时创建空的页表页面，但那不是我们的目的。它可能导致回收操作专注于相邻的期望被清空的页面，这些页面在后面也是会被回收的。

There are some other questions to be answered as well, though. Some page-table pages may be actively used, even if they just map the zero page; reclaiming them could hurt performance if they just have to be quickly recreated again. There are also questions about whether it makes sense to reclaim higher-level page tables. Reclaiming empty PMD pages might be worthwhile, Hildenbrand said, but there is unlikely to be value in trying to go higher.

尽管如此，有些问题我们是回避不了的。一些页表页面可能正被活跃地使用，即使它们被映射到零页面。如果它们很快又会被再次创建的话，回收它们可能损害性能。同样地，关于回收高级页表页面是否有意义也存在问题。回收空的PMD 页面可能是值得的，Hildenbrand如是说。但是一般来说，回收更高层次的页表页面并没有多大价值。

When should this reclaim happen? With regard to empty page-table pages, the answer is easy — as soon as they become empty. For zero-page mappings, the problem is a little harder. They can be found by scanning, but that scanning has its own cost. So perhaps this scanning should only happen when the kernel determines that there is suspicious zero-page activity happening. But defining "suspicious" could be tricky. Virtual machines tend to create a lot of zero-page mappings naturally, for example, so that is not suspicious on its own; it may be necessary to rely on more subjective heuristics. One such might be to detect when a process has a high ratio of page-table pages to its overall resident-set size.

那么，什么时候时候触发回收呢？对于空的页表页面，答案比较简单-只要这些页面时空的就能回收。对于零页映射，情况会复杂一些。零页映射可以在扫描时会被发现，然而，扫描也是会有消耗的，因此可能这个扫描应当在内核确定有值得怀疑的零页活动发生时。但是定义 “值得怀疑”可能需要一些技巧，比如说，一般情况下，虚拟机趋向于创建大量的零页映射，因此，这个并没有什么悬念。我们应该依赖更多的启发式的目标。一个有效的方式是在进程的页表页面与它总的常驻内存大小比率较大时触发检测。

Even when a removable page-table page has been identified, he said, removing it may not be trivial. There is a strong desire to avoid acquiring the mmap_lock, which could create contention issues; that limits when these removals can be done. Removing higher-level page-table pages is harder, since ordinary scanning can hold a reference to them for a long time. Waiting for the reference to go away could stall the reclaim process indefinitely.

即使当一个可删除的页表页面被标识出来了，他说，删除它也并不是无关紧要的。有一个强烈的期望是避免使用获取mmap_lock，它可能引发潜在的问题，它会限制这些删除操作被完成的时间点。删除高阶的页表页面会更难，因为普通的扫描可能持有到自身的引用很长时间，等待这些引用被释放可能导致回收进程永久阻塞。

With regard to what can be done to defeat a malicious user-space process, Hildenbrand said that the allocation of page-table pages must be throttled somehow. The best way might be to enforce reclaim of page-table pages against processes that are behaving in a suspicious way. One way to find such processes could be to just set a threshold on the amount of memory used for page-table pages, but that could perhaps be circumvented simply by spawning a lot of processes.

对于如何处理一个恶意的用户空间进程， Hildenbrand说：分配页表页面在某种程度上必须被限制住。最好的方式是对那些表现异常的进程进行页面页表强制回收。

查找这类进程的一个方式可以就只是对内存页表页面内存使用量设置一个阈值，但是这也可能通过启动大量进程来设法规避。

At this point, the session wound down. Yang Shi suggested at the end that there might be some help to be found in the multi-generational LRU work, if and when that is merged. It has to scan page tables anyway, so adding the detection of empty page-table pages might fit in relatively easily.

到了这里，讨论会结束，最后，Yang Shi提示multi-generational LRU相关的工作对内存回收有些用处，如果它被合入的话。不管怎么样，它必须扫描页表，因此添加检测空页表页面可能相对简单合适。

原文链接：Ways to reclaim unused page-table pages [LWN.net]