Ways to reclaim unused page-table pages

在2022年的LSFMM峰会上,DavidHildenbrand探讨了页表回收的重要性,指出当前内存管理系统在回收未使用的页表时存在的问题。页表是内存管理的关键数据结构,但目前它们的回收并不充分,尤其是空页表和零页映射。现有的补丁集可以实现空页表的回收,而针对零页映射的检测和回收则更为复杂。此外,讨论还涉及如何防止恶意用户空间进程滥用页表资源,提出了限制分配和监测可疑行为的方法。回收策略的优化将有助于提升系统性能和内存利用率。

One of the memory-management subsystem's most important jobs is reclaiming unused (or little-used) memory so that it can be put to better use. When it comes to one of the core memory-management data structures — page tables — though, this subsystem often falls down on the job. At the 2022 Linux Storage, Filesystem, Memory-management and BPF Summit (LSFMM), David Hildenbrand led a session on the problems posed by the lack of page-table reclaim and explored options for improving the situation.

回收未使用(或少使用)的内存是内存管理子系统中的最重要的工作之一,因此在这方面的工作值得我们做的更好。当涉及到内存管理的核心数据结构-页表 的时候,内存管理子系统并非总能做的很好。在2022 Linux Storage, Filesystem, Memory-management and BPF Summit(LSFMM)会议上,David Hildenbrand主持的一个会议上讨论了缺少页表回收的问题,并探讨了一些能够改进该问题的选项。

Page tables, of course, contain the mapping between virtual and physical addresses; every virtual address used by a process must be translated by the hardware using a page-table entry. These tables are arranged in a hierarchy, three to five levels deep, and can take up a fair amount of memory in their own right. On an x86 system, a 2TB mapping will require 4GB of bottom-level page tables, 8MB of the next-level (PMD) table, and smaller amounts for the higher-level tables. These tables are not swappable or movable, and sometimes get in the way of operations like hot-unplugging memory.

理所当然地,页表(page tables)包含虚拟地址和物理地址之间的映射关系。每个进程使用的虚拟地址必须由硬件通过一个页表项来进行翻译。这些页表是分层组织的,一般是3到5层.,并且

也会消耗相当数量的内存。在x86系统上,一个2TB的地址映射需要4GB大小的底层页表,8MB的下一级(PMD)页表,还有少量的更高一级的页表,这些页表(所在的页面)是不能被交换(到交换分区)或者迁移的,并且有时候(因为无法交换或迁移)会阻塞内存热插拔操作。

Page tables are reclaimed by the memory-management subsystem in some cases now. For example, removing a range of address space with munmap() will clean up the associated page tables as well. But there are situations that create empty page tables that are not reclaimed. Memory allocators tend to map large ranges, for example, and to only use parts of the range at any given time. When a given sub-range is no longer needed, the allocator will use madvise() to release the pages, but the associated page tables will not be freed. Large mapped files can also lead to empty page tables. These are all legitimate use cases, and they can be problematic enough; it is also easy to write a malicious program that fills memory with empty page tables and brings the system to a halt, which is rather less legitimate.

现在,页表在某些场景下是可以通过内存管理子系统回收的。比如说,通过系统调用munmap()删除一个地址空间范围也会清理关联的页表。但是,也有创建空页表的场景是不能被回收的。内存分配器趋向于分配覆盖大块内存的页表,但是在后面却只仅使用了其中的一部分。当一个给定的子范围内存不再需要被使用,分配器将使用madvise()来释放页面,但是关联的页表不会被释放。大的映射文件也可能导致产生空的页表, 这些合理的使用场景也是可能引发问题的。同样地,也有不合理的使用场景,比如写一个用空页表将内存耗尽并导致系统挂起的恶意程序也是不难的。

In all of these cases, many of the resulting page tables are empty and of no real use; they could be reclaimed without harm. One step in that direction, Hildenbrand said, is this patch set from Qi Zheng. It adds a reference count (pte_ref) to each page table page; when a given page's reference count drops to zero, that page can be reclaimed.

在所有这些场景中,大部分分配的页表是空的并且没有被实际使用到。这些页表其实可以进行无害回收的。在无害回收这个方向上,Hildenbrand说,Qi Zheng提了一个补丁集合。这个补丁为每一个页表页面添加了一个引用计数(pte_ref),当一个页面的引用技术变为0的时候,该页可以被回收。

Another source of empty page-table pages is mappings to the shared zero page. This is a page maintained by the kernel, containing only zeroes, which is used to initialize anonymous-memory mappings. The mapping is copy-on-write, so user space will get a new page should it ever write to a page mapped in this way. It is easy for user space to create large numbers of page-table entries pointing to the zero page, once again filling memory with unreclaimable allocations. This can be done, for example, by simply reading one byte out of every 2MB of address space mapped as anonymous memory. These page-table pages could also be easily reclaimed, though, if the kernel were able to detect them; there is no difference (as far as user space can tell) between a page-table page filled with zero-page mappings and a missing page-table page. At worst, the page-table page would have to be recreated should an address within it be referenced again from user space.

另外一个空页表页面的来源是共享零页面映射。这是一个由内核维护的页面,页面内容全部写0,用于初始化匿名内存映射。该映射是写时拷贝的,因此当进程写入这样的映射页时,用户空间将会获取到一个新的页面。这使用户空间能够轻易创建大量的指向该零页的页表项,这又导致了不可回收内存空间的分配。这种场景是可以解决的,比如,通过简单地从每个映射到2MB匿名映射地址空间中读出一个字节,这些页表页面同样可以轻易地被回收,如果内核能够检测到的话。实际上0页面映射的页表页面与缺失的页表页面并无区别。最差情况下,当一个用户地址空间的地址引用它的时候, 也就是重新分配一个页表页面。

There are other tricks user space can use to create large numbers of page-table pages. Some of them, like so many good exploits, involve the userfaultfd() system call.

有一些在用户空间创建大量页表页面的小技巧,其中有一些,比如很多好的漏洞利用方式,其中就包括userfaultfd()系统调用。

So what is to be done? Reclaiming empty page-table pages is "low-hanging fruit", and the patches to do so already exist. The kernel could also delete mappings to the zero page while scanning memory; that would eventually empty out page-table pages containing only zero-page mappings, and allow those to be reclaimed as well. Ordinary reclaim unmaps file-backed pages, which can create empty page-table pages, but that is not an explicit objective now. It might be possible to get reclaim to focus on pages that are close to each other in the hopes of emptying out page-table pages, which could then be reclaimed as well.

所以说,要采取什么措施?回收空的页表页面其实是 “容易采摘的果实”,并且已经有补丁存在了。内核在扫描内存时也会执行删除到零页面的映射。它最终会将只包含零页映射的页表页清空,并且也允许那些页面被回收。普通的回收操作会取消基于文件的页面,同时创建空的页表页面,但那不是我们的目的。它可能导致回收操作专注于相邻的期望被清空的页面,这些页面在后面也是会被回收的。

There are some other questions to be answered as well, though. Some page-table pages may be actively used, even if they just map the zero page; reclaiming them could hurt performance if they just have to be quickly recreated again. There are also questions about whether it makes sense to reclaim higher-level page tables. Reclaiming empty PMD pages might be worthwhile, Hildenbrand said, but there is unlikely to be value in trying to go higher.

尽管如此,有些问题我们是回避不了的。一些页表页面可能正被活跃地使用,即使它们被映射到零页面。如果它们很快又会被再次创建的话,回收它们可能损害性能。同样地,关于回收高级页表页面是否有意义也存在问题。回收空的PMD 页面可能是值得的,Hildenbrand如是说。但是一般来说,回收更高层次的页表页面并没有多大价值。

When should this reclaim happen? With regard to empty page-table pages, the answer is easy — as soon as they become empty. For zero-page mappings, the problem is a little harder. They can be found by scanning, but that scanning has its own cost. So perhaps this scanning should only happen when the kernel determines that there is suspicious zero-page activity happening. But defining "suspicious" could be tricky. Virtual machines tend to create a lot of zero-page mappings naturally, for example, so that is not suspicious on its own; it may be necessary to rely on more subjective heuristics. One such might be to detect when a process has a high ratio of page-table pages to its overall resident-set size.

那么,什么时候时候触发回收呢?对于空的页表页面,答案比较简单-只要这些页面时空的就能回收。对于零页映射,情况会复杂一些。零页映射可以在扫描时会被发现,然而,扫描也是会有消耗的,因此可能这个扫描应当在内核确定有值得怀疑的零页活动发生时。但是定义 “值得怀疑”可能需要一些技巧,比如说,一般情况下,虚拟机趋向于创建大量的零页映射,因此,这个并没有什么悬念。我们应该依赖更多的启发式的目标。一个有效的方式是在进程的页表页面与它总的常驻内存大小比率较大时触发检测。

Even when a removable page-table page has been identified, he said, removing it may not be trivial. There is a strong desire to avoid acquiring the mmap_lock, which could create contention issues; that limits when these removals can be done. Removing higher-level page-table pages is harder, since ordinary scanning can hold a reference to them for a long time. Waiting for the reference to go away could stall the reclaim process indefinitely.

即使当一个可删除的页表页面被标识出来了,他说,删除它也并不是无关紧要的。有一个强烈的期望是避免使用获取mmap_lock,它可能引发潜在的问题,它会限制这些删除操作被完成的时间点。删除高阶的页表页面会更难,因为普通的扫描可能持有到自身的引用很长时间,等待这些引用被释放可能导致回收进程永久阻塞。

With regard to what can be done to defeat a malicious user-space process, Hildenbrand said that the allocation of page-table pages must be throttled somehow. The best way might be to enforce reclaim of page-table pages against processes that are behaving in a suspicious way. One way to find such processes could be to just set a threshold on the amount of memory used for page-table pages, but that could perhaps be circumvented simply by spawning a lot of processes.

对于如何处理一个恶意的用户空间进程, Hildenbrand说:分配页表页面在某种程度上必须被限制住。最好的方式是对那些表现异常的进程进行页面页表强制回收。

查找这类进程的一个方式可以就只是对内存页表页面内存使用量设置一个阈值,但是这也可能通过启动大量进程来设法规避。

At this point, the session wound down. Yang Shi suggested at the end that there might be some help to be found in the multi-generational LRU work, if and when that is merged. It has to scan page tables anyway, so adding the detection of empty page-table pages might fit in relatively easily.

到了这里,讨论会结束,最后,Yang Shi提示multi-generational LRU相关的工作对内存回收有些用处,如果它被合入的话。不管怎么样,它必须扫描页表,因此添加检测空页表页面可能相对简单合适。

原文链接:Ways to reclaim unused page-table pages [LWN.net]

<think>好的,我现在需要解决用户提出的Kubernetes中ephemeral-storage回收的问题。首先,我要理解用户的问题背景。用户提到ephemeral-storage的回收问题,可能是指容器临时存储空间不足,或者存储没有被及时释放,导致资源浪费或应用故障。根据用户提供的引用,特别是引用[2]提到容器中的磁盘文件是临时的,当容器崩溃重启时文件会丢失,而Kubernetes的Volume抽象可以解决这个问题。但用户现在关心的是如何回收这些临时存储。 接下来,我需要回顾Kubernetes中ephemeral-storage的相关知识。Ephemeral storage通常指的是节点上供Pod使用的临时存储,包括容器的可写层和emptyDir卷。当Pod被删除时,这些存储通常会被自动清理,但有时候可能因为某些原因(比如Pod未被正确删除,或者kubelet未能及时回收)导致存储未被释放。 查看用户提供的引用,引用[5]提到了kubelet的驱逐机制,包括硬驱逐和软驱逐。驱逐条件可能包括ephemeral-storage不足的情况。当节点存储压力大时,kubelet会根据配置驱逐Pod来回收资源。这可能与用户的问题相关,即如何配置kubelet来自动回收临时存储。 另外,引用[3]中的StatefulSet示例展示了如何创建有状态的Pod,可能使用持久卷来避免数据丢失。但用户的问题是关于临时存储的回收,所以可能需要结合Volume的使用和kubelet的配置来处理。 接下来,我需要整理回收ephemeral-storage的方法。可能的步骤包括: 1. **手动删除Pod**:最直接的方法是删除占用存储的Pod,让kubelet自动清理相关资源。但用户可能希望自动化这个过程,而不是手动干预。 2. **配置资源限制**:在Pod的配置中设置ephemeral-storage的资源请求和限制,确保容器不会过度使用存储。当超过限制时,kubelet会触发驱逐。 3. **kubelet的驱逐策略**:调整kubelet的参数,如--eviction-hard和--eviction-soft,设置存储的阈值,当节点的临时存储达到阈值时,自动驱逐Pod,释放空间。 4. **监控和告警**:使用监控工具检测节点的存储使用情况,及时发现问题并处理。 5. **日志和临时文件管理**:优化应用,定期清理日志和临时文件,减少存储占用。 需要验证这些方法是否符合用户提供的引用内容。引用[5]中的驱逐参数确实可以用来控制存储的回收,比如设置--eviction-hard=nodefs.available<5%来触发硬驱逐。而引用[2]提到emptyDir卷的生命周期与Pod一致,所以删除Pod会自动清理这些卷的数据。 用户可能需要具体的操作步骤,比如如何设置资源限制、如何修改kubelet配置等。同时,需要确保这些步骤的准确性,例如在Pod的YAML中添加resources部分,限制ephemeral-storage。 此外,可能需要注意不同Kubernetes版本的配置差异,以及StorageClass(如引用[4])是否会影响临时存储的管理,但临时存储通常不涉及StorageClass,后者更多用于持久卷的动态供给。 最后,生成相关问题时要围绕用户的问题扩展,比如如何监控存储使用、如何优化应用以减少存储占用、驱逐策略的详细配置等。</think>### Kubernetes临时存储回收解决方案 在Kubernetes中,`ephemeral-storage`主要用于存储容器日志、临时文件和`emptyDir`卷数据[^2]。以下是回收临时存储的具体方法: --- #### **1. 手动触发回收** - **删除相关Pod** 临时存储的生命周期与Pod绑定,执行以下命令强制删除Pod: ```bash kubectl delete pod <pod-name> --grace-period=0 --force ``` 删除后kubelet会自动清理其临时存储目录(如`/var/lib/kubelet/pods/<pod-id>`)[^5]。 - **清理节点缓存** 登录到节点手动清理缓存(需谨慎操作): ```bash sudo rm -rf /var/lib/kubelet/pods/*/volumes/kubernetes.io~empty-dir/ ``` --- #### **2. 配置资源限制** 在Pod定义中设置`ephemeral-storage`资源限制,强制触发kubelet自动回收: ```yaml apiVersion: v1 kind: Pod metadata: name: myapp spec: containers: - name: app image: nginx resources: limits: ephemeral-storage: "1Gi" # 存储上限 requests: ephemeral-storage: "500Mi" ``` 当容器存储超过`limits`时,kubelet会终止Pod并生成`Evicted`状态[^5]。 --- #### **3. 配置kubelet驱逐策略** 修改kubelet参数实现自动驱逐(需在节点kubelet配置文件中设置): ```bash --eviction-hard=nodefs.available<5%,imagefs.available<5% # 硬驱逐阈值 --eviction-soft=nodefs.available<10%,imagefs.available<10% # 软驱逐阈值 --eviction-soft-grace-period=nodefs.available=30s,imagefs.available=30s # 软驱逐宽限期 --eviction-max-pod-grace-period=60 # 允许Pod终止的最大等待时间 ``` *示例逻辑*:当节点临时存储使用超过95%(硬阈值),kubelet立即驱逐Pod;若使用超过90%(软阈值)且持续30秒,则触发驱逐[^5]。 --- #### **4. 使用emptyDir卷内存模式** 将`emptyDir.medium`设置为`Memory`,数据存储在内存中而非磁盘: ```yaml volumes: - name: temp emptyDir: medium: Memory ``` 此方式数据不会持久化,但需注意内存资源消耗。 --- #### **5. 日志轮转与临时文件清理** 在容器内配置日志轮转策略(如logrotate),或在应用代码中定期清理临时文件。例如,在Deployment中添加生命周期钩子: ```yaml lifecycle: preStop: exec: command: ["/bin/sh", "-c", "rm -rf /tmp/*.log"] ``` --- ### 总结方案对比 | 方法 | 自动化程度 | 适用场景 | 风险 | |------|------------|----------|------| | 手动删除Pod | 低 | 紧急恢复 | 可能影响服务可用性 | | 资源限制 | 中 | 预防性控制 | 可能导致Pod频繁重启 | | kubelet驱逐策略 | 高 | 大规模集群 | 需精细调整阈值 | | emptyDir内存模式 | 高 | 高IO临时数据 | 内存资源消耗增加 | ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值