linux memory cgroup的memory.move_charge_at_immigrate含义

1.内核文档

上面的例子说明:

最开始某个进程是在cgroup A中,后面要迁移到cgroup B中,那么进程的内存计数是否要完全迁入B中,就是通过memory.move_charge_at_immigrate控制,如果目标cgroup也就是B设置了1到该字段中,那么进程的内存记账也要从cgroup A中去掉(uncharge from A),同时计算到B中(charge to B)。

2. 源码流程 
2.1 uncharge A和charge B调用栈

上面提到进程进行cgroup迁移的过程中:如果目标cgroup配置了move_charge_at_immigrate,那么就要将进程的内存占用从源cgroup减掉,同时加到目标cgroup中,这个过程实在mem_cgroup_move_account函数实现的,具体调用站如下:

#41 0x0000000000000000 in ?? ()
(gdb) bt
#0  mem_cgroup_move_account (page=0xffffea0000086300, compound=false, from=0xffff888005fba000, to=0xffff888005fb9000) at mm/memcontrol.c:5592
#1  0xffffffff81288da3 in mem_cgroup_move_charge_pte_range (pmd=<optimized out>, addr=8949760, end=8962048, walk=<optimized out>) at mm/memcontrol.c:6072
#2  0xffffffff8123d3eb in walk_pmd_range (walk=<optimized out>, end=<optimized out>, addr=8962048, pud=<optimized out>) at mm/pagewalk.c:89
#3  walk_pud_range (walk=<optimized out>, end=<optimized out>, addr=8962048, p4d=<optimized out>) at mm/pagewalk.c:160
#4  walk_p4d_range (walk=<optimized out>, end=<optimized out>, addr=8962048, pgd=<optimized out>) at mm/pagewalk.c:193
#5  walk_pgd_range (walk=<optimized out>, end=<optimized out>, addr=8962048) at mm/pagewalk.c:229
#6  __walk_page_range (start=8949760, end=<optimized out>, walk=0xffff888005f73d48) at mm/pagewalk.c:331
#7  0xffffffff8123d9f5 in walk_page_range (mm=<optimized out>, start=8949760, end=18446612682170408960, ops=<optimized out>, private=<optimized out>) at mm/pagewalk.c:427
#8  0xffffffff812857b6 in mem_cgroup_move_charge () at mm/memcontrol.c:6145
#9  mem_cgroup_move_task () at mm/memcontrol.c:6155
#10 0xffffffff81182b1c in cgroup_procs_write_finish (task=<optimized out>, locked=<optimized out>) at kernel/cgroup/cgroup.c:2827
#11 0xffffffff8118856d in __cgroup1_procs_write (of=0xffff888005fa46c0, buf=<optimized out>, nbytes=<optimized out>, threadgroup=<optimized out>, off=<optimized out>) at kernel/cgroup/cgroup-v1.c:522
#12 0xffffffff811885ae in cgroup1_procs_write (of=<optimized out>, buf=<optimized out>, nbytes=<optimized out>, off=<optimized out>) at kernel/cgroup/cgroup-v1.c:532
#13 0xffffffff8117fa48 in cgroup_file_write (of=<optimized out>, buf=<optimized out>, nbytes=4, off=<optimized out>) at kernel/cgroup/cgroup.c:3697
#14 0xffffffff81337576 in kernfs_fop_write (file=<optimized out>, user_buf=<optimized out>, count=<optimized out>, ppos=0xffff888005fb9000) at fs/kernfs/file.c:315
#15 0xffffffff81297bfc in vfs_write (pos=<optimized out>, count=4, buf=<optimized out>, file=<optimized out>) at fs/read_write.c:584
#16 vfs_write (file=0xffff888005faff00, buf=0x2806fd0 "130\n", count=<optimized out>, pos=0xffff888005f73ef0) at fs/read_write.c:566
#17 0xffffffff81297eac in ksys_write (fd=<optimized out>, buf=0x2806fd0 "130\n", count=4) at fs/read_write.c:639
#18 0xffffffff81297f35 in __do_sys_write (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:651
#19 __se_sys_write (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:648
#20 __x64_sys_write (regs=<optimized out>) at fs/read_write.c:648
#21 0xffffffff81c71608 in do_syscall_64 (nr=<optimized out>, regs=0xffff888005f73f58) at arch/x86/entry/common.c:46
#22 0xffffffff81e0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:118
2.2 进程task_struct数据结构的cgroups字段是哪里更新到B cgroup 

进程搬迁cgroup的过程中,我们尤其比较关心一个字段:page->mem_cgroup,按道理来讲,如果设置了move_charge_at_immigrate,既然进程迁入新的cgroup,内存技术已经同样加到了目标的cgroup,那么page->mem_cgroup理论上也应该修改指向目标的cgroup,确实如此,具体实现在

cgroup_move_task 函数中,调用栈:
0  cgroup_move_task (task=0xffff8880074f3400, to=0xffff888005f18600) at kernel/sched/psi.c:962
#1  0xffffffff81182e54 in css_set_move_task (task=0xffff8880074f3400, from_cset=0xffff888005fd7000, to_cset=0xffff888005f18600, use_mg_tasks=<optimized out>) at kernel/cgroup/cgroup.c:899
#2  0xffffffff811830b1 in cgroup_migrate_execute (mgctx=<optimized out>) at kernel/cgroup/cgroup.c:2426
#3  0xffffffff811833cb in cgroup_migrate (leader=0xffff8880074f3400, threadgroup=true, mgctx=0xffff888005f73d30) at kernel/cgroup/cgroup.c:2707
#4  0xffffffff81183519 in cgroup_attach_task (dst_cgrp=0xffff888005f9e800, leader=0xffff8880074f3400, threadgroup=true) at kernel/cgroup/cgroup.c:2740
#5  0xffffffff8118855e in __cgroup1_procs_write (of=0xffff888005fa43c0, buf=<optimized out>, nbytes=<optimized out>, threadgroup=<optimized out>, off=<optimized out>) at kernel/cgroup/cgroup-v1.c:519
#6  0xffffffff811885ae in cgroup1_procs_write (of=<optimized out>, buf=<optimized out>, nbytes=<optimized out>, off=<optimized out>) at kernel/cgroup/cgroup-v1.c:532
#7  0xffffffff8117fa48 in cgroup_file_write (of=<optimized out>, buf=<optimized out>, nbytes=4, off=<optimized out>) at kernel/cgroup/cgroup.c:3697
#8  0xffffffff81337576 in kernfs_fop_write (file=<optimized out>, user_buf=<optimized out>, count=<optimized out>, ppos=0xffff888005fd70a0) at fs/kernfs/file.c:315
#9  0xffffffff81297bfc in vfs_write (pos=<optimized out>, count=4, buf=<optimized out>, file=<optimized out>) at fs/read_write.c:584
#10 vfs_write (file=0xffff888005fc4100, buf=0x28070f0 "145\n", count=<optimized out>, pos=0xffff888005f73ef0) at fs/read_write.c:566
#11 0xffffffff81297eac in ksys_write (fd=<optimized out>, buf=0x28070f0 "145\n", count=4) at fs/read_write.c:639
#12 0xffffffff81297f35 in __do_sys_write (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:651
#13 __se_sys_write (count=<optimized out>, buf=<optimized out>, fd=<optimized out>) at fs/read_write.c:648
#14 __x64_sys_write (regs=<optimized out>) at fs/read_write.c:648
#15 0xffffffff81c71608 in do_syscall_64 (nr=<optimized out>, regs=0xffff888005f73f58) at arch/x86/entry/common.c:46
#16 0xffffffff81e0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:118

<think>我们正在解决 Kubernetes 环境中 OpenJDK 无法找到 `/sys/fs/cgroup/memory/memory.limit_in_bytes` 文件的问题。结合引用内容,我们知道Kubernetes通过cgroups来实施资源限制(如memory.limit_in_bytes对应内存限制)。该文件不存在的主要原因是节点的cgroup版本为v2,而该文件路径是cgroup v1的。在cgroup v2中,内存限制文件路径改为`/sys/fs/cgroup/memory.max`。 解决方案: 1. 升级JDK:使用支持cgroup v2的JDK版本(>=15)。 2. 切换节点回cgroup v1(需要修改节点配置并重启)。 3. 在Pod中挂载宿主机的cgroup文件系统(注意路径差异)。 4. 使用JVM参数(旧版JDK的临时方案)。 验证步骤: - 检查节点cgroup版本:`stat -fc %T /sys/fs/cgroup/` - 在容器内检查文件路径是否存在。 相关问题: 1. 如何检查节点cgroup版本? 2. 如何为容器配置正确的内存限制? 3. Kubernetes升级到cgroup v2的条件? 注意:在cgroup v2中,内存限制文件是`/sys/fs/cgroup/memory.max`,而v1是`/sys/fs/cgroup/memory/memory.limit_in_bytes`。 根据引用[1],Kubernetes通过resources字段设置的内存限制会传递给cgroup,因此正确设置资源限制是基础。 现在,按照用户要求,我们重点分析该文件不存在的原因及解决办法。 </think>### Kubernetes 中 `/sys/fs/cgroup/memory/memory.limit_in_bytes` 文件不存在的原因及解决方案 #### 🔍 原因分析 1. **cgroup v2 版本升级**(主要原因) Kubernetes 1.25+ 默认使用 cgroup v2[^1],其内存限制文件路径已变更: $$ \text{cgroup v1} \rightarrow \text{/sys/fs/cgroup/memory/memory.limit\_in\_bytes} $$ $$ \text{cgroup v2} \rightarrow \text{/sys/fs/cgroup/memory.max} $$ 2. **资源限制未生效** 若 Pod 未配置 `resources.limits.memory`,Kubernetes 不会创建 cgroup 内存限制文件[^1]: ```yaml # 错误示例:缺失内存限制 resources: limits: cpu: "2" # ⚠️ 缺少 memory 配置 ``` 3. **节点 cgroup 配置异常** 节点未启用内存子系统或 cgroup 挂载点损坏: ```bash # 检查节点配置 cat /proc/cgroups | grep memory # 若无输出表示子系统未启用 ``` 4. **容器文件系统隔离** 容器默认不挂载宿主机 cgroup 文件系统: ```bash kubectl exec <pod> -- ls /sys/fs/cgroup # 通常返回空目录 ``` #### ️ 解决方案 ##### 方案 1:适配 cgroup v2(推荐) **适用于:** 集群已升级到 Kubernetes 1.25+ ```bash # 容器内使用新路径获取内存限制 cat /sys/fs/cgroup/memory.max ``` ##### 方案 2:启用 cgroup v1 兼容模式 **适用于:** 需兼容旧版 JDK(<15)的场景 ```bash # 修改节点 GRUB 配置 sudo sed -i 's/GRUB_CMDLINE_LINUX="/&systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1 /' /etc/default/grub sudo update-grub && sudo reboot # 验证 stat -fc %T /sys/fs/cgroup # 返回 "tmpfs" 表示 v1 ``` ##### 方案 3:显式挂载 cgroup ```yaml # deployment.yaml spec: containers: - name: app volumeMounts: - name: cgroup mountPath: /sys/fs/cgroup/memory volumes: - name: cgroup hostPath: path: /sys/fs/cgroup/memory # 宿主机 v1 路径 type: Directory ``` ##### 方案 4:配置完整资源限制 确保 Pod 声明内存限制以触发 cgroup 文件生成: ```yaml resources: limits: memory: "2Gi" # 必须配置才会生成 memory.limit_in_bytes cpu: "1" requests: memory: "1Gi" cpu: "0.5" ``` #### ✅ 验证步骤 1. **检查集群 cgroup 版本**: ```bash kubectl get nodes -o jsonpath='{.items[*].status.nodeInfo.kernelVersion}' stat -fc %T /sys/fs/cgroup # 节点执行 ``` - `cgroup2fs` → v2 - `tmpfs` → v1 2. **验证容器内文件存在性**: ```bash # cgroup v1 kubectl exec <pod> -- cat /sys/fs/cgroup/memory/memory.limit_in_bytes # cgroup v2 kubectl exec <pod> -- cat /sys/fs/cgroup/memory.max ``` 3. **检查资源限制配置**: ```bash kubectl describe pod <pod-name> | grep -A 5 "Limits" ``` #### ⚠️ 注意事项 1. **JDK 版本兼容性**: OpenJDK 8-14 需要特殊处理,JDK 15+ 原生支持 cgroup v2 2. **资源限制必要性**: 未配置 `limits.memory` 时,即使使用 v1 也不会生成该文件[^1] 3. **混合集群风险**: 当部分节点使用 v1、部分用 v2 时需统一配置: ```bash kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.nodeInfo.kernelVersion}{"\n"}{end}' ``` > 通过合理配置资源限制和 cgroup 版本,可确保内存控制文件正确生成并供应用程序使用[^1][^2]。 --- ### 📚 技术原理 - Kubernetes 通过 `kubelet` 将 `resources.limits.memory` 写入容器 cgroup 文件 - cgroup v1 内存子系统路径: $$ \text{/sys/fs/cgroup/memory/kubepods.slice/<pod-id>/memory.limit_in_bytes} $$ - 该文件值对应 `limits.memory` 的字节数(如 2Gi → 2147483648)[^1][^2]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值