最近在解决container中的内存问题过程中查找了linux memory 相关的一些文章,稍微整理一下:
1. linux memory 相关的几个参数,主要是控制cgroup的资源配置:
A control group (cgroup) is a Linux kernel feature that limits, accounts for, and isolates the resource usage (CPU, memory, disk I/O, network, and so on) of a collection of processes.
path: /sys/fs/cgroup/memory/”
- memory.limit_in_bytes
cgroup的最大可用内存- memory.usage_in_bytes
cgroup中所有进程当前使用的内存- memory.memsw.limit_in_bytes
cgroup的最大可用内存和通过swap获得的内存总和- memory.memsw.usage_in_bytes
cgroup中所有进程当前使用的内存和swap空间总和- memory.oom_control
oom killer机制是否被打开
oom_kill_disable: enabled (0) or disabled (1).
cgroup 处于何种oom状态
under_oom: under oom control (1),进程被停止
not under oom control (0).
- /proc/<pid>/oom_score
查看oom打分- /proc/pid/oom_score_adj
调整oom打分的权重
2. oom kill 系统日志分析:
以下是oom kill 发生后通过sudo egrep -i -r 'killed process' /var/log获取的系统日志:
Sep 23 08:21:12 ip-000-00-00-000 kernel: test invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
Sep 23 08:21:12 ip-000-00-00-000 kernel: test cpuset=14f5209827883f18be016f8015f7c69000741877199d5802c5326dee0cbb899f mems_allowed=0
Sep 23 08:21:12 ip-000-00-00-000 kernel: CPU: 3 PID: 14239 Comm: test Kdump: loaded Tainted: G
Sep 23 08:21:12 ip-000-00-00-000 kernel: Hardware name: Amazon EC2 c6i.4xlarge/, BIOS 1.0 10/16/2017
Sep 23 08:21:12 ip-000-00-00-000 kernel: Call Trace:
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95b1bec>] dump_stack+0x19/0x1f
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95acb4f>] dump_header+0x90/0x22d
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb912235b>] ? cred_has_capability+0x6b/0x130
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb8f2ed94>] ? __css_tryget+0x24/0x50
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb904a4a8>] ? try_get_mem_cgroup_from_mm+0x28/0x70
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb8fcd3f5>] oom_kill_process+0x2d5/0x4a0
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb912244e>] ? selinux_capable+0x2e/0x40
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb904e98c>] mem_cgroup_oom_synchronize+0x55c/0x590
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb904dde0>] ? mem_cgroup_charge_common+0xc0/0xc0
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb8fcdcf4>] pagefault_out_of_memory+0x14/0x90
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95aaf88>] mm_fault_error+0x6a/0x15b
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95bfa61>] __do_page_fault+0x4a1/0x510
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95bfbb6>] trace_do_page_fault+0x56/0x150
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95bf112>] do_async_page_fault+0x22/0x100
Sep 23 08:21:12 ip-000-00-00-000 kernel: [<ffffffffb95bb7e8>] async_page_fault+0x28/0x30
Sep 23 08:21:12 ip-000-00-00-000 kernel: Task in /docker/14f5209827883f18be016f8015f7c69000741877199d5802c5326dee0cbb899f killed as a result of limit of /docker/14f5209827883f18be016f8015f7c69000741877199d5802c5326dee0cbb899f
Sep 23 08:21:12 ip-000-00-00-000 kernel: memory: usage 4193568kB, limit 4194304kB, failcnt 9360087
Sep 23 08:21:12 ip-000-00-00-000 kernel: memory+swap: usage 4193948kB, limit 4194304kB, failcnt 547
Sep 23 08:21:12 ip-000-00-00-000 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Sep 23 08:21:12 ip-000-00-00-000 kernel: Memory cgroup stats for /docker/14f5209827883f18be016f8015f7c69000741877199d5802c5326dee0cbb899f: cache:100KB rss:4194084KB rss_huge:989184KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:4194140KB inactive_file:0KB active_file:0KB unevictable:0KB
Sep 23 08:21:12 ip-000-00-00-000 kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13494] 0 13494 2922 282 11 0 0 bash
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13558] 0 13558 2957 450 11 0 0 sh
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13589] 0 13589 32884 1094 68 0 0 sudo
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13590] 0 13590 2392 350 9 0 0 app1
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13720] 0 13720 32884 1095 65 0 0 sudo
Sep 23 08:21:12 ip-000-00-00-000 kernel: [13721] 0 13721 2391 346 10 0 0 app2
Sep 23 08:21:12 ip-000-00-00-000 kernel: [ 9090] 0 9090 2957 484 12 0 0 bash
Sep 23 08:21:12 ip-000-00-00-000 kernel: [26182] 0 26182 3878119 518938 2607 0 0 app3
Sep 23 08:21:12 ip-000-00-00-000 kernel: [14211] 0 14211 4120327 521111 2439 0 0 test
Sep 23 08:21:12 ip-000-00-00-000 kernel: [16224] 0 16224 1091 90 8 0 0 sleep
Sep 23 08:21:12 ip-000-00-00-000 kernel: [16232] 0 16232 1091 89 7 0 0 sleep
Sep 23 08:21:12 ip-000-00-00-000 kernel: Memory cgroup out of memory: Kill process 15622 (test) score 484 or sacrifice child
Sep 23 08:21:12 ip-000-00-00-000 kernel: Killed process 14211 (test), UID 0, total-vm:16481308kB, anon-rss:2076604kB, file-rss:7840kB, shmem-rss:0kB
4kB, file-rss:7840kB, shmem-rss:0kB
pid14211被kill, log显示rss为521111(以4kB page作为单位), 521111*4kB = 2084444kB = 2076604kB + 7840kB = anon-rss+ file-rss
Reference:
3.7. memory | Red Hat Product Documentation
https://docs.oracle.com/en/operating-systems/oracle-linux/6/adminsg/ol_memory_cgroups.html
memory - Understanding the Linux oom-killer's logs - Stack Overflow
proc_pid_oom_score_adj(5) - Linux manual page
linux - How does the OOM killer decide which process to kill first? - Unix & Linux Stack Exchange
Cirata Community
What Are Namespaces and cgroups, and How Do They Work?