linux的page cache&buffer

最新推荐文章于 2024-10-23 15:12:10 发布

转载最新推荐文章于 2024-10-23 15:12:10 发布 · 291 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：https://blog.youkuaiyun.com/fall221/article/details/46290563

kernel 同时被 2 个专栏收录

49 篇文章

订阅专栏

内存管理

4 篇文章

订阅专栏

本文深入解析Linux中的内存管理机制，特别是buffer与cached的区别及其在内存管理中的作用。介绍了如何通过调整内核参数来优化内存使用，包括缓存的控制、脏页的管理策略等关键知识点。

转自https://blog.youkuaiyun.com/fall221/article/details/46290563

https://blog.youkuaiyun.com/damontive/article/details/80552566

1.buffer与cached

执行free命令出来的结果，包括buff和cache，实际可用的内存需要看第二行。

[root@localhost chenming] # free -m

total used free shared buffers cached

Mem:          1006        985         20          0         53        804

-/+ buffers/cache: 127 878

Swap: 1235 3 1232

以上，实际可用的内存是878M。由于执行过大的cp，cache居然达到了804M。那么什么是cached？什么是buffers？简言之，cached缓存了对文件的读写，buffers缓存了inode，dentry等文件系统metadata。显然，metadata大小远远小于文件内存，所以buffers一般也远小于cached。cached的部分叫page cache，buffers的部分叫buffer cache。page cache与buffer cache的区别可参考以下文章：

http://www.quora.com/Linux-Kernel/What-is-the-major-difference-between-the-buffer-cache-and-the-page-cache

The page cache caches pages of files to optimize file I/O. The buffer cache caches disk blocks to optimize block I/O.

The VM subsystem now drives I/O and it does so out of the page cache. If cached data has both a file and a block representation—as most data does—the buffer cache will simply point into the page cache; thus only one instance of the data is cached in memory. The page cache is what you picture when you think of a disk cache: It caches file data from a disk to make subsequent I/O faster.

The buffer cache remains, however, as the kernel still needs to perform block I/O in terms of blocks, not pages. As most blocks represent file data, most of the buffer cache is represented by the page cache. But a small amount of block data isn't file backed—metadata and raw block I/O for example—and thus is solely represented by the buffer cache.

free命令产生的buffers和cached的区别参考以下文章：

http://www.quora.com/What-is-the-difference-between-Buffers-and-Cached-columns-in-proc-meminfo-output

Short answer: Cached is the size of the page cache. Buffers is the size of in-memory block I/O buffers. Cached matters; Buffers is largely irrelevant.

Long answer: Cached is the size of the Linux page cache, minus the memory in the swap cache, which is represented by SwapCached (thus the total page cache size is Cached + SwapCached). Linux performs all file I/O through the page cache. Writes are implemented as simply marking as dirty the corresponding pages in the page cache; the flusher threads then periodically write back to disk any dirty pages. Reads are implemented by returning the data from the page cache; if the data is not yet in the cache, it is first populated. On a modern Linux system, Cached can easily be several gigabytes. It will shrink only in response to memory pressure. The system will purge the page cache along with swapping data out to disk to make available more memory as needed.

Buffers are in-memory block I/O buffers. They are relatively short-lived. Prior to Linux kernel version 2.4, Linux had separate page and buffer caches. Since 2.4, the page and buffer cache are unified and Buffers is raw disk blocks not represented in the page cache—i.e., not file data. The Buffers metric is thus of minimal importance. On most systems, Buffers is often only tens of megabytes.

===》补充SwapCached，个人理解Cached 属于file active+inactive

这个swap cache的作用不是说要加快磁盘的I/O效率，主要是为了防止页面在swap in和swap out时，进程的同步问题，也就是在进行swap out操作时(将页面内容写入磁盘分区时）进程如果发起了对换出页面的访问，系统对其的处理。有了swap cache的存在，如果页面的数据还没有完全写入磁盘时，这个page frame是在swap cache（swap cache有个引用指向页面），等数据完全写入磁盘后，而且没有进程对page frame进行访问，那么swap cache才会释放page frame，将其交给buddy system

2.控制cache

Linux对cache的使用没有规定上限，因为，物理内存不是拿来看的，而是拿来用的。只要当需要的时候，这部分内存还可以还回去就行！

[root@localhost chenming] # ls /proc/sys/vm

block_dump drop_caches lowmem_reserve_ratio oom_kill_allocating_task

compact_memory             extfrag_threshold           max_map_count            overcommit_memory

dirty_background_bytes extra_free_kbytes min_free_kbytes overcommit_ratio

dirty_background_ratio

     highmem_is_dirtyable        mmap_min_addr            page-cluster

dirty_bytes

                hugepages_treat_as_movable  nr_hugepages             panic_on_oom

dirty_expire_centisecs hugetlb_shm_group nr_overcommit_hugepages percpu_pagelist_fraction

dirty_ratio

                laptop_mode                 nr_pdflush_threads       scan_unevictable_pages

dirty_writeback_centisecs legacy_va_layout oom_dump_tasks stat_interval

swappiness

unmap_area_factor

vdso_enabled

vfs_cache_pressure

would_have_oomed

以上红色的参数均可影响cache。一个个解释。

1. dirty_writeback_centisecs

写文件时，先写入内核的cache中，后台进程pdflush或flush-n:m负责异步将写操作writeback到磁盘。这个进程每个dirty_writeback_centisecs（默认值500，即5s）厘秒唤醒然后执行一次。他会检查这些脏页面的时间是不是超时了，超时的会被写磁盘。超时时间就是

dirty_expire_centisecs（默认值3000，即20s）厘秒。当然，就算一个都没超时，但脏页面很多的话，也是会写磁盘的。这个下面会讲。

2.dirty_background_bytes与dirty_background_ratio

参数意义：当脏页所占的百分比（相对于所有可用内存，即空闲内存页+可回收内存页）达到dirty_background_ratio时内核的pdflush线程开始回写脏页。

增大会使用更多内存用于缓冲，可以提高系统的读写性能。当需要持续、恒定的写入场合时，应该降低该数值。

脏页所占的内存数量超过dirty_background_bytes时，内核的pdflush线程也会开始回写脏页。

注意：dirty_background_bytes参数和dirty_background_ratio参数是相对的，只能指定其中一个。当其中一个参数文件被写入时，会立即开始计算脏页限

制，并且会将另一个参数的值清零。

3.dirty_bytes与dirty_ratio

如果dirty_background_bytes与dirty_background_ratio还不能有效发挥作用，导致脏页面比例持续升高，并且超过了dirty_ratio，那么那么执行write

的那个用户态进程自己会block住，等待pdflush干完活再唤起。默认值40。

4.drop_caches

向/proc/sys/vm/drop_caches文件中写入数值可以使内核释放page cache，dentries和inodes缓存所占的内存。生产环境不要用。

只释放pagecache：

echo 1 > /proc/sys/vm/drop_caches

只释放dentries和inodes缓存：

echo 2 > /proc/sys/vm/drop_caches

释放pagecache、dentries和inodes缓存：

echo 3 > /proc/sys/vm/drop_caches

这个操作不是破坏性操作，脏的对象（比如脏页）不会被释放，因此要首先运行sync命令。

sync将脏页全部回写，dropcache再释放就能得到更多free的页。

5.min_free_kbytes

这个参数用来指定强制Linux VM保留的内存区域的最小值，单位是kb。VM会使用这个参数的值来计算系统中每个低端内存域的watermark[WMARK_MIN]值。每个低端内存域都会根据这个参数保留一定数量的空闲内存页。

一部分少量的内存用来满足PF_MEMALLOC类型的内存分配请求。如果进程设置了PF_MEMALLOC标志，表示不能让这个进程分配内存失败，可以分配保留的内存。并不是所有进程都有的。kswapd、direct reclaim的process等在回收的时候会设置这个标志，因为回收的时候它们还要为自己分配一些内存。有了PF_MEMALLOC标志，它们就可以获得保留的低端内存。

如果设置的值小于1024KB，系统很容易崩溃，在负载较高时很容易死锁。如果设置的值太大，系统会经常OOM。

PS：自己试验过一次，内存降到这个值的2倍时就不再下降了。32位系统，1G内存，3个内存zone。

This controls the amount of memory that is kept free for use by special reserves including “atomic” allocations (those which cannot wait for reclaim). This should not normally be lowered unless the system is being very carefully tuned for memory usage (normally useful for embedded rather than server applications). If “page allocation failure” messages and stack traces are frequently seen in logs, min_free_kbytes could be increased until the errors disappear. There is no need for concern, if these messages are very infrequent. The default value depends on the amount of RAM.

6.vfs_cache_pressure

控制内核回收dentry和inode cache内存的倾向。

默认值是100，内核会根据pagecache和swapcache的回收情况，让dentry和inode cache的内存占用量保持在一个相对公平的百分比上。

减小vfs_cache_pressure会让内核更倾向于保留dentry和inode cache。当vfs_cache_pressure等于0，在内存紧张时，内核也不会回收dentry和inode cache，这容易导致OOM。如果vfs_cache_pressure的值超过100，内核会更倾向于回收dentry和inode cache。

综述

Page cache是通过将磁盘中的数据缓存到内存中，从而减少磁盘I/O操作，从而提高性能。此外，还要确保在page cache中的数据更改时能够被同步到磁盘上，后者被称为page回写（page writeback）。一个inode对应一个page cache对象，一个page cache对象包含多个物理page。

对磁盘的数据进行缓存从而提高性能主要是基于两个因素：第一，磁盘访问的速度比内存慢好几个数量级（毫秒和纳秒的差距）。第二是被访问过的数据，有很大概率会被再次访问。

Page Cache

Page cache由内存中的物理page组成，其内容对应磁盘上的block。page cache的大小是动态变化的，可以扩大，也可以在内存不足时缩小。cache缓存的存储设备被称为后备存储（backing store），注意我们在block I/O中提到的：一个page通常包含多个block，这些block不一定是连续的。

读Cache

当内核发起一个读请求时（例如进程发起read()请求），首先会检查请求的数据是否缓存到了page cache中，如果有，那么直接从内存中读取，不需要访问磁盘，这被称为cache命中（cache hit）。如果cache中没有请求的数据，即cache未命中（cache miss），就必须从磁盘中读取数据。然后内核将读取的数据缓存到cache中，这样后续的读请求就可以命中cache了。page可以只缓存一个文件部分的内容，不需要把整个文件都缓存进来。

写Cache

当内核发起一个写请求时（例如进程发起write()请求），同样是直接往cache中写入，后备存储中的内容不会直接更新。内核会将被写入的page标记为dirty，并将其加入dirty list中。内核会周期性地将dirty list中的page写回到磁盘上，从而使磁盘上的数据和内存中缓存的数据一致。

Cache回收

Page cache的另一个重要工作是释放page，从而释放内存空间。cache回收的任务是选择合适的page释放，并且如果page是dirty的，需要将page写回到磁盘中再释放。理想的做法是释放距离下次访问时间最久的page，但是很明显，这是不现实的。下面先介绍LRU算法，然后介绍基于LRU改进的Two-List策略，后者是Linux使用的策略。

LRU算法

LRU（least rencently used)算法是选择最近一次访问时间最靠前的page，即干掉最近没被光顾过的page。原始LRU算法存在的问题是，有些文件只会被访问一次，但是按照LRU的算法，即使这些文件以后再也不会被访问了，但是如果它们是刚刚被访问的，就不会被选中。

Two-List策略

Two-List策略维护了两个list，active list 和 inactive list。在active list上的page被认为是hot的，不能释放。只有inactive list上的page可以被释放的。首次缓存的数据的page会被加入到inactive list中，已经在inactive list中的page如果再次被访问，就会移入active list中。两个链表都使用了伪LRU算法维护，新的page从尾部加入，移除时从头部移除，就像队列一样。如果active list中page的数量远大于inactive list，那么active list头部的页面会被移入inactive list中，从而位置两个表的平衡。

Page Cache在Linux中的具体实现

address_space结构

内核使用address_space结构来表示一个page cache，address_space这个名字起得很糟糕，叫page_ache_entity可能更合适。下面是address_space的定义

struct address_space {
    struct inode            *host;              /* owning inode */
    struct radix_tree_root  page_tree;          /* radix tree of all pages */
    spinlock_t              tree_lock;          /* page_tree lock */
    unsigned int            i_mmap_writable;    /* VM_SHARED ma count */
    struct prio_tree_root   i_mmap;             /* list of all mappings */
    struct list_head        i_mmap_nonlinear;   /* VM_NONLINEAR ma list */
    spinlock_t              i_mmap_lock;        /* i_mmap lock */
    atomic_t                truncate_count;     /* truncate re count */
    unsigned long           nrpages;            /* total number of pages */
    pgoff_t                 writeback_index;    /* writeback start offset */
    struct address_space_operations *a_ops;     /* operations table */
    unsigned                long flags;         /* gfp_mask and error flags */
    struct backing_dev_info *backing_dev_info;  /* read-ahead information */
    spinlock_t              private_lock;       /* private lock */
    struct list_head        private_list;       /* private list */
    struct address_space    *assoc_mapping;     /* associated buffers */
};

其中 host域指向对应的inode对象，host有可能为NULL，这意味着这个address_space不是和一个文件关联，而是和swap area相关，swap是Linux中将匿名内存（比如进程的堆、栈等，没有一个文件作为back store）置换到swap area（比如swap分区）从而释放物理内存的一种机制。page_tree保存了该page cache中所有的page，使用基数树(radix Tree)来存储。i_mmap是保存了所有映射到当前page cache（物理的）的虚拟内存区域（VMA）。nrpages是当前address_space中page的数量。

address_space操作函数

address_space中的a_ops域指向操作函数表（struct address_space_operations），每个后备存储都要实现这个函数表，比如ext3文件系统在fs/ext3/inode.c中实现了这个函数表。

内核使用函数表中的函数管理page cache，其中最重要的两个函数是readpage() 和writepage()

readpage()函数

readpage()首先会调用find_get_page(mapping, index)在page cache中寻找请求的数据，mapping是要寻找的page cache对象，即address_space对象，index是要读取的数据在文件中的偏移量。如果请求的数据不在该page cache中，那么内核就会创建一个新的page加入page cache中，并将要请求的磁盘数据缓存到该page中，同时将page返回给调用者。

writepage() 函数

对于文件映射（host指向一个inode对象），page每次修改后都会调用SetPageDirty（page）将page标识为dirty。（个人理解swap映射的page不需要dirty，是因为不需要考虑断电丢失数据的问题，因为内存的数据断电时默认就是会失去的）内核首先在指定的address_space寻找目标page，如果没有，就分配一个page并加入到page cache中，然后内核发起一个写请求将数据从用户空间拷入内核空间，最后将数据写入磁盘中。（对从用户空间拷贝到内核空间不是很理解，后期会重点学习Linux读、写文件的详细过程然后写一篇详细的blog介绍）

Buffer Cache

在Block I/O的文章中提到用于表示内存到磁盘映射的buffer_head结构，每个buffer-block映射都有一个buffer_head结构，buffer_head中的b_assoc_map指向了address_space。在Linux2.4中，buffer cache和 page cache之间是独立的，前者使用老版本的buffer_head进行存储，这导致了一个磁盘block可能在两个cache中同时存在，造成了内存的浪费。2.6内核中将两者合并到了一起，使buffer_head只存储buffer-block的映射信息，不再存储block的内容。这样保证一个磁盘block在内存中只会有一个副本，减少了内存浪费。

Flusher线程群（Flusher Threads）

Page cache推迟了文件写入后备存储的时间，但是dirty page最终还是要被写回磁盘的。

内核在下面三种情况下会进行会将dirty page写回磁盘：

用户进程调用sync() 和 fsync()系统调用
空闲内存低于特定的阈值（threshold）
Dirty数据在内存中驻留的时间超过一个特定的阈值

线程群的特点是让一个线程负责一个存储设备（比如一个磁盘驱动器），多少个存储设备就用多少个线程。这样可以避免阻塞或者竞争的情况，提高效率。当空闲内存低于阈值时，内核就会调用wakeup_flusher_threads()来唤醒一个或者多个flusher线程，将数据写回磁盘。为了避免dirty数据在内存中驻留过长时间（避免在系统崩溃时丢失过多数据），内核会定期唤醒一个flusher线程，将驻留时间过长的dirty数据写回磁盘。