Page Frame Reclamation

本文深入探讨了Linux系统中页面替换策略及页面缓存机制,详细阐述了如何选择并释放旧页面以供新用途,以及不同类型的页面是如何被缓存和快速定位的。文中还解释了页面替换策略的实现方式,包括主动和被动列表的使用,以及如何通过`refill_inactive()`函数调整列表大小以保持工作集的高效运行。同时,文章详细介绍了页面缓存的组成和操作API,以及如何通过哈希表和inode队列优化页面查找效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

refer from https://www.kernel.org/doc/gorman/html/understand/understand013.html


Chapter�10��Page Frame Reclamation

A running system will eventually use all available page frames for purposeslike disk buffers, dentries, inode entries, process pages and so on. Linuxneeds to select old pages which can be freed and invalidated for new usesbefore physical memory is exhausted. This chapter will focus exclusivelyon how Linux implements its page replacement policy and how different typesof pages are invalidated.

The methods Linux uses to select pages are rather empirical in nature andthe theory behind the approach is based on multiple different ideas. It hasbeen shown to work well in practice and adjustments are made based on userfeedback and benchmarks. The basics of the page replacement policy is thefirst item of discussion in this Chapter.

The second topic of discussion is the Page cache. All data thatis read from disk is stored in the page cache to reduce the amount of diskIO that must be performed. Strictly speaking, this is not directly relatedto page frame reclamation, but the LRU lists and page cache are closelyrelated. The relevant section will focus on how pages are added to the pagecache and quickly located.

This will being us to the third topic, the LRU lists. Withthe exception of the slab allocator, all pages in use by the system arestored on LRU lists and linked together via pagelru sothey can be easily scanned for replacement. The slab pages are not storedon the LRU lists as it is considerably more difficult to age a page basedon the objects used by the slab. The section will focus on how pages movethrough the LRU lists before they are reclaimed.

From there, we'll cover how pages belonging to other caches, such asthe dcache, and the slab allocator are reclaimed before talking about howprocess-mapped pages are removed. Process mapped pages are not easily swappableas there is no way to map struct pages to PTEs except to searchevery page table which is far too expensive. If the page cache has a largenumber of process-mapped pages in it, process page tables will be walked andpages swapped out by swap_out() until enough pages have beenfreed but this will still have trouble with shared pages. If a page is shared,a swap entry is allocated, the PTE filled with the necessary information tofind the page in swap again and the reference count decremented. Only whenthe count reaches zero will the page be freed. Pages like this are consideredto be in the Swap cache.

Finally, this chaper will cover the page replacement daemonkswapd, how it is implemented and what it's responsibilities are.

10.1��Page Replacement Policy

During discussions the page replacement policy is frequently said to be aLeast Recently Used (LRU)-based algorithm but this is not strictlyspeaking true as the lists are not strictly maintained in LRU order. TheLRU in Linux consists of two lists called the active_list andinactive_list. The objective is for the active_listto contain the working set�[Den70] of all processes andthe inactive_list to contain reclaim canditates. As all reclaimablepages are contained in just two lists and pages belonging to any processmay be reclaimed, rather than just those belonging to a faulting process,the replacement policy is a global one.

The lists resemble a simplified LRU 2Q�[JS94]where two lists called Am and A1 are maintained. WithLRU 2Q, pages when first allocated are placed on a FIFO queue calledA1. If they are referenced while on that queue, they are placed ina normal LRU managed list called Am. This is roughly analogousto using lru_cache_add() to place pages on a queue calledinactive_list (A1) and using mark_page_accessed()to get moved to the active_list (Am). The algorithm describeshow the size of the two lists have to be tuned but Linux takes a simplerapproach by using refill_inactive() to move pages from thebottom of active_list to inactive_list to keepactive_list about two thirds the size of the total page cache.Figure ?? illustrates how the two lists arestructured, how pages are added and how pages move between the lists withrefill_inactive().


Figure 10.1: Page Cache LRU Lists

The lists described for 2Q presumes Am is an LRU list but the list in Linuxcloser resembles a Clock algorithm�[Car84] where the hand-spread isthe size of the active list. When pages reach the bottom of the list, thereferenced flag is checked, if it is set, it is moved back to the top ofthe list and the next page checked. If it is cleared, it is moved to theinactive_list.

The Move-To-Front heuristic means that thelists behave in an LRU-like manner but there are too many differencesbetween the Linux replacement policy and LRU to consider it a stackalgorithm�[MM87]. Even if we ignore the problemof analysing multi-programmed systems�[CD80] and the fact thememory size for each process is not fixed , the policy does not satisfy theinclusion property as the location of pages in the lists dependheavily upon the size of the lists as opposed to the time of last reference.Neither is the list priority ordered as that would require list updates withevery reference. As a final nail in the stack algorithm coffin, the listsare almost ignored when paging out from processes as pageout decisions arerelated to their location in the virtual address space of the process ratherthan the location within the page lists.

In summary, the algorithm does exhibit LRU-like behaviour and it has been shownby benchmarks to perform well in practice. There are only two cases where thealgorithm is likely to behave really badly. The first is if the candidatesfor reclamation are principally anonymous pages. In this case, Linux willkeep examining a large number of pages before linearly scanning process pagetables searching for pages to reclaim but this situation is fortunately rare.

The second situation is where there is a single process with many filebacked resident pages in the inactive_list that are beingwritten to frequently. Processes and kswapd may go into a loopof constantly “laundering” these pages and placing them at the top ofthe inactive_list without freeing anything. In this case, fewpages are moved from the active_list to inactive_listas the ratio between the two lists sizes remains not change significantly.

10.2��Page Cache

The page cache is a set of data structures which contain pages that arebacked by regular files, block devices or swap. There are basically fourtypes of pages that exist in the cache:

  • Pages that were faulted in as a result of reading a memory mapped file;
  • Blocks read from a block device or filesystem are packed into specialpages called buffer pages. The number of blocks that may fit dependson the size of the block and the page size of the architecture;
  • Anonymous pages exist in a special aspect of the page cache called theswap cache when slots are allocated in the backing storage for page-out,discussed further in Chapter 11;
  • Pages belonging to shared memory regions are treated in a similarfashion to anonymous pages. The only difference is that shared pagesare added to the swap cache and space reserved in backing storageimmediately after the first write to the page.

The principal reason for the existance of this cache is to eliminateunnecessary disk reads. Pages read from disk are stored in a pagehash table which is hashed on the struct address_space andthe offset which is always searched before the disk is accessed. An API isprovided that is responsible for manipulating the page cache which is listedin Table 10.1.


void add_to_page_cache(struct page * page, struct address_space * mapping, unsigned long offset)
�Adds a page to the LRU with lru_cache_add() inaddition to adding it to the inode queue and page hash tables
 
void add_to_page_cache_unique(struct page * page,struct address_space *mapping, unsigned long offset, struct page **hash)
�This is imilar to add_to_page_cache() except itchecks that the page is not already in the page cache. This is required whenthe caller does not hold the pagecache_lock spinlock
 
void remove_inode_page(struct page *page)
�This function removes a page from the inode andhash queues with remove_page_from_inode_queue() andremove_page_from_hash_queue(), effectively removing the pagefrom the page cache
 
struct page * page_cache_alloc(struct address_space *x)
�This is a wrapper around alloc_pages() which usesxgfp_mask as the GFP mask
 
void page_cache_get(struct page *page)
�Increases the reference count to a page already in the page cache
 
int page_cache_read(struct file * file, unsigned long offset)
�This function adds a page corresponding to anoffset with a file if it is not alreadythere. If necessary, the page will be read from disk using anaddress_space_operationsreadpage function
 
void page_cache_release(struct page *page)
�An alias for __free_page(). The reference count isdecremented and if it drops to 0, the page will be freed
 
Table 10.1: Page Cache API

10.2.1��Page Cache Hash Table

There is a requirement that pages in the page cache bequickly located. To facilitate this, pages are insertedinto a table page_hash_table and the fieldspagenext_hash and pagepprev_hashare used to handle collisions.

The table is declared as follows in mm/filemap.c:

 45 atomic_t page_cache_size = ATOMIC_INIT(0);
 46 unsigned int page_hash_bits;
 47 struct page **page_hash_table;

The table is allocated during systeminitialisation by page_cache_init() which takes the number ofphysical pages in the system as a parameter. The desired size of the table(htable_size) is enough to hold pointers to every structpage in the system and is calculated by

htable_size = num_physpages * sizeof(struct page *)

To allocate a table, the system begins with an order allocation largeenough to contain the entire table. It calculates this value by starting at 0and incrementing it until 2order > htable_size. Thismay be roughly expressed as the integer component of the following simpleequation.

order = log2(num_physpages * 2 - 1)

An attempt is made to allocate this order of pages with__get_free_pages(). If the allocation fails, lower orders willbe tried and if no allocation is satisfied, the system panics.

The value of page_hash_bits is based on the size of thetable for use with the hashing function _page_hashfn().The value is calculated by successive divides by two but in real terms,this is equivalent to:

page_hash_bits =  log2( (PAGE_SIZE * 2^order) / (sizeof(struct page *)) )

This makes the table a power-of-two hash table which negates the need touse a modulus which is a common choice for hashing functions.

10.2.2��Inode Queue

The inode queue is part of the struct address_spaceintroduced in Section 4.4.2. The struct contains three lists: clean_pagesis a list of cleanpages associated with the inode; dirty_pages which have beenwritten to since the list sync to disk; and locked_pageswhich are those currently locked. These three lists in combinationare considered to be the inode queue for a given mapping and thepagelist field is used to link pages on it. Pagesare added to the inode queue with add_page_to_inode_queue()which places pages on the clean_pages lists and removed withremove_page_from_inode_queue().

10.2.3��Adding Pages to the Page Cache

Pages read from a file or block device are generally added tothe page cache to avoid further disk IO. Most filesystems usethe high level function generic_file_read() as theirfile_operationsread(). The shared memory filesystem,which is covered in Chatper 12,is one noteworthy exception but, in general, filesystems perform theiroperations through the page cache. For the purposes of this section, we'llillustrate how generic_file_read() operates and how it adds pagesto the page cache.

For normal IO1, generic_file_read() beginswith a few basic checks before calling do_generic_file_read(). Thissearches the page cache, by calling __find_page_nolock()with the pagecache_lock held, to see if the page already exists init. If it does not, a new page is allocated with page_cache_alloc(),which is a simple wrapper around alloc_pages(), and added to thepage cache with __add_to_page_cache(). Once a page frame ispresent in the page cache, generic_file_readahead() is called whichuses page_cache_read() to read the page from disk. It reads thepage using mappinga_opsreadpage(),where mapping is the address_space managing the file.readpage() is the filesystem specific function used to read a pageon disk.


#ifdef CONFIG_SPECULATIVE_PAGE_FAULT /* * speculative_page_walk_begin() ... speculative_page_walk_end() protects * against races with page table reclamation. * * This is similar to what fast GUP does, but fast GUP also needs to * protect against races with THP page splitting, so it always needs * to disable interrupts. * Speculative page faults need to protect against page table reclamation, * even with MMU_GATHER_RCU_TABLE_FREE case page table removal slow-path is * not RCU-safe (see comment inside tlb_remove_table_sync_one), therefore * we still have to disable IRQs. */ #define speculative_page_walk_begin() local_irq_disable() #define speculative_page_walk_end() local_irq_enable() bool __pte_map_lock(struct vm_fault *vmf) { pmd_t pmdval; pte_t *pte = vmf->pte; spinlock_t *ptl; if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) { vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); if (!pte) vmf->pte = pte_offset_map(vmf->pmd, vmf->address); spin_lock(vmf->ptl); return true; } speculative_page_walk_begin(); if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq, SPF_ABORT_PTE_MAP_LOCK_SEQ1)) goto fail; /* * The mmap sequence count check guarantees that the page * tables are still valid at that point, and * speculative_page_walk_begin() ensures that they stay around. */ /* * We check if the pmd value is still the same to ensure that there * is not a huge collapse operation in progress in our back. * It also ensures that pmd was not cleared by pmd_clear in * free_pte_range and ptl is still valid. */ pmdval = READ_ONCE(*vmf->pmd); if (!pmd_same(pmdval, vmf->orig_pmd)) { count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PMD); goto fail; } ptl = pte_lockptr(vmf->vma->vm_mm, &pmdval); if (!pte) pte = pte_offset_map(&pmdval, vmf->address); /* * Try locking the page table. * * Note that we might race against zap_pte_range() which * invalidates TLBs while holding the page table lock. * We are still under the speculative_page_walk_begin() section, * and zap_pte_range() could thus deadlock with us if we tried * using spin_lock() here. * * We also don't want to retry until spin_trylock() succeeds, * because of the starvation potential against a stream of lockers. */ if (unlikely(!spin_trylock(ptl))) { count_vm_spf_event(SPF_ABORT_PTE_MAP_LOCK_PTL); goto fail; } /* * The check below will fail if __pte_map_lock passed its ptl barrier * before we took the ptl lock. */ if (!mmap_seq_read_check(vmf->vma->vm_mm, vmf->seq, SPF_ABORT_PTE_MAP_LOCK_SEQ2)) goto unlock_fail; speculative_page_walk_end(); vmf->pte = pte; vmf->ptl = ptl; return true; unlock_fail: spin_unlock(ptl); fail: if (pte) pte_unmap(pte); speculative_page_walk_end(); return false; } #endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
08-07
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值