refer from https://www.kernel.org/doc/gorman/html/understand/understand013.html
Chapter�10��Page Frame Reclamation
A running system will eventually use all available page frames for purposeslike disk buffers, dentries, inode entries, process pages and so on. Linuxneeds to select old pages which can be freed and invalidated for new usesbefore physical memory is exhausted. This chapter will focus exclusivelyon how Linux implements its page replacement policy and how different typesof pages are invalidated.
The methods Linux uses to select pages are rather empirical in nature andthe theory behind the approach is based on multiple different ideas. It hasbeen shown to work well in practice and adjustments are made based on userfeedback and benchmarks. The basics of the page replacement policy is thefirst item of discussion in this Chapter.
The second topic of discussion is the Page cache. All data thatis read from disk is stored in the page cache to reduce the amount of diskIO that must be performed. Strictly speaking, this is not directly relatedto page frame reclamation, but the LRU lists and page cache are closelyrelated. The relevant section will focus on how pages are added to the pagecache and quickly located.
This will being us to the third topic, the LRU lists. Withthe exception of the slab allocator, all pages in use by the system arestored on LRU lists and linked together via page→lru sothey can be easily scanned for replacement. The slab pages are not storedon the LRU lists as it is considerably more difficult to age a page basedon the objects used by the slab. The section will focus on how pages movethrough the LRU lists before they are reclaimed.
From there, we'll cover how pages belonging to other caches, such asthe dcache, and the slab allocator are reclaimed before talking about howprocess-mapped pages are removed. Process mapped pages are not easily swappableas there is no way to map struct pages to PTEs except to searchevery page table which is far too expensive. If the page cache has a largenumber of process-mapped pages in it, process page tables will be walked andpages swapped out by swap_out() until enough pages have beenfreed but this will still have trouble with shared pages. If a page is shared,a swap entry is allocated, the PTE filled with the necessary information tofind the page in swap again and the reference count decremented. Only whenthe count reaches zero will the page be freed. Pages like this are consideredto be in the Swap cache.
Finally, this chaper will cover the page replacement daemonkswapd, how it is implemented and what it's responsibilities are.
10.1��Page Replacement Policy
During discussions the page replacement policy is frequently said to be aLeast Recently Used (LRU)-based algorithm but this is not strictlyspeaking true as the lists are not strictly maintained in LRU order. TheLRU in Linux consists of two lists called the active_list andinactive_list. The objective is for the active_listto contain the working set�[Den70] of all processes andthe inactive_list to contain reclaim canditates. As all reclaimablepages are contained in just two lists and pages belonging to any processmay be reclaimed, rather than just those belonging to a faulting process,the replacement policy is a global one.
The lists resemble a simplified LRU 2Q�[JS94]where two lists called Am and A1 are maintained. WithLRU 2Q, pages when first allocated are placed on a FIFO queue calledA1. If they are referenced while on that queue, they are placed ina normal LRU managed list called Am. This is roughly analogousto using lru_cache_add() to place pages on a queue calledinactive_list (A1) and using mark_page_accessed()to get moved to the active_list (Am). The algorithm describeshow the size of the two lists have to be tuned but Linux takes a simplerapproach by using refill_inactive() to move pages from thebottom of active_list to inactive_list to keepactive_list about two thirds the size of the total page cache.Figure ?? illustrates how the two lists arestructured, how pages are added and how pages move between the lists withrefill_inactive().
![]()
Figure 10.1: Page Cache LRU Lists
The lists described for 2Q presumes Am is an LRU list but the list in Linuxcloser resembles a Clock algorithm�[Car84] where the hand-spread isthe size of the active list. When pages reach the bottom of the list, thereferenced flag is checked, if it is set, it is moved back to the top ofthe list and the next page checked. If it is cleared, it is moved to theinactive_list.
The Move-To-Front heuristic means that thelists behave in an LRU-like manner but there are too many differencesbetween the Linux replacement policy and LRU to consider it a stackalgorithm�[MM87]. Even if we ignore the problemof analysing multi-programmed systems�[CD80] and the fact thememory size for each process is not fixed , the policy does not satisfy theinclusion property as the location of pages in the lists dependheavily upon the size of the lists as opposed to the time of last reference.Neither is the list priority ordered as that would require list updates withevery reference. As a final nail in the stack algorithm coffin, the listsare almost ignored when paging out from processes as pageout decisions arerelated to their location in the virtual address space of the process ratherthan the location within the page lists.
In summary, the algorithm does exhibit LRU-like behaviour and it has been shownby benchmarks to perform well in practice. There are only two cases where thealgorithm is likely to behave really badly. The first is if the candidatesfor reclamation are principally anonymous pages. In this case, Linux willkeep examining a large number of pages before linearly scanning process pagetables searching for pages to reclaim but this situation is fortunately rare.
The second situation is where there is a single process with many filebacked resident pages in the inactive_list that are beingwritten to frequently. Processes and kswapd may go into a loopof constantly “laundering” these pages and placing them at the top ofthe inactive_list without freeing anything. In this case, fewpages are moved from the active_list to inactive_listas the ratio between the two lists sizes remains not change significantly.
10.2��Page Cache
The page cache is a set of data structures which contain pages that arebacked by regular files, block devices or swap. There are basically fourtypes of pages that exist in the cache:
- Pages that were faulted in as a result of reading a memory mapped file;
- Blocks read from a block device or filesystem are packed into specialpages called buffer pages. The number of blocks that may fit dependson the size of the block and the page size of the architecture;
- Anonymous pages exist in a special aspect of the page cache called theswap cache when slots are allocated in the backing storage for page-out,discussed further in Chapter 11;
- Pages belonging to shared memory regions are treated in a similarfashion to anonymous pages. The only difference is that shared pagesare added to the swap cache and space reserved in backing storageimmediately after the first write to the page.
The principal reason for the existance of this cache is to eliminateunnecessary disk reads. Pages read from disk are stored in a pagehash table which is hashed on the struct address_space andthe offset which is always searched before the disk is accessed. An API isprovided that is responsible for manipulating the page cache which is listedin Table 10.1.
Table 10.1: Page Cache API
10.2.1��Page Cache Hash Table
There is a requirement that pages in the page cache bequickly located. To facilitate this, pages are insertedinto a table page_hash_table and the fieldspage→next_hash and page→pprev_hashare used to handle collisions.
The table is declared as follows in mm/filemap.c:
45 atomic_t page_cache_size = ATOMIC_INIT(0); 46 unsigned int page_hash_bits; 47 struct page **page_hash_table;
The table is allocated during systeminitialisation by page_cache_init() which takes the number ofphysical pages in the system as a parameter. The desired size of the table(htable_size) is enough to hold pointers to every structpage in the system and is calculated by
htable_size = num_physpages * sizeof(struct page *)
To allocate a table, the system begins with an order allocation largeenough to contain the entire table. It calculates this value by starting at 0and incrementing it until 2order > htable_size. Thismay be roughly expressed as the integer component of the following simpleequation.
order = log2(num_physpages * 2 - 1)
An attempt is made to allocate this order of pages with__get_free_pages(). If the allocation fails, lower orders willbe tried and if no allocation is satisfied, the system panics.
The value of page_hash_bits is based on the size of thetable for use with the hashing function _page_hashfn().The value is calculated by successive divides by two but in real terms,this is equivalent to:
page_hash_bits = log2( (PAGE_SIZE * 2^order) / (sizeof(struct page *)) )
This makes the table a power-of-two hash table which negates the need touse a modulus which is a common choice for hashing functions.
10.2.2��Inode Queue
The inode queue is part of the struct address_spaceintroduced in Section 4.4.2. The struct contains three lists: clean_pagesis a list of cleanpages associated with the inode; dirty_pages which have beenwritten to since the list sync to disk; and locked_pageswhich are those currently locked. These three lists in combinationare considered to be the inode queue for a given mapping and thepage→list field is used to link pages on it. Pagesare added to the inode queue with add_page_to_inode_queue()which places pages on the clean_pages lists and removed withremove_page_from_inode_queue().
10.2.3��Adding Pages to the Page Cache
Pages read from a file or block device are generally added tothe page cache to avoid further disk IO. Most filesystems usethe high level function generic_file_read() as theirfile_operations→read(). The shared memory filesystem,which is covered in Chatper 12,is one noteworthy exception but, in general, filesystems perform theiroperations through the page cache. For the purposes of this section, we'llillustrate how generic_file_read() operates and how it adds pagesto the page cache.
For normal IO1, generic_file_read() beginswith a few basic checks before calling do_generic_file_read(). Thissearches the page cache, by calling __find_page_nolock()with the pagecache_lock held, to see if the page already exists init. If it does not, a new page is allocated with page_cache_alloc(),which is a simple wrapper around alloc_pages(), and added to thepage cache with __add_to_page_cache(). Once a page frame ispresent in the page cache, generic_file_readahead() is called whichuses page_cache_read() to read the page from disk. It reads thepage using mapping→a_ops→readpage(),where mapping is the address_space managing the file.readpage() is the filesystem specific function used to read a pageon disk.