Linux核心 (The Linux Kernel)

最新推荐文章于 2023-12-25 11:42:35 发布

原创最新推荐文章于 2023-12-25 11:42:35 发布 · 2.9k 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#linux #system #struct #数据结构 #table #translation

4.操作系统|Linux|汇编专栏收录该内容

33 篇文章

订阅专栏

本文详细探讨了Linux内核的工作原理，包括关键的数据结构、表的管理和地址转换。通过对这些核心概念的解析，读者将能够更深入地了解Linux系统的底层运作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Linux核心 (The Linux Kernel)

作者：毕昕等(… 文章来源：纯C电子杂志 2005年1月号(总第3期) 点击数： 1314 更新时间：2005-3-2

Linux核心 (The Linux Kernel)

原著：David A Rusling¹

翻译²：毕昕³、胡宁宁⁴、仲盛⁵、赵振平⁶、周笑波⁷、李群⁸、陈怀临⁹

[编者按]这几位现在已经从南京大学毕业的学生为我们留下了一份宝贵的讲解Linux 内核的资料，在此，我们不得不对他们当年的辛苦的工作表示由衷的感谢。由于当时的翻译是按照原作者0.8-2 版进行的，在本次收录的时候编者按照原作者的0.8-3 版进行了校对，并对其中的一些地方进行了增改及补译。

哈尔滨工业大学计算机科学与技术学院IBM 技术中心的吴晋老师对全部译稿进行了审校，在此特表感谢！

Chapter 3 Memory Management

The memory management subsystem is one of the most important parts of the operating system. Since the early days of computing, there has been a need for more memory than exists physically in a system. Strategies have been developed to overcome this limitation and the most successful of these is virtual memory. Virtual memory makes the system appear to have more memory than it actually has by sharing it between competing processes as they need it.

Virtual memory does more than just make your computer's memory go further. The memory management subsystem provides:

Large Address Spaces

The operating system makes the system appear as if it has a larger amount of memory than it actually has. The virtual memory can be many times larger than the physical memory in the system.

Protection

Each process in the system has its own virtual address space. These virtual address spaces are completely separate from each other and so a process running one application cannot affect another. Also, the hardware virtual memory mechanisms allow areas of memory to be protected against writing. This protects code and data from being overwritten by rogue applications.

Memory Mapping

Memory mapping is used to map image and data files into a processes address space. In memory mapping, the contents of a file are linked directly into the virtual address space of a process.

Fair Physical Memory Allocation

The memory management subsystem allows each running process in the system a fair share of the physical memory of the system.

Shared Virtual Memory

Although virtual memory allows processes to have separate (virtual) address spaces, there are times when you need processes to share memory. For example there could be several processes in the system running the bash command shell. Rather than have several copies of bash, one in each processes virtual address space, it is better to have only one copy in physical memory and all of the processes running bash share it. Dynamic libraries are another common example of executing code shared between several processes.

Shared memory can also be used as an Inter Process Communication (IPC) mechanism, with two or more processes exchanging information via memory common to all of them. Linux supports the Unix System V shared memory IPC.

3.1 An Abstract Model of Virtual Memory

Before considering the methods that Linux uses to support virtual memory it is useful to consider an abstract model that is not cluttered by too much detail.

As the processor executes a program it reads an instruction from memory and decodes it. In decoding the instruction it may need to fetch or store the contents of a location in memory. The processor then executes the instruction and moves onto the next instruction in the program. In this way the processor is always accessing memory either to fetch instructions or to fetch and store data.

第 3 章内存管理

内存管理子系统是操作系统中最重要的组成部份之一。从早期计算机开始，系统的实际内存总是不能满足需求，为解决这一矛盾，人们想了许多办法，其中虚存是最成功的一个。虚存让各进程共享系统内存空间，这样系统就似乎有了更多的内存。

虚存不仅使计算机的内存看起来更多，内存管理子系统还提供以下功能：

扩大地址空间

操作系统使系统看起来有远远大于它实际所拥有的内存空间。虚存能比系统的物理内存大许多倍。

内存保护

系统中每个进程都有它自己的虚拟地址空间。这些虚拟地址空间之间彼此分开，以保证应用程序运行时互不影响。另外，硬件虚存机制可以对内存部分区域提供写保护，以防止代码和数据被其它恶意的应用程序所篡改。

内存映射

内存映射被用于将映像和数据文件映射到一个进程的虚拟地址空间中，也就是将文件内容直接地连接到虚地址中。

公平分配内存

内存管理子系统可以使每一个在系统中运行的进程公平的共享系统的物理内存。

虚存共享

尽管虚存允许各进程有各自的(虚拟)地址空间，但有时进程间需要共享内存。例如，若干进程同时运行Bash命令。并非在每个进程的虚地址空间中都有一个Bash的拷贝，在内存中仅有一个运行的Bash拷贝供各进程共享。又如，若干进程可以共享动态函数库。

共享内存也能作为一种进程间的通信机制(IPC)。两个或两个以上进程通过共享内存来交换数据这非常普遍。Linux 支持 Unix System V 的共享内存IPC机制。

3.1 一个抽象的虚存模型

在分析 Linux 实现虚存的方法前，让我们先来看一个没有由于过多细节而混乱的抽象模型。

当处理器执行一段程序时，它先从内存中读出一条指令并对它进行解码。解码时可能需要在内存中的某一地址存取数据。然后处理器执行这条指令并移向下一条。可见处理器总是不断地在内存中存取数据或指令。

Figure 3.1: Abstract model of Virtual to Physical address mapping

In a virtual memory system all of these addresses are virtual addresses and not physical addresses. These virtual addresses are converted into physical addresses by the processor based on information held in a set of tables maintained by the operating system.

To make this translation easier, virtual and physical memory are divided into handy sized chunks called pages. These pages are all the same size, they need not be but if they were not, the system would be very hard to administer. Linux on Alpha AXP systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages. Each of these pages is given a unique number; the page frame number (PFN).

In this paged model, a virtual address is composed of two parts; an offset and a virtual page frame number. If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12 and above are the virtual page frame number. Each time the processor encounters a virtual address it must extract the offset and the virtual page frame number. The processor must translate the virtual page frame number into a physical one and then access the location at the correct offset into that physical page. To do this the processor uses page tables.

Figure 3.1 shows the virtual address spaces of two processes, process X and process Y, each with their own page tables. These page tables map each processes virtual pages into physical pages in memory. This shows that process X's virtual page frame number 0 is mapped into memory in physical page frame number 1 and that process Y's virtual page frame number 1 is mapped into physical page frame number 4. Each entry in the theoretical page table contains the following information:

l Valid flag. This indicates if this page table entry is valid,

l The physical page frame number that this entry is describing,

l Access control information. This describes how the page may be used. Can it be written to? Does it contain executable code?

The page table is accessed using the virtual page frame number as an offset. Virtual page frame 5 would be the 6th element of the table (0 is the first element).

To translate a virtual address into a physical one, the processor must first work out the virtual addresses page frame number and the offset within that virtual page. By making the page size a power of 2 this can be easily done by masking and shifting. Looking again at Figures 3.1 and assuming a page size of 0x2000 bytes (which is decimal 8192) and an address of 0x 2194 in process Y's virtual address space then the processor would translate that address into offset 0x194 into virtual page frame number 1.

The processor uses the virtual page frame number as an index into the processes page table to retrieve its page table entry. If the page table entry at that offset is valid, the processor takes the physical page frame number from this entry. If the entry is invalid, the process has accessed a non-existent area of its virtual memory. In this case, the processor cannot resolve the address and must pass control to the operating system so that it can fix things up.

Just how the processor notifies the operating system that the correct process has attempted to access a virtual address for which there is no valid translation is specific to the processor. However the processor delivers it, this is known as a page fault and the operating system is notified of the faulting virtual address and the reason for the page fault.

Assuming that this is a valid page table entry, the processor takes that physical page frame number and multiplies it by the page size to get the address of the base of the page in physical memory. Finally, the processor adds in the offset to the instruction or data that it needs.

Using the above example again, process Y's virtual page frame number 1 is mapped to physical page frame number 4 which starts at 0x8000 (4 * 0x2000). Adding in the 0x194 byte offset gives us a final physical address of 0x8194.

By mapping virtual to physical addresses this way, the virtual memory can be mapped into the system's physical pages in any order. For example, in Figure 3.1 process X's virtual page frame number 0 is mapped to physical page frame number 1 whereas virtual page frame number 7 is mapped to physical page frame number 0 even though it is higher in virtual memory than virtual page frame number 0. This demonstrates an interesting byproduct of virtual memory; the pages of virtual memory do not have to be present in physical memory in any particular order.

3.1.1 Demand Paging

As there is much less physical memory than virtual memory the operating system must be careful that it does not use the physical memory inefficiently. One way to save physical memory is to only load virtual pages that are currently being used by the executing program. For example, a database program may be run to query a database. In this case not all of the database needs to be loaded into memory, just those data records that are being examined. If the database query is a search query then it does not make sense to load the code from the database program that deals with adding new records. This technique of only loading virtual pages into memory as they are accessed is known as demand paging.

When a process attempts to access a virtual address that is not currently in memory the processor cannot find a page table entry for the virtual page referenced. For example, in Figure 3.1 there is no entry in process X's page table for virtual page frame number 2 and so if process X attempts to read from an address within virtual page frame number 2 the processor cannot translate the address into a physical one. At this point the processor notifies the operating system that a page fault has occurred.

If the faulting virtual address is invalid this means that the process has attempted to access a virtual address that it should not have. Maybe the application has gone wrong in some way, for example writing to random addresses in memory. In this case the operating system will terminate it, protecting the other processes in the system from this rogue process.

If the faulting virtual address was valid but the page that it refers to is not currently in memory, the operating system must bring the appropriate page into memory from the image on disk. Disk access takes a long time, relatively speaking, and so the process must wait quite a while until the page has been fetched. If there are other processes that could run then the operating system will select one of them to run. The fetched page is written into a free physical page frame and an entry for the virtual page frame number is added to the processes page table. The process is then restarted at the machine instruction where the memory fault occurred. This time the virtual memory access is made, the processor can make the virtual to physical address translation and so the process continues to run.

Linux uses demand paging to load executable images into a processes virtual memory. Whenever a command is executed, the file containing it is opened and its contents are mapped into the processes virtual memory. This is done by modifying the data structures describing this processes memory map and is known as memory mapping. However, only the first part of the image is actually brought into physical memory. The rest of the image is left on disk. As the image executes, it generates page faults and Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution.

3.1.2 Swapping

If a process needs to bring a virtual page into physical memory and there are no free physical pages available, the operating system must make room for this page by discarding another page from physical memory.

If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file.

However, if the page has been modified, the operating system must preserve the contents of that page so that it can be accessed at a later time. This type of page is known as a dirty page and when it is removed from memory it is saved in a special sort of file called the swap file. Accesses to the swap file are very long relative to the speed of the processor and physical memory and the operating system must juggle the need to write pages to disk with the need to retain them in memory to be used again.

If the algorithm used to decide which pages to discard or swap (the swap algorithm is not efficient then a condition known as thrashing occurs. In this case, pages are constantly being written to disk and then being read back and the operating system is too busy to allow much real work to be performed. If, for example, physical page frame number 1 in Figure 3.1 is being regularly accessed then it is not a good candidate for swapping to hard disk. The set of pages that a process is currently using is called the working set. An efficient swap scheme would make sure that all processes have their working set in physical memory.

Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might be removed from the system. This scheme involves every page in the system having an age which changes as the page is accessed. The more that a page is accessed, the younger it is; the less that it is accessed the older and more stale it becomes. Old pages are good candidates for swapping.

3.1.3 Shared Virtual Memory

Virtual memory makes it easy for several processes to share memory. All memory access are made via page tables and each process has its own separate page table. For two processes sharing a physical page of memory, its physical page frame number must appear in a page table entry in both of their page tables.

Figure 3.1 shows two processes that each share physical page frame number 4. For process X this is virtual page frame number 4 whereas for process Y this is virtual page frame number 6. This illustrates an interesting point about sharing pages: the shared physical page does not have to exist at the same place in virtual memory for any or all of the processes sharing it.

3.1.4 P hysical and Virtual Addressing Modes

It does not make much sense for the operating system itself to run in virtual memory. This would be a nightmare situation where the operating system must maintain page tables for itself. Most multi-purpose processors support the notion of a physical address mode as well as a virtual address mode. Physical addressing mode requires no page tables and the processor does not attempt to perform any address translations in this mode. The Linux kernel is linked to run in physical address space.

The Alpha AXP processor does not have a special physical addressing mode. Instead, it divides up the memory space into several areas and designates two of them as physically mapped addresses. This kernel address space is known as KSEG address space and it encompasses all addresses upwards from 0xfffffc0000000000. In order to execute from code linked in KSEG (by definition, kernel code) or access data there, the code must be executing in kernel mode. The Linux kernel on Alpha is linked to execute from address 0xfffffc0000310000.

3.1.5 A ccess Control

The page table entries also contain access control information. As the processor is already using the page table entry to map a processes virtual address to a physical one, it can easily use the access control information to check that the process is not accessing memory in a way that it should not.

There are many reasons why you would want to restrict access to areas of memory. Some memory, such as that containing executable code, is naturally read only memory; the operating system should not allow a process to write data over its executable code. By contrast, pages containing data can be written to but attempts to execute that memory as instructions should fail. Most processors have at least two modes of execution: kernel and user. You would not want kernel code executing by a user or kernel data structures to be accessible except when the processor is running in kernel mode.

在虚存系统中，所有地址都是虚地址而非物理地址。处理器根据操作系统维护的一组表格而把这些虚地址翻译成相应的物理地址。

为使这翻译的过程更容易，虚存和物理内存被划分成许多适当大小的块，这些块称之为“页”。这些页的大小都是一样的，当然不是必须这样，不过如果不这样的话操作系统会非常难管理它们。在Alpha AXP上的Linux系统中，每页有8Kbyte，但在Intel x86系统中，每页有4Kbyte。每一页又被分配了一个各不相同的数字，叫页号（PFN）。

在页模型中，一个虚地址由两部份组成：偏移量和虚页号。如果页的大小是4Kbytes，那么虚地址的0至11位是偏移量，第12位以上是虚页号。每当处理器遇到虚地址时，它先取出偏移量和虚页号。然后，处理器把虚页号翻译成物理页号，再由偏移量得到正确的物理地址，最后存取数据。处理器需要使用页表来完成这整个过程。

图3.1显示了两个进程的虚存地址空间。进程X和进程Y分别有各自的页表。页表记录了各进程虚页和物理页之间的映射。如图：X进程的虚存的第0页被映射为物理内存的第1页，Y进程的虚存的第1页被是非曲直射为物理内存的第4页。理论上，页表中每条记录包含以下信息：

l 有效性标志。用以标识页表记录有效与否。

l 物理页号。

l 存取控制信息。描述这页应该怎样被使用。是否可写？是否包含可执行代码？

页表中使用虚页号作为偏移量。虚页5将是表中的第6条记录（0是第一条记录）。

把一个虚地址翻译成物理地址时，处理器必须先得出虚页号和偏移量。让页的大小总是2的幂，这便于进行mask和移位操作。图3.1中，假定页的大小是0x2000字节（它是十进制的8192），在进程Y的地址空间中有一虚地址0x2194。那么处理器将把这个地址翻译成偏移量为0x194，虚页号为1。

处理器使用虚页号作为检索进程页表记录的索引。如果对应那偏移量的页表记录是有效的，处理器就从中拿出物理页号。如果记录是无效的，表明进程想存取一个不在物理内存中的地址。在这种情况下，处理器不能翻译这个虚地址，而必须把控制权传给操作系统，让它处理。

当当前进程试图存取一个处理器无法翻译的虚地址时，处理器如何通知操作系统这是与特定的处理器相关的。不过，通常的做法是，处理器会引发一个“页错误”，并将产生页错误的虚地址和原因告诉给操作系统。

假设找到的是一有效的页表记录，处理器就取出物理页号并且乘以页的大小，得到内存中页的基地址。最后，处理器加上偏移量得到它需要指令或数据的地址。

再次使用上面的例子，进程Y的虚存的第1页被映射到物理内存的第4页，它从0x8000（4* 0x2000）开始。加上偏移量0x194字节，我们得到最后的物理地址就是0x8194。

由虚地址映射到物理地址时，虚存各页映射到系统内存中的顺序是任意的。例如，在图3.1中，进程X的虚存第0页被映射到内存第1页，而虚存第7页被映射到内存第0页，即使它的虚存页号比虚存页号0要大(这里直译不太好理解，其实主要的意思就是虽然后者的虚存页号比前者的虚存页号要大，但是映射到物理内存中的物理页号却比前者映射的物理页号要小，作者以此例来说明虚存各页映射到物理内存各页的顺序是任意的)。这说明了虚存的一个有趣现象，虚存各页在物理内存中不必有任何顺序。

3.1.1 按需装载页

由于物理内存比虚拟内存小很多，因此操作系统必须专注于的物理内存的使用效率。节省物理内存的一个方法是只装载被当前执行程序使用的虚页。例如，有一个用来查询数据库的程序，此时，并非所有数据库中的数据都需要被装载进内存，只需要那些正在被访问的数据。如果正运行一条数据库查询命令，那么就不必载入添加新的数据记录的代码。这种只有在访问时才将对应的虚页载入内存的技术，叫做按需装载页。

当进程试图存取一个不在内存中的虚地址时，处理器不可能在页表中找到这一虚页的记录。例如，在图3.1中，进程X的虚存第2页没有对应的页表记录，如果尝试对这页进行读操作，那么处理器不能把虚地址翻译成物理地址。处理器就会通知操作系统发生了一个页错误。

如果页错误对应的虚地址是无效的，这意味着进程试图存取它不应该访问的虚地址。这也许是因为应用程序出了某些错误，例如试图在内存中任意进行写操作。在这种情况下，操作系统将终止这个错误进程，以保护其它进程。

如果页错误对应的虚地址是有效的，只是这页目前不在物理内存中，操作系统必须将对应的页从磁盘载入内存。相对来说，磁盘存取会花很多时间，所以进程必须等待相当一会儿直到页被读入。这时候，如果有其它进程能运行，操作系统将选择其中之一来运行。从磁盘中被取出来的页将被读入内存一空页中，并在进程页表中加入一条记录。然后，进程从产生页错的那条机器指令处重新启动。这次处理器能将虚地址翻译成物理地址了，因此进程能继续运行下去。

Linux使用按需装载页将可执行的映像载入到进程的虚拟内存空间。一个命令被执行时，包含它的文件被打开，文件的内容被映射入进程的虚存。这一操作需修改描述这一进程内存映像的数据结构。这一过程称为内存映射。然而，只有映像的第一部份被实际载入物理内存，余下部份被留在磁盘上。当映像执行时，它将不断产生页错，Linux使用进程的内存映像表来决定哪块映像应该被载入内存。

3.1.2 页交换 (Swapping)

当进程要装载一虚页进物理内存时，如果得不到可用的物理内存中的空页, 操作系统必须从内存中淘汰别的页，为这页提供空间。

如果从内存中被丢弃的那页是从映像或数据文件中来的，并且这页没有被修改过，那这页就不需再被保存，可以直接丢掉。如果进程再需要那页，它可以重新被操作系统从映像或数据文件中读入内存。

但如果该页已被修改了，操作系统必须保存这页的内容以便它以后能再被访问。这类页叫作“脏页”，当它们被从内存中移出时，它们必须被保存在一种特殊的文件中，这种文件称为“交换文件”。相对于处理器和内存的速度，交换文件的存取时间是很长的，所以操作系统必须权衡是否需要把页写到磁盘上，还是保留在内存中以备后用。

如果用来决定那个页应当被淘汰或交换的交换算法的效率不高，那么“颠簸”现象就会发生。在这种情况下，页常常一会儿被写到磁盘上，一会儿又被读回来，操作系统忙于文件存取而不能执行真正的工作。例如，图3.1 中，如果内存第 1页不断被访问，那它就不应该被交换到硬盘上。进程当前正在使用的页的集合被叫作工作集。有效的交换算法将保证所有进程的工作集都在内存中。

Linux使用最近最少使用算法（LRU）来公平的从内存中选择被丢弃的页。在这个算法中，系统中的每个页都有一个年龄，这个年龄随着页被存取而变化。页被存取得越多便越年轻，被存取得越少就越老。老的页通常是被交换的好候选。

3.1.3 共享虚存

虚存使得若干进程更容易共享内存。进程所有的内存访问都要通过页表，并且各进程有各自独立的页表。当两个进程共享内存中的一页时，物理页号就会同时出现在每个进程的页表中。

图3.1中显示两进程共享物理第4页。对进程X而言，那是虚存的第4页，对进程Y而言，那是虚存第6页。这说明共享页中的一个有趣的现象：被共享的物理页对应的虚存页号可以各不相同。

3.1.4 物理和虚拟地址模式

把操作系统运行在虚存中是不明智之举，如果操作系统还要为自己保存页表，那将是一场恶梦。因此，很多通用处理器同时支持虚拟地址模式和物理地址模式。物理地址模式不需要页表，处理器不必做任何地址翻译。Linux内核被直接连在物理地址空间中运行。

Alpha AXP处理器没有物理地址模式。相反，它把内存划分成若干区域并且指定其中两块为物理地址区。这段内核地址空间叫作KSEG地址空间，包括所有0xfffffc0000000000以上的地址。在KSEG那里执行的（按定义，称之为核心代码）或在那里存取数据的代码肯定是在内核模式下执行。在Alpha上的Linux内核被连接到从0xfffffc0000310000处开始执行。

3.1.5 存取控制

页表记录中也包含了存取控制信息。处理器使用页表记录来把虚地址翻译成物理地址的同时，它也很容易地使用其中的存取控制信息来检查进程是否在正确地访问内存。

在很多种情况下，你想要为内存的一段区域设置存取限制。一段内存，例如包含可执行的代码，应为只读内存；操作系统应该不允许进程在它的可执行的代码上写数据。相反的，包含数据的页能被写，但是当指令试图执行那段内存时，应该失败。大多数处理器至少有两种执行模式：内核模式和用户模式。你应当不想由一个用户来执行内核代码，或者让内核数据结构被内核代码之外的代码所访问。

Figure 3.2: Alpha AXP Page Table Entry

The access control information is held in the PTE and is processor specific; figure 3.2 shows the PTE for Alpha AXP. The bit fields have the following meanings:

Valid, if set this PTE is valid,

FOE

“Fault on Execute”, Whenever an attempt to execute instructions in this page occurs, the processor reports a page fault and passes control to the operating system.

FOW

“Fault on Write” as above but page fault on an attempt to write to this page,

FOR

“Fault on Read”, as above but page fault on an attempt to read from this page,

ASM

Address Space Match. This is used when the operating system wishes to clear only some of the entries from the Translation Buffer,

KRE

Code running in kernel mode can read this page,

URE

Code running in user mode can read this page,

Granularity hint used when mapping an entire block with a single Translation Buffer entry rather than many,

KWE

Code running in kernel mode can write to this page,

UWE

Code running in user mode can write to this page,

page frame number

For PTEs with the V bit set, this field contains the physical Page Frame Number (page frame number) for this PTE. For invalid PTEs, if this field is not zero, it contains information about where the page is in the swap file.

The following two bits are defined and used by Linux:

_PAGE_DIRTY

if set, the page needs to be written out to the swap file,

_PAGE_ACCESSED

Used by Linux to mark a page as having been accessed.

3.2 Caches

If you were to implement a system using the above theoretical model then it would work, but not particularly efficiently. Both operating system and processor designers try hard to extract more performance from the system. Apart from making the processors, memory and so on faster the best approach is to maintain caches of useful information and data that make some operations faster. Linux uses a number of memory management related caches:

Buffer Cache

The buffer cache contains data buffers that are used by the block device drivers.

These buffers are of fixed sizes (for example 512 bytes) and contain blocks of information that have either been read from a block device or are being written to it. A block device is one that can only be accessed by reading and writing fixed sized blocks of data. All hard disks are block devices. (See fs/buffer.c)

The buffer cache is indexed via the device identifier and the desired block number and is used to quickly find a block of data. Block devices are only ever accessed via the buffer cache. If data can be found in the buffer cache then it does not need to be read from the physical block device, for example a hard disk, and access to it is much faster.

Page Cache

This is used to speed up access to images and data on disk.

It is used to cache the logical contents of a file a page at a time and is accessed via the file and offset within the file. As pages are read into memory from disk, they are cached in the page cache. (See mm/filemap.c)

Swap Cache

Only modified (or dirty) pages are saved in the swap file.

So long as these pages are not modified after they have been written to the swap file then the next time the page is swapped out there is no need to write it to the swap file as the page is already in the swap file. Instead the page can simply be discarded. In a heavily swapping system this saves many unnecessary and costly disk operations. (See swap.h; mm/swap_state.c; mm/swapfile.c)

Hardware Caches

One commonly implemented hardware cache is in the processor; a cache of Page Table Entries. In this case, the processor does not always read the page table directly but instead caches translations for pages as it needs them. These are the Translation Look-aside Buffers and contain cached copies of the page table entries from one or more processes in the system.

When the reference to the virtual address is made, the processor will attempt to find a matching TLB entry. If it finds one, it can directly translate the virtual address into a physical one and perform the correct operation on the data. If the processor cannot find a matching TLB entry then it must get the operating system to help. It does this by signalling the operating system that a TLB miss has occurred. A system specific mechanism is used to deliver that exception to the operating system code that can fix things up. The operating system generates a new TLB entry for the address mapping. When the exception has been cleared, the processor will make another attempt to translate the virtual address. This time it will work because there is now a valid entry in the TLB for that address.

The drawback of using caches, hardware or otherwise, is that in order to save effort Linux must use more time and space maintaining these caches and, if the caches become corrupted, the system will crash.

3.3 Linux Page Tables

存取控制信息被保存在PTE中，并且不同的处理器，PTE的格式是不同的；图3.2显示的是Alpha AXP的PTE。各位包含以下信息：

有效位。如果设置，表示这PTE是有效的。

FOE

执行错误。无论何时，试图在这页执行指令时，处理器将报告页错，并且把控制权传给操作系统。

FOW

写错误。同上面一样，不过页错发生在试图写这页时。

FOR

读错误。同上面一样，不过页错发生在试图读这页时。

ASM

地址空间匹配。当操作系统仅仅希望清除翻译缓冲区中若干记录时，这一位被使用。

KRE

在内核模式下运行的代码能读这页。

URE

在用户模式下运行的代码能读这页。

粒度性。指在映射一整块虚存时，是用一个翻译缓冲记录还是多个。

KWE

在内核模式下运行的代码能写这页。

UWE

在用户模式下运行的代码能写这页。

页号

在V位被置位的PTE中(编者注：即有效的PTE中)，这域包含了对应的物理页号。对于无效的PTE，如果这域不是零，它包含了页在交换文件中什么位置的信息。

以下两位是Linux定义并使用的：

_PAGE_DIRTY

如果被设置，则页需要被写到交换文件中。

_PAGE_ACCESSED

Linux用它来标记这页是否曾经被访问。

3.2 缓存

如果你按照上面理论模型，可以实现一个工作的系统，但不会特别高效。操作系统和处理器的设计者都在努力提高系统性能。除提高处理器和内存的速度外，最好的途径是把有用的信息和数据保存在缓存中。Linux就使用了很多与内存管理有关的缓存：

缓冲区

缓冲区包含了块设备驱动程序所使用的数据缓冲区。

这些缓冲区有固定的大小（例如512个字节)，记录从一台块设备读或写的信息。一台块设备只能读写整块数据。所有的硬盘都是块设备。（参见fs/buffer.c）

缓冲区以设备标识符和需要的块号做为索引来迅速找到所需数据。块设备只能通过缓冲区进行存取操作。如果数据在缓冲区中，那么它就不需要再从块设备中被读取，例如硬盘，这样会使存取更快。

页缓存

它被用来加快磁盘上映像和数据的存取。

它被用来一次缓存文件的一页，存取操作通过文件名和偏移量来实现。当页从磁盘上被读进内存时，他们被缓存在页缓存中。（参见mm/filemap.c）

交换缓存

只有修改了的页，即“脏页”，被保存在交换文件中。

只要页在被写进交换文件以后，没有再被修改，下次这页被换出内存时，不需要再把它写入交换文件，因为它已经存在于交换文件中了，它只需要被简单的丢弃。对一个要进行许多页面交换的系统，这将节省许多不必要的并且开销极大的磁盘操作。（参见swap.h、mm/swap_state.c、mm/swapfile.c）

硬件缓存

一个常见的硬件缓存是处理器内部的页表记录的缓存。通常情况下，处理器并不总是直接读页表，而是用页表缓存保留用到的记录。这些缓存被叫做Translation Look-aside Buffer，保存了系统中一个或多个进程页表的拷贝。

当翻译地址时，处理器先试图找到一个匹配的TLB记录。如果它找到了一个，它能直接把虚地址翻译成物理地址，并且对数据进行存取操作。如果处理器不能找到一个匹配的TLB记录，那就必须借助操作系统。它发信号给操作系统，报告有一个TLB疏漏发生。系统特定的机制将把这异常信号送给操作系统中可以解决此问题的代码。操作系统为映射的地址产生一个新的TLB记录。当异常被解决后，处理器将尝试再次翻译那个虚地址。因为现在那个地址在TLB中有一个有效的记录，因此这次的地址翻译一定成功。

使用缓冲区、硬件缓存或其它缓存等的缺点是Linux必须花费更多的时间和空间来维护这些缓存，如果缓存发生错误，系统将崩溃。

3.3 Linux页表

Figure 3.4: The free_area data structure

For example, in Figure 3.4 if a block of 2 pages was requested, the first block of 4 pages (starting at page frame number 4) would be broken into two 2 page blocks. The first, starting at page frame number 4 would be returned to the caller as the allocated pages and the second block, starting at page frame number 6 would be queued as a free block of 2 pages onto element 1 of the free_area array.

3.4.2 P age Deallocation

Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken down into smaller ones. The page deallocation code recombines pages into larger blocks of free pages whenever it can. In fact the page block size is important as it allows for easy combination of blocks into larger blocks. (See free_pages() in mm/page_alloc.c)

Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free. If it is, then it is combined with the newly freed block of pages to form a new free block of pages for the next size block of pages. Each time two blocks of pages are recombined into a bigger block of free pages the page deallocation code attempts to recombine that block into a yet larger one. In this way the blocks of free pages are as large as memory usage will allow.

For example, in Figure 3.4, if page frame number 1 were to be freed, then that would be combined with the already free page frame number 0 and queued onto element 1 of the free_area as a free block of size 2 pages.

3.5 Memory Mapping

When an image is executed, the contents of the executable image must be brought into the processes virtual address space. The same is also true of any shared libraries that the executable image has been linked to use. The executable file is not actually brought into physical memory, instead it is merely linked into the processes virtual memory. Then, as the parts of the program are referenced by the running application, the image is brought into memory from the executable image. This linking of an image into a processes virtual address space is known as memory mapping.

例如，在图3.4中，如果需要一个2页块，那么第一个空的4页块(从第4页起)将被分成两半。从第4页开始的2页块被返回给请求者；从第6页开始的2页块将被做为空闲块，在free_area队列的第一个元组中排队。(编者注：这里free_area队列的第一个元组其实是大小为二的所有空闲页块组成的一个链表)

3.4.2 页的回收

页分配时容易将大块连续的内存分成很多小块。页回收代码须尽可能将小块的空闲页重新组合成大块的空闲页。事实上，页块的大小对内存的重新组合很重要。In fact the page block size is important as it allows for easy combination of blocks into larger blocks.（参见mm/page_alloc.c中的free_pages()）

当一页块被释放时，系统会检查它的相邻块或者同样大小的buddy块(编者注：sorry，实在没能力把握buddy block的准确意思)，看它们是否是空闲的。如果是，它们将结合成一个大小为原来两倍的整块。每次当两块内存被拼成了更大的空闲块时，页回收代码将尝试把它们继续与其它空闲块进行组合，以得到更大的空间。在此种方式下，与存储器使用量等量的空闲页块是可以的。

例如，在图3.1中，如果第1页被释放，那它将与已经是空闲的第0页组合，并作为一个大小为2的空闲页块，被放到free_area的保存大小为两页的空闲块的队中。

3.5 内存印射

当一映像被执行时，它的内容必须被读入进程的虚存地址空间。它所链接并使用的一些共享库也必须被读入虚存。这个可执行文件并非被实际读入内存,相反它只是被连接入进程的虚存。然后，当程序的一部份被运行中的应用程序调用时，系统才将这部份映像读入内存。像这样将映像连接到进程的虚地址空间叫做内存映射(memory mapping)。

Figure 3.5: Areas of Virtual Memory

Every processes virtual memory is represented by an mm_struct data structure. This contains information about the image that it is currently executing (for example bash) and also has pointers to a number of vm_area_struct data structures. Each vm_area_struct data structure describes the start and end of the area of virtual memory, the processes access rights to that memory and a set of operations for that memory. These operations are a set of routines that Linux must use when manipulating this area of virtual memory. For example, one of the virtual memory operations performs the correct actions when the process has attempted to access this virtual memory but finds (via a page fault) that the memory is not actually in physical memory. This operation is the nopage operation. The nopage operation is used when Linux demand pages the pages of an executable image into memory.

When an executable image is mapped into a processes virtual address a set of vm_area_struct data structures is generated. Each vm_area_struct data structure represents a part of the executable image; the executable code, initialized data (variables), unitialized data and so on. Linux supports a number of standard virtual memory operations and as the vm_area_struct data structures are created, the correct set of virtual memory operations are associated with them.

3.6 Demand Paging

Once an executable image has been memory mapped into a processes virtual memory it can start to execute. As only the very start of the image is physically pulled into memory it will soon access an area of virtual memory that is not yet in physical memory. When a process accesses a virtual address that does not have a valid page table entry, the processor will report a page fault to Linux. (See handle_mm_fault() in mm/memory.c)

The page fault describes the virtual address where the page fault occurred and the type of memory access that caused.

Linux must find the vm_area_struct that represents the area of memory that the page fault occurred in. As searching through the vm_area_struct data structures is critical to the efficient handling of page faults, these are linked together in an AVL (Adelson-Velskii and Landis) tree structure. If there is no vm_area_struct data structure for this faulting virtual address, this process has accessed an illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the process does not have a handler for that signal it will be terminated.

Linux next checks the type of page fault that occurred against the types of accesses allowed for this area of virtual memory. If the process is accessing the memory in an illegal way, say writing to an area that it is only allowed to read from, it is also signalled with a memory error.

Now that Linux has determined that the page fault is legal, it must deal with it.

Linux must differentiate between pages that are in the swap file and those that are part of an executable image on a disk somewhere. It does this by using the page table entry for this faulting virtual address. (See do_no_page() in mm/memory.c)

If the page's page table entry is invalid but not empty, the page fault is for a page currently being held in the swap file. For Alpha AXP page table entries, these are entries which do not have their valid bit set but which have a non-zero value in their PFN field. In this case the PFN field holds information about where in the swap (and which swap file) the page is being held. How pages in the swap file are handled is described later in this chapter.

Not all vm_area_struct data structures have a set of virtual memory operations and even those that do may not have a nopage operation. This is because by default Linux will fix up the access by allocating a new physical page and creating a valid page table entry for it. If there is a nopage operation for this area of virtual memory, Linux will use it.

The generic Linux nopage operation is used for memory mapped executable images and it uses the page cache to bring the required image page into physical memory.

However the required page is brought into physical memory, the processes page tables are updated. It may be necessary for hardware specific actions to update those entries, particularly if the processor uses translation look aside buffers. Now that the page fault has been handled it can be dismissed and the process is restarted at the instruction that made the faulting virtual memory access. (See filemap_nopage() in mm/filemap.c)

3.7 The Linux Page Cache

每个进程的虚存空间由一个mm_struct数据结构表示。其中包含了当前正在执行的映像（例如bash）的信息，以及很多指向vm_area_struct这一数据结构的指针。每个vm_area_struct数据结构描述了一段虚存区域的开始和结束，及进程对那段虚存的存取权限和允许的操作。这些操作是Linux在维护这段虚存必须使用的一套例程。例如，当进程试图存取虚存中某页，但发现这页并不在内存中时，应执行的正确操作是nopage操作(通过页错)。nopage操作在Linux需要将一些可执行映像的页载入内存中时使用。

当一段可执行映像被映射入进程的虚存时，会产生一组vm_area_struct数据结构。每个vm_area_struct数据结构可表示可执行映像的一部份：可执行代码、初始化数据(变量)，未初始化数据等等。Linux支持很多标准的虚存操作，当vm_area_struct数据结构产生时，系统会把正确的虚存操作集与他们相关联。

3.6 按需换页

当一部份可执行映像被映射入进程虚存后，它就可以开始执行了。可是这时只有映像的开始部份被实际读入内存，它将很快访问不在内存中的部份。当进程存取一个没有有效页表记录的虚地址时，那处理器将报告一个页错误给Linux系统。（参见 mm/memory.c中的handle_mm_fault()）

页错误描述页错发生的虚地址和引起页错的访存方式。

Linux必须先找到代表页错发生区域的vm_area_struct。由于搜索vm_area_struct数据结构对高效处理页错非常关键，所以所有vm_area_struct被连接成AVL（Adelson-Velskii and Landis）树结构(编者注：即平衡二叉树结构)。如果没有vm_area_struct代表这页错发生的虚地址，表示这进程企图访问一个非法的虚地址。Linux将发送SIGSEGV信号给进程，如果进程没有这个信号的处理程序，那么该进程将被终止。

Linux再检查引发这页错的存取操作是否是被允许的。如果进程在用一个非法的方式存取内存，例如，写一个只读区域，这也将引发一个内存错误信号。

如果Linux确定页错是合法的,那么Linux就必须处理它。

Linux 必须首先区别映像是在交换文件中还是在磁盘上。它是通过页错发生的虚地址所对应的页表记录来识别的。（参见mm/memory.c中的do_no_page()）

如果那页的页表记录是无效的，但非空，说明产生页错的那页当前存在于交换文件中。Alpha AXP页表记录中，这样的记录的有效位未置位，但是PFN域不为零。在这种情况下，PFN域含有的信息表示了这页被保存在哪个交换文件中的哪个位置。本章的后面部份将讲述怎样处理在交换文件中的页。

并非所有的vm_area_struct数据结构都有一组虚存操作，即使有，也不一定有nopage操作。这是因为，在缺损情况下，Linux将分配一页新内存，并为这页增加一项页表记录，Linux以此来修证这个访问错误。但如果这段虚存有nopage操作，Linux将使用它。

通常，Linux的nopage操作被用于把可执行映像通过页缓存读入物理内存。

当页被读入内存后，进程的页表将被更新。特别是如果处理器使用TLB缓冲区的话，它可能需要通过硬件操作来完成更新。页错被处理后，进程在发生虚拟内存访问错误的指令处(编者注：即产生页错的指令处)重新开始执行。(编者注：这里有个TLB，可能大家不太熟习，这里简单提一下：TLB是translation look aside buffer的缩写，它是CPU在内部做的一块专用存储器一样的东东（当然还有一些其它控制逻辑），它主要用来缓存一小部分页表记录，这样它就可以专门用于加快虚拟地址到物理地址的转换，因为有了它之后，CPU不需要每次都从外部存储器中读取页表记录了。故而，一些资料上常将其译为“快速地址转换表”。)（参见mm/filemap.c中的filemap_nopage()）

3.7 Linux页缓存

Figure 3.6: The Linux Page Cache

The role of the Linux page cache is to speed up access to files on disk. Memory mapped files are read a page at a time and these pages are stored in the page cache. Figure 3.6 shows that the page cache consists of the page_hash_table, a vector of pointers to mem_map_t data structures. (See include/linux/pagemap.h)

Each file in Linux is identified by a VFS inode data structure (described in Chapter filesystem-chapter) and each VFS inode is unique and fully describes one and only one file. The index into the page table is derived from the file's VFS inode and the offset into the file.

Whenever a page is read from a memory mapped file, for example when it needs to be brought back into memory during demand paging, the page is read through the page cache. If the page is present in the cache, a pointer to the mem_map_t data structure representing it is returned to the page fault handling code. Otherwise the page must be brought into memory from the file system that holds the image. Linux allocates a physical page and reads the page from the file on disk.

If it is possible, Linux will initiate a read of the next page in the file. This single page read ahead means that if the process is accessing the pages in the file serially, the next page will be waiting in memory for the process.

Over time the page cache grows as images are read and executed. Pages will be removed from the cache as they are no longer needed, say as an image is no longer being used by any process. As Linux uses memory it can start to run low on physical pages. In this case Linux will reduce the size of the page cache.

3.8 Swapping Out and Discarding Pages

When physical memory becomes scarce the Linux memory management subsystem must attempt to free physical pages. This task falls to the kernel swap daemon (kswapd). (See kswapd() in mm/vmscan.c)

The kernel swap daemon is a special type of process, a kernel thread. Kernel threads are processes have no virtual memory, instead they run in kernel mode in the physical address space. The kernel swap daemon is slightly misnamed in that it does more than merely swap pages out to the system's swap files. Its role is make sure that there are enough free pages in the system to keep the memory management system operating efficiently.

The Kernel swap daemon (kswapd) is started by the kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire.

Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. It uses two variables, free_pages_high and free_pages_low to decide if it should free some pages. So long as the number of free pages in the system remains above free_pages_high, the kernel swap daemon does nothing; it sleeps again until its timer next expires. For the purposes of this check the kernel swap daemon takes into account the number of pages currently being written out to the swap file. It keeps a count of these in nr_async_pages; this is incremented each time a page is queued waiting to be written out to the swap file and decremented when the write to the swap device has completed. free_pages_low and free_pages_high are set at system startup time and are related to the number of physical pages in the system. If the number of free pages in the system has fallen below free_pages_high or worse still free_pages_low, the kernel swap daemon will try three ways to reduce the number of physical pages being used by the system:

l Reducing the size of the buffer and page caches,

l Swapping out System V shared memory pages,

l Swapping out and discarding pages.

If the number of free pages in the system has fallen below free_pages_low, the kernel swap daemon will try to free 6 pages before it next runs. Otherwise it will try to free 3 pages. Each of the above methods are tried in turn until enough pages have been freed. The kernel swap daemon remembers which method it was using the last time that it attempted to free physical pages. Each time it runs it will start trying to free pages using this last successful method.

After it has free sufficient pages, the swap daemon sleeps again until its timer expires. If the reason that the kernel swap daemon freed pages was that the number of free pages in the system had fallen below free_pages_low, it only sleeps for half its usual time. Once the number of free pages is more than free_pages_low the kernel swap daemon goes back to sleeping longer between checks.

3.8.1 Reducing the Size of the Page and Buffer Caches

The pages held in the page and buffer caches are good candidates for being freed into the free_area vector. The Page Cache, which contains pages of memory mapped files, may contain unneccessary pages that are filling up the system's memory. Likewise the Buffer Cache, which contains buffers read from or being written to physical devices, may also contain unneeded buffers. When the physical pages in the system start to run out, discarding pages from these caches is relatively easy as it requires no writing to physical devices (unlike swapping pages out of memory). Discarding these pages does not have too many harmful side effects other than making access to physical devices and memory mapped files slower. However, if the discarding of pages from these caches is done fairly, all processes will suffer equally. (See shrink_map() in mm/filemap.c)

Every time the Kernel swap daemon tries to shrink these caches it examines a block of pages in the mem_map page vector to see if any can be discarded from physical memory. The size of the block of pages examined is higher if the kernel swap daemon is intensively swapping; that is if the number of free pages in the system has fallen dangerously low. The blocks of pages are examined in a cyclical manner; a different block of pages is examined each time an attempt is made to shrink the memory map. This is known as the clock algorithm as, rather like the minute hand of a clock, the whole mem_map page vector is examined a few pages at a time.

Each page being examined is checked to see if it is cached in either the page cache or the buffer cache. You should note that shared pages are not considered for discarding at this time and that a page cannot be in both caches at the same time. If the page is not in either cache then the next page in the mem_map page vector is examined.

Pages are cached in the buffer cache (or rather the buffers within the pages are cached) to make buffer allocation and deallocation more efficient. The memory map shrinking code tries to free the buffers that are contained within the page being examined.

If all the buffers are freed, then the pages that contain them are also be freed. If the examined page is in the Linux page cache, it is removed from the page cache and freed. (See free_buffer() in fs/buffer.c)

When enough pages have been freed on this attempt then the kernel swap daemon will wait until the next time it is periodically woken. As none of the freed pages were part of any process's virtual memory (they were cached pages), then no page tables need updating. If there were not enough cached pages discarded then the swap daemon will try to swap out some shared pages.

3.8.2 Swapping Out System V Shared Memory Pages

System V shared memory is an inter-process communication mechanism which allows two or more processes to share virtual memory in order to pass information amongst themselves. How processes share memory in this way is described in more detail in Chapter IPC-chapter. For now it is enough to say that each area of System V shared memory is described by a shmid_ds data structure. This contains a pointer to a list of vm_area_struct data structures, one for each process sharing this area of virtual memory. The vm_area_struct data structures describe where in each processes virtual memory this area of System V shared memory goes. Each vm_area_struct data structure for this System V shared memory is linked together using the vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure also contains a list of page table entries each of which describes the physical page that a shared virtual page maps to.

The kernel swap daemon also uses a clock algorithm when swapping out System V shared memory pages.

Each time it runs it remembers which page of which shared virtual memory area it last swapped out. It does this by keeping two indices, the first is an index into the set of shmid_ds data structures, the second into the list of page table entries for this area of System V shared memory. This makes sure that it fairly victimizes the areas of System V shared memory. (See shm_swap() in ipc/shm.c)

As the physical page frame number for a given virtual page of System V shared memory is contained in the page tables of all of the processes sharing this area of virtual memory, the kernel swap daemon must modify all of these page tables to show that the page is no longer in memory but is now held in the swap file. For each shared page it is swapping out, the kernel swap daemon finds the page table entry in each of the sharing processes page tables (by following a pointer from each vm_area_struct data structure). If this processes page table entry for this page of System V shared memory is valid, it converts it into an invalid but swapped out page table entry and reduces this (shared) page's count of users by one. The format of a swapped out System V shared page table entry contains an index into the set of shmid_ds data structures and an index into the page table entries for this area of System V shared memory.

If the page's count is zero after the page tables of the sharing processes have all been modified, the shared page can be written out to the swap file. The page table entry in the list pointed at by the shmid_ds data structure for this area of System V shared memory is replaced by a swapped out page table entry. A swapped out page table entry is invalid but contains an index into the set of open swap files and the offset in that file where the swapped out page can be found. This information will be used when the page has to be brought back into physical memory.

3.8.3 Swapping Out and Discarding Pages

The swap daemon looks at each process in the system in turn to see if it is a good candidate for swapping.

Good candidates are processes that can be swapped (some cannot) and that have one or more pages which can be swapped or discarded from memory. Pages are swapped out of physical memory into the system's swap files only if the data in them cannot be retrieved another way. (See swap_out() in mm/vmscan.c)

A lot of the contents of an executable image come from the image's file and can easily be re-read from that file. For example, the executable instructions of an image will never be modified by the image and so will never be written to the swap file. These pages can simply be discarded; when they are again referenced by the process, they will be brought back into memory from the executable image.

Once the process to swap has been located, the swap daemon looks through all of its virtual memory regions looking for areas which are not shared or locked.

Linux does not swap out all of the swappable pages of the process that it has selected; instead it removes only a small number of pages.

Pages cannot be swapped or discarded if they are locked in memory. (See swap_out_vma() in mm/vmscan.c)

The Linux swap algorithm uses page aging. Each page has a counter (held in the mem_map_t data structure) that gives the Kernel swap daemon some idea whether or not a page is worth swapping. Pages age when they are unused and rejuvinate on access; the swap daemon only swaps out old pages. The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched, it's age is increased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages pages, decrementing their age by 1. These default actions can be changed and for this reason they (and other swap related information) are stored in the swap_control data structure.

If the page is old (age = 0), the swap daemon will process it further. Dirty pages are pages which can be swapped out. Linux uses an architecture specific bit in the PTE to describe pages this way (see Figure 3.2). However, not all dirty pages are necessarily written to the swap file. Every virtual memory region of a process may have its own swap operation (pointed at by the vm_ops pointer in the vm_area_struct) and that method is used. Otherwise, the swap daemon will allocate a page in the swap file and write the page out to that device.

The page's page table entry is replaced by one which is marked as invalid but which contains information about where the page is in the swap file. This is an offset into the swap file where the page is held and an indication of which swap file is being used. Whatever the swap method used, the original physical page is made free by putting it back into the free_area. Clean (or rather not dirty) pages can be discarded and put back into the free_area for re-use.

If enough of the swappable processes pages have been swapped out or discarded, the swap daemon will again sleep. The next time it wakes it will consider the next process in the system. In this way, the swap daemon nibbles away at each processes physical pages until the system is again in balance. This is much fairer than swapping out whole processes.

3.9 The Swap Cache

When swapping pages out to the swap files, Linux avoids writing pages if it does not have to. There are times when a page is both in a swap file and in physical memory. This happens when a page that was swapped out of memory was then brought back into memory when it was again accessed by a process. So long as the page in memory is not written to, the copy in the swap file remains valid.

Linux uses the swap cache to track these pages. The swap cache is a list of page table entries, one per physical page in the system. This is a page table entry for a swapped out page and describes which swap file the page is being held in together with its location in the swap file. If a swap cache entry is non-zero, it represents a page which is being held in a swap file that has not been modified. If the page is subsequently modified (by being written to), its entry is removed from the swap cache.

When Linux needs to swap a physical page out to a swap file it consults the swap cache and, if there is a valid entry for this page, it does not need to write the page out to the swap file. This is because the page in memory has not been modified since it was last read from the swap file.

The entries in the swap cache are page table entries for swapped out pages. They are marked as invalid but contain information which allow Linux to find the right swap file and the right page within that swap file.

3.10 Swapping Pages In

The dirty pages saved in the swap files may be needed again, for example when an application writes to an area of virtual memory whose contents are held in a swapped out physical page. Accessing a page of virtual memory that is not held in physical memory causes a page fault to occur. The page fault is the processor signalling the operating system that it cannot translate a virtual address into a physical one. In this case this is because the page table entry describing this page of virtual memory was marked as invalid when the page was swapped out. The processor cannot handle the virtual to physical address translation and so hands control back to the operating system describing as it does so the virtual address that faulted and the reason for the fault. The format of this information and how the processor passes control to the operating system is processor specific.

The processor specific page fault handling code must locate the vm_area_struct data structure that describes the area of virtual memory that contains the faulting virtual address. It does this by searching the vm_area_struct data structures for this process until it finds the one containing the faulting virtual address. This is very time critical code and a processes vm_area_struct data structures are so arranged as to make this search take as little time as possible. (See do_page_fault() in arch/i386/mm/fault.c)

Having carried out the appropriate processor specific actions and found that the faulting virtual address is for a valid area of virtual memory, the page fault processing becomes generic and applicable to all processors that Linux runs on.

The generic page fault handling code looks for the page table entry for the faulting virtual address. If the page table entry it finds is for a swapped out page, Linux must swap the page back into physical memory. The format of the page table entry for a swapped out page is processor specific but all processors mark these pages as invalid and put the information neccessary to locate the page within the swap file into the page table entry. Linux needs this information in order to bring the page back into physical memory. (See do_no_page() in mm/memory.c)

At this point, Linux knows the faulting virtual address and has a page table entry containing information about where this page has been swapped to. The vm_area_struct data structure may contain a pointer to a routine which will swap any page of the area of virtual memory that it describes back into physical memory. This is its swapin operation. If there is a swapin operation for this area of virtual memory then Linux will use it. This is, in fact, how swapped out System V shared memory pages are handled as it requires special handling because the format of a swapped out System V shared page is a little different from that of an ordinairy swapped out page. There may not be a swapin operation, in which case Linux will assume that this is an ordinairy page that does not need to be specially handled.

It allocates a free physical page and reads the swapped out page back from the swap file. Information telling it where in the swap file (and which swap file) is taken from the the invalid page table entry. (See do_swap_page() in mm/memory.c; shm_swap_in() in ipc/shm.c; swap_in() in mm/page_alloc.c)

If the access that caused the page fault was not a write access then the page is left in the swap cache and its page table entry is not marked as writable. If the page is subsequently written to, another page fault will occur and, at that point, the page is marked as dirty and its entry is removed from the swap cache. If the page is not written to and it needs to be swapped out again, Linux can avoid the write of the page to its swap file because the page is already in the swap file.

If the access that caused the page to be brought in from the swap file was a write operation, this page is removed from the swap cache and its page table entry is marked as both dirty and writable.

Linux页缓存的作用是加快从磁盘上存取文件的速度。内存映射文件以每次一页的方式读出，这些页将被放在页缓存中。图3.6显示页缓存由page_hash_table，以及一组指向mem_map_t的指针组成。（参见include/linux/pagemap.h）

Linux的每个文件由一VFS inode数据结构表示(请参看“文件系统”一章)，并且每个VFS inode是唯一的并且描述一个且仅一个文件。页表中的索引从文件的VFS inode及其在文件中的偏移量派生而来。

当从内存映像文件中读取一页时，例如，按需装载一页回内存时，读操作将通过页缓存进行。如果页在缓存中，一个指向它的mem_map_t指针将被返回给处理页错的代码。否则，这页必须从含有这份映象的文件系统中被读入内存。Linux需分配一页内存，并从磁盘文件中读取这页。(编者注：这两句话直译很坳口，其实它的意思就是如果这页不是在页缓存中，则系统必须把它从存放这页的磁盘文件中读取出来。)

如果可能，Linux将开始读文件中的下一页。向前多读一页意味着如果进程是连续地访问文件，那么下一页将等在内存中。

页缓存将随着映象的读取与执行而渐渐增长。当不再被需要，或说不再被任何进程使用时，这些页将从缓存中移出。Linux使用内存时，会尽量减少物理页的使用。在此种情况下，Linux将减少页缓存的大小。

3.8 页的交换和释放

当空内存变得很少时，Linux内存管理系统必须释放一些页。这任务由内核交换后台程序来完成（kswapd）。（参见mm/vmscan.c中的kswapd()）

内核交换后台程序是一种特殊的进程，是一个内核线程。内核线程是没有虚存的进程，它们在物理地址空间以内核模式运行。“内核交换后台程序”这个名称稍微有点不恰当，因为它不仅仅是把页交换到系统的交换文件中。它这个角色是保证系统有足够的空闲内存而使内存管理系统可以高效地工作。

内核交换后台程序被内核init进程在初始化时启动，并等待内核交换定时器周期性地到期时开始运行。

每次定时器到期，内核交换后台程序就会检查系统中的空页数是否变得太低。交换程序使用两个变量，free_pages_high和 free_pages_low来决定是否它应该释放一些页。只要系统的空页数大于 free_pages_high，内核交换后台程序不做任何事情；它继续休息直到定时器再次到期。在做这项检查时，交换程序计算了正在往交换文件中写的页数。它把这个值保存在nr_async_pages中，每次有一页等待写入交换文件时，此值加1，当操作结束后，此值减1。free_pages_low和free_pages_high在系统开始时被设置，并且与系统物理内存的页数有关。如果系统的空页数小于 free_pages_high 或甚至小于 free_pages_low,内核交换后台程序将尝试 3 种方法以减少系统使用的页数：

l 减少缓冲区和页缓存的大小

l 换出系统 V 的共享页

l 换出并释放一些页

如果系统的空页数小于 free_pages_low,内核交换后台程序在它下次运行以前，将尝试释放6页，否则它将尝试释放3页。上面的方法将依次被使用直到有足够的页被释放。内核交换后台程序将记住上一次它是用什么方法释放内存的，下一次将首先使用这个成功的方法。

在系统有足够的空页后，交换程序将休息直到它的定时器到期。如果上次空页数小于free_pages_low，它只休息一半时间。直到空页数多于 free_pages_low，内核交换后台程序才恢复休息的时间。

3.8.1 减少页缓存和缓冲区的大小

页缓存和缓冲区中的页是被释放到free_area数组里的最佳候选。页缓存保存着内存映像文件，很可能包括了许多占据着系统内存但又没用的页。同样，包含着从物理设备中读写的数据的缓冲区中，也很可能包含许多不需要的数据缓存。当系统的内存页快用完时，从这些缓存丢弃页是相对容易的，因为它们不需要写物理设备(不同于从内存交换页)。丢弃这些页除了使访问物理设备和内存映象文件的速度减慢一些以外，没有其它的副作用。并且，如果从缓存中对页的丢弃是公平的话，那么对各进程的影响是相同的。（参见mm/filemap.c中的shrink_map()）

每次内核交换后台程序尝试缩小这些缓存时，它先检查在mem_map页面数组中的页块，看是否有页可以从内存中释放。如果内核交换后台程序经常作交换操作，也就是系统空页数已经非常少了，它会先检查大一些的块。页块会被轮流检查；每次减少缓存时检查一组不同的页块。这被称作“时钟算法”，像钟的分针一样轮流检查mem_map页面数组中的页。

检查一页是看它是否在页缓存或缓冲区中。应该注意共享页在这时候不能被释放，并且一页不能同时在两个缓存中。如果页不在任何一个缓存中，那么就检查mem_map页面数组中的下一页。

页被缓存在缓冲区中(或页内的缓冲区被缓存)是为更有效地分配和回收缓存。缩减内存代码将尝试释放被检查页中的缓冲区。

如果所有的缓冲区都被释放了，那么对应它们的内存也就被释放了。如果被检查的页在Linux页缓存中，它将被从页缓存中移出并释放。（参见fs/buffer.c中的free_buffer()）

当足够的页被释放后，内核交换后台程序将等到下一个周期再运行。因为释放的页都不是任何进程的虚存部份(他们是被缓存的页),所以没有页表记录需要更新。如果没有释放足够的缓存页，那么交换程序将试着释放一些共享页。

3.8.2 交换出系统V的共享页

系统V共享内存是一个进程内通信机制，它允许两个或多个进程共享虚拟内存，以便于在它们之间传递信息。进程间如何通过这种方式共享内存，将会在IPC章中详细描述。现在，只要知道每一块系统V共享内存区域被一个shmid_ds数据结构描述就足够了。这个结构包含一个指向一组vm_area_struct数据结构的指针，每个进程都通过它共享这部分虚存。vm_area_struct数据结构描述了每个进程在各自虚存的哪里共享系统V的这个区域。每个vm_area_struct由vm_next_shared和vm_prev_shared指针相互连接起来。每个shmid_ds数据结构还包括一组页表记录，每个页表记录描述了这些共享页是对应内存中的哪些物理页。

内核交换后台程序也使用时钟算法来换出系统V的共享页。

每次它运行时，它记得上次换出的是哪个共享内存区域中的哪一页。它将其记录在两个索引中，第一个是shmid_ds数据结构的索引,第二个是这段系统V共享内存的页表记录的索引。这保证它公平地对待系统V的所有共享页。（参见ipc/shm.c中的shm_swap()）

由于给定的系统V共享内存的物理页号在每一个共享此内存区域的进程的页表中都有记录，内核交换后台程序必须修改所有这些页表，显示页已不在内存中了，而被保存在交换文件中。对于每个换出的共享页，内核交换后台程序查找这共享页在各个进程中的页表记录（顺着每个vm_area_struct的指针）。如果这系统V共享页对应的页表记录是有效的，交换程序将把它改成无效，换出页表项，再将对应这页(共享页)的用户计数器减1。被换出的系统V共享页的页表记录格式中含有一个shmid_ds数据结构的索引,以及一个这段系统V共享内存区域的索引。

如果各进程的页表修改过后，页的计数器变成0，那么这个共享页就可以被写入交换文件了。shmid_ds中指向系统V共享内存页的页表记录将被换出页表记录所替换。一个换出页表记录是无效的，但它包含一组打开的交换文件的索引，以及换出页面在交换文件中的偏移量。当这页面重新被载入物理内存时，这些信息会被使用到。

3.8.3 换出及释放的页

交换程序轮流检查系统中每一个进程，看它们是不是被用来交换的好的候选。

好的候选是那些能被(有的不能)换出的进程以及那些能从内存中换出或释放若干页的进程。只有包含的数据不能从其它地方得到的页，才会从物理内存中交换到系统的交换文件中。（参见mm/vmscan.c中的swap_out()）

可执行映像的许多内容是可以从映像文件中读出并且可以很容易的重新读出来的。例如,一段映像的可执行指令决不会被映象修改，所以决不会被写进交换文件。这些页可以简单的被丢弃；当他们再被进程调用时，他们将被从可执行映像中重新读入内存。

一旦确定了换出的进程，交换程序将检查它所有的虚拟内存区域，找出不是共享或被加锁的区域。

Linux并不换出它所选择进程的所有可交换页；相反它仅移出其中的一小部份。

如果页在内存中被锁住了，它们就不能被换出或释放。（参见mm/vmscan.c中的swap_out_vma()）

Linux交换算法使用页的年龄(aging)。每页有一个记数器（保存在mem_map_t数据结构中），告诉交换程序是否应将它移出。当页不使用时页会变老；当被访问时，会变年轻。交换程序仅仅移出衰老的页。缺省状态下，当一页被分配时，起始年龄是3，每次它被访问，它的年龄将增加3，最大值为20。每次内核交换后台程序运行时，它把所有页的年龄数减1。这些缺省操作都能被改变，它们（以及一些其它的与交换相关的信息）被存储在swap_control数据结构中。

如果页是旧的(age = 0)，交换程序就进一步处理它。脏页也可以被移出。Linux用PTE中的特定位来标示(见3.2图)。然而,并非所有的脏页必须被写进交换文件。进程的每个虚存区域都可以有它们自己的交换操作(由vm_area_struct中的vm_ops指针指出)，这个特定的操作将被调用。否则，交换程序将在交换文件上分配一页，并将那页写到磁盘上。

页对应的页表记录将被改为无效，但包含了它在交换文件中的信息，它将指出是它存在于哪个交换文件，并且偏移量是多少。无论采取什么交换方法，原来的物理页将被放回free_area。乾净的(或者not dirty)的页可以直接丢弃放并放回free_area以备后用。

如果有足够的页被换出或释放,交换程序就又开始休息。下一次它运行时，它将检查系统中的下一个进程。这样，交换程序一点一点地将每个进程都移出几页，直到系统达到一个平衡，这比移出一整个进程来的公平。

3.9 交换缓存

当将页移入交换文件中时，如果不是必要，Linux总是避免进行写页操作。有时一页既在交换文件中，又在内存中。这种情况是由于这页本来被移到了交换文件中，后又因为被调用，又被重新读入内存。只要在内存中的页没被写过, 在交换文件中的拷贝仍然是有效。

Linux使用交换缓存来记录这些页。交换缓存是一个页表记录链表，每条记录对应一页。每条页表记录描述被换出的页在哪个交换文件中及其在文件中的位置。如果一个交换缓存记录非零，表示在交换文件中的那页没被修改过，如果页被修改了(被写)，它的记录将被从交换缓存中移出。

当Linux需要移出一页内存到交换文件中时，它先查询交换缓存,如果这页有一个有效的记录，它就不需要把页写到交换文件中了。因为自从它上次从交换文件中读出后，在内存中没被修改过。

交换缓存中的记录是已被交换出的页的页表记录。它们被标为无效，但是告知了Linux页在哪个交换文件以及在交换文件的哪个位置。

3.10 移入页

保存在交换文件中的脏页可能会被再次调用。例如，一个应用程序要向已交换出物理页面的虚拟内存区上写入时。这样，存取不在内存中的虚页将引起页错。页错误是由处理器发信号给操作系统，告诉操作系统它不能把某个虚地址翻译成物理地址，由于描述这个虚页在被交换出时，页表记录已被标记为无效，处理器不能处理虚拟地址到物理地址的转换，于是，处理器把控制权交还给操作系统。同时告诉操作系统发生页错的虚拟地址及原因。消息的格式以及处理器怎样把控制权交给操作系统这是与处理器相关的。

与处理器相关的页错处理代码必须找到引起页错的虚地址对应的vm_area_struct数据结构。在这个过程中，系统检索该进程所有的vm_area_struct数据结构直到找到为止。这段代码对时间的要求很高，所以vm_area_struct应被合理组织起来，以缩短查找所需的时间。（参见arch/i386/mm/fault.c中的do_page_fault()）

系统执行完这些与处理器相关的操作并找到引起页错的虚地址所代表的有效内存区域后，处理页错的其它代码是通用并且与运行Linux的处理器无关的了。

页错处理代码寻找引发页错的虚地址对应的页表记录。如果页表记录指示这页在交换文件中,Linux就必须把这页读回物理内存。页表记录的格式因处理器的不同而各不相同，但所有的处理器都会标记此页无效，并且都保存着有关这页在交换文件中的有用的信息。Linux需要利用这些信息来把页重新载入内存。（参见mm/memory.c中的do_no_page()）

此时，Linux知道了引起页错的虚地址及其对应的页表记录，并且拥有一个包含此页被交换到哪个交换文件中的信息的页表记录。vm_area_struct数据结构可能包含一个指向一个例程的指针，这个例程能将此虚拟内存中的任何见交换到物理内存中去。这是swapin操作。如果此虚拟内存区域存在swapin操作，Linux就会调用它。实际上，这是怎样换出系统V共享内存页的操作，这个操作需要特殊处理，因为系统V的页的格式与一般的页不同。这里也可能没有swapin操作，在这种情况下，Linux将认为它是一普通的页，而不需要做任何特别的处理。

系统将在物理内存中分配一空页，并从交换文件中把交换同的页读回来。而关于页面在交换文件中的位置信息（以及在哪个交换文件中）是从无效的页表记录中取回的。（参见mm/memory.c中的do_swap_page()、ipc/shm.c中的shm_swap_in()、mm/page_alloc.c中的swap_in()）

如果引起页错的不是写操作，那么这页将被留在交换缓存中，它的页表记录不会被标为“可写”。如果后来这页被写了，那么会产生另一个页错，这时，页被标成“dirty”，并且它的页表记录被从交换缓冲中删去。如果这页没被修改过，而它又需要被换出，Linux将避免再把这页写到交换文件中，因为它已经在那儿了。

如果引起页面从交换文件中读出的操作是写操作，页将被从交换缓存中删除，它的页表记录将被标成“脏的(dirty)”和“可写(writable)”。

1 电子邮件：david.rusling@arm.com

2 由于无法取得所有作者现在的真实情况，所有作者介绍均来自作者翻译时的前注。

3 毕业于美国Purdue 大学，获MS 学位。现供职于美国GTE 公司。电子邮件：bixin@yahoo.com

4 现为美国Carnegie Mellon 大学计算机系Ph.D Student。电子邮件：bixin@yahoo.com

5 现为美国Yale 大学 Ph.D。电子邮件：sheng.zhong@yale.edu

6 现任职于Lode Soft 公司，南京。电子邮件：ping@lodesoft.com

7 现为美国Wayne State Univ.Assistant Professor。电子邮件：zbo@cs.wayne.edu

8 现为美国DartMouth College Ph.D Candidate。电子邮件：qun.li@dartmouth.edu

9 美国硅谷软件工程师。电子邮件：niuniu_888@hotmail.com