2.1 内存管理基本框架
2.2 地址映射的全过程
2.3 重要的数据结构和函数
2.1
系统空间和用户空间的划分有利提高系统的安全性,页面交换带来时间上不确定性,嵌入式一般不采用。不采用地址映射不需要MMU。对空间的保护必须通过MMU才能实现。
二个空间的划分各CPU不同,include/asm-arm/memory.h
#define TASK_SIZE UL(0xbf000000) //the maximum size of a user space task.
#define PAGE_OFFSET UL(0xc0000000)
include/asm-arm/arch-s3c2410/memory.h
#define PHYS_OFFSET UL(0x30000000)
考虑到64位CPU,内核映射设计成三层,目录PGD,中间目录PMD,页表PT.均为数组.表项PTE.虚拟地址分成4个位段.没有三层的CPU跳过PMD.
include/asm-arm/pgtable.h
#define PMD_SHIFT 21
#define PGDIR_SHIFT 21
#define PTRS_PER_PTE 512 //256*2,2M的块,512个页面,每页面4K
#define PTRS_PER_PMD 1
#define PTRS_PER_PGD 2048 //2048个目录/块*2M=4G
#define PMD_SIZE (1UL << PMD_SHIFT)
#define PMD_MASK (~(PMD_SIZE-1))
#define PGDIR_SIZE (1UL << PGDIR_SHIFT)
#define PGDIR_MASK (~(PGDIR_SIZE-1))
#define PAGE_SHIFT 12
#define PAGE_SIZE (1UL << PAGE_SHIFT)
映射过程
内核为MMU设置PGD到XX寄存器,MMU用虚拟地址中目录位段索引PGD,相应PTE逻辑上指向PMD,物理上指向PT
PMD逻辑上存在,一个表项.保持原值不变,指向PT
MMU用PT位段索引PT,相应PTE指向物理页面
MMU将最后位段与物理页相加得到物理地址
系统空间虚拟地址对应的物理地址转换通过PAGE_OFFSET,
#define __pa(x) __virt_to_phys((unsigned long)(x))
#define __va(x) ((void *)__phys_to_virt((unsigned long)(x)))
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
#define __virt_to_phys(x) ((x) - PAGE_OFFSET + PHYS_OFFSET)
#define __phys_to_virt(x) ((x) - PHYS_OFFSET + PAGE_OFFSET)
//内核空间1G中并不一定都有相应的内存映射,由内存大小决定.I/O也放在内存空间中,得到映射.(如一些设备寄存器,小块存储器等)
? mapping sizes are 1MB sections,(一层映射其它为二层) 64KB large pages, 4KB small pages and new 1KB tiny pages

二层粗页面表,256*4为1K大小对齐,细页面表1024*4为4K大小对齐。64K大页面,仍有256个表项。每个大页面对应页表中16个连续的表项,重复16次。64k/4k=16. MMU以页面为单位装入/删除TLB表项,所以在完成大页面的映射时一次装入16个表项,与4K页面的区别。
首层映射表的表项

/*
* - section
*/
#define PMD_SECT_BUFFERABLE (1 << 2)
#define PMD_SECT_CACHEABLE (1 << 3)
#define PMD_SECT_XN (1 << 4) /* v6 */
#define PMD_SECT_AP_WRITE (1 << 10)
#define PMD_SECT_AP_READ (1 << 11)

* + Level 2 descriptor (PTE) 硬件版
* - common
*/
#define PTE_TYPE_MASK (3 << 0)
#define PTE_TYPE_FAULT (0 << 0)
#define PTE_TYPE_LARGE (1 << 0)
#define PTE_TYPE_SMALL (2 << 0)
#define PTE_BUFFERABLE (1 << 2)
#define PTE_CACHEABLE (1 << 3)
* - small page
*/
#define PTE_SMALL_AP_MASK (0xff << 4)
#define PTE_SMALL_AP_UNO_SRO (0x00 << 4) //USR NO SUP READONLY ---UNO SRO
#define PTE_SMALL_AP_UNO_SRW (0x55 << 4)
#define PTE_SMALL_AP_URO_SRW (0xaa << 4)
#define PTE_SMALL_AP_URW_SRW (0xff << 4)
* + Level 2 descriptor (PTE) LINUX版
* - common
*/
/*
* "Linux" PTE definitions.
*
* We keep two sets of PTEs - the hardware and the linux version.
* This allows greater flexibility in the way we map the Linux bits
* onto the hardware tables, and allows us to have YOUNG and DIRTY
* bits.
*
* The PTE table pointer refers to the hardware entries; the "Linux"
* entries are stored 1024 bytes below.
*/
#define L_PTE_PRESENT (1 << 0)
#define L_PTE_FILE (1 << 1) /* only when !PRESENT */
#define L_PTE_YOUNG (1 << 1) //为SWAP设计的标志位
#define L_PTE_BUFFERABLE (1 << 2) /* matches PTE */
#define L_PTE_CACHEABLE (1 << 3) /* matches PTE */
#define L_PTE_USER (1 << 4)
#define L_PTE_WRITE (1 << 5)
#define L_PTE_EXEC (1 << 6)
#define L_PTE_DIRTY (1 << 7)
/*
* Hardware-wise, we have a two level page table structure, where the first
* level has 4096 entries, and the second level has 256 entries. Each entry
* is one 32-bit word. Most of the bits in the second level entry are used
* by hardware, and there aren't any "accessed" and "dirty" bits.
*
* Linux on the other hand has a three level page table structure, which can
* be wrapped to fit a two level page table structure easily - using the PGD
* and PTE only. However, Linux also expects one "PTE" table per page, and
* at least a "dirty" bit.
*
* Therefore, we tweak the implementation slightly - we tell Linux that we
* have 2048 entries in the first level, each of which is 8 bytes (iow, two
* hardware pointers to the second level.) The second level contains two
* hardware PTE tables arranged contiguously, followed by Linux versions
* which contain the state information Linux needs. We, therefore, end up
* with 512 entries in the "PTE" level.
*
* This leads to the page tables having the following layout:
*
* pgd pte
* | |
* +--------+ +0
* | |-----> +------------+ +0
* +- - - - + +4 | h/w pt 0 |
* | |-----> +------------+ +1024
* +--------+ +8 | h/w pt 1 |
* | | +------------+ +2048
* +- - - - + | Linux pt 0 |
* | | +------------+ +3072
* +--------+ | Linux pt 1 |
* | | +------------+ +4096
*
* See L_PTE_xxx below for definitions of bits in the "Linux pt", and
* PTE_xxx for definitions of bits appearing in the "h/w pt".
*
* PMD_xxx definitions refer to bits in the first level page table.
*
* The "dirty" bit is emulated by only granting hardware write permission
* iff the page is marked "writable" and "dirty" in the Linux PTE. This
* means that a write to a clean page will cause a permission fault, and
* the Linux MM layer will mark the page dirty via handle_pte_fault().
* For the hardware to notice the permission change, the TLB entry must
* be flushed, and ptep_set_access_flags() does that for us.
*
* The "accessed" or "young" bit is emulated by a similar method; we only
* allow accesses to the page if the "young" bit is set. Accesses to the
* page will cause a fault, and handle_pte_fault() will set the young bit
* for us as long as the page is marked present in the corresponding Linux
* PTE entry. Again, ptep_set_access_flags() will ensure that the TLB is
* up to date.
*
* However, when the "young" bit is cleared, we deny access to the page
* by clearing the hardware PTE. Currently Linux does not flush the TLB
* for us in this case, which means the TLB will retain the transation
* until either the TLB entry is evicted under pressure, or a context
* switch which changes the user space mapping occurs.
*/
U第一次对某地址访存实际三次,二次为访存查表,为了使地址映射的过程能不访问内存,TLB高速缓存。MMU自动有选择地把一部分表项放在TLB。开启地址映射之初,TLB为空,U等MMU二次访存周期。一旦走过映射MMU把表项加入TLB,即同一页面以后再次访问节省二个访存周期。允许个别表项装入TLB,TLB满后,ARM采用轮换round-robin replacement algorithm (also called cyclic)


All regions of memory have an associated domain. A domain is the primary access control mechanism for a region
of memory and defines the conditions in which an access can proceed. The domain determines whether:
the access permissions are used to qualify the access
the access is unconditionally allowed to proceed
the access is unconditionally aborted.
In the latter two cases, the access permission attributes are ignored.
There are 16 domains, which are configured using the domain access control register.

内核只用了4个域
/*
* Domain numbers
*
* DOMAIN_IO - domain 2 includes all IO only
* DOMAIN_USER - domain 1 includes all user memory only
* DOMAIN_KERNEL - domain 0 includes all kernel memory only
*
* The domain numbering depends on whether we support 36 physical
* address for I/O or not. Addresses above the 32 bit boundary can
* only be mapped using supersections and supersections can only
* be set for domain 0. We could just default to DOMAIN_IO as zero,
* but there may be systems with supersection support and no 36-bit
* addressing. In such cases, we want to map system memory with
* supersections to reduce TLB misses and footprint.
*
* 36-bit addressing and supersections are only available on
* CPUs based on ARMv6+ or the Intel XSC3 core.
#define DOMAIN_KERNEL 0 //系统空间
#define DOMAIN_TABLE 0 //系统空间
#define DOMAIN_USER 1 //用户空间
#define DOMAIN_IO 2 //系统空间
各域的属性由内核分别设置
/*
* Domain types //用来设置域寄存器
*/
#define DOMAIN_NOACCESS 0
#define DOMAIN_CLIENT 1
#define DOMAIN_MANAGER 3
只有在386上要先经过段式映射.因为386CPU中的MMU规定先段式映射,再页式.LINUX采用页式,对于段式进行糊弄操作.线性地址A完好的变成了虚拟地址A.(如0X08048568-->FUNCXXX)
每个进程的PGD指针保存在每个进程的mm_struct中,每当调度一个进程运行时,内核都要为其设置寄存器XX,而MMU总是从XX中取得PGD指针.放入XX中的为物理地址,所以PGD指针需要从虚拟到物理转换
include/asm-arm/mmu_context.h
/*
* This is the actual mm switch as far as the scheduler
* is concerned. No registers are touched. We avoid
* calling the CPU specific function when the mm hasn't
* actually changed.
*/
static inline void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
cpu_switch_mm(next->pgd, next);
}
include/asm-arm/proc-fns.h
#define cpu_switch_mm(pgd,mm) cpu_do_switch_mm(virt_to_phys(pgd),mm)
页面映射过程中,CPU要访问内存三次,第一次PGD,PT,真正目标.一旦装入CACHE以后并合中以后,不需要内存中查找.硬件实现速度快.
2.3
PGD,PMD,PT中的PTE
include/asm-arm/page.h
/*
* These are used to make use of C type-checking..
*/
typedef struct { unsigned long pte; } pte_t;
typedef struct { unsigned long pmd; } pmd_t;
typedef struct { unsigned long pgd[2]; } pgd_t;
#define pte_val(x) ((x).pte)
#define pmd_val(x) ((x).pmd)
#define pgd_val(x) ((x).pgd[0])
#define pgprot_val(x) ((x).pgprot)
typedef struct { unsigned long pgprot; } pgprot_t; //与PTE的低XX位相对应,表示页面状态和权限
include/asm-arm/pgtable.h
/*
* "Linux" PTE definitions.
*
* We keep two sets of PTEs - the hardware and the linux version.
* This allows greater flexibility in the way we map the Linux bits
* onto the hardware tables, and allows us to have YOUNG and DIRTY
* bits.
*
* The PTE table pointer refers to the hardware entries; the "Linux"
* entries are stored 1024 bytes below.
*/
#define L_PTE_PRESENT (1 << 0)
#define L_PTE_FILE (1 << 1) /* only when !PRESENT */
#define L_PTE_YOUNG (1 << 1)
#define L_PTE_BUFFERABLE (1 << 2) /* matches PTE */
#define L_PTE_CACHEABLE (1 << 3) /* matches PTE */
#define L_PTE_USER (1 << 4)
#define L_PTE_WRITE (1 << 5)
#define L_PTE_EXEC (1 << 6)
#define L_PTE_DIRTY (1 << 7)
#define L_PTE_SHARED (1 << 10) /* shared(v6), coherent(xsc3) */
PTE中的地址部分和属性状态部分组成一个完整的PTE,
/*
* Conversion functions: convert a page and protection to a page entry,
* and a page entry and page directory to the page they refer to.
*/
#define mk_pte(page,prot) pfn_pte(page_to_pfn(page),prot)
unsigned long page_to_pfn(struct page *page)
{
return __page_to_pfn(page);
}
#define __page_to_pfn(page) ((unsigned long)((page) - mem_map) + ARCH_PFN_OFFSET)
#define ARCH_PFN_OFFSET PHYS_PFN_OFFSET
#define PHYS_PFN_OFFSET (PHYS_OFFSET >> PAGE_SHIFT)
#define PHYS_OFFSET UL(0x30000000)
#define PAGE_SHIFT 12
#define pfn_pte(pfn,prot) (__pte(((pfn) << PAGE_SHIFT) | pgprot_val(prot)))
#define __pte(x) ((pte_t) { (x) } )
mem_map指向一个page结构数组,每个page代表一个物理页面,整个数组代表全部物理页面.高20位对于软件经过转换是一个物理页号,,可以找到相应page结构,对于MMU补上低位就是物理页起始地址.用PTE中的地址段得出相应的物理页page结构.如下
#define pte_page(pte) (pfn_to_page(pte_pfn(pte)))
#define pte_pfn(pte) (pte_val(pte) >> PAGE_SHIFT)
struct page *pfn_to_page(unsigned long pfn)
{
return __pfn_to_page(pfn);
}
#define __pfn_to_page(pfn) (mem_map + ((pfn) - ARCH_PFN_OFFSET))
映射过程中,MMU首先检查的是P标志,如为1才完成映射,否则产生缺页异常.此时PTE中的内容对MMU无意义.软件也可以设置和检测PTE内容.如
#define set_pte_at(mm,addr,ptep,pteval) do { \
set_pte_ext(ptep, pteval, (addr) >= TASK_SIZE ? 0 : PTE_EXT_NG); \
} while (0)
/*
* The following only work if pte_present() is true.
* Undefined behaviour if not..
*/
#define pte_present(pte) (pte_val(pte) & L_PTE_PRESENT)
#define pte_write(pte) (pte_val(pte) & L_PTE_WRITE)
#define pte_dirty(pte) (pte_val(pte) & L_PTE_DIRTY)
#define pte_young(pte) (pte_val(pte) & L_PTE_YOUNG)
对软件来说,PTE为0表示没有建立映射,如果PTE非0,但P标志为0,表示映射已经建立,但物理页已换出到交换设备上.
内核代码中,经常需要根据虚拟地址找到相应物理页page结构,
/*
* Conversion between a struct page and a physical address.
*
* Note: when converting an unknown physical address to a
* struct page, the resulting pointer must be validated
* using VALID_PAGE(). It must return an invalid struct page
* for any physical address not corresponding to a system
* RAM address.
*
* page_to_pfn(page) convert a struct page * to a PFN number
* pfn_to_page(pfn) convert a _valid_ PFN number to struct page *
* pfn_valid(pfn) indicates whether a PFN number is valid
*
* virt_to_page(k) convert a _valid_ virtual address to struct page *
* virt_addr_valid(k) indicates whether a virtual address is valid
*/
#define virt_to_page(kaddr) pfn_to_page(__pa(kaddr) >> PAGE_SHIFT)
代表物理页的page结构,
/include/linux/Mm_types.h
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
* moment. Note that we have no way to track which tasks are using
* a page, though if it is a pagecache page,
structures can tell us
* who is mapping it.
*/
struct page {
unsigned long flags; /* Atomic flags, some possibly
* updated asynchronously */
atomic_t _count; /* Usage count, see below. */
union {
atomic_t _mapcount; /* Count of ptes mapped in mms,
* to show when page is mapped
* & limit reverse map searches.
*/
unsigned int inuse; /* SLUB: Nr of objects */
};
union {
struct {
unsigned long private; /* Mapping-private opaque data:
* usually used for buffer_heads
* if PagePrivate set; used for
* swp_entry_t if PageSwapCache;
* indicates order in the buddy
* system if PG_buddy is set.
*/
struct address_space *mapping; /* If low bit clear, points to
* inode address_space, or NULL.
* If page mapped as anonymous
* memory, low bit is set, and
* it points to anon_vma object:
* see PAGE_MAPPING_ANON below.
*/
};
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
spinlock_t ptl;
#endif
struct kmem_cache *slab; /* SLUB: Pointer to slab */
struct page *first_page; /* Compound tail pages */
};
union {
pgoff_t index; /* Our offset within mapping. */
void *freelist; /* SLUB: freelist req. slab lock */
};
struct list_head lru; /* Pageout list, eg. active_list
* protected by zone->lru_lock !
*/
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
unsigned long page_cgroup;
#endif
};
系统初始化时根据物理内存大小建立一个page结构数组mem_map,作为物理页的仓库.每个物理页的page结构数组下标就是该物理页号.
由于硬件的限制,Linux内核并不能对所有页面都一视同仁。有些页面位于内存中特定的物理地址上,所以不能将其用于一些特定的任务。由于存在这种限制,所以内核把页划分为不同的区(zones)。仓库里的物理页划分成ZONE_DMA,ZONE_NORMAL(ZONE_HIGHMEM用于物理地址超过1G)
内核地址空间
内核地址空间是虚拟地址的3G-4G部分, 内核代码产生并处理位于这个区间的地址, 假设物理内存有4G, 其中0-896MB可以直接映射到内核虚拟地址空间, 但物理内存>896MB部分怎么办呢? 这就是所谓的高端内存管理做的事情了, 这些高端内存并不映射到内核地址空间, 因此, 内核不能直接访问它们.
不妨先来看两张示意图, 第一个图片就是常见的虚拟内存布局, 或者说虚拟地址空间布局; 第二个图片是"少见的"物理内存布局. 从中可以看到, 物理内存是从低向高分配的, 物理内存的低位1GB都有特别的用途, 比如前面提到的 mem_map 就是放在这里. 具体细节后文物理内存管理部分会提及.
NUMA & Zone
Linux 2.6开始支持 Non-Uniform Memory Access(NUMA) 模型, CPU 访问内存的不同位置耗费的时间可以是不一样的.
内核还对内存做了分区, 分成三个区: ZONE_DMA,ZONE_NORMAL,ZONE_HIGHMEM. ZONE_DMA 的内存范围是0~16MB, ZONE_NORMAL 的内存范围是16~896MB, ZONE_HIGHMEM 的内存范围是 >896MB.
顾名思义, ZONE_DMA 是为了 DMA 设定了, 一般供硬件设备使用, 因为某些硬件设备不使用内存映射机制, 而是需要直接访问内存.并且要求地址不能过高.DMA要求地址连续当大于1页的数据后.
DMA使用的页是I/O必需的,如果仓库中的页分光,那就无法I/O.
ZONE_NORMAL,ZONE_DMA 包含的页框都是内核可以直接访问的, 当然是先经过简单的线性映射, 而 ZONE_HIGHMEM 包含的页框不能经由同样的线性映射被内核直接访问, 这就是所谓的高端内存了.
当然, ZONE_HIGHMEM 是可能不存在的, 比如前几年, 个人电脑的内存一般只有256M,512M这样, 根本就没有高端内存了. 另外, 前文提到, 在64位机器上也是没有高端内存的.
Linux 将内存分区的目的是直白的, 细分之后, 对于不同的内存请求, 可以在不同的分区里分配页框.
每个管理区都有一个结构 zone_struct
/include/linux/mmzone.h
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long pages_min, pages_low, pages_high;
/*
* We don't know if the memory that we're going to allocate will be freeable
* or/and it will be released eventually, so to avoid totally wasting several
* GB of ram we must reserve some of the lower zone memory (otherwise we risk
* to run OOM on the lower zones despite there's tons of freeable ram
* on the higher zones). This array is recalculated at runtime if the
* sysctl_lowmem_reserve_ratio sysctl changes.
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
struct per_cpu_pageset pageset[NR_CPUS];
/*
* free areas of different sizes
*/
spinlock_t lock;
struct free_area free_area[MAX_ORDER];
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
struct list_head active_list;
struct list_head inactive_list;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
unsigned long pages_scanned; /* since last reclaim */
unsigned long flags; /* zone flags, see below */
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
* prev_priority holds the scanning priority for this zone. It is
* defined as the scanning priority at which we achieved our reclaim
* target at the previous try_to_free_pages() or balance_pgdat()
* invokation.
*
* We use prev_priority as a measure of how much stress page reclaim is
* under - it drives the swappiness decision: whether to unmap mapped
* pages.
*
* Access to both this field is quite racy even on uniprocessor. But
* it is expected to average out OK.
*/
int prev_priority;
ZONE_PADDING(_pad2_)
/* Rarely used or read-mostly fields */
/*
* wait_table -- the array holding the hash table
* wait_table_hash_nr_entries -- the size of the hash table array
* wait_table_bits -- wait_table_size == (1 << wait_table_bits)
*
* The purpose of all these is to keep track of the people
* waiting for a page to become available and make them
* runnable again when possible. The trouble is that this
* consumes a lot of space, especially when so few things
* wait on pages at a given time. So instead of using
* per-page waitqueues, we use a waitqueue hash table.
*
* The bucket discipline is to sleep on the same queue when
* colliding and wake all in that wait queue when removing.
* When something wakes, it must check to be sure its page is
* truly available, a la thundering herd. The cost of a
* collision is great, but given the expected load of the
* table, they should be so rare as to be outweighed by the
* benefits from the saved space.
*
* __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
* primary users of these fields, and in mm/page_alloc.c
* free_area_init_core() performs the initialization of them.
*/
wait_queue_head_t * wait_table;
unsigned long wait_table_hash_nr_entries;
unsigned long wait_table_bits;
/*
* Discontig memory support fields.
*/
struct pglist_data *zone_pgdat;
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* zone_start_pfn, spanned_pages and present_pages are all
* protected by span_seqlock. It is a seqlock because it has
* to be read outside of zone->lock, and it is done in the main
* allocator path. But, it is written quite infrequently.
*
* The lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*/
unsigned long spanned_pages; /* total size, including holes */
unsigned long present_pages; /* amount of memory (excluding holes) */
/*
* rarely used fields:
*/
const char *name;
} ____cacheline_internodealigned_in_smp;
经过按块分配内存,所以每个zone有一组空闲块队列,块的大小为2/max_order个页面,#define MAX_ORDER 11
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
高端内存页框的内核映射
至此, 我们已经了解到, Linux 将物理内存划分为页框, 建立起 NUMA 的概念, 并将内存分区. 并且内核提供了虚拟内存这一抽象层, 内核地址空间只有1G大小, 只能映射1G大小的物理内存, >896M的内存属于高端内存, 需要特别的处理方法.
内核肯定要能访问所有的物理内存的, 因为物理内存分配这么重要的工作不是由用户程序完成的, 而是必须通过系统调用由内核代劳. 那么内核到底如何访问高端内存的呢?
低端内存的内核映射
在讲高端内存之前, 先来细致地看看"低端内存"是如何被内核访问的. 高端内存在32位机器下是指>896MB的物理内存, 低端内存指的就是<896MB的物理内存, 也就是 ZONE_DMA+ZONE_NORMAL.
在前文讲解分页的时候提到, 进程的线性地址(也就是虚拟地址)被分成三块, 第一块用来描述页全局目录, 包含10位, 能表示1024项. 页全局目录的前768项(能表示3G的虚拟地址空间, 用以表示虚拟地址空间中用户态的那一部分)的虚拟地址小于0xc0000000. 余下的表项对所有进程来说都是相同, 它们等于主内核页全局目录的相应表项.
主内核页全局目录(master kernel page global directory)是内核中的概念(不知道怎么去理解和解释它), 前面提到, 内核中使用的地址也是虚拟地址, 内核维护了一组自己使用的页表, 这些页表就驻留在主内核全局目录中.
由内核页表提供的最终映射必须能把从 0xc0000000 开始的虚拟地址转换为从 0 开始的物理地址. 因而可以认为, 0xc0000000开始的8996MB虚拟地址空间是线性映射到物理内存的前896MB的, 线性映射是指: 虚拟地址=0xc0000000+物理地址.
896M
你一定很好奇, 我们明明说到虚拟地址空间的最高1G是内核地址空间, 但是为什么高端内存是从896MB开始的呢? 通过简单的线性映射(加减偏移量), 1G虚拟地址是当然是可以映射1G物理地址的阿?
这是因为虚拟地址的最高 128MB 留给其他映射方式使用了(下文会提及, 可以猜想用途之一是高端内存管理), 1G-128M=896M. 这就是为什么.
三种映射机制
896MB 边界以上的页框并不映射在内核线性地址空间的第4个GB, 因此, 内核不能直接访问它们. 这意味着返回所分配页框线性地址的页分配器函数不适用于高端内存. 高端内存页框的分配只能通过 alloc_pages/alloc_page 搞定, 这些函数不返回第一个被分配页框的虚拟地址, 因为如果该页框属于高端内存, 那么这样的虚拟地址根本不存在. 这些函数返回的是第一个被分配页框的 struct page*.
内核虚拟地址空间的最后 128MB 的一部分专门用于映射高端内存页框. 当然这种映射不是固定的, 这一部分虚拟地址可以被重复利用, 使得整个高端内存能够在不同的时间被访问.
内核提供三种不同的映射机制将页框映射到高端内存: 永久内核映射, 临时内核映射和非连续内存分配.
这些技术中没有一种可以确保对整个 RAM 同时进行寻址, 毕竟只有 128MB 虚拟地址留给映射高端内存.
下图显示了内核虚拟地址空间的布局, 主要强调这里说的高端内存的映射机制.
永久映射
永久内核映射允许内核建立高端内存页框到内核[虚拟]地址空间的长期映射. 它们使用主内核页表中一个专门的页表, 地址存放在 pkmap_page_table 变量中, 页表项是512项或者1024项(取决于 PAE 是否被激活), 因此内核一次最多访问2MB或4MB的高端内存.
kmap, kunmap 分别用来建立和取消高端内存的永久内核映射, 其本质无非是关联,取消关联高端内存页框和有限的那么一点内核虚拟地址.
临时映射
Linux 在内核地址空间中预留了很少的页表项, 以用作高端内存的临时映射. 内核使用函数 kmap_atomic, kunmap_atomic 建立和取消临时映射.
临时映射的目的是为中断处理程序等函数提供分配高端内存的方式, 因为临时映射从不阻塞当前进程, 而永久映射可能阻塞进程, 因而不能用于中断处理程序等.
非连续内存区管理
非连续内存管理是指用连续的内核虚拟地址来访问非连续的物理页框.
内核用 vm_struct 结构表示每个非连续的物理内存区段, 所有 vm_struct 组成一个链表, 链表的第一个元素放在 vmlist 变量中. 调用 vmalloc() 函数给内核分配一个非连续内存区, 其主要工作是分配一组连续的内核虚拟地址, 然后分配一组非连续的页框映射到这些地址.
此时, 还需要做关键的一步, 就是修改内核使用的页表项, 以此表明分配给非连续内存区的每个页框现在对应着一个虚拟地址. 这个工作通过 map_vm_area() 做到.
用于非连续内存区保留的内核虚拟地址空间的起始位置是 VMALLOC_START, 末尾地址是 VMALLOC_END.
内存分配和伙伴算法
至此, 我们已经大致了解, 低端内存和高端内存大致是如何映射到内核虚拟地址空间的. 很多细节, 如页表的某些变量可以用来表示页表的内容类型, 是否被占用等, 被本文放弃, 但仍不妨碍我们理解内核管理内存的纲领.
处理内存分配请求的是被称为分区页框分配器(zoned page frame allocator)的内核子系统. 其主要组成如图:
其中的"管理区分配器"接受动态内存分配和释放的请求, 所谓动态内存, 是指除被永久占有的内存(如用来存放内核代码和数据的内存)之外的那些内存. 这个分配器从 ZONE_HIGHMEM, ZONE_NORMAL, ZONE_DMA 中搜索一个能满足请求的管理区, 然后在管理区里面申请内存. 一般的搜索顺序是 HIGHMEM > NORMAL > DMA.
而每一个内存管理区里面, 由称之为"伙伴系统"的模块去真正地分配内存. 内核应该为分配一组连续的页框建立一种健壮,高效的分配策略, 能解决著名的内存管理问题, 也就是所谓的外部碎片(external fragmentation)问题. Linux 采用的算法就是著名的伙伴系统(buddy system)算法.
描述伙伴算法如下:
内核把所有空闲页框分组为 11 个块链表, 每个块链表分别包含大小为 1,2,4,8,16,32,64,128,256,512,1024 个连续的页框.
假设要请求一个 256 个页框的块(即1MB), 伙伴算法先在第9个链表, 即256个页框大小的块链表中检查是否有一个空闲块, 如果有则可分配并返回; 如果没有, 则算法查找下一个更大的块, 也就是512个页框大小的块链表. 如果找到, 则内核把一个512的块分成两份, 一份分配给请求者, 另一份256页框插入到256的链表. 依此递推, 如果最终1024大小的块链表都不能满足要求, 算法就放弃并产生出错信号.
这个过程的逆就是伙伴算法的回收过程, 在释放页框时, 伙伴算法将页框插入到对应的块链表, 然后检查这个链表中是否有相邻的空闲块, 如果有则合并它们"两 buddy", 然后将其插入到下一个更大的链表中去.
上面简述了伙伴算法, 还是很好理解的. 当然细节待后续添加.
高速缓存和slab分配器
伙伴算法适合处理大块内存的请求, 而对于小内存的请求, 伙伴算法是不合适的, 比如经常申请和释放几十数百个字节的数据, 使用伙伴算法就会产生很多的内部碎片.
这就引出了高速缓存和 slab 分配器.
内核建立了高速缓存的概念, 可为每个类型的数据申请一个高速缓存区, 高速缓存中包含 slab, 每个 slab 由一个或者多个连续的页框组成, slab 中存放对象. 比如 inode 高速缓存, 在申请新的 inode 结构时, 内核从 slab 中取出一个合法的 inode 对象即可, 用完之后设置标志, 放回 slab, 无需释放内存. 这样就节省了大把的内存申请和释放的时间, 并且多个对象可以位于一个页框之中, 从而减少了内部碎片.
我在"linux 物理内存管理"那篇文章中也简述了 slab 分配机制, 里面有一张不错的示意图. 这里仍然不讨论具体细节了. 仅来了解一下著名的 slab 着色机制.
同一硬件高速缓存行可以映射 RAM 中的多个不同块, 相同大小的对象倾向于存放在高速缓存内相同的偏移量处, 在不同的 slab 中具有相同偏移量的对象最终很可能映射在同一高速缓存行中, 高速缓存的硬件可能因此而花费内存周期在同一高速缓存行与 RAM 内存单元之间来来往往传送两个对象, 而其他高速缓存行并未充分利用. slab 着色(slab coloring)就是为了尽量降低高速缓存的这种不愉快行为.
slab 着色将称为颜色(color)的不同随机数分配给 slab, 从而影响缓存行为. ---- 着色这块内容纯为摘抄, 我没有去试着理解更多.
保留页框池和"内存池"
这二者不是同一个概念, 但是这二者都是为了某些特殊目的而存在的. 本文仅打算非常有限地理解这二者.
内核中有少量的"保留页框池", 只能用于满足中断处理程序或内部临界区发出的原子内存分配请求.
内存池是动态内存的储备, 只能被特定的内核成分(即池的拥有者)使用. 拥有者通常不使用这个储备, 只有动态内存变得极为稀有, 以至于所有普通内存分配请求都将失败的话, 才用作最后的解决手段.