为什么需要引导内存分配器?
伙伴系统等 Linux 内存分配器需要内核完成初始化并建立相关内核数据结构后才能够正常工作。因此在启动时需要引导内存分配器(boot memory allocator)来实现简单的内存分配和释放操作,直到构建起完整的页分配机制(page allocator)。
在 Linux1.0 时代,最早期的内存初始化非常简单,仅采用一个全局 memory_start
变量来记录内存空闲的起始地址,当有内存申请时则增加 memory_start
即可。
In the early days, Linux didn’t have an early memory allocator; in the 1.0 kernel, memory initialization was not as robust and versatile as it is today. Every subsystem initialization call, or simply any function called from start_kernel(), had access to the starting address of the single block of free memory via the global memory_start variable. If a function needed to allocate memory, it just increased memory_start by the desired amount.
—— A quick history of early-boot memory allocators
随着 Linux 支持的硬件架构逐渐多样和复杂化,这种原始的方式无法满足需要。因此在内核 2.3.23 版本开始加入 bootmem 作为引导内存分配器。
Bootmem
使用 bitmap 来表示页面使用状况,一个比特位代表一个页框,如果比特位置 1 则表示该页被分配出去了;反之,如果置 0 表示该页空闲。
bootmem 用 bdata_list
来管理内存,bdata_list
由 bootmem_data
节点构成,每个节点管理若干物理页框,节点结构如下:
// include/linux/bootmem.h
typedef struct bootmem_data {
unsigned long node_min_pfn;
unsigned long node_low_pfn;
void *node_bootmem_map;
unsigned long last_end_off;
unsigned long hint_idx;
struct list_head list;
} bootmem_data_t;
node_min_pfn
/node_low_pfn
:表示该节点内存的起始和终止位置。node_bootmem_map
:表示存放 bootmem 位图的地址。last_end_off
:记录上一次申请的空间的页内的偏移量,如果是 0,则上一次申请的页面全部被占用了。hint_idx
:记录最后一次申请的空间后的一个物理页框的地址,方便下一次申请内存时使用。list
:连接到 bdata_list。
启动时,bootmem 调用 init_bootmem
初始化 boot memory,init_bootmem
会调用 init_bootmem_core
,分配 0
~pages
的物理页框。
/**
* init_bootmem - register boot memory
* @start: pfn where the bitmap is to be placed
* @pages: number of available physical pages
*
* Returns the number of bytes needed to hold the bitmap.
*/
unsigned long __init init_bootmem(unsigned long start, unsigned long pages)
{
max_low_pfn = pages;
min_low_pfn = start;
return init_bootmem_core(NODE_DATA(0)->bdata, start, 0, pages);
}
/*
* Called once to set up the allocator itself.
*/
static unsigned long __init init_bootmem_core(bootmem_data_t *bdata,
unsigned long mapstart, unsigned long start, unsigned long end)
{
unsigned long mapsize;
mminit_validate_memmodel_limits(&start, &end);
// 获得bitmap的虚拟地址
bdata->node_bootmem_map = phys_to_virt(PFN_PHYS(mapstart));
// 设置管理的页框范围
bdata->node_min_pfn = start;
bdata->node_low_pfn = end;
// 把bdata节点连接到bdata_list
link_bootmem(bdata);
/*
* Initially all pages are reserved - setup_arch() has to
* register free RAM areas explicitly.
*/
// 计算bitmap大小
mapsize = bootmap_bytes(end - start);
// 初始化bitmap
memset(bdata->node_bootmem_map, 0xff, mapsize);
bdebug("nid=%td start=%lx map=%lx end=%lx mapsize=%lx\n",
bdata - bootmem_node_data, start, mapstart, end, mapsize);
return mapsize;
}
Memblock
bootmem 的主要缺陷在于 bitmap 需要静态分配初始化。创建 bitmap 时,必须要明确知道物理内存的配置,以此决定 bitmap 要多大,还需要一个连续的物理内存来存放 bitmap。
Linux 4.0 时代,memblock 逐渐取代 bootmem 成为新的引导内存分配器。
memblock 用 region 动态数组来管理物理内存,支持动态添加 region。memblock 可以被立即使用因为它基于足够大的静态数组,至少可以容纳最初的内存注册和分配。
The major drawback of bootmem is the bitmap initialization. To create this bitmap, it is necessary to know the physical memory configuration. What is the correct size of the bitmap? Which memory bank has enough contiguous physical memory to store the bitmap? And, of course, as memory sizes increase so does the bootmem bitmap. For a system with 32GB of RAM, the bitmap will require 1MB of that memory. Memblock, on the other hand, can be used immediately as it is based on static arrays large enough to accommodate, at least, the very first memory registrations and allocations.
—— A quick history of early-boot memory allocators
memblock 的整体数据结构主要由 memblock
、memblock_type
和 memblock_region
构成,其结构的总体关系图如下:
// include/linux/memblock.h
struct memblock {
bool bottom_up; /* is bottom up direction? */ // true代表从底向上管理内存
phys_addr_t current_limit; // 管理内存的最大值
struct memblock_type memory; // 可用内存
struct memblock_type reserved; // 预留内存(不可被分配)
#ifdef CONFIG_HAVE_MEMBLOCK_PHYS_MAP
struct memblock_type physmem;
#endif
};
struct memblock_type {
unsigned long cnt; /* number of regions */
unsigned long max; /* size of the allocated array */
phys_addr_t total_size; /* size of all regions */
struct memblock_region *regions; // region数组
char *name;
};
struct memblock_region {
phys_addr_t base; // 区块起始地址
phys_addr_t size; // 区块大小
unsigned long flags; // 区块类型
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid; // NUMA节点号
#endif
};
enum {
MEMBLOCK_NONE = 0x0, /* No special request */
MEMBLOCK_HOTPLUG = 0x1, /* hotpluggable region */
MEMBLOCK_MIRROR = 0x2, /* mirrored region */
MEMBLOCK_NOMAP = 0x4, /* don't add to kernel direct mapping */
};
memblock_add
函数将目标内存区域添加到可用内存的集合中,主要调用 memblock_add_range
实现,将 base
开始 size
大小的可用内存区域添加到名为 memblock.memory
的 memblock_type
中。
// mm/memblock.c
int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
phys_addr_t end = base + size - 1;
memblock_dbg("memblock_add: [%pa-%pa] %pF\n",
&base, &end, (void *)_RET_IP_);
return memblock_add_range(&memblock.memory, base, size, MAX_NUMNODES, 0);
}
在 memblock_add_range
流程当中,如果目标内存区域与已经添加的区域有重叠,切割出非重叠的部分作为新 region 插入。
具体源码实现中,repeat:
之后的代码执行两遍,第一遍 insert
为 true
,只统计要插入的 region 数量并动态扩容 regions
数组,第二遍 insert
为 false
,真正执行插入操作。
// mm/memblock.c
/**
* memblock_add_range - add new memblock region
* @type: memblock type to add new region into
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
* @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
* existing regions. @type is guaranteed to be minimal (all neighbouring
* compatible regions are merged) after the addition.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
int __init_memblock memblock_add_range(struct memblock_type *type,
phys_addr_t base, phys_addr_t size,
int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
phys_addr_t end = base + memblock_cap_size(base, &size);
int idx, nr_new;
struct memblock_region *rgn;
if (!size)
return 0;
// regions动态数组为空
/* special case for empty array */
if (type->regions[0].size == 0) {
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
}
repeat:
// 第一遍执行时insert为true,只统计要插入的region数量并动态扩容regions数组
// 第二遍执行时insert为false,真正执行插入操作。
/*
* The following is executed twice. Once with %false @insert and
* then with %true. The first counts the number of regions needed
* to accommodate the new area. The second actually inserts them.
*/
base = obase;
nr_new = 0;
for_each_memblock_type(idx, type, rgn) {
phys_addr_t rbase = rgn->base;
phys_addr_t rend = rbase + rgn->size;
if (rbase >= end) // 新区块end在当前遍历区块base之前
break;
if (rend <= base) // 新区块base在当前遍历区块end之后
continue;
/*
* @rgn overlaps. If it separates the lower part of new
* area, insert that portion.
*/
// 新区块base在当前区块base之前,新区块end在当前区块base之后
// 需要将新区块base~当前区块base这段作为新region插入
if (rbase > base) {
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
WARN_ON(nid != memblock_get_region_node(rgn));
#endif
WARN_ON(flags != rgn->flags);
nr_new++;
// 第二遍插入,真正执行插入操作。
if (insert)
memblock_insert_region(type, idx++, base,
rbase - base, nid,
flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
}
/* insert the remaining portion */
if (base < end) {
nr_new++;
if (insert)
memblock_insert_region(type, idx, base, end - base,
nid, flags);
}
if (!nr_new)
return 0;
/*
* If this was the first round, resize array and repeat for actual
* insertions; otherwise, merge and return.
*/
// 第一遍插入,动态分配regions数组
if (!insert) {
while (type->cnt + nr_new > type->max)
if (memblock_double_array(type, obase, size) < 0)
return -ENOMEM;
insert = true;
goto repeat;
} else {
memblock_merge_regions(type);
return 0;
}
}
参考
linux/mm at master · torvalds/linux (github.com)
A quick history of early-boot memory allocators
详解linux引导内存分配器bootmem简介 - 知乎 (zhihu.com)
memblock 内存分配器原理和代码分析 - 泰晓科技 (tinylab.org)
内存管理 | Bootmem机制和Memblock机制 - 知乎 (zhihu.com)
linux内核那些事之early boot memory-membloc - 知乎 (zhihu.com)
深入解析Linux内存管理:探索三大分配器 - 知乎 (zhihu.com)