KVM 虚拟化原理探究--启动过程及各部分虚拟化原理

最新推荐文章于 2025-10-11 10:17:43 发布

转载最新推荐文章于 2025-10-11 10:17:43 发布 · 3.1w 阅读

云计算专栏收录该内容

113 篇文章

订阅专栏

本文探讨KVM虚拟化技术的基础概念与架构，重点介绍KVM如何利用Intel-V或AMD-V指令集运行虚拟机，以及它与QEMU的关系。此外，还分析了虚拟机的启动过程和内存管理。

KVM 虚拟化原理探究— overview

标签（空格分隔）： KVM

写在前面的话

本文不介绍kvm和qemu的基本安装操作，希望读者具有一定的KVM实践经验。同时希望借此系列博客，能够对KVM底层有一些清晰直观的认识，当然我没有通读KVM的源码，文中的内容一部分来自于书籍和资料，一部分来自于实践，还有一些来自于自己的理解，肯定会有一些理解的偏差，欢迎讨论并指正。本系列文章敬代表我个人观点和实践，不代表公司层面。

KVM虚拟化简介

KVM 全称 kernel-based virtual machine，由Qumranet公司发起，2008年被RedHat收购。
KVM实现主要基于Intel-V或者AMD-V提供的虚拟化平台，利用普通的Linux进程运行于虚拟态的指令集，模拟虚拟机监视器和CPU。KVM不提供硬件虚拟化操作，其IO操作等都借助QEMU来完成。

image_1apkng413147jg181jlq1j3kjhsm.png-16.4kB

KVM有如下特点：

guest作为一个普通进程运行于宿主机
guest的CPU(vCPU)作为进程的线程存在，并受到宿主机内核的调度
guest继承了宿主机内核的一些属性，比如huge pages(大页表）
guest的磁盘IO和网络IO会受到宿主机的设置的影响
guest通过宿主机上的虚拟网桥与外部相连

KVM整体架构

image_1apknjokv7q91f5cif312h6jpu13.png-122.4kB

每一个虚拟机(guest)在Host上都被模拟为一个QEMU进程，即emulation进程。
我们创建一个虚拟机后，用普通的ps 命令就可以查看到。

➜  ~ virsh list --all
 Id    Name                           State
----------------------------------------------------
 1     kvm-01                         running

➜  ~ ps aux | grep qemu
libvirt+ 20308 15.1  7.5 5023928 595884 ?      Sl   17:29   0:10 /usr/bin/qemu-system-x86_64 -name kvm-01 -S -machine pc-i440fx-wily,accel=kvm,usb=off -m 2048 -realtime mlock=off -smp 2 qemu ....

可以看到，此虚拟机就是一个普通的Linux进程，他有自己的pid。并且有四个线程，线程数量不是固定的，但是至少会有三个（vCPU，IO，Signal)。其中有两个是vCPU线程，有一个IO线程还有一个信号处理线程。

➜  ~ pstree -p 20308
qemu-system-x86(20308)-+-{qemu-system-x86}(20353)
                       |-{qemu-system-x86}(20408)
                       |-{qemu-system-x86}(20409)
                       |-{qemu-system-x86}(20412)

虚拟CPU

guest的所有用户级别(user)的指令集，都会直接由宿主机线程执行，此线程会调用KVM的ioctl方式提供的接口加载guest的指令并在特殊的CPU模式下运行，不需要经过CPU指令集的软件模拟转换，大大的减少了虚拟化成本，这也是KVM优于其他虚拟化方式的点之一。

KVM向外提供了一个虚拟设备/dev/kvm，通过ioctl(IO设备带外管理接口）来对KVM进行操作，包括虚拟机的初始化，分配内存，指令加载等等。

虚拟IO设备

guest作为一个进程存在，当然他的内核的所有驱动等都存在，只是硬件被QEMU所模拟（后面介绍virtio的时候特殊)。guest的所有硬件操作都会有QEMU来接管，QEMU负责与真实的宿主机硬件打交道。

虚拟内存

guest的内存在host上由emulator提供，对emulator来说，guest访问的内存就是他的虚拟地址空间，guest上需要经过一次虚拟地址到物理地址的转换，转换到guest的物理地址其实也就是emulator的虚拟地址，emulator再次经过一次转换，转换为host的物理地址。后面会有介绍各种虚拟化的优化手段，这里只是做一个overview。

虚拟机启动过程

第一步，获取到kvm句柄
kvmfd = open("/dev/kvm", O_RDWR);
第二步，创建虚拟机，获取到虚拟机句柄。
vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
第三步，为虚拟机映射内存，还有其他的PCI，信号处理的初始化。
ioctl(kvmfd, KVM_SET_USER_MEMORY_REGION, &mem);
第四步，将虚拟机镜像映射到内存，相当于物理机的boot过程，把镜像映射到内存。
第五步，创建vCPU，并为vCPU分配内存空间。
ioctl(kvmfd, KVM_CREATE_VCPU, vcpuid);
vcpu->kvm_run_mmap_size = ioctl(kvm->dev_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
第五步，创建vCPU个数的线程并运行虚拟机。
ioctl(kvm->vcpus->vcpu_fd, KVM_RUN, 0);
第六步，线程进入循环，并捕获虚拟机退出原因，做相应的处理。
这里的退出并不一定是虚拟机关机，虚拟机如果遇到IO操作，访问硬件设备，缺页中断等都会退出执行，退出执行可以理解为将CPU执行上下文返回到QEMU。

open("/dev/kvm")
ioctl(KVM_CREATE_VM)
ioctl(KVM_CREATE_VCPU)
for (;;) {
     ioctl(KVM_RUN)
     switch (exit_reason) {
     case KVM_EXIT_IO:  /* ... */
     case KVM_EXIT_HLT: /* ... */
     }
}

关于KVM_CREATE_VM参数的描述，创建的VM是没有cpu和内存的，需要QEMU进程利用mmap系统调用映射一块内存给VM的描述符，其实也就是给VM创建内存的过程。

KVM ioctl接口文档

先来一个KVM API开胃菜

下面是一个KVM的简单demo，其目的在于加载 code 并使用KVM运行起来.
这是一个at&t的8086汇编，.code16表示他是一个16位的，当然直接运行是运行不起来的，为了让他运行起来，我们可以用KVM提供的API，将这个程序看做一个最简单的操作系统，让其运行起来。
这个汇编的作用是输出al寄存器的值到0x3f8端口。对于x86架构来说，通过IN/OUT指令访问。PC架构一共有65536个8bit的I/O端口，组成64KI/O地址空间，编号从0~0xFFFF。连续两个8bit的端口可以组成一个16bit的端口，连续4个组成一个32bit的端口。I/O地址空间和CPU的物理地址空间是两个不同的概念，例如I/O地址空间为64K，一个32bit的CPU物理地址空间是4G。
最终程序理想的输出应该是，al，bl的值后面KVM初始化的时候有赋值。
4\n (并不直接输出\n，而是换了一行），hlt 指令表示虚拟机退出

.globl _start
    .code16
_start:
    mov $0x3f8, %dx
    add %bl, %al
    add $'0', %al
    out %al, (%dx)
    mov $'\n', %al
    out %al, (%dx)
    hlt

我们编译一下这个汇编，得到一个 Bin.bin 的二进制文件

as -32 bin.S -o bin.o
ld -m elf_i386 --oformat binary -N -e _start -Ttext 0x10000 -o Bin.bin bin.o

查看一下二进制格式

➜  demo1 hexdump -C bin.bin
00000000  ba f8 03 00 d8 04 30 ee  b0 0a ee f4              |......0.....|
0000000c
对应了下面的code数组，这样直接加载字节码就不需要再从文件加载了
    const uint8_t code[] = {
        0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
        0x00, 0xd8,       /* add %bl, %al */
        0x04, '0',        /* add $'0', %al */
        0xee,             /* out %al, (%dx) */
        0xb0, '\n',       /* mov $'\n', %al */
        0xee,             /* out %al, (%dx) */
        0xf4,             /* hlt */
    };

#include <err.h>
#include <fcntl.h>
#include <linux/kvm.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/types.h>

int main(void)
{
    int kvm, vmfd, vcpufd, ret;
    const uint8_t code[] = {
        0xba, 0xf8, 0x03, /* mov $0x3f8, %dx */
        0x00, 0xd8,       /* add %bl, %al */
        0x04, '0',        /* add $'0', %al */
        0xee,             /* out %al, (%dx) */
        0xb0, '\n',       /* mov $'\n', %al */
        0xee,             /* out %al, (%dx) */
        0xf4,             /* hlt */
    };
    uint8_t *mem;
    struct kvm_sregs sregs;
    size_t mmap_size;
    struct kvm_run *run;
    
    // 获取 kvm 句柄
    kvm = open("/dev/kvm", O_RDWR | O_CLOEXEC);
    if (kvm == -1)
        err(1, "/dev/kvm");

    // 确保是正确的 API 版本
    ret = ioctl(kvm, KVM_GET_API_VERSION, NULL);
    if (ret == -1)
        err(1, "KVM_GET_API_VERSION");
    if (ret != 12)
        errx(1, "KVM_GET_API_VERSION %d, expected 12", ret);
    
    // 创建一虚拟机
    vmfd = ioctl(kvm, KVM_CREATE_VM, (unsigned long)0);
    if (vmfd == -1)
        err(1, "KVM_CREATE_VM");
    
    // 为这个虚拟机申请内存，并将代码（镜像）加载到虚拟机内存中
    mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0);
    if (!mem)
        err(1, "allocating guest memory");
    memcpy(mem, code, sizeof(code));

    // 为什么从 0x1000 开始呢，因为页表空间的前4K是留给页表目录
    struct kvm_userspace_memory_region region = {
        .slot = 0,
        .guest_phys_addr = 0x1000,
        .memory_size = 0x1000,
        .userspace_addr = (uint64_t)mem,
    };
    // 设置 KVM 的内存区域
    ret = ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &region);
    if (ret == -1)
        err(1, "KVM_SET_USER_MEMORY_REGION");
    
    // 创建虚拟CPU
    vcpufd = ioctl(vmfd, KVM_CREATE_VCPU, (unsigned long)0);
    if (vcpufd == -1)
        err(1, "KVM_CREATE_VCPU");

    // 获取 KVM 运行时结构的大小
    ret = ioctl(kvm, KVM_GET_VCPU_MMAP_SIZE, NULL);
    if (ret == -1)
        err(1, "KVM_GET_VCPU_MMAP_SIZE");
    mmap_size = ret;
    if (mmap_size < sizeof(*run))
        errx(1, "KVM_GET_VCPU_MMAP_SIZE unexpectedly small");
    // 将 kvm run 与 vcpu 做关联，这样能够获取到kvm的运行时信息
    run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpufd, 0);
    if (!run)
        err(1, "mmap vcpu");

    // 获取特殊寄存器
    ret = ioctl(vcpufd, KVM_GET_SREGS, &sregs);
    if (ret == -1)
        err(1, "KVM_GET_SREGS");
    // 设置代码段为从地址0处开始，我们的代码被加载到了0x0000的起始位置
    sregs.cs.base = 0;
    sregs.cs.selector = 0;
    // KVM_SET_SREGS 设置特殊寄存器
    ret = ioctl(vcpufd, KVM_SET_SREGS, &sregs);
    if (ret == -1)
        err(1, "KVM_SET_SREGS");

    
    // 设置代码的入口地址，相当于32位main函数的地址，这里16位汇编都是由0x1000处开始。
    // 如果是正式的镜像，那么rip的值应该是类似引导扇区加载进来的指令
    struct kvm_regs regs = {
        .rip = 0x1000,
        .rax = 2,    // 设置 ax 寄存器初始值为 2
        .rbx = 2,    // 同理
        .rflags = 0x2,   // 初始化flags寄存器，x86架构下需要设置，否则会粗错
    };
    ret = ioctl(vcpufd, KVM_SET_REGS, &regs);
    if (ret == -1)
        err(1, "KVM_SET_REGS");

    // 开始运行虚拟机，如果是qemu-kvm，会用一个线程来执行这个vCPU，并加载指令
    while (1) {
        // 开始运行虚拟机
        ret = ioctl(vcpufd, KVM_RUN, NULL);
        if (ret == -1)
            err(1, "KVM_RUN");
        // 获取虚拟机退出原因
        switch (run->exit_reason) {
        case KVM_EXIT_HLT:
            puts("KVM_EXIT_HLT");
            return 0;
        // 汇编调用了 out 指令，vmx 模式下不允许执行这个操作，所以
        // 将操作权切换到了宿主机，切换的时候会将上下文保存到VMCS寄存器
        // 后面CPU虚拟化会讲到这部分
        // 因为虚拟机的内存宿主机能够直接读取到，所以直接在宿主机上获取到
        // 虚拟机的输出（out指令），这也是后面PCI设备虚拟化的一个基础，DMA模式的PCI设备
        case KVM_EXIT_IO:
            if (run->io.direction == KVM_EXIT_IO_OUT && run->io.size == 1 && run->io.port == 0x3f8 && run->io.count == 1)
                putchar(*(((char *)run) + run->io.data_offset));
            else
                errx(1, "unhandled KVM_EXIT_IO");
            break;
        case KVM_EXIT_FAIL_ENTRY:
            errx(1, "KVM_EXIT_FAIL_ENTRY: hardware_entry_failure_reason = 0x%llx",
                 (unsigned long long)run->fail_entry.hardware_entry_failure_reason);
        case KVM_EXIT_INTERNAL_ERROR:
            errx(1, "KVM_EXIT_INTERNAL_ERROR: suberror = 0x%x", run->internal.suberror);
        default:
            errx(1, "exit_reason = 0x%x", run->exit_reason);
        }
    }
}

编译并运行这个demo

gcc -g demo.c -o demo
➜  demo1 ./demo
4
KVM_EXIT_HLT

另外一个简单的QEMU emulator demo

IBM的徐同学有做过介绍，在此基础上我再详细介绍一下qemu-kvm的启动过程。

.globl _start
    .code16
_start:
    xorw %ax, %ax   # 将 ax 寄存器清零

loop1:
    out %ax, $0x10  # 像 0x10 的端口输出 ax 的内容，at&t汇编的操作数和Intel的相反。
    inc %ax         # ax 值加一
    jmp loop1       # 继续循环

这个汇编的作用就是一直不停的向0x10端口输出一字节的值。

从main函数开始说起

int main(int argc, char **argv) {
    int ret = 0;
    // 初始化kvm结构体
    struct kvm *kvm = kvm_init();

    if (kvm == NULL) {
        fprintf(stderr, "kvm init fauilt\n");
        return -1;
    }
    
    // 创建VM，并分配内存空间
    if (kvm_create_vm(kvm, RAM_SIZE) < 0) {
        fprintf(stderr, "create vm fault\n");
        return -1;
    }
    
    // 加载镜像
    load_binary(kvm);

    // only support one vcpu now
    kvm->vcpu_number = 1;
    // 创建执行现场
    kvm->vcpus = kvm_init_vcpu(kvm, 0, kvm_cpu_thread);
    
    // 启动虚拟机
    kvm_run_vm(kvm);

    kvm_clean_vm(kvm);
    kvm_clean_vcpu(kvm->vcpus);
    kvm_clean(kvm);
}

第一步，调用kvm_init() 初始化了 kvm 结构体。先来看看怎么定义一个简单的kvm。

struct kvm {
   int dev_fd;              // /dev/kvm 的句柄
   int vm_fd;               // GUEST 的句柄
   __u64 ram_size;          // GUEST 的内存大小
   __u64 ram_start;         // GUEST 的内存起始地址，
                            // 这个地址是qemu emulator通过mmap映射的地址
   
   int kvm_version;         
   struct kvm_userspace_memory_region mem; // slot 内存结构，由用户空间填充、
                                           // 允许对guest的地址做分段。将多个slot组成线性地址

   struct vcpu *vcpus;      // vcpu 数组
   int vcpu_number;         // vcpu 个数
};

初始化 kvm 结构体。

struct kvm *kvm_init(void) {
    struct kvm *kvm = malloc(sizeof(struct kvm));
    kvm->dev_fd = open(KVM_DEVICE, O_RDWR);  // 打开 /dev/kvm 获取 kvm 句柄

    if (kvm->dev_fd < 0) {
        perror("open kvm device fault: ");
        return NULL;
    }

    kvm->kvm_version = ioctl(kvm->dev_fd, KVM_GET_API_VERSION, 0);  // 获取 kvm API 版本

    return kvm;
}

第二步+第三步，创建虚拟机，获取到虚拟机句柄，并为其分配内存。

int kvm_create_vm(struct kvm *kvm, int ram_size) {
    int ret = 0;
    // 调用 KVM_CREATE_KVM 接口获取 vm 句柄
    kvm->vm_fd = ioctl(kvm->dev_fd, KVM_CREATE_VM, 0);

    if (kvm->vm_fd < 0) {
        perror("can not create vm");
        return -1;
    }

    // 为 kvm 分配内存。通过系统调用.
    kvm->ram_size = ram_size;
    kvm->ram_start =  (__u64)mmap(NULL, kvm->ram_size, 
                PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, 
                -1, 0);

    if ((void *)kvm->ram_start == MAP_FAILED) {
        perror("can not mmap ram");
        return -1;
    }
    
    // kvm->mem 结构需要初始化后传递给 KVM_SET_USER_MEMORY_REGION 接口
    // 只有一个内存槽
    kvm->mem.slot = 0;
    // guest 物理内存起始地址
    kvm->mem.guest_phys_addr = 0;
    // 虚拟机内存大小
    kvm->mem.memory_size = kvm->ram_size;
    // 虚拟机内存在host上的用户空间地址，这里就是绑定内存给guest
    kvm->mem.userspace_addr = kvm->ram_start;
    
    // 调用 KVM_SET_USER_MEMORY_REGION 为虚拟机分配内存。
    ret = ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &(kvm->mem));

    if (ret < 0) {
        perror("can not set user memory region");
        return ret;
    }
    return ret;
}

接下来就是load_binary把二进制文件load到虚拟机的内存中来，在第一个demo中我们是直接把字节码放到了内存中，这里模拟镜像加载步骤，把二进制文件加载到内存中。

void load_binary(struct kvm *kvm) {
    int fd = open(BINARY_FILE, O_RDONLY);  // 打开这个二进制文件(镜像）

    if (fd < 0) {
        fprintf(stderr, "can not open binary file\n");
        exit(1);
    }

    int ret = 0;
    char *p = (char *)kvm->ram_start;

    while(1) {
        ret = read(fd, p, 4096);           // 将镜像内容加载到虚拟机的内存中
        if (ret <= 0) {
            break;
        }
        printf("read size: %d", ret);
        p += ret;
    }
}

加载完镜像后，需要初始化vCPU，以便能够运行镜像内容

struct vcpu {
    int vcpu_id;                 // vCPU id，vCPU
    int vcpu_fd;                 // vCPU 句柄
    pthread_t vcpu_thread;       // vCPU 线程句柄
    struct kvm_run *kvm_run;     // KVM 运行时结构，也可以看做是上下文
    int kvm_run_mmap_size;       // 运行时结构大小
    struct kvm_regs regs;        // vCPU的寄存器
    struct kvm_sregs sregs;      // vCPU的特殊寄存器
    void *(*vcpu_thread_func)(void *);  // 线程执行函数
};

struct vcpu *kvm_init_vcpu(struct kvm *kvm, int vcpu_id, void *(*fn)(void *)) {
    // 申请vcpu结构
    struct vcpu *vcpu = malloc(sizeof(struct vcpu));
    // 只有一个 vCPU，所以这里只初始化一个
    vcpu->vcpu_id = 0;
    // 调用 KVM_CREATE_VCPU 获取 vCPU 句柄，并关联到kvm->vm_fd（由KVM_CREATE_VM返回）
    vcpu->vcpu_fd = ioctl(kvm->vm_fd, KVM_CREATE_VCPU, vcpu->vcpu_id);

    if (vcpu->vcpu_fd < 0) {
        perror("can not create vcpu");
        return NULL;
    }
    
    // 获取KVM运行时结构大小
    vcpu->kvm_run_mmap_size = ioctl(kvm->dev_fd, KVM_GET_VCPU_MMAP_SIZE, 0);

    if (vcpu->kvm_run_mmap_size < 0) {
        perror("can not get vcpu mmsize");
        return NULL;
    }

    printf("%d\n", vcpu->kvm_run_mmap_size);
    // 将 vcpu_fd 的内存映射给 vcpu->kvm_run结构。相当于一个关联操作
    // 以便能够在虚拟机退出的时候获取到vCPU的返回值等信息
    vcpu->kvm_run = mmap(NULL, vcpu->kvm_run_mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu->vcpu_fd, 0);

    if (vcpu->kvm_run == MAP_FAILED) {
        perror("can not mmap kvm_run");
        return NULL;
    }
    
    // 设置线程执行函数
    vcpu->vcpu_thread_func = fn;
    return vcpu;
}

最后一步，以上工作就绪后，启动虚拟机。

void kvm_run_vm(struct kvm *kvm) {
    int i = 0;

    for (i = 0; i < kvm->vcpu_number; i++) {
        // 启动线程执行 vcpu_thread_func 并将 kvm 结构作为参数传递给线程
        if (pthread_create(&(kvm->vcpus->vcpu_thread), (const pthread_attr_t *)NULL, kvm->vcpus[i].vcpu_thread_func, kvm) != 0) {
            perror("can not create kvm thread");
            exit(1);
        }
    }

    pthread_join(kvm->vcpus->vcpu_thread, NULL);
}

启动虚拟机其实就是创建线程，并执行相应的线程回调函数。
线程回调函数在kvm_init_vcpu的时候传入

void *kvm_cpu_thread(void *data) {
    // 获取参数
    struct kvm *kvm = (struct kvm *)data;
    int ret = 0;
    // 设置KVM的参数
    kvm_reset_vcpu(kvm->vcpus);

    while (1) {
        printf("KVM start run\n");
        // 启动虚拟机，此时的虚拟机已经有内存和CPU了，可以运行起来了。
        ret = ioctl(kvm->vcpus->vcpu_fd, KVM_RUN, 0);
    
        if (ret < 0) {
            fprintf(stderr, "KVM_RUN failed\n");
            exit(1);
        }
        
        // 前文 kvm_init_vcpu 函数中，将 kvm_run 关联了 vCPU 结构的内存
        // 所以这里虚拟机退出的时候，可以获取到 exit_reason，虚拟机退出原因
        switch (kvm->vcpus->kvm_run->exit_reason) {
        case KVM_EXIT_UNKNOWN:
            printf("KVM_EXIT_UNKNOWN\n");
            break;
        case KVM_EXIT_DEBUG:
            printf("KVM_EXIT_DEBUG\n");
            break;
        // 虚拟机执行了IO操作，虚拟机模式下的CPU会暂停虚拟机并
        // 把执行权交给emulator
        case KVM_EXIT_IO:
            printf("KVM_EXIT_IO\n");
            printf("out port: %d, data: %d\n", 
                kvm->vcpus->kvm_run->io.port,  
                *(int *)((char *)(kvm->vcpus->kvm_run) + kvm->vcpus->kvm_run->io.data_offset)
                );
            sleep(1);
            break;
        // 虚拟机执行了memory map IO操作
        case KVM_EXIT_MMIO:
            printf("KVM_EXIT_MMIO\n");
            break;
        case KVM_EXIT_INTR:
            printf("KVM_EXIT_INTR\n");
            break;
        case KVM_EXIT_SHUTDOWN:
            printf("KVM_EXIT_SHUTDOWN\n");
            goto exit_kvm;
            break;
        default:
            printf("KVM PANIC\n");
            goto exit_kvm;
        }
    }

exit_kvm:
    return 0;
}

void kvm_reset_vcpu (struct vcpu *vcpu) {
    if (ioctl(vcpu->vcpu_fd, KVM_GET_SREGS, &(vcpu->sregs)) < 0) {
        perror("can not get sregs\n");
        exit(1);
    }
    // #define CODE_START 0x1000
    /* sregs 结构体
        x86
        struct kvm_sregs {
            struct kvm_segment cs, ds, es, fs, gs, ss;
            struct kvm_segment tr, ldt;
            struct kvm_dtable gdt, idt;
            __u64 cr0, cr2, cr3, cr4, cr8;
            __u64 efer;
            __u64 apic_base;
            __u64 interrupt_bitmap[(KVM_NR_INTERRUPTS + 63) / 64];
        };
    */
    // cs 为code start寄存器，存放了程序的起始地址
    vcpu->sregs.cs.selector = CODE_START;
    vcpu->sregs.cs.base = CODE_START * 16;
    // ss 为堆栈寄存器，存放了堆栈的起始位置
    vcpu->sregs.ss.selector = CODE_START;
    vcpu->sregs.ss.base = CODE_START * 16;
    // ds 为数据段寄存器，存放了数据开始地址
    vcpu->sregs.ds.selector = CODE_START;
    vcpu->sregs.ds.base = CODE_START *16;
    // es 为附加段寄存器
    vcpu->sregs.es.selector = CODE_START;
    vcpu->sregs.es.base = CODE_START * 16;
    // fs, gs 同样为段寄存器
    vcpu->sregs.fs.selector = CODE_START;
    vcpu->sregs.fs.base = CODE_START * 16;
    vcpu->sregs.gs.selector = CODE_START;
    
    // 为vCPU设置以上寄存器的值
    if (ioctl(vcpu->vcpu_fd, KVM_SET_SREGS, &vcpu->sregs) < 0) {
        perror("can not set sregs");
        exit(1);
    }
    
    // 设置寄存器标志位
    vcpu->regs.rflags = 0x0000000000000002ULL;
    // rip 表示了程序的起始指针，地址为 0x0000000
    // 在加载镜像的时候，我们直接将binary读取到了虚拟机的内存起始位
    // 所以虚拟机开始的时候会直接运行binary
    vcpu->regs.rip = 0;
    // rsp 为堆栈顶
    vcpu->regs.rsp = 0xffffffff;
    // rbp 为堆栈底部
    vcpu->regs.rbp= 0;

    if (ioctl(vcpu->vcpu_fd, KVM_SET_REGS, &(vcpu->regs)) < 0) {
        perror("KVM SET REGS\n");
        exit(1);
    }
}

运行一下结果，可以看到当虚拟机执行了指令 out %ax, $0x10 的时候，会引起虚拟机的退出，这是CPU虚拟化里面将要介绍的特殊机制。
宿主机获取到虚拟机退出的原因后，获取相应的输出。这里的步骤就类似于IO虚拟化，直接读取IO模块的内存，并输出结果。

➜  kvmsample git:(master) ✗ ./kvmsample
read size: 712288
KVM start run
KVM_EXIT_IO
out port: 16, data: 0
KVM start run
KVM_EXIT_IO
out port: 16, data: 1
KVM start run
KVM_EXIT_IO
out port: 16, data: 2
KVM start run
KVM_EXIT_IO
out port: 16, data: 3
KVM start run
KVM_EXIT_IO
out port: 16, data: 4
...

总结

虚拟机的启动过程基本上可以这么总结：
创建kvm句柄->创建vm->分配内存->加载镜像到内存->启动线程执行KVM_RUN。从这个虚拟机的demo可以看出，虚拟机的内存是由宿主机通过mmap调用映射给虚拟机的，而vCPU是宿主机的一个线程，这个线程通过设置相应的vCPU的寄存器指定了虚拟机的程序加载地址后，开始运行虚拟机的指令，当虚拟机执行了IO操作后，CPU捕获到中断并把执行权又交回给宿主机。

当然真实的qemu-kvm比这个复杂的多，包括设置很多IO设备的MMIO，设置信号处理等。

下一篇将介绍CPU虚拟化相关知识。

上一篇文章笼统的介绍了一个虚拟机的诞生过程，从demo中也可以看到，运行一个虚拟机再也不需要像以前想象的那样，需要用软件来模拟硬件指令集了。虚拟机的指令集直接运行在宿主机物理CPU上，当虚拟机中的指令设计到IO操作或者一些特殊指令的时候，控制权转让给了宿主机（这里其实是转让给了vm monitor，下面检查VMM），也就是一个demo进程，他在宿主机上的表现形式也就是一个用户级进程。

用一张图来解释更为贴切。

vcpu-follow.png-102.9kB

VMM完成vCPU，内存的初始化后，通过ioctl调用KVM的接口，完成虚拟机的创建，并创建一个线程来运行VM，由于VM在前期初始化的时候会设置各种寄存器来帮助KVM查找到需要加载的指令的入口（main函数）。所以线程在调用了KVM接口后，物理CPU的控制权就交给了VM。VM运行在VMX non-root模式，这是Intel-V或者AMD-V提供的一种特殊的CPU执行模式。然后当VM执行了特殊指令的时候，CPU将当前VM的上下文保存到VMCS寄存器（这个寄存器是一个指针，保存了实际的上下文地址），然后执行权切换到VMM。VMM 获取 VM 返回原因，并做处理。如果是IO请求，VMM 可以直接读取VM的内存并将IO操作模拟出来，然后再调用VMRESUME指令，VM继续执行，此时在VM看来，IO操作的指令被CPU执行了。

Intel-V 技术

Intel-V 技术是Intel为了支持虚拟化而提供的一套CPU特殊运行模式。

Intel-V虚拟化技术结构

Intel-V 在IA-32处理器上扩展了处理器等级，原来的CPU支持ring0~ring3 4个等级，但是Linux只使用了其中的两个ring0,ring3。当CPU寄存器标示了当前CPU处于ring0级别的时候，表示此时CPU正在运行的是内核的代码。而当CPU处于ring3级别的时候，表示此时CPU正在运行的是用户级别的代码。当发生系统调用或者进程切换的时候，CPU会从ring3级别转到ring0级别。ring3级别是不允许执行硬件操作的，所有硬件操作都需要系统提供的API来完成。
比如说一个IO操作：

int nread = read(fd, buffer, 1024);

当执行到此段代码的时候，然后查找到系统调用号，保存到寄存器eax，然后会将对应的参数压栈后产生一个系统调用中断，对应的是 int $0x80。产生了系统调用中断后，此时CPU将切换到ring0模式，内核通过寄存器读取到参数，并完成最后的IO后续操作，操作完成后返回ring3模式。

movel　　$3,%eax
movel　　fd,%ebx
movel　　buffer,%ecx
movel　　1024,%edx　　　　　　
int　　  $0x80

Intel-V 在 ring0~ring3 的基础上，增加了VMX模式，VMX分为root和non-root。这里的VMX root模式是给VMM（前面有提到VM monitor)，在KVM体系中，就是qemu-kvm进程所运行的模式。VMX non-root模式就是运行的Guest，Guest也分ring0~ring3，不过他并不感知自己处于VMX non-root模式下。

image_1appl048b1jqh1joj5hkthj1kdr29.png-169.9kB

Intel的虚拟架构基本上分两个部分:

虚拟机监视器
客户机（Guest VM)

虚拟机监视器（Virtual-machine monitors - VMM)

虚拟机监视器在宿主机上表现为一个提供虚拟机CPU，内存以及一系列硬件虚拟的实体，这个实体在KVM体系中就是一个进程，如qemu-kvm。VMM负责管理虚拟机的资源，并拥有所有虚拟机资源的控制权，包括切换虚拟机的CPU上下文等。

Guest

这个Guest在前面的Demo里面也提到，可能是一个操作系统（OS），也可能就是一个二进制程序，whatever，对于VMM来说，他就是一堆指令集，只需要知道入口（rip寄存器值）就可以加载。
Guest运行需要虚拟CPU，当Guest代码运行的时候，处于VMX non-root模式，此模式下，该用什么指令还是用什么指令，该用寄存器该用cache还是用cache，但是在执行到特殊指令的时候（比如Demo中的out指令），把CPU控制权交给VMM，由VMM来处理特殊指令，完成硬件操作。

VMM 与 Guest 的切换

image_1applkej22o6chnj5q1u6ffpq2m.png-18.5kB

Guest与VMM之间的切换分两个部分：VM entry 和 VM exit。有几种情况会导致VM exit，比如说Guest执行了硬件访问操作，或者Guest调用了VMCALL指令或者调用了退出指令或者产生了一个page fault，或者访问了特殊设备的寄存器等。当Guest处于VMX模式的时候，没有提供获取是否处于此模式下的指令或者寄存器，也就是说，Guest不能判断当前CPU是否处于VMX模式。当产生VM exit的时候，CPU会将exit reason保存到MSRs（VMX模式的特殊寄存器组），对应到KVM就是vCPU->kvm_run->exit_reason。VMM根据exit_reason做相应的处理。

VMM 的生命周期

如上图所示，VMM 开始于VMXON 指令，结束与VMXOFF指令。
第一次启动Guest，通过VMLAUNCH指令加载Guest，这时候一切都是新的，比如说起始的rip寄存器等。后续Guest exit后再entry，是通过VMRESUME指令，此指令会将VMCS(后面会介绍到）所指向的内容加载到当前Guest的上下文，以便Guest继续执行。

VMCS （Virtual-Machine control structure)

顾名思义，VMCS就是虚拟机控制结构，前面提到过很多次，Guest Exit的时候，会将当前Guest的上下文保存到VMCS中，Guest entry的时候把VMCS上下文恢复到VMM。VMCS是一个64位的指针，指向一个真实的内存地址，VMCS是以vCPU为单位的，就是说当前有多少个vCPU，就有多少个VMCS指针。VMCS的操作包括VMREAD，VMWRITE，VMCLEAR。

Guest exit Reason

下面是qemu-kvm定义的exit reason。可以看到有很多可能会导致Guest转让控制权。选取几个解释一下。

static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
    [EXIT_REASON_EXCEPTION_NMI]           = handle_exception, 
    [EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt, 
    [EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,
    [EXIT_REASON_NMI_WINDOW]              = handle_nmi_window,
     // 访问了IO设备
    [EXIT_REASON_IO_INSTRUCTION]          = handle_io,
     // 访问了CR寄存器，地址寄存器，和DR寄存器（debug register)一样，用于调试
    [EXIT_REASON_CR_ACCESS]               = handle_cr,
    [EXIT_REASON_DR_ACCESS]               = handle_dr, 
    [EXIT_REASON_CPUID]                   = handle_cpuid,
    // 访问了MSR寄存器
    [EXIT_REASON_MSR_READ]                = handle_rdmsr,
    [EXIT_REASON_MSR_WRITE]               = handle_wrmsr,
    [EXIT_REASON_PENDING_INTERRUPT]       = handle_interrupt_window,
    // Guest执行了HLT指令，Demo开胃菜就是这个指令
    [EXIT_REASON_HLT]                     = handle_halt,
    [EXIT_REASON_INVD]                    = handle_invd,
    [EXIT_REASON_INVLPG]                  = handle_invlpg,
    [EXIT_REASON_RDPMC]                   = handle_rdpmc,
    // 不太清楚以下VM系列的指令有什么用，猜测是递归VM（虚拟机里面运行虚拟机）
    [EXIT_REASON_VMCALL]                  = handle_vmcall, 
    [EXIT_REASON_VMCLEAR]                 = handle_vmclear,
    [EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,
    [EXIT_REASON_VMPTRLD]                 = handle_vmptrld,
    [EXIT_REASON_VMPTRST]                 = handle_vmptrst,
    [EXIT_REASON_VMREAD]                  = handle_vmread,
    [EXIT_REASON_VMRESUME]                = handle_vmresume,
    [EXIT_REASON_VMWRITE]                 = handle_vmwrite,
    [EXIT_REASON_VMOFF]                   = handle_vmoff,
    [EXIT_REASON_VMON]                    = handle_vmon,
    
    [EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,
    // 访问了高级PCI设备
    [EXIT_REASON_APIC_ACCESS]             = handle_apic_access,
    [EXIT_REASON_APIC_WRITE]              = handle_apic_write,
    [EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,
    [EXIT_REASON_WBINVD]                  = handle_wbinvd,
    [EXIT_REASON_XSETBV]                  = handle_xsetbv,
    // 进程切换
    [EXIT_REASON_TASK_SWITCH]             = handle_task_switch,
    [EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,
    // ept 是Intel的一个硬件内存虚拟化技术
    [EXIT_REASON_EPT_VIOLATION]           = handle_ept_violation,
    [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,
    // 执行了暂停指令
    [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,
    [EXIT_REASON_MWAIT_INSTRUCTION]       = handle_invalid_op,
    [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,
    [EXIT_REASON_INVEPT]                  = handle_invept,
};

总结

KVM的CPU虚拟化依托于Intel-V提供的虚拟化技术，将Guest运行于VMX模式，当执行了特殊操作的时候，将控制权返回给VMM。VMM处理完特殊操作后再把结果返回给Guest。

CPU虚拟化可以说是KVM的最关键的核心，弄清楚了VM Exit和VM Entry。后续的IO虚拟化，内存虚拟化都是建立在此基础上。下一章介绍内存虚拟化。

前面文章中我们讲过Qemu、KVM、Guest OS这三种层次以及对应的三种模式，也知道这三种模式之间的配合，下面上一张图回顾一下。

那现在我们就从代码的角度来讲一下这三层之间具体是如何配合的。

前面我们也讲过，首先Qemu层用户发起启动虚拟机命令后会通过ioctl调用进入到kvm内核层，完成相关初始化工作之后就运行虚拟机。

在kvm内核层中，当接收到ioctl的KVM_RUN命令后，实际调用的是kvm_arch_vcpu_ioctl_run()函数。

[objc] view plain copy

case KVM_RUN:
r = -EINVAL;
if (arg)
goto out;
r =kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);//
trace_kvm_userspace_exit(vcpu->run->exit_reason,r);
break;

随后依次调用__vcpu_run()，vcpu_enter_guest()，kvm_x86_ops->run()，vmx_vcpu_run()，在vmx_vcpu_run()函数中有一段汇编语言被调用，这段汇编中执行了ASM_VMX_VMLAUNCH或者ASM_VMX_VMRESUME指令进入到客户模式。

[objc] view plain copy

asm(
.........//省略部分代码
/* Enter guest mode */
"jne 1f \n\t"
__ex(ASM_VMX_VMLAUNCH)"\n\t"
"jmp 2f \n\t"
"1: "__ex(ASM_VMX_VMRESUME) "\n\t"
........//省略部分代码
);

执行汇编指令进入到客户模式能够实现是因为KVM采用了硬件虚拟化的技术，比如Intel的芯片上提供了硬件支持并提供了相关一系列指令。再具体我也不知道了，查看Intel手册吧。那么进入到客户模式后，客户模式因为一些异常需要退出到KVM内核进行处理，这个是怎么实现的呢？

首先我们要说一下一个与异常处理相关的重要的数据结构VMCS。VMCS是虚拟机控制结构，他分为三部分：版本信息；终止标识符；VMCS数据域。其中VMCS数据域包含六类信息：客户状态域，宿主机状态域，VM-Entry控制域，VM-Execution控制域，VM-Exit控制域以及VM-Exit信息域。宿主机状态域保存了基本的寄存器信息，其中CS:RIP指向KVM中异常处理程序的入口地址，VM-Exit信息域中存放异常退出原因等信息。实际上，在KVM内核初始化vcpu时就将异常处理程序入口地址装载进VMCS中CS:RIP寄存器结构，当客户机发生异常时，就根据这个入口地址退出到内核模式执行异常处理程序。

KVM内核中异常处理总入口函数是vmx_handle_exit()函数。

[objc] view plain copy

static intvmx_handle_exit(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
u32 exit_reason = vmx->exit_reason;
........//一些处理，省略这部分代码
if (exit_reason <kvm_vmx_max_exit_handlers
&& kvm_vmx_exit_handlers[exit_reason])
returnkvm_vmx_exit_handlers[exit_reason](vcpu);
else {
vcpu->run->exit_reason= KVM_EXIT_UNKNOWN;
vcpu->run->hw.hardware_exit_reason= exit_reason;
}
return 0;
}

该函数中，首先读取exit_reason，然后进行一些必要的处理，最后调用kvm_vmx_exit_handlers[exit_reason](vcpu)，我们来看一下这个结构，实际上是一个函数指针数组，里面对应着所有的异常相应的异常处理函数。

[objc] view plain copy

static int(*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_EXCEPTION_NMI] = handle_exception,
[EXIT_REASON_EXTERNAL_INTERRUPT] = handle_external_interrupt,
[EXIT_REASON_TRIPLE_FAULT] = handle_triple_fault,
[EXIT_REASON_NMI_WINDOW] = handle_nmi_window,
[EXIT_REASON_IO_INSTRUCTION] = handle_io,
[EXIT_REASON_CR_ACCESS] = handle_cr,
[EXIT_REASON_DR_ACCESS] = handle_dr,
[EXIT_REASON_CPUID] = handle_cpuid,
[EXIT_REASON_MSR_READ] = handle_rdmsr,
[EXIT_REASON_MSR_WRITE] = handle_wrmsr,
[EXIT_REASON_PENDING_INTERRUPT] = handle_interrupt_window,
[EXIT_REASON_HLT] = handle_halt,
[EXIT_REASON_INVD] = handle_invd,
[EXIT_REASON_INVLPG] = handle_invlpg,
[EXIT_REASON_RDPMC] = handle_rdpmc,
[EXIT_REASON_VMCALL] = handle_vmcall,
[EXIT_REASON_VMCLEAR] = handle_vmclear,
[EXIT_REASON_VMLAUNCH] = handle_vmlaunch,
[EXIT_REASON_VMPTRLD] = handle_vmptrld,
[EXIT_REASON_VMPTRST] = handle_vmptrst,
[EXIT_REASON_VMREAD] = handle_vmread,
[EXIT_REASON_VMRESUME] = handle_vmresume,
[EXIT_REASON_VMWRITE] = handle_vmwrite,
[EXIT_REASON_VMOFF] = handle_vmoff,
[EXIT_REASON_VMON] = handle_vmon,
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
[EXIT_REASON_APIC_WRITE] = handle_apic_write,
[EXIT_REASON_EOI_INDUCED] = handle_apic_eoi_induced,
[EXIT_REASON_WBINVD] = handle_wbinvd,
[EXIT_REASON_XSETBV] = handle_xsetbv,
[EXIT_REASON_TASK_SWITCH] = handle_task_switch,
[EXIT_REASON_MCE_DURING_VMENTRY] = handle_machine_check,
[EXIT_REASON_EPT_VIOLATION] = handle_ept_violation,
[EXIT_REASON_EPT_MISCONFIG] = handle_ept_misconfig,
[EXIT_REASON_PAUSE_INSTRUCTION] = handle_pause,
[EXIT_REASON_MWAIT_INSTRUCTION] =handle_invalid_op,
[EXIT_REASON_MONITOR_INSTRUCTION] = handle_invalid_op,
};

这里面比如handle_ept_violation就是影子页（EPT页）缺页异常的处理函数。

我们以handle_ept_violation()为例向下说明，依次调用kvm_mmu_page_fault()，vcpu->arch.mmu.page_fault()，tdp_page_fault()等后续函数完成缺页处理。

在这里，我们要注意kvm_vmx_exit_handlers[exit_reason](vcpu)的返回值，比如当实际调用handle_ept_violation()时返回值大于0，就直接切回客户模式。但是有时候可能需要Qemu的协助。在实际调用（r = kvm_x86_ops->handle_exit(vcpu);）时，返回值大于0，那么就说明KVM已经处理完成，可以再次切换进客户模式，但如果返回值小于等于0，那就说明需要Qemu的协助，KVM会在run结构体中的exit_reason中记录退出原因，并进入到Qemu中进行处理。这个判断过程是在__vcpu_run()函数中进行的，实际是一个while循环。

[objc] view plain copy

static int__vcpu_run(struct kvm_vcpu *vcpu)
{
......//省略部分代码
r = 1;
while (r > 0) {
if (vcpu->arch.mp_state ==KVM_MP_STATE_RUNNABLE &&
!vcpu->arch.apf.halted)
r =vcpu_enter_guest(vcpu);
else {
......//省略部分代码
}
if (r <= 0)
break;
......//省略部分代码
}
srcu_read_unlock(&kvm->srcu,vcpu->srcu_idx);
vapic_exit(vcpu);
return r;
}

上面函数中vcpu_enter_guest()我们前面讲过，是在kvm内核中转入客户模式的函数，他处于while循环中，也就是如果不需要Qemu的协助，即r>0，那就继续循环，然后重新切换进客户系统运行，如果需要Qemu的协助，那返回值r<=0,退出循环，向上层返回r。

上面说的r一直往上层返回，直到kvm_vcpu_ioctl()函数中的

case KVM_RUN：

trace_kvm_userspace_exit(vcpu->run->exit_reason, r);

这一条语句就是将退出原因注入到Qemu层。

Qemu层这时候读取到ioctl的返回值，然后继续执行，就会判断有没有KVM的异常注入，这里其实我在前一篇文章中简单提及了一下。

[objc] view plain copy

int kvm_cpu_exec(CPUArchState *env)
{
.......
do {
......
run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN,0);
......
trace_kvm_run_exit(cpu->cpu_index,run->exit_reason);
switch (run->exit_reason) {
case KVM_EXIT_IO:
......
break;
case KVM_EXIT_MMIO:
......
break;
case KVM_EXIT_IRQ_WINDOW_OPEN:
.......
break;
case KVM_EXIT_SHUTDOWN:
......
break;
case KVM_EXIT_UNKNOWN:
......
break;
case KVM_EXIT_INTERNAL_ERROR:
......
break;
default:
......
break;
}
} while (ret == 0);
}

trace_kvm_run_exit(cpu->cpu_index,run->exit_reason);这条语句就是接收内核注入的退出原因，后面switch语句进行处理，每一个case对应一种退出原因，这里你也可以自己添加的。因为也是在while循环中，处理完一次后又进行ioctl调用运行虚拟机并切换到客户模式，这就形成了一个完整的闭环。

内存虚拟化简介

前一章介绍了CPU虚拟化的内容，这一章介绍一下KVM的内存虚拟化原理。可以说内存是除了CPU外最重要的组件，Guest最终使用的还是宿主机的内存，所以内存虚拟化其实就是关于如何做Guest到宿主机物理内存之间的各种地址转换，如何转换会让转换效率更高呢，KVM经历了三代的内存虚拟化技术，大大加快了内存的访问速率。

传统的地址转换

在保护模式下，普通的应用进程使用的都是自己的虚拟地址空间，一个64位的机器上的每一个进程都可以访问0到2^64的地址范围，实际上内存并没有这么多，也不会给你这么多。对于进程而言，他拥有所有的内存，对内核而言，只分配了一小段内存给进程，待进程需要更多的进程的时候再分配给进程。
通常应用进程所使用的内存叫做虚拟地址，而内核所使用的是物理内存。内核负责为每个进程维护虚拟地址到物理内存的转换关系映射。
首先，逻辑地址需要转换为线性地址，然后由线性地址转换为物理地址。

逻辑地址 ==> 线性地址 ==> 物理地址

逻辑地址和线性地址之间通过简单的偏移来完成。
image_1aq931k0p42710me1il511r1kin9.png-59.9kB

一个完整的逻辑地址 = [段选择符：段内偏移地址]，查找GDT或者LDT（通过寄存器gdtr，ldtr）找到描述符，通过段选择符(selector)前13位在段描述符做index，找到Base地址，Base+offset就是线性地址。

为什么要这么做？据说是Intel为了保证兼容性。

逻辑地址到线性地址的转换在虚拟化中没有太多的需要介绍的，这一层不存在实际的虚拟化操作，和传统方式一样，最重要的是线性地址到物理地址这一层的转换。

传统的线性地址到物理地址的转换由CPU的页式内存管理，页式内存管理。
页式内存管理负责将线性地址转换到物理地址，一个线性地址被分五段描述，第一段为基地址，通过与当前CR3寄存器（CR3寄存器每个进程有一个，线程共享，当发生进程切换的时候，CR3被载入到对应的寄存器中，这也是各个进程的内存隔离的基础）做运算，得到页表的地址index，通过四次运算，最终得到一个大小为4K的页（有可能更大，比如设置了hugepages以后）。整个过程都是CPU完成，进程不需要参与其中，如果在查询中发现页已经存在，直接返回物理地址，如果页不存在，那么将产生一个缺页中断，内核负责处理缺页中断，并把页加载到页表中，中断返回后，CPU获取到页地址后继续进行运算。

image_1aq93m25p10v913qns5kpsfbjm.png-152.1kB

KVM中的内存结构

由于qemu-kvm进程在宿主机上作为一个普通进程，那对于Guest而言，需要的转换过程就是这样。

  Guest虚拟内存地址(GVA)
          |
    Guest线性地址 
          |
   Guest物理地址(GPA)
          |             Guest
   ------------------
          |             HV
    HV虚拟地址(HVA)
          |
      HV线性地址
          |
    HV物理地址(HPA)

What's the fu*k ？这么多...
别着急，Guest虚拟地址到HV线性地址之间的转换和HV虚拟地址到线性地址的转换过程可以省略，这样看起来就更清晰一点。

  Guest虚拟内存地址(GVA)
          |
   Guest物理地址(GPA)
          |             Guest
  ------------------
          |             HV
    HV虚拟地址(HVA)
          |
    HV物理地址(HPA)

前面也说到KVM通过不断的改进转换过程，让KVM的内存虚拟化更加的高效，我们从最初的软件虚拟化的方式介绍。

软件虚拟化方式实现

第一层转换，由GVA->GPA的转换和传统的转换关系一样，通过查找CR3然后进行页表查询，找到对应的GPA，GPA到HVA的关系由qemu-kvm负责维护，我们在第二章KVM启动过程的demo里面就有介绍到怎样给KVM映射内存，通过mmap的方式把HV的内存映射给Guest。

image_1aq96r87sq6v12lp1r8119n0cd21t.png-37.4kB

struct kvm_userspace_memory_region region = {
    .slot = 0,
    .guest_phys_addr = 0x1000,
    .memory_size = 0x1000,
    .userspace_addr = (uint64_t)mem,
};

可以看到，qemu-kvm的kvm_userspace_memory_region结构体描述了guest的物理地址起始位置和内存大小，然后描述了Guest的物理内存在HV的映射userspace_addr，通过多个slot，可以把不连续的HV的虚拟地址空间映射给Guest的连续的物理地址空间。

image_1aq965a4v18e1mnj15pb3du13pu1g.png-41.7kB

软件模拟的虚拟化方式由qemu-kvm来负责维护GPA->HVA的转换，然后再经过一次HVA->HPA的方式，从过程上来看，这样的访问是很低效的，特别是在当GVA到GPA转换时候产生缺页中断，这时候产生一个异常Guest退出，HV捕获异常后计算出物理地址（分配新的内存给Guest），然后重新Entry。这个过程会可能导致频繁的Guest退出，且转换过程过长。于是KVM使用了一种叫做影子页表的技术。

影子页表的虚拟化方式

影子页表的出现，就是为了减少地址转换带来的开销，直接把GVA转换到HVP的技术。在软件虚拟化的内存转换中，GVA到GPA的转换通过查询CR3寄存器来完成，CR3保存了Guest中的页表基地址，然后载入MMU来做地址转换。
在加入了影子页表的技术后，当访问到CR3寄存器的时候（可能是由于Guest进程后导致的），KVM捕获到这个操作，CPU虚拟化章节EXIT_REASON_CR_ACCESS，qemu-kvm通过载入特俗的CR3和影子页表来欺骗Guest这个就是真实的CR3，后面的操作就和传统的访问内存的方式一致，当需要访问物理内存的时候，只会经过一层的影子页表的转换。

image_1aq972tgu2tr15g216kh1u3s1s1q2a.png-47.2kB

影子页表由qemu-kvm进程维护，实际上就是一个Guest的页表到宿主机页表的映射，每一级的页表的hash值对应到qemu-kvm中影子页表的一个目录。在初次GVA->HPA的转换时候，影子页表没有建立，此时Guest产生缺页中断，和传统的转换过程一样，经过两次转换(VA->PA)，然后影子页表记录GVA->GPA->HVA->HPA。这样产生GVA->GPA的直接关系，保存到影子页表中。

image_1aq97hvkm14lg14al112i1j1m1ren2n.png-19.4kB

影子页表的引入，减少了GVA->HPA的转换过程，但是坏处在于qemu-kvm需要为Guest的每个进程维护一个影子页表，这将带来很大的内存开销，同时影子页表的建立是很耗时的，如果Guest进程过多，将导致频繁的影子页表的导入与导出，虽然用了cache技术，但是还是软件层面的，效率并不是最好，所以Intel和AMD在此基础上提供了硬件虚拟化技术。

EPT硬件加速的虚拟化方式

image_1aq997mlo1rt01aka186abnob3134.png-65kB
EPT(extended page table)可以看做一个硬件的影子页表，在Guest中通过增加EPT寄存器，当Guest产生了CR3和页表的访问的时候，由于对CR3中的页表地址的访问是GPA，当地址为空时候，也就是Page fault后，产生缺页异常，如果在软件模拟或者影子页表的虚拟化方式中，此时会有VM退出，qemu-kvm进程接管并获取到此异常。但是在EPT的虚拟化方式中，qemu-kvm忽略此异常，Guest并不退出，而是按照传统的缺页中断处理，在缺页中断处理的过程中会产生EXIT_REASON_EPT_VIOLATION，Guest退出，qemu-kvm捕获到异常后，分配物理地址并建立GVA->HPA的映射，并保存到EPT中，将EPT载入到MMU，下次转换时候直接查询根据CR3查询EPT表来完成GVA->HPA的转换。以后的转换都由硬件直接完成，大大提高了效率，且不需要为每个进程维护一套页表，减少了内存开销。
在笔者的测试中，Guest和HV的内存访问速率对比为3756MB/s对比4340MB/s。可以看到内存访问已经很接近宿主机的水平了。

总结

KVM内存的虚拟化就是一个将虚拟机的虚拟内存转换为宿主机物理内存的过程，Guest使用的依然是宿主机的物理内存，只是在这个过程中怎样减少转换带来的开销成为优化的主要点。
KVM经过软件模拟->影子页表->EPT的技术的进化，效率也越来越高。

IO 虚拟化简介

前面的文章介绍了KVM的启动过程，CPU虚拟化，内存虚拟化原理。作为一个完整的风诺依曼计算机系统，必然有输入计算输出这个步骤。传统的IO包括了网络设备IO，块设备IO，字符设备IO等等，在KVM虚拟化原理探究里面，我们最主要介绍网络设备IO和块设备IO，其实他们的原理都很像，但是在虚拟化层又分化开了，这也是为什么网络设备IO虚拟化和块设备IO虚拟化要分开讲的原因。这一章介绍一下网络设备IO虚拟化，下一章介绍块设备IO虚拟化。

传统的网络IO流程

这里的传统并不是真的传统，而是介绍一下在非虚拟化环境下的网络设备IO流程。我们平常所使用的Linux版本，比如Debian或者CentOS等都是标准的Linux TCP/IP协议栈，协议栈底层提供了driver抽象层来适配不同的网卡，在虚拟化中最重要的是设备的虚拟化，但是了解整个网络IO流程后去看待虚拟化就会更加容易理解了。

标准的TCP/IP结构

在用户层，我们通过socket与Kernel做交互，包括创建端口，数据的接收发送等操作。
在Kernel层，TCP/IP协议栈负责将我们的socket数据封装到TCP或者UDP包中，然后进入IP层，加入IP地址端口信息等，进入数据链路层，加入Mac地址等信息后，通过驱动写入到网卡，网卡再把数据发送出去。如下图所示，比较主观的图。

image_1aqgfh90g3ip1gtmclm180313o4m.png-95.8kB

在Linux的TCP/IP协议栈中，每个数据包是有内核的skb_buff结构描述的，如下图所示，socket发送数据包的时候后，进入内核，内核从skb_buff的池中分配一个skb_buff用来承载数据流量。

image_1aqgg26d7gof1jaanlbtblvq13.png-114.6kB
当数据到了链路层，链路层做好相应的链路层头部封装后，调用驱动层适配层的发送接口 dev_queue_xmit，最终调用到 net_start_xmit 接口。
image_1aqgn9e3ak2g183g1s2j150713581g.png-268.7kB

发送数据和接收数据驱动层都采用DMA模式，驱动加载时候会为网卡映射内存并设置描述状态(寄存器中），也就是内存的起始位，长度，剩余大小等等。发送时候将数据放到映射的内存中，然后设置网卡寄存器产生一个中断，告诉网卡有数据，网卡收到中断后处理对应的内存中的数据，处理完后向CPU产生一个中断告诉CPU数据发送完成，CPU中断处理过程中向上层driver通知数据发送完成，driver再依次向上层返回。在这个过程中对于driver来说，发送是同步的。接收数据的流程和发送数据几乎一致，这里就不细说了。DMA的模式对后面的IO虚拟化来说很重要。

image_1aqger4b915nf19k11gjv1lc21atm9.png-46.6kB

KVM 网络IO虚拟化

准确来说，KVM只提供了一些基本的CPU和内存的虚拟化方案，真正的IO实现都由qemu-kvm来完成，只不过我们在介绍KVM的文章里都默认qemu-kvm和KVM为一个体系，就没有分的那么仔细了。实际上网络IO虚拟化都是由qemu-kvm来完成的。

KVM 全虚拟化IO

还记得我们第一章节的demo里面，我们的“镜像”调用了 out 指令产生了一个IO操作，然后因为此操作为敏感的设备访问类型的操作，不能在VMX non-root 模式下执行，于是VM exits，模拟器接管了这个IO操作。

switch (kvm->vcpus->kvm_run->exit_reason) {
        case KVM_EXIT_UNKNOWN:
            printf("KVM_EXIT_UNKNOWN\n");
            break;
        // 虚拟机执行了IO操作，虚拟机模式下的CPU会暂停虚拟机并
        // 把执行权交给emulator
        case KVM_EXIT_IO:
            printf("KVM_EXIT_IO\n");
            printf("out port: %d, data: %d\n", 
                kvm->vcpus->kvm_run->io.port,  
                *(int *)((char *)(kvm->vcpus->kvm_run) + kvm->vcpus->kvm_run->io.data_offset)
                );
            break;
        ...

虚拟机退出并得知原因为 KVM_EXIT_IO，模拟器得知由于设备产生了IO操作并退出，于是获取这个IO操作并打印出数据。这里其实我们就最小化的模拟了一个虚拟IO的过程，由模拟器接管这个IO。

在qemu-kvm全虚拟化的IO过程中，其实原理也是一样，KVM捕获IO中断，由qemu-kvm接管这个IO，由于采用了DMA映射，qemu-kvm在启动时候会注册设备的mmio信息，以便能获取到DMA设备的映射内存和控制信息。

static int pci_e1000_init(PCIDevice *pci_dev)
{
    e1000_mmio_setup(d); 
    // 为PCI设备设置 mmio 空间
    pci_register_bar(&d->dev, 0, PCI_BASE_ADDRESS_SPACE_MEMORY, &d->mmio); 
    pci_register_bar(&d->dev, 1, PCI_BASE_ADDRESS_SPACE_IO, &d->io);
    d->nic = qemu_new_nic(&net_e1000_info, &d->conf, object_get_typename(OBJECT(d)), d->dev.qdev.id, d);   
    add_boot_device_path(d->conf.bootindex, &pci_dev->qdev, "/ethernet-phy@0"); 
}

对于PCI设备来说，当设备与CPU之间通过映射了一段连续的物理内存后，CPU对PCI设备的访问只需要像访问内存一样访问既可以。IO设备通常有两种模式，一种是port模式，一种是MMIO模式，前者就是我们demo里面的in/out指令，后者就是PCI设备的DMA访问方式，两种方式的操作都能被KVM捕获。

于是qemu-kvm将此操作代替Guest完成后并执行相应的“回调”，也就是向vCPU产生中断告诉IO完成并返回Guest继续执行。vCPU中断和CPU中断一样，设置相应的寄存器后中断便会触发。

在全虚拟化环境下，Guest中的IO都由qemu-kvm接管，在Guest中看到的一个网卡设备并不是真正的一块网卡，而是由物理机产生的一个tap设备。知识在驱动注册的时候将一些tap设备所支持的特性加入到了Guest的驱动注册信息里面，所以在Guest中看到有网络设备。

image_1aqgs30fm15lj1pse1t6lm051751t.png-37.9kB

如上图所示，qemu接管了来自Guest的IO操作，真实的场景肯定是需要将数据再次发送出去的，而不是像demo一样打印出来，在Guest中的数据包二层封装的Mac地址后，qemu层不需要对数据进行拆开再解析，而只需要将数据写入到tap设备，tap设备和bridge之间交互完成后，由bridge直接发送到网卡，bridge（其实NIC绑定到了Bridge）开启了混杂模式，可以将所有请求都接收或者发送出去。

以下来自这篇文章的引用

当一个 TAP 设备被创建时，在 Linux 设备文件目录下将会生成一个对应 char 设备，用户程序可以像打开普通文件一样打开这个文件进行读写。当执行 write()操作时，数据进入 TAP 设备，此时对于 Linux 网络层来说，相当于 TAP 设备收到了一包数据，请求内核接受它，如同普通的物理网卡从外界收到一包数据一样，不同的是其实数据来自 Linux 上的一个用户程序。Linux 收到此数据后将根据网络配置进行后续处理，从而完成了用户程序向 Linux 内核网络层注入数据的功能。当用户程序执行 read()请求时，相当于向内核查询 TAP 设备上是否有需要被发送出去的数据，有的话取出到用户程序里，完成 TAP 设备的发送数据功能。针对 TAP 设备的一个形象的比喻是：使用 TAP 设备的应用程序相当于另外一台计算机，TAP 设备是本机的一个网卡，他们之间相互连接。应用程序通过 read()/write()操作，和本机网络核心进行通讯。

类似这样的操作

fd = open("/dev/tap", XXX)
write(fd, buf, 1024);
read(fd, buf, 1024);

bridge可能是一个Linux bridge，也可能是一个OVS（Open virtual switch），在涉及到网络虚拟化的时候，通常需要利用到bridge提供的VLAN tag功能。

以上就是KVM的网络全虚拟化IO流程了，我们也可以看到这个流程的不足，比如说当网络流量很大的时候，会产生过多的VM的切换，同时产生过多的数据copy操作，我们知道copy是很浪费CPU时钟周期的。于是qemu-kvm在发展的过程中，实现了virtio驱动。

KVM Virtio 驱动

基于 Virtio 的虚拟化也叫作半虚拟化，因为要求在Guest中加入virtio驱动，也就意味着Guest知道了自己运行于虚拟环境了。
image_1aqgvbg7090kr1a1meq1tvilc62a.png-48.6kB

不同于全虚拟化的方式，Virtio通过在Guest的Driver层引入了两个队列和相应的队列就绪描述符与qemu-kvm层Virtio Backend进行通信，并用文件描述符来替代之前的中断。
Virtio front-end与Backend之间通过Vring buffer交互，在qemu中，使用事件循环机制来描述buffer的状态，这样当buffer中有数据的时候，qemu-kvm会监听到eventfd的事件就绪，于是就可以读取数据后发送到tap设备，当有数据从tap设备过来的时候，qemu将数据写入到buffer，并设置eventfd，这样front-end监听到事件就绪后从buffer中读取数据。

可以看到virtio在全虚拟化的基础上做了改动，降低了Guest exit和entry的开销，同时利用eventfd来建立控制替代硬件中断面膜是，一定程度上改善了网络IO的性能。
不过从整体流程上来看，virtio还是存在过多的内存拷贝，比如qemu-kvm从Vring buffer中拷贝数据后发送到tap设备，这个过程需要经过用户态到内核态的拷贝，加上一系列的系统调用，所以在流程上还可以继续完善，于是出现了内核态的virtio，被称作vhost-net。

KVM Vhost-net

我们用一张图来对比一下virtio与vhost-net，图片来自redhat官网。

image_1aqh0s600fq41tj21b1r16331elm2n.png-143.8kB

vhost-net 绕过了 QEMU 直接在Guest的front-end和backend之间通信，减少了数据的拷贝，特别是减少了用户态到内核态的拷贝。性能得到大大加强，就吞吐量来说，vhost-net基本能够跑满一台物理机的带宽。
vhost-net需要内核支持，Redhat 6.1 后开始支持，默认状态下是开启的。

总结

KVM的网络设备IO虚拟化经过了全虚拟化->virtio->vhost-net的进化，性能越来越接近真实物理网卡，但是在小包处理方面任然存在差距，不过已经不是一个系统的瓶颈了，可以看到KVM在经过了这多年的发展后，性能也是越发的强劲，这也是他领先于其他虚拟化的重要原因之一。
在本章介绍了IO虚拟化后，下一章介绍块设备的虚拟化，块设备虚拟化同样利用了DMA的特性。

块设备IO虚拟化简介

上一篇文章讲到了网络IO虚拟化，作为另外一个重要的虚拟化资源，块设备IO的虚拟化也是同样非常重要的。同网络IO虚拟化类似，块设备IO也有全虚拟化和virtio的虚拟化方式（virtio-blk）。现代块设备的工作模式都是基于DMA的方式，所以全虚拟化的方式和网络设备的方式接近，同样的virtio-blk的虚拟化方式和virtio-net的设计方式也是一样，只是在virtio backend端有差别。

传统块设备架构

块设备IO协议栈

image_1aqo0ufta16pqlsm7nvuvc1ur71g.png-133.1kB

如上图所示，我们把块设备IO的流程也看做一个TCP/IP协议栈的话，从最上层说起。

Page cache层，这里如果是非直接IO，写操作如果在内存够用的情况下，都是写到这一级后就返回。在IO流程里面，属于writeback模式。需要持久化的时候有两种选择，一种是显示的调用flush操作，这样此文件（以文件为单位）的cache就会同步刷到磁盘，另一种是等待系统自动flush。

VFS，也就是我们通常所说的虚拟文件系统层，这一层给我们上层提供了统一的系统调用，我们常用的create，open，read，write，close转化为系统调用后，都与VFS层交互。VFS不仅为上层系统调用提供了统一的接口，还组织了文件系统结构，定义了文件的数据结构，比如根据inode查找dentry并找到对应文件信息，并找到描述一个文件的数据结构struct file。文件其实是一种对磁盘中存储的一堆零散的数据的一种描述，在Linux上，一个文件由一个inode 表示。inode在系统管理员看来是每一个文件的唯一标识，在系统里面，inode是一个结构，存储了关于这个文件的大部分信息。这个数据结构有几个回调操作就是提供给不同的文件系统做适配的。下层的文件系统需要实现file_operation的几个接口，做具体的数据的读写操作等。

struct file_operations {
    //文件读操作
    ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
    //文件写操作
    ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
    int (*readdir) (struct file *, void *, filldir_t);
    //文件打开操作
    int (*open) (struct inode *, struct file *);
};

再往下就是针对不同的文件系统层，比如我们的ext3，ext4等等。我们在VFS层所说的几个文件系统需要实现的接口，都在这一层做真正的实现。这一层的文件系统并不直接操作文件，而是通过与下层的通用块设备层做交互，为什么要抽象一层通用块设备呢？我们的文件系统适用于不同的设备类型，比如可能是一个SSD盘又或者是一个USB设备，不同的设备的驱动不一样，文件系统没有必要为每一种不同的设备做适配，只需要兼容通用快设备层的接口就可以。

位于文件系统层下面的是通用快设备层，这一层在程序设计里面是属于接口层，用于屏蔽底层不同的快设备做的抽象，为上层的文件系统提供统一的接口。

通用快设备下层就是IO调度层。用如下命令可以看到系统的IO调度算法.

➜  ~ cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

noop，可以看成是FIFO（先进先出队列），对IO做一些简单的合并，比如对同一个文件的操作做合并，这种算法适合比如SSD磁盘不需要寻道的块设备。
cfq，完全公平队列。此算法的设计是从进程级别来保证的，就是说公平的对象是每个进程。系统为此算法分配了N个队列用来保存来自不同进程的请求，当进程有IO请求的时候，会散列到不同的队列，散列算法是一致的，同一个进程的请求总是被散列到同一个队列。然后系统根据时间片轮训这N个队列的IO请求来完成实际磁盘读写。
deadline，在Linux的电梯调度算法的基础上，增加了两个队列，用来处理即将超时或者超时的IO请求，这两个队列的优先级比其他队列的优先级较高，所以避免了IO饥饿情况的产生。

块设备驱动层就是针对不同的块设备的真实驱动层了，块设备驱动层完成块设备的内存映射并处理块设备的中断，完成块设备的读写。

块设备就是真实的存储设备，包括SAS，SATA，SSD等等。块设备也可能有cache，一般称为Disk cache，对于驱动层来说，cache的存在是很重要的，比如writeback模式下，驱动层只需要写入到Disk cache层就可以返回，块设备层保证数据的持久化以及一致性。
通常带有Disk cache的块设备都有电池管理，当掉电的时候保证cache的内容能够保持一段时间，下次启动的时候将cache的内容写入到磁盘中。

块设备IO流程

应用层的读写操作，都是通过系统调用read，write完成，由Linux VFS提供的系统调用接口完成，屏蔽了下层块设备的复杂操作。write操作有直接IO和非直接IO之分（缓冲IO），非直接IO的写操作直接写入到page cache后就返回，后续的数据依赖系统的flush操作，如果在flush操作未完成的时候发生了系统掉电，那可能会丢失一部分数据。直接IO（Direct IO），绕过了page cache，数据必须达到磁盘后才返回IO操作完成。

对于I/O的读写流程，逻辑比较复杂，这里以写流程简单描述如下：
image_1aqo0lnfn6m8ei6lef1qg019p713.png-46.9kB

用户调用系统调用write写一个文件，会调到sys_write函数；
经过VFS虚拟文件系统层，调用vfs_write，如果是缓存写方式，则写入page cache，然后就返回，后续就是刷脏页的流程；如果是Direct I/O的方式，就会走到块设备直接IO(do_blockdev_direct_IO)的流程；
构造bio请求，调用submit_bio往具体的块设备下发请求，submit_bio函数通过generic_make_request转发bio，generic_make_request是一个循环，其通过每个块设备下注册的q->make_request_fn函数与块设备进行交互；
请求下发到底层的块设备上，调用块设备请求处理函数__make_request进行处理，在这个函数中就会调用blk_queue_bio，这个函数就是合并bio到request中，也就是I/O调度器的具体实现：如果几个bio要读写的区域是连续的，就合并到一个request；否则就创建一个新的request，把自己挂到这个request下。合并bio请求也是有限度的，如果合并后的请求超过阈值（在/sys/block/xxx/queue/max_sectors_kb里设置）,就不能再合并成一个request了，而会新分配一个request；
接下来的I/O操作就与具体的物理设备有关了，块设备驱动的读写也是通过DMA方式进行。

如上图所示，在初始化IO设备的时候，会为IO设备分配一部分物理内存，这个物理内存可以由CPU的MMU和连接IO总线的IOMMU管理，作为共享内存存在。以一个读取操作为例子，当CPU需要读取块设备的某个内容的时候，CPU会通过中断告知设备内存地址以及大小和需要读取的块设备地址，然后CPU返回，块设备完成实际的读取数据后，将数据写入到共享的内存，并以中断方式通知CPU IO流程完成，并设置内存地址，接着CPU直接从内存中读取数据。
写请求类似，都是通过共享内存的方式，这样可以解放CPU，不需要CPU同步等待IO的完成并且不需要CPU做过多的运算操作。
因为块设备IO的虚拟化需要经过两次IO协议栈，一次Guest，一次HV。所以需要把块设备IO协议栈说的很具体一点。

至此，Linux块设备的IO层就基本介绍完整了，以上内容也只是做一个简单的介绍，这部分的内容可以很深入的去了解，在此限于篇幅限制，就不做过多介绍了。

块设备IO虚拟化

块设备的全虚拟化方式和网络IO的DMA设备虚拟化方式类似，这里就不过多介绍了，主要介绍一下virtio-blk。

image_1aqo1nvbo1o615hn2kuptlbso1t.png-52.7kB

如上图所示，块设备IO的虚拟化流程和网络IO的流程基本一致，差别在于virtio-backend一段，virtio-net是写入到tap设备，virtio-blk是写入到镜像文件中。
块设备IO的流程需要经过两次IO协议栈，一次位于Guest，一次位于HV。当我们指定virtio的cache模式的时候，实际上指定的是virtio-backend（下面简称v-backend）写入HV块设备的方式。
在虚拟化层次来看，Guest对于这几种Cache模式是没有感知的，也就是无论Cache模式是怎样，Guest都不会有所谓的绕过Guest的Page cache等操作，Virtio-front模拟的是驱动层的操作，不会涉及到更上层的IO协议栈。

image_1aqo7picj8ks1qng1lt51errfvim.png-62.7kB

如上图所示，蓝色表示 writethrough，黄色表示 none，红色表示 writeback。其中虚线表示写到哪一个层次后write调用返回。

cache=writethrough （蓝色线）
表示v-backend打开镜像文件并写入时候采用非直接IO+flush操作，也就是说每次写入到Page cache并flush一次，直到数据被真实写入到磁盘后write调用返回，这样必然会导致数据写入变慢，但是好处就是安全性较高。
cache=none （黄色线）
cache为none模式表示了v-backend写入文件时候使用到了DIRECT_IO，将会绕过HV的Page cache，直接写入磁盘，如果磁盘有Disk cache的话，写入Disk cache就返回，此模式的好处在于保证性能的前提下，也能保证数据的安全性，在使用了Disk cache电池的情况下。但是对于读操作，因为没有写入HV Page cache，所以会有一定性能影响。
cache=writeback （红色线）
此模式表示v-backend使用了非直接IO，写入到HV的Page后就返回，有可能会导致数据丢失。

总结

块设备IO的虚拟化方式也是统一的virtio-x模式，但是virtio-blk需要经过两次IO协议栈，带来了不必要的开销。前面的铺垫都是为了介绍三种重要的cache模式。

使用KVM API实现Emulator Demo

这边文章来描述如何用KVM API来写一个Virtualizer的demo code, 也就是相当与Qemu，用来做设备模拟。此文是帮助想了解KVM原理已经Qemu原理的人 or Just for fun.

完整的Code在这里： https://github.com/soulxu/kvmsample

这个code其实是很久以前写的，以前在team内部分享过，用来帮助大家理解kvm工作原理。现在既然要开始写code了，就用这个先来个开端。

当然我不可能写一个完整的Qemu，只是写出Qemu中最基本的那些code。这个虚拟机只有一个VCPU和512000000字节内存(其实富裕了) 可以进行一些I/O，当然这些I/O的结果只能导致一些print，没有实际模拟任何设备。所以所能执行的Guest也很简单。

首先来看看Guest有多简单。

   .globl _start
    .code16
_start:
    xorw %ax, %ax

loop1:
    out %ax, $0x10
    inc %ax
    jmp loop1

  

不熟悉汇编也没关系，这code很简单，基本也能猜到干啥了。对，Guest只是基于at&t汇编写的一个在8086模式下的死循环，不停的向端口0x10写东西。目标就是让这个Guest跑起来了。

我们的目标就是让这个Guest能执行起来。下面开始看我们虚拟机的code了。

我们先来看看main函数：

   int main(int argc, char **argv) {
    int ret = 0;
    struct kvm *kvm = kvm_init();

    if (kvm == NULL) {
        fprintf(stderr, "kvm init fauilt\n");
        return -1;
    }

    if (kvm_create_vm(kvm, RAM_SIZE) < 0) {
        fprintf(stderr, "create vm fault\n");
        return -1;
    }

    load_binary(kvm);

    // only support one vcpu now
    kvm->vcpu_number = 1;
    kvm->vcpus = kvm_init_vcpu(kvm, 0, kvm_cpu_thread);

    kvm_run_vm(kvm);

    kvm_clean_vm(kvm);
    kvm_clean_vcpu(kvm->vcpus);
    kvm_clean(kvm);
}

  

这里正是第一个kvm基本原理：一个虚拟机就是一个进程，我们的虚拟机从这个main函数开始

让我先来看看kvm_init。这里很简单，就是打开了/dev/kvm设备，这是kvm的入口，对kvm的所有操作都是通过对文件描述符上执行ioctl来完成。这里很简单，就是打开kvm设备，然后将文件描述符返回到我自己创建的一个结构体当中。

然后我们就开始创建一个vm，然后为其分配内存。

   kvm->vm_fd = ioctl(kvm->dev_fd, KVM_CREATE_VM, 0);

  

创建一个虚拟机很简单，在kvm设备上执行这么一个ioctl即可，然后会得到新建的vm的文件描述，用来操作这个vm。

然后我们来分配内存，这里最重要的是struct kvm_userspace_memory_region这个数据结构。

   /* for KVM_SET_USER_MEMORY_REGION */
struct kvm_userspace_memory_region {
        __u32 slot;
        __u32 flags;
        __u64 guest_phys_addr;
        __u64 memory_size; /* bytes */
        __u64 userspace_addr; /* start of the userspace allocated memory */
};

  

memory_size是guest的内存的大小。userspace_addr是你为其份分配的内存的起始地址，而guest_phys_addr则是这段内存映射到guest的什么物理内存地址。

这里用mmap创建了一段匿名映射，并将地址置入userspace_addr。随后来告诉我们的vm这些信息：

   ioctl(kvm->vm_fd, KVM_SET_USER_MEMORY_REGION, &(kvm->mem));

  

这里是来操作我们的vm了，不是kvm设备文件了。

我们有了内存了，现在可以把我们的guest code加载的进来了，这个实现很简单就是打开编译后的二进制文件将其写入我们分配的内存空间当中。这里所要注意的就是如何编译guest code，这里我们编译出来的是flat binary，不需要什么elf的封装。

有了内存，下一步就是vcpu了，创建vcpu是在kvm_init_vcpu函数里。这里最重要的操作只有这个：

   vcpu->kvm_run_mmap_size = ioctl(kvm->dev_fd, KVM_GET_VCPU_MMAP_SIZE, 0);
...
vcpu->kvm_run = mmap(NULL, vcpu->kvm_run_mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, vcpu->vcpu_fd, 0);

  

struct kvm_run是保存vcpu状态的一个数据结构，稍后我们可以看到我们可以从这里得到当陷入后具体陷入原因。

有了内存和vcpu就可以运行了：

   pthread_create(&(kvm->vcpus->vcpu_thread), (const pthread_attr_t *)NULL, kvm->vcpus[i].vcpu_thread_func, kvm)

  

这里是另一个kvm基本概念了，一个vcpu就是一个线程。这里让我们为vcpu创建一个线程。

最终我们到了最关键的部分了，就是这个vcpu线程。其实他就是一个循环。当循环开始的时候，我们让他执行guest code:

   ret = ioctl(kvm->vcpus->vcpu_fd, KVM_RUN, 0)

  

当执行这条语句后，guest code就开始执行了，这个函数就阻塞在这里了。直到something happened而且需要由hypervisor进行处理的时候这个函数才会返回。比如说I/O发生了，这个函数就会返回了，这里我们就需要通过struct kvm_run中得到具体的陷入原因。我们的guest只是做一些I/O port的操作，所以可以看到当退出原因是KVM_EXIT_IO时，我将guest的所写入的数据print出来。

到这里这就是这个virtualizer的全部了. 如果你想体验一下，只需要执行make。

   :~/code/kvmsample$ make
cc    -c -o main.o main.c
gcc main.c -o kvmsample -lpthread
as -32 test.S -o test.o
ld -m elf_i386 --oformat binary -N -e _start -Ttext 0x10000 -o test.bin test.o

  

然后执行kvmsample

   $ ./kvmsample
read size: 712288
KVM start run
KVM_EXIT_IO
out port: 16, data: 0
KVM start run
KVM_EXIT_IO
out port: 16, data: 1
KVM start run
KVM_EXIT_IO
out port: 16, data: 2
KVM start run
KVM_EXIT_IO
out port: 16, data: 3
....

  

其实qemu里面的code也就是这样，你也可以在其中找到这个loop，只不过它被qemu内部的各种设备框架所隐藏起来了。

The entry point of QEMU is int main in vl.c

   01int main(int argc, char **argv, char **envp)

   02{

   03   ......

   04   GMemVTable mem_trace = {

   05        .malloc = malloc_and_trace,

   06        .realloc = realloc_and_trace,

   07        .free = free_and_trace,

   08    }; // This data structure is prepared for glib.

   09       // The three values are function pointers which are required by glib if you want to use glib library.

   10 

   11   g_mem_set_vtable(&mem_trace);// Register these functions in glib

   12 

   13   g_thread_init(NULL);// If you use gLib from more than one thread, you must initialize the thread system by calling

   14 

   15   module_call_init(MODULE_INIT_QOM);

   16   // Check Code1-1 for Module_INIT_QOM definition.

   17   // This function will retrieve ModuleEntry in the list of Module_INIT_QOM. Actually, QEMU maintain one list for each Module Type. Then execute the init function for each ModuleEntry.

   18   // See module_call_init for which "init" function will be executed?

   19 

   20 

   21 

   22}

MODULE_INIT_TYPE (Code1-1)

   1typedef enum {

   2    MODULE_INIT_BLOCK,

   3    MODULE_INIT_MACHINE,

   4    MODULE_INIT_QAPI,

   5    MODULE_INIT_QOM,

   6    MODULE_INIT_MAX

   7} module_init_type;

Four types Module Definition

   1#define block_init(function) module_init(function, MODULE_INIT_BLOCK)

   2#define machine_init(function) module_init(function, MODULE_INIT_MACHINE)

   3#define qapi_init(function) module_init(function, MODULE_INIT_QAPI)

   4#define type_init(function) module_init(function, MODULE_INIT_QOM)

module_init()

   1#define module_init(function, type)                                         \

   2static void __attribute__((constructor)) do_qemu_init_ ## function(void) {  \

   3    register_module_init(function, type);                                   \

   4}

register_module_init()

   01void register_module_init(void (*fn)(void), module_init_type type)

   02{

   03    ModuleEntry *e;

   04    ModuleTypeList *l;

   05 

   06    e = g_malloc0(sizeof(*e));

   07    e->init = fn;

   08 

   09    l = find_type(type);

   10 

   11    QTAILQ_INSERT_TAIL(l, e, node);

   12}

module_call_init(...)

   01void module_call_init(module_init_type type)

   02{

   03    ModuleTypeList *l;

   04    ModuleEntry *e;

   05 

   06    l = find_type(type);

   07 

   08    QTAILQ_FOREACH(e, l, node) {

   09        e->init(); // Here, only the init function for QOM module is executed.

   10                   // For example, one object "x86_cpu_register_types" is registered by type_init(x86_cpu_register_types)

   11                   // Then only x86_cpu_register_types() is executed and this function only set up each parameter in TypeInfo

   12                   // TypeInfo can be found in the following.

   13    }

   14}

TypeInfo The basic parameter used to restore QEMU Object.

   01/**

   02 * TypeInfo:

   03 * @name: The name of the type.

   04 * @parent: The name of the parent type.

   05 * @instance_size: The size of the object (derivative of #Object).  If

   06 *   @instance_size is 0, then the size of the object will be the size of the

   07 *   parent object.

   08 * @instance_init: This function is called to initialize an object.  The parent

   09 *   class will have already been initialized so the type is only responsible

   10 *   for initializing its own members.

   11 * @instance_post_init: This function is called to finish initialization of

   12 *   an object, after all @instance_init functions were called.

   13 * @instance_finalize: This function is called during object destruction.  This

   14 *   is called before the parent @instance_finalize function has been called.

   15 *   An object should only free the members that are unique to its type in this

   16 *   function.

   17 * @abstract: If this field is true, then the class is considered abstract and

   18 *   cannot be directly instantiated.

   19 * @class_size: The size of the class object (derivative of #ObjectClass)

   20 *   for this object.  If @class_size is 0, then the size of the class will be

   21 *   assumed to be the size of the parent class.  This allows a type to avoid

   22 *   implementing an explicit class type if they are not adding additional

   23 *   virtual functions.

   24 * @class_init: This function is called after all parent class initialization

   25 *   has occurred to allow a class to set its default virtual method pointers.

   26 *   This is also the function to use to override virtual methods from a parent

   27 *   class.

   28 * @class_base_init: This function is called for all base classes after all

   29 *   parent class initialization has occurred, but before the class itself

   30 *   is initialized.  This is the function to use to undo the effects of

   31 *   memcpy from the parent class to the descendents.

   32 * @class_finalize: This function is called during class destruction and is

   33 *   meant to release and dynamic parameters allocated by @class_init.

   34 * @class_data: Data to pass to the @class_init, @class_base_init and

   35 *   @class_finalize functions.  This can be useful when building dynamic

   36 *   classes.

   37 * @interfaces: The list of interfaces associated with this type.  This

   38 *   should point to a static array that's terminated with a zero filled

   39 *   element.

   40 */

   41struct TypeInfo

   42{

   43    const char *name;

   44    const char *parent;

   45 

   46    size_t instance_size;

   47    void (*instance_init)(Object *obj);

   48    void (*instance_post_init)(Object *obj);

   49    void (*instance_finalize)(Object *obj);

   50 

   51    bool abstract;

   52    size_t class_size;

   53 

   54    void (*class_init)(ObjectClass *klass, void *data);

   55    void (*class_base_init)(ObjectClass *klass, void *data);

   56    void (*class_finalize)(ObjectClass *klass, void *data);

   57    void *class_data;

   58 

   59    InterfaceInfo *interfaces;

60};
QEMU Partconfigure_accelerator()
      01static int configure_accelerator(void)

      02{

      03    const char *p;

      04    char buf[10];

      05    int i, ret;

      06    bool accel_initialised = false;

      07    bool init_failed = false;

      08 

      09    p = qemu_opt_get(qemu_get_machine_opts(), "accel");

      10    if (p == NULL) {

      11        /* Use the default "accelerator", tcg */

      12        p = "tcg";

      13    }

      14 

      15    while (!accel_initialised && *p != '\0') {

      16        if (*p == ':') {

      17            p++;

      18        }

      19        p = get_opt_name(buf, sizeof (buf), p, ':'); // buf records the name of hypervisor

      20        for (i = 0; i < ARRAY_SIZE(accel_list); i++) { // accel_list[] can be found in the following.

      21            if (strcmp(accel_list[i].opt_name, buf) == 0) { // Retrieve accel_list to match the name of hypervisor

      22                if (!accel_list[i].available()) { // Check if this hypervisor is available

      23                    printf("%s not supported for this target\n",

      24                           accel_list[i].name);

      25                    continue;

      26                }

      27                *(accel_list[i].allowed) = true;

      28                ret = accel_list[i].init(); // Run the initialization function for the chosen hypervisor. For example, kvm -> kvm_init()

      29                if (ret < 0) {

      30                    init_failed = true;

      31                    fprintf(stderr, "failed to initialize %s: %s\n",

      32                            accel_list[i].name,

      33                            strerror(-ret));

      34                    *(accel_list[i].allowed) = false;

      35                } else {

      36                    accel_initialised = true;

      37                }

      38                break;

      39            }

      40        }

      41        if (i == ARRAY_SIZE(accel_list)) {

      42            fprintf(stderr, "\"%s\" accelerator does not exist.\n", buf);

      43        }

      44    }

      45 

      46    if (!accel_initialised) {

      47        if (!init_failed) {

      48            fprintf(stderr, "No accelerator found!\n");

      49        }

      50        exit(1);

      51    }

      52 

      53    if (init_failed) {

      54        fprintf(stderr, "Back to %s accelerator.\n", accel_list[i].name);

      55    }

      56 

      57    return !accel_initialised;

      58}

accel_list[]
   
      01static struct {

      02    const char *opt_name;

      03    const char *name;

      04    int (*available)(void);

      05    int (*init)(void);

      06    bool *allowed;

      07} accel_list[] = {

      08    { "tcg", "tcg", tcg_available, tcg_init, &tcg_allowed },

      09    { "xen", "Xen", xen_available, xen_init, &xen_allowed },

      10    { "kvm", "KVM", kvm_available, kvm_init, &kvm_allowed }, // kvm_available and kvm_init are both pointers to functions.

      11    { "qtest", "QTest", qtest_available, qtest_init, &qtest_allowed },

      12};

kvm_init()
      001int kvm_init(void)

      002{

      003    static const char upgrade_note[] =

      004        "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"

      005        "(see http://sourceforge.net/projects/kvm).\n";

      006    struct {

      007        const char *name;

      008        int num;

      009    } num_cpus[] = {

      010        { "SMP",          smp_cpus },

      011        { "hotpluggable", max_cpus },

      012        { NULL, }

      013    }, *nc = num_cpus;

      014    int soft_vcpus_limit, hard_vcpus_limit;

      015    KVMState *s;

      016    const KVMCapabilityInfo *missing_cap;

      017    int ret;

      018    int i;

      019 

      020    s = g_malloc0(sizeof(KVMState)); // KVMState definition could be found in the following

      021 

      022    /*

      023     * On systems where the kernel can support different base page

      024     * sizes, host page size may be different from TARGET_PAGE_SIZE,

      025     * even with KVM.  TARGET_PAGE_SIZE is assumed to be the minimum

      026     * page size for the system though.

      027     */

      028    assert(TARGET_PAGE_SIZE <= getpagesize());

      029 

      030#ifdef KVM_CAP_SET_GUEST_DEBUG

      031    QTAILQ_INIT(&s->kvm_sw_breakpoints);

      032#endif

      033    for (i = 0; i < ARRAY_SIZE(s->slots); i++) {

      034        s->slots[i].slot = i;

      035    }

      036    s->vmfd = -1;

      037    s->fd = qemu_open("/dev/kvm", O_RDWR); // Return the KVM fd

      038    if (s->fd == -1) {

      039        fprintf(stderr, "Could not access KVM kernel module: %m\n");

      040        ret = -errno;

      041        goto err;

      042    }

      043 

      044    ret = kvm_ioctl(s, KVM_GET_API_VERSION, 0); // Send Command "KVM_GET_API_VERSION" to KVM fd. kvm_ioctl can be found in the following.

      045    if (ret < KVM_API_VERSION) {                // The definition of "KVM_GET_API_VERSION" could be found in the following.

      046        if (ret > 0) {

      047            ret = -EINVAL;

      048        }

      049        fprintf(stderr, "kvm version too old\n");

      050        goto err;

      051    }

      052 

      053    if (ret > KVM_API_VERSION) {

      054        ret = -EINVAL;

      055        fprintf(stderr, "kvm version not supported\n");

      056        goto err;

      057    }

      058 

      059    /* check the vcpu limits */

      060    soft_vcpus_limit = kvm_recommended_vcpus(s);

      061   // call kvm_check_extension(s, KVM_CAP_NR_VCPUS) -> kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_NR_VCPUS). Default is 4 vcpus.

      062   // s: KVMState.

      063   // KVM_CAP_NR_VCPUS: one of API in KVM_Extension.

      064   // KVM_CHECK_EXTENSION: KVM API which can be recognized by KVM fd.

      065     

      066    hard_vcpus_limit = kvm_max_vcpus(s); 

      067   // call kvm_check_extension(s, KVM_CAP_MAX_VCPUS) -> kvm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_MAX_VCPUS). Default is 4 vcpus.      

      068   // KVM_CAP_MAX_VCPUS: one of API in KVM Extension.

      069   // KVM_CHECK_EXTENSION: KVM API which can be recognized by KVM fd.

      070    while (nc->name) {

      071        if (nc->num > soft_vcpus_limit) {

      072            fprintf(stderr,

      073                    "Warning: Number of %s cpus requested (%d) exceeds "

      074                    "the recommended cpus supported by KVM (%d)\n",

      075                    nc->name, nc->num, soft_vcpus_limit);

      076 

      077            if (nc->num > hard_vcpus_limit) {

      078                ret = -EINVAL;

      079                fprintf(stderr, "Number of %s cpus requested (%d) exceeds "

      080                        "the maximum cpus supported by KVM (%d)\n",

      081                        nc->name, nc->num, hard_vcpus_limit);

      082                goto err;

      083            }

      084        }

      085        nc++;

      086    }

      087    /* call KVM API "KVM_CREATE_VM" to create a new VM */

      088    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0); // VM fd will be assigned to s->vmfd

      089    if (s->vmfd < 0) {

      090#ifdef TARGET_S390X

      091        fprintf(stderr, "Please add the 'switch_amode' kernel parameter to "

      092                        "your host kernel command line\n");

      093#endif

      094        ret = s->vmfd;

      095        goto err;

      096    }

      097 

      098    missing_cap = kvm_check_extension_list(s, kvm_required_capabilites);

      099    if (!missing_cap) {

      100        missing_cap =

      101            kvm_check_extension_list(s, kvm_arch_required_capabilities);

      102    }

      103    if (missing_cap) {

      104        ret = -EINVAL;

      105        fprintf(stderr, "kvm does not support %s\n%s",

      106                missing_cap->name, upgrade_note);

      107        goto err;

      108    }

      109 

      110    s->coalesced_mmio = kvm_check_extension(s, KVM_CAP_COALESCED_MMIO);

      111 

      112    s->broken_set_mem_region = 1;

      113    ret = kvm_check_extension(s, KVM_CAP_JOIN_MEMORY_REGIONS_WORKS);

      114    if (ret > 0) {

      115        s->broken_set_mem_region = 0;

      116    }

      117 

      118#ifdef KVM_CAP_VCPU_EVENTS

      119    s->vcpu_events = kvm_check_extension(s, KVM_CAP_VCPU_EVENTS);

      120#endif

      121 

      122    s->robust_singlestep =

      123        kvm_check_extension(s, KVM_CAP_X86_ROBUST_SINGLESTEP);

      124 

      125#ifdef KVM_CAP_DEBUGREGS

      126    s->debugregs = kvm_check_extension(s, KVM_CAP_DEBUGREGS);

      127#endif

      128 

      129#ifdef KVM_CAP_XSAVE

      130    s->xsave = kvm_check_extension(s, KVM_CAP_XSAVE);

      131#endif

      132 

      133#ifdef KVM_CAP_XCRS

      134    s->xcrs = kvm_check_extension(s, KVM_CAP_XCRS);

      135#endif

      136 

      137#ifdef KVM_CAP_PIT_STATE2

      138    s->pit_state2 = kvm_check_extension(s, KVM_CAP_PIT_STATE2);

      139#endif

      140 

      141#ifdef KVM_CAP_IRQ_ROUTING

      142    s->direct_msi = (kvm_check_extension(s, KVM_CAP_SIGNAL_MSI) > 0);

      143#endif

      144 

      145    s->intx_set_mask = kvm_check_extension(s, KVM_CAP_PCI_2_3);

      146 

      147    s->irq_set_ioctl = KVM_IRQ_LINE;

      148    if (kvm_check_extension(s, KVM_CAP_IRQ_INJECT_STATUS)) {

      149        s->irq_set_ioctl = KVM_IRQ_LINE_STATUS;

      150    }

      151 

      152#ifdef KVM_CAP_READONLY_MEM

      153    kvm_readonly_mem_allowed =

      154        (kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);

      155#endif

      156 

      157    ret = kvm_arch_init(s);

      158    if (ret < 0) {

      159        goto err;

      160    }

      161 

      162    ret = kvm_irqchip_create(s);

      163    if (ret < 0) {

      164        goto err;

      165    }

      166 

      167    kvm_state = s;

      168    memory_listener_register(&kvm_memory_listener, &address_space_memory);

      169    memory_listener_register(&kvm_io_listener, &address_space_io);

      170 

      171    s->many_ioeventfds = kvm_check_many_ioeventfds();

      172 

      173    cpu_interrupt_handler = kvm_handle_interrupt;

      174 

      175    return 0;

      176 

      177err:

      178    if (s->vmfd >= 0) {

      179        close(s->vmfd);

      180    }

      181    if (s->fd != -1) {

      182        close(s->fd);

      183    }

      184    g_free(s);

      185 

      186    return ret;

      187}

KVMState
   
      01struct KVMState

      02{

      03    KVMSlot slots[32];

      04    int fd;    // KVM fd

      05    int vmfd;  // VM fd

      06    int coalesced_mmio;

      07    struct kvm_coalesced_mmio_ring *coalesced_mmio_ring;

      08    bool coalesced_flush_in_progress;

      09    int broken_set_mem_region;

      10    int migration_log;

      11    int vcpu_events;

      12    int robust_singlestep;

      13    int debugregs;

      14#ifdef KVM_CAP_SET_GUEST_DEBUG

      15    struct kvm_sw_breakpoint_head kvm_sw_breakpoints;

      16#endif

      17    int pit_state2;

      18    int xsave, xcrs;

      19    int many_ioeventfds;

      20    int intx_set_mask;

      21    /* The man page (and posix) say ioctl numbers are signed int, but

      22     * they're not.  Linux, glibc and *BSD all treat ioctl numbers as

      23     * unsigned, and treating them as signed here can break things */

      24    unsigned irq_set_ioctl;

      25#ifdef KVM_CAP_IRQ_ROUTING

      26    struct kvm_irq_routing *irq_routes;

      27    int nr_allocated_irq_routes;

      28    uint32_t *used_gsi_bitmap;

      29    unsigned int gsi_count;

      30    QTAILQ_HEAD(msi_hashtab, KVMMSIRoute) msi_hashtab[KVM_MSI_HASHTAB_SIZE];

      31    bool direct_msi;

      32#endif

      33};

kvm_ioctl(KVMState *s, int type, ...)
   
      01int kvm_ioctl(KVMState *s, int type, ...)

      02{

      03    int ret;

      04    void *arg;

      05    va_list ap;

      06 

      07    va_start(ap, type); // Use va_start/va_arg/va_end to get variable parameter list.

      08    arg = va_arg(ap, void *);

      09    va_end(ap);

      10 

      11    trace_kvm_ioctl(type, arg);

      12    ret = ioctl(s->fd, type, arg); // User ioctl to call kvm fd

      13    if (ret == -1) {

      14        ret = -errno;

      15    }

      16    return ret;

      17}

KVM PartRegister ioctl handlervmx_init() -> kvm_init(...) -> misc_register(kvm_dev) -> kvm.&kvm_chardev_ops 
kvm_dev
   
      1static struct miscdevice kvm_dev = {

      2 KVM_MINOR,

      3 "kvm",

      4 &kvm_chardev_ops,

      5};

kvm_chardev_ops
   
      1static struct file_operations kvm_chardev_ops = {

      2 .unlocked_ioctl = kvm_dev_ioctl,

      3 .compat_ioctl   = kvm_dev_ioctl, // When QEMU send a command to KVM fd (e.g. KVM_GET_API_VERSION), this function will be invoked.

      4 .llseek  = noop_llseek,

      5};

KVM API

The "
   /dev/kvm" ioctl handler is as follows: 
   
KVM_GET_API_VERSIONThis API just return the API version of KVM. The handler just return he version of KVM. This parameter is defined in 
   ./include/uapi/linux/kvm.h
      1#define KVM_GET_API_VERSION       _IO(KVMIO,   0x00)  // According to the following datastructure, KVM_GET_API_VERSION will be a number with format ||dir||type||nr||size||.

      1#define _IO(type,nr)  _IOC(_IOC_NONE,(type),(nr),0)

      1#define _IOC(dir,type,nr,size)   \

      2 ((unsigned int)    \

      3  (((dir)  << _IOC_DIRSHIFT) |  \

      4   ((type) << _IOC_TYPESHIFT) |  \

      5   ((nr)   << _IOC_NRSHIFT) |  \

      6   ((size) << _IOC_SIZESHIFT)))

KVM_CREATE_VMThis API is used by QEMU to ask KVM create VM. "
   KVM_CREATE_VM" definition is the same as 
   KVM_GET_API_VERSION.
   
      1case KVM_CREATE_VM:

      2 r = kvm_dev_ioctl_create_vm(arg);

      3 break;

QEMU Source Code Study (3) - KVM_CREATE_VCPU

x86_cpu_register_types() ----> type_register_static(&x86_cpu_type_info) ----> TypeInfo x86_cpu_type_info.class_init = x86_cpu_common_class_init ----> x86_cpu_common_class_init(ObjectClass *oc, void *data) ----> dc->realize = x86_cpu_realizefn ----> x86_cpu_realizefn(DeviceState *dev, Error **error) ----> qemu_init_vcpu(cpu) ----> qemu_kvm_start_vcpu(cpu) ----> qemu_thread_create(cpu->thread, qemu_kvm_cpu_thread_fn, cpu) ----> nqemu_kvm_cpu_thread_fn(arg) ----> kvm_cpu_exec(cpu) ----> kvm_vcpu_ioctl(cpu, KVM_RUN, 0);

How to execute these object?

QEMU Part

kvm_init_vcpu(...)

    01int kvm_init_vcpu(CPUState *cpu)

    02{

    03    KVMState *s = kvm_state;

    04    long mmap_size;

    05    int ret;

    06 

    07    DPRINTF("kvm_init_vcpu\n");

    08 

    09    ret = kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)kvm_arch_vcpu_id(cpu)); // Send vm fd a command "KVM_CREATE_VCPU"

    10    if (ret < 0) {

    11        DPRINTF("kvm_create_vcpu failed\n");

    12        goto err;

    13    }

    14 

    15    cpu->kvm_fd = ret;

    16    cpu->kvm_state = s;

    17    cpu->kvm_vcpu_dirty = true;

    18 

    19    mmap_size = kvm_ioctl(s, KVM_GET_VCPU_MMAP_SIZE, 0); // Send kvm fd (/dev/kvm) a command "KVM_GET_VCPU_MMAP_SIZE"

    20    if (mmap_size < 0) {

    21        ret = mmap_size;

    22        DPRINTF("KVM_GET_VCPU_MMAP_SIZE failed\n");

    23        goto err;

    24    }

    25 

    26    cpu->kvm_run = mmap(NULL, mmap_size, PROT_READ | PROT_WRITE, MAP_SHARED, // Map the kvm fd into the kernel memory. So that, can access kvm fd by memory operation rather than read/write

    27                        cpu->kvm_fd, 0);

    28    if (cpu->kvm_run == MAP_FAILED) {

    29        ret = -errno;

    30        DPRINTF("mmap'ing vcpu state failed\n");

    31        goto err;

    32    }

    33 

    34    if (s->coalesced_mmio && !s->coalesced_mmio_ring) {

    35        s->coalesced_mmio_ring =

    36            (void *)cpu->kvm_run + s->coalesced_mmio * PAGE_SIZE;

    37    }

    38 

    39    ret = kvm_arch_init_vcpu(cpu); // This function is quite long. I guess it is responsible for configuring vcpu . This is an important function

    40                                   // call cpu_x86_cpuid(...) to emulate all physical cpu.

    41    if (ret == 0) {

    42        qemu_register_reset(kvm_reset_vcpu, cpu);

    43        kvm_arch_reset_vcpu(cpu);

    44    }

    45err:

    46    return ret;

    47}

KVM Part

kvm_vm_ioctl()

    1case KVM_CREATE_VCPU:

    2     r = kvm_vm_ioctl_create_vcpu(kvm, arg);

    3     break;

kvm_vm_ioctl_create_vcpu(kvm, id)

    01static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)

    02{

    03 int r;

    04 struct kvm_vcpu *vcpu, *v;

    05 

    06 vcpu = kvm_arch_vcpu_create(kvm, id); // just run kvm_x86_ops-> vcpu_create(kvm,id);

    07 if (IS_ERR(vcpu))

    08  return PTR_ERR(vcpu);

    09 

    10 preempt_notifier_init(&vcpu->preempt_notifier, &kvm_preempt_ops);

    11 

    12 r = kvm_arch_vcpu_setup(vcpu);

    13 if (r)

    14  goto vcpu_destroy;

    15 

    16 mutex_lock(&kvm->lock);

    17 if (!kvm_vcpu_compatible(vcpu)) {

    18  r = -EINVAL;

    19  goto unlock_vcpu_destroy;

    20 }

    21 if (atomic_read(&kvm->online_vcpus) == KVM_MAX_VCPUS) {

    22  r = -EINVAL;

    23  goto unlock_vcpu_destroy;

    24 }

    25 

    26 kvm_for_each_vcpu(r, v, kvm)

    27  if (v->vcpu_id == id) {

    28   r = -EEXIST;

    29   goto unlock_vcpu_destroy;

    30  }

    31 

    32 BUG_ON(kvm->vcpus[atomic_read(&kvm->online_vcpus)]);

    33 

    34 /* Now it's all set up, let userspace reach it */

    35 kvm_get_kvm(kvm);

    36 r = create_vcpu_fd(vcpu);

    37 if (r < 0) {

    38  kvm_put_kvm(kvm);

    39  goto unlock_vcpu_destroy;

    40 }

    41 

    42 kvm->vcpus[atomic_read(&kvm->online_vcpus)] = vcpu;

    43 smp_wmb();

    44 atomic_inc(&kvm->online_vcpus);

    45 

    46 mutex_unlock(&kvm->lock);

    47 kvm_arch_vcpu_postcreate(vcpu);

    48 return r;

    49 

    50unlock_vcpu_destroy:

    51 mutex_unlock(&kvm->lock);

    52vcpu_destroy:

    53 kvm_arch_vcpu_destroy(vcpu);

    54 return r;

    55}

kvm_x86_ops

    001static struct kvm_x86_ops vmx_x86_ops = {

    002 .cpu_has_kvm_support = cpu_has_kvm_support,

    003 .disabled_by_bios = vmx_disabled_by_bios,

    004 .hardware_setup = hardware_setup,

    005 .hardware_unsetup = hardware_unsetup,

    006 .check_processor_compatibility = vmx_check_processor_compat,

    007 .hardware_enable = hardware_enable,

    008 .hardware_disable = hardware_disable,

    009 .cpu_has_accelerated_tpr = report_flexpriority,

    010 

    011 .vcpu_create = vmx_create_vcpu, // The construction function for vcpu

    012 .vcpu_free = vmx_free_vcpu,

    013 .vcpu_reset = vmx_vcpu_reset,

    014 

    015 .prepare_guest_switch = vmx_save_host_state,             // Save host machine state

    016 .vcpu_load = vmx_vcpu_load,

    017 .vcpu_put = vmx_vcpu_put,

    018 

    019 .update_db_bp_intercept = update_exception_bitmap,

    020 .get_msr = vmx_get_msr,

    021 .set_msr = vmx_set_msr,

    022 .get_segment_base = vmx_get_segment_base,

    023 .get_segment = vmx_get_segment,

    024 .set_segment = vmx_set_segment,

    025 .get_cpl = vmx_get_cpl,

    026 .get_cs_db_l_bits = vmx_get_cs_db_l_bits,

    027 .decache_cr0_guest_bits = vmx_decache_cr0_guest_bits,

    028 .decache_cr3 = vmx_decache_cr3,

    029 .decache_cr4_guest_bits = vmx_decache_cr4_guest_bits,

    030 .set_cr0 = vmx_set_cr0,

    031 .set_cr3 = vmx_set_cr3,

    032 .set_cr4 = vmx_set_cr4,

    033 .set_efer = vmx_set_efer,

    034 .get_idt = vmx_get_idt,

    035 .set_idt = vmx_set_idt,

    036 .get_gdt = vmx_get_gdt,

    037 .set_gdt = vmx_set_gdt,

    038 .set_dr7 = vmx_set_dr7,

    039 .cache_reg = vmx_cache_reg,

    040 .get_rflags = vmx_get_rflags,

    041 .set_rflags = vmx_set_rflags,

    042 .fpu_activate = vmx_fpu_activate,

    043 .fpu_deactivate = vmx_fpu_deactivate,

    044 

    045 .tlb_flush = vmx_flush_tlb,

    046 

    047 .run = vmx_vcpu_run,                                   // The function to run a guest VM

    048 .handle_exit = vmx_handle_exit,                        // VMEXIT handler function

    049 .skip_emulated_instruction = skip_emulated_instruction,

    050 .set_interrupt_shadow = vmx_set_interrupt_shadow,

    051 .get_interrupt_shadow = vmx_get_interrupt_shadow,

    052 .patch_hypercall = vmx_patch_hypercall,

    053 .set_irq = vmx_inject_irq,

    054 .set_nmi = vmx_inject_nmi,

    055 .queue_exception = vmx_queue_exception,

    056 .cancel_injection = vmx_cancel_injection,

    057 .interrupt_allowed = vmx_interrupt_allowed,

    058 .nmi_allowed = vmx_nmi_allowed,

    059 .get_nmi_mask = vmx_get_nmi_mask,

    060 .set_nmi_mask = vmx_set_nmi_mask,

    061 .enable_nmi_window = enable_nmi_window,

    062 .enable_irq_window = enable_irq_window,

    063 .update_cr8_intercept = update_cr8_intercept,

    064 .set_virtual_x2apic_mode = vmx_set_virtual_x2apic_mode,

    065 .vm_has_apicv = vmx_vm_has_apicv,

    066 .load_eoi_exitmap = vmx_load_eoi_exitmap,

    067 .hwapic_irr_update = vmx_hwapic_irr_update,

    068 .hwapic_isr_update = vmx_hwapic_isr_update,

    069 .sync_pir_to_irr = vmx_sync_pir_to_irr,

    070 .deliver_posted_interrupt = vmx_deliver_posted_interrupt,

    071 

    072 .set_tss_addr = vmx_set_tss_addr,

    073 .get_tdp_level = get_ept_level,

    074 .get_mt_mask = vmx_get_mt_mask,

    075 

    076 .get_exit_info = vmx_get_exit_info,

    077 

    078 .get_lpage_level = vmx_get_lpage_level,

    079 

    080 .cpuid_update = vmx_cpuid_update,

    081 

    082 .rdtscp_supported = vmx_rdtscp_supported,

    083 .invpcid_supported = vmx_invpcid_supported,

    084 

    085 .set_supported_cpuid = vmx_set_supported_cpuid,

    086 

    087 .has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,

    088 

    089 .set_tsc_khz = vmx_set_tsc_khz,

    090 .read_tsc_offset = vmx_read_tsc_offset,

    091 .write_tsc_offset = vmx_write_tsc_offset,

    092 .adjust_tsc_offset = vmx_adjust_tsc_offset,

    093 .compute_tsc_offset = vmx_compute_tsc_offset,

    094 .read_l1_tsc = vmx_read_l1_tsc,

    095 

    096 .set_tdp_cr3 = vmx_set_cr3,

    097 

    098 .check_intercept = vmx_check_intercept,

    099 .handle_external_intr = vmx_handle_external_intr,

    100};

vmx_create_vcpu

    01static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)

    02{

    03 int err;

    04 struct vcpu_vmx *vmx = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);

    05 int cpu;

    06 

    07 if (!vmx)

    08  return ERR_PTR(-ENOMEM);

    09 

    10 allocate_vpid(vmx);

    11 

    12 err = kvm_vcpu_init(&vmx->vcpu, kvm, id);

    13 if (err)

    14  goto free_vcpu;

    15 

    16 vmx->guest_msrs = kmalloc(PAGE_SIZE, GFP_KERNEL);

    17 err = -ENOMEM;

    18 if (!vmx->guest_msrs) {

    19  goto uninit_vcpu;

    20 }

    21 

    22 vmx->loaded_vmcs = &vmx->vmcs01;

    23 vmx->loaded_vmcs->vmcs = alloc_vmcs();

    24 if (!vmx->loaded_vmcs->vmcs)

    25  goto free_msrs;

    26 if (!vmm_exclusive)

    27  kvm_cpu_vmxon(__pa(per_cpu(vmxarea, raw_smp_processor_id())));

    28 loaded_vmcs_init(vmx->loaded_vmcs);

    29 if (!vmm_exclusive)

    30  kvm_cpu_vmxoff();

    31 

    32 cpu = get_cpu();

    33 vmx_vcpu_load(&vmx->vcpu, cpu);

    34 vmx->vcpu.cpu = cpu;

    35 err = vmx_vcpu_setup(vmx);

    36 vmx_vcpu_put(&vmx->vcpu);

    37 put_cpu();

    38 if (err)

    39  goto free_vmcs;

    40 if (vm_need_virtualize_apic_accesses(kvm)) {

    41  err = alloc_apic_access_page(kvm);

    42  if (err)

    43   goto free_vmcs;

    44 }

    45 

    46 if (enable_ept) {

    47  if (!kvm->arch.ept_identity_map_addr)

    48   kvm->arch.ept_identity_map_addr =

    49    VMX_EPT_IDENTITY_PAGETABLE_ADDR;

    50  err = -ENOMEM;

    51  if (alloc_identity_pagetable(kvm) != 0)

    52   goto free_vmcs;

    53  if (!init_rmode_identity_map(kvm))

    54   goto free_vmcs;

    55 }

    56 

    57 vmx->nested.current_vmptr = -1ull;

    58 vmx->nested.current_vmcs12 = NULL;

    59 

    60 return &vmx->vcpu;

    61 

    62free_vmcs:

    63 free_loaded_vmcs(vmx->loaded_vmcs);

    64free_msrs:

    65 kfree(vmx->guest_msrs);

    66uninit_vcpu:

    67 kvm_vcpu_uninit(&vmx->vcpu);

    68free_vcpu:

    69 free_vpid(vmx);

    70 kmem_cache_free(kvm_vcpu_cache, vmx);

    71 return ERR_PTR(err);

    72}

QEMU Part

Like KVM_VCPU_CREATE() , kvm_cpu_exec() is also called by function " qemu_kvm_cpu_thread_fn(...) ". For the execution process of qemu_kvm_cpu_thread_fn(...) , please check the KVM_VCPU_CREATE() post.

kvm_vcpu_exec()

   01int kvm_cpu_exec(CPUState *cpu)

   02{

   03    struct kvm_run *run = cpu->kvm_run; // kvm_run: restores the vmexit reason and information including: VMEXIT/IO etc,. Actually, this structure is used to switch information between guest and hypervisor

   04                                        // The definition of kvm_run can be found in the following.

   05    int ret, run_ret;

   06 

   07    DPRINTF("kvm_cpu_exec()\n");

   08 

   09    if (kvm_arch_process_async_events(cpu)) {

   10        cpu->exit_request = 0;

   11        return EXCP_HLT;

   12    }

   13 

   14    do {

   15        if (cpu->kvm_vcpu_dirty) {

   16            kvm_arch_put_registers(cpu, KVM_PUT_RUNTIME_STATE);

   17            cpu->kvm_vcpu_dirty = false;

   18        }

   19 

   20        kvm_arch_pre_run(cpu, run);  // Check if an interrupt should be delivered to the guest.

   21        if (cpu->exit_request) {

   22            DPRINTF("interrupt exit requested\n");

   23            /*

   24             * KVM requires us to reenter the kernel after IO exits to complete

   25             * instruction emulation. This self-signal will ensure that we

   26             * leave ASAP again.

   27             */

   28            qemu_cpu_kick_self();

   29        }

   30        qemu_mutex_unlock_iothread();

   31 

   32        run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0); // Send "KVM_RUN" command to kvm/vcpu

   33 

   34        qemu_mutex_lock_iothread();

   35        kvm_arch_post_run(cpu, run);

   36 

   37        if (run_ret < 0) {

   38            if (run_ret == -EINTR || run_ret == -EAGAIN) {

   39                DPRINTF("io window exit\n");

   40                ret = EXCP_INTERRUPT;

   41                break;

   42            }

   43            fprintf(stderr, "error: kvm run failed %s\n",

   44                    strerror(-run_ret));

   45            abort();

   46        }

   47 

   48        trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);

   49        switch (run->exit_reason) {

   50        case KVM_EXIT_IO:

   51            DPRINTF("handle_io\n");

   52            kvm_handle_io(run->io.port,

   53                          (uint8_t *)run + run->io.data_offset,

   54                          run->io.direction,

   55                          run->io.size,

   56                          run->io.count);

   57            ret = 0;

   58            break;

   59        case KVM_EXIT_MMIO:

   60            DPRINTF("handle_mmio\n");

   61            cpu_physical_memory_rw(run->mmio.phys_addr,

   62                                   run->mmio.data,

   63                                   run->mmio.len,

   64                                   run->mmio.is_write);

   65            ret = 0;

   66            break;

   67        case KVM_EXIT_IRQ_WINDOW_OPEN:

   68            DPRINTF("irq_window_open\n");

   69            ret = EXCP_INTERRUPT;

   70            break;

   71        case KVM_EXIT_SHUTDOWN:

   72            DPRINTF("shutdown\n");

   73            qemu_system_reset_request();

   74            ret = EXCP_INTERRUPT;

   75            break;

   76        case KVM_EXIT_UNKNOWN:

   77            fprintf(stderr, "KVM: unknown exit, hardware reason %" PRIx64 "\n",

   78                    (uint64_t)run->hw.hardware_exit_reason);

   79            ret = -1;

   80            break;

   81        case KVM_EXIT_INTERNAL_ERROR:

   82            ret = kvm_handle_internal_error(cpu, run);

   83            break;

   84        default:

   85            DPRINTF("kvm_arch_handle_exit\n");

   86            ret = kvm_arch_handle_exit(cpu, run);

   87            break;

   88        }

   89    } while (ret == 0);

   90 

   91    if (ret < 0) {

   92        cpu_dump_state(cpu, stderr, fprintf, CPU_DUMP_CODE);

   93        vm_stop(RUN_STATE_INTERNAL_ERROR);

   94    }

   95 

   96    cpu->exit_request = 0;

   97    return ret;

   98}

kvm_arch_pre_run(cpu, run)

   01void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)

   02{

   03    X86CPU *x86_cpu = X86_CPU(cpu);

   04    CPUX86State *env = &x86_cpu->env;

   05    int ret;

   06 

   07    /* Inject NMI */

   08    if (cpu->interrupt_request & CPU_INTERRUPT_NMI) {

   09        cpu->interrupt_request &= ~CPU_INTERRUPT_NMI;

   10        DPRINTF("injected NMI\n");

   11        ret = kvm_vcpu_ioctl(cpu, KVM_NMI); // Send KVM a command "KVM_NMI" to inject NMI

   12        if (ret < 0) {

   13            fprintf(stderr, "KVM: injection failed, NMI lost (%s)\n",

   14                    strerror(-ret));

   15        }

   16    }

   17 

   18    if (!kvm_irqchip_in_kernel()) {

   19        /* Force the VCPU out of its inner loop to process any INIT requests

   20         * or pending TPR access reports. */

   21        if (cpu->interrupt_request &

   22            (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {

   23            cpu->exit_request = 1;

   24        }

   25 

   26        /* Try to inject an interrupt if the guest can accept it */

   27        if (run->ready_for_interrupt_injection &&

   28            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&

   29            (env->eflags & IF_MASK)) {

   30            int irq;

   31 

   32            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;

   33            irq = cpu_get_pic_interrupt(env);

   34            if (irq >= 0) {

   35                struct kvm_interrupt intr;

   36 

   37                intr.irq = irq;

   38                DPRINTF("injected interrupt %d\n", irq);

   39                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr); // Send KVM a command "KVM_INTERRUPT" to inject the interrupt handled by guest.

   40                if (ret < 0) {

   41                    fprintf(stderr,

   42                            "KVM: injection failed, interrupt lost (%s)\n",

   43                            strerror(-ret));

   44                }

   45            }

   46        }

   47 

   48        /* If we have an interrupt but the guest is not ready to receive an

   49         * interrupt, request an interrupt window exit.  This will

   50         * cause a return to userspace as soon as the guest is ready to

   51         * receive interrupts. */

   52        if ((cpu->interrupt_request & CPU_INTERRUPT_HARD)) {

   53            run->request_interrupt_window = 1;

   54        } else {

   55            run->request_interrupt_window = 0;

   56        }

   57 

   58        DPRINTF("setting tpr\n");

   59        run->cr8 = cpu_get_apic_tpr(env->apic_state);

   60    }

   61}

struct kvm_run

   001struct kvm_run {

   002 /* in */

   003 __u8 request_interrupt_window;

   004 __u8 padding1[7];

   005 

   006 /* out */

   007 __u32 exit_reason;

   008 __u8 ready_for_interrupt_injection;

   009 __u8 if_flag;

   010 __u8 padding2[2];

   011 

   012 /* in (pre_kvm_run), out (post_kvm_run) */

   013 __u64 cr8;

   014 __u64 apic_base;

   015 

   016#ifdef __KVM_S390

   017 /* the processor status word for s390 */

   018 __u64 psw_mask; /* psw upper half */

   019 __u64 psw_addr; /* psw lower half */

   020#endif

   021 union {

   022  /* KVM_EXIT_UNKNOWN */

   023  struct {

   024   __u64 hardware_exit_reason;

   025  } hw;

   026  /* KVM_EXIT_FAIL_ENTRY */

   027  struct {

   028   __u64 hardware_entry_failure_reason;

   029  } fail_entry;

   030  /* KVM_EXIT_EXCEPTION */

   031  struct {

   032   __u32 exception;

   033   __u32 error_code;

   034  } ex;

   035  /* KVM_EXIT_IO */

   036  struct {

   037#define KVM_EXIT_IO_IN  0

   038#define KVM_EXIT_IO_OUT 1

   039   __u8 direction;

   040   __u8 size; /* bytes */

   041   __u16 port;

   042   __u32 count;

   043   __u64 data_offset; /* relative to kvm_run start */

   044  } io;

   045  struct {

   046   struct kvm_debug_exit_arch arch;

   047  } debug;

   048  /* KVM_EXIT_MMIO */

   049  struct {

   050   __u64 phys_addr;

   051   __u8  data[8];

   052   __u32 len;

   053   __u8  is_write;

   054  } mmio;

   055  /* KVM_EXIT_HYPERCALL */

   056  struct {

   057   __u64 nr;

   058   __u64 args[6];

   059   __u64 ret;

   060   __u32 longmode;

   061   __u32 pad;

   062  } hypercall;

   063  /* KVM_EXIT_TPR_ACCESS */

   064  struct {

   065   __u64 rip;

   066   __u32 is_write;

   067   __u32 pad;

   068  } tpr_access;

   069  /* KVM_EXIT_S390_SIEIC */

   070  struct {

   071   __u8 icptcode;

   072   __u16 ipa;

   073   __u32 ipb;

   074  } s390_sieic;

   075  /* KVM_EXIT_S390_RESET */

   076#define KVM_S390_RESET_POR       1

   077#define KVM_S390_RESET_CLEAR     2

   078#define KVM_S390_RESET_SUBSYSTEM 4

   079#define KVM_S390_RESET_CPU_INIT  8

   080#define KVM_S390_RESET_IPL       16

   081  __u64 s390_reset_flags;

   082  /* KVM_EXIT_S390_UCONTROL */

   083  struct {

   084   __u64 trans_exc_code;

   085   __u32 pgm_code;

   086  } s390_ucontrol;

   087  /* KVM_EXIT_DCR */

   088  struct {

   089   __u32 dcrn;

   090   __u32 data;

   091   __u8  is_write;

   092  } dcr;

   093  struct {

   094   __u32 suberror;

   095   /* Available with KVM_CAP_INTERNAL_ERROR_DATA: */

   096   __u32 ndata;

   097   __u64 data[16];

   098  } internal;

   099  /* KVM_EXIT_OSI */

   100  struct {

   101   __u64 gprs[32];

   102  } osi;

   103  struct {

   104   __u64 nr;

   105   __u64 ret;

   106   __u64 args[9];

   107  } papr_hcall;

   108  /* KVM_EXIT_S390_TSCH */

   109  struct {

   110   __u16 subchannel_id;

   111   __u16 subchannel_nr;

   112   __u32 io_int_parm;

   113   __u32 io_int_word;

   114   __u32 ipb;

   115   __u8 dequeued;

   116  } s390_tsch;

   117  /* KVM_EXIT_EPR */

   118  struct {

   119   __u32 epr;

   120  } epr;

   121  /* Fix the size of the union. */

   122  char padding[256];

   123 };

   124 

   125 /*

   126  * shared registers between kvm and userspace.

   127  * kvm_valid_regs specifies the register classes set by the host

   128  * kvm_dirty_regs specified the register classes dirtied by userspace

   129  * struct kvm_sync_regs is architecture specific, as well as the

   130  * bits for kvm_valid_regs and kvm_dirty_regs

   131  */

   132 __u64 kvm_valid_regs;

   133 __u64 kvm_dirty_regs;

   134 union {

   135  struct kvm_sync_regs regs;

   136  char padding[1024];

   137 } s;

   138};

KVM Part

kvm_vcpu_ioctl(...)

   001static long kvm_vcpu_ioctl(struct file *filp,

   002      unsigned int ioctl, unsigned long arg)

   003{

   004 struct kvm_vcpu *vcpu = filp->private_data;

   005 void __user *argp = (void __user *)arg;

   006 int r;

   007 struct kvm_fpu *fpu = NULL;

   008 struct kvm_sregs *kvm_sregs = NULL;

   009 

   010 if (vcpu->kvm->mm != current->mm)

   011  return -EIO;

   012 

   013#if defined(CONFIG_S390) || defined(CONFIG_PPC) || defined(CONFIG_MIPS)

   014 /*

   015  * Special cases: vcpu ioctls that are asynchronous to vcpu execution,

   016  * so vcpu_load() would break it.

   017  */

   018 if (ioctl == KVM_S390_INTERRUPT || ioctl == KVM_INTERRUPT)

   019  return kvm_arch_vcpu_ioctl(filp, ioctl, arg);

   020#endif

   021 

   022 

   023 r = vcpu_load(vcpu);

   024 if (r)

   025  return r;

   026 switch (ioctl) {

   027 case KVM_RUN:                             // KVM_RUN

   028  r = kvm_arch_vcpu_ioctl_run(vcpu, vcpu->run);     // Actually, it will call function vcpu_enter_guest(kvm_vcpu *vcpu).

   029  break;

   030 case KVM_GET_REGS: {

   031  struct kvm_regs *kvm_regs;

   032 

   033  r = -ENOMEM;

   034  kvm_regs = kzalloc(sizeof(struct kvm_regs), GFP_KERNEL);

   035  r = kvm_arch_vcpu_ioctl_get_regs(vcpu, kvm_regs);

   036  if (copy_to_user(argp, kvm_regs, sizeof(struct kvm_regs)))

   037   goto out_free1;

   038  r = 0;

   039out_free1:

   040  kfree(kvm_regs);

   041  break;

   042 }

   043 case KVM_SET_REGS: {                     // KVM_SET_REGS

   044  struct kvm_regs *kvm_regs;

   045 

   046  kvm_regs = memdup_user(argp, sizeof(*kvm_regs));

   047  r = kvm_arch_vcpu_ioctl_set_regs(vcpu, kvm_regs);

   048  kfree(kvm_regs);

   049  break;

   050 }

   051 case KVM_GET_SREGS: {                    // KVM_GET_SREGS

   052  kvm_sregs = kzalloc(sizeof(struct kvm_sregs), GFP_KERNEL);

   053  r = kvm_arch_vcpu_ioctl_get_sregs(vcpu, kvm_sregs);

   054  if (copy_to_user(argp, kvm_sregs, sizeof(struct kvm_sregs)))

   055   goto out;

   056  r = 0;

   057  break;

   058 }

   059 case KVM_SET_SREGS: {                   // KVM_SET_SREGS

   060  kvm_sregs = memdup_user(argp, sizeof(*kvm_sregs));

   061  r = kvm_arch_vcpu_ioctl_set_sregs(vcpu, kvm_sregs);

   062  break;

   063 }

   064 case KVM_GET_MP_STATE: {                // KVM_GET_MP_STATE

   065  struct kvm_mp_state mp_state;

   066 

   067  r = kvm_arch_vcpu_ioctl_get_mpstate(vcpu, &mp_state);

   068  if (copy_to_user(argp, &mp_state, sizeof mp_state))

   069   goto out;

   070  r = 0;

   071  break;

   072 }

   073 case KVM_SET_MP_STATE: {               // KVM_SET_MP_STATE

   074  struct kvm_mp_state mp_state;

   075 

   076  if (copy_from_user(&mp_state, argp, sizeof mp_state))

   077   goto out;

   078  r = kvm_arch_vcpu_ioctl_set_mpstate(vcpu, &mp_state);

   079  break;

   080 }

   081 case KVM_TRANSLATE: {                 // KVM_TRANSLATE

   082  struct kvm_translation tr;

   083 

   084  if (copy_from_user(&tr, argp, sizeof tr))

   085   goto out;

   086  r = kvm_arch_vcpu_ioctl_translate(vcpu, &tr);

   087   

   088  if (copy_to_user(argp, &tr, sizeof tr))

   089   goto out;

   090  r = 0;

   091  break;

   092 }

   093 case KVM_SET_GUEST_DEBUG: {           // KVM_SET_GUEST_DEBUG

   094  struct kvm_guest_debug dbg;

   095 

   096  if (copy_from_user(&dbg, argp, sizeof dbg))

   097   goto out;

   098  r = kvm_arch_vcpu_ioctl_set_guest_debug(vcpu, &dbg);

   099  break;

   100 }

   101 case KVM_SET_SIGNAL_MASK: {           // KVM_SET_SIGNAL_MASK

   102  struct kvm_signal_mask __user *sigmask_arg = argp;

   103  struct kvm_signal_mask kvm_sigmask;

   104  sigset_t sigset, *p;

   105 

   106  p = NULL;

   107  if (argp) {

   108   r = -EFAULT;

   109   if (copy_from_user(&kvm_sigmask, argp,

   110        sizeof kvm_sigmask))

   111    goto out;

   112   r = -EINVAL;

   113   if (kvm_sigmask.len != sizeof sigset)

   114    goto out;

   115   r = -EFAULT;

   116   if (copy_from_user(&sigset, sigmask_arg->sigset,

   117        sizeof sigset))

   118    goto out;

   119   p = &sigset;

   120  }

   121  r = kvm_vcpu_ioctl_set_sigmask(vcpu, p);

   122  break;

   123 }

   124 case KVM_GET_FPU: {                     // KVM_GET_FPU

   125  fpu = kzalloc(sizeof(struct kvm_fpu), GFP_KERNEL);

   126  r = -ENOMEM;

   127  r = kvm_arch_vcpu_ioctl_get_fpu(vcpu, fpu);

   128  if (copy_to_user(argp, fpu, sizeof(struct kvm_fpu)))

   129   goto out;

   130  r = 0;

   131  break;

   132 }

   133 case KVM_SET_FPU: {                     // KVM_SET_FPU

   134  fpu = memdup_user(argp, sizeof(*fpu));

   135   

   136  r = kvm_arch_vcpu_ioctl_set_fpu(vcpu, fpu);

   137  break;

   138 }

   139 default:

   140  r = kvm_arch_vcpu_ioctl(filp, ioctl, arg);       // Default Handler

   141                                                   // kvm_arch_vcpu_ioctl(...) can be found in the following

   142 }

   143out:

   144 vcpu_put(vcpu);

   145 kfree(fpu);

   146 kfree(kvm_sregs);

   147 return r;

   148}

kvm_arch_vcpu_ioctl(filp, ioctl, arg)

   001long kvm_arch_vcpu_ioctl(struct file *filp,

   002    unsigned int ioctl, unsigned long arg)

   003{

   004 struct kvm_vcpu *vcpu = filp->private_data;

   005 void __user *argp = (void __user *)arg;

   006 int r;

   007 union {

   008  struct kvm_lapic_state *lapic;

   009  struct kvm_xsave *xsave;

   010  struct kvm_xcrs *xcrs;

   011  void *buffer;

   012 } u;

   013 

   014 u.buffer = NULL;

   015 switch (ioctl) {

   016 case KVM_GET_LAPIC: {                     // KVM_GET_LAPIC

   017  r = -EINVAL;

   018 

   019  u.lapic = kzalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);

   020 

   021  r = -ENOMEM;

   022   

   023  r = kvm_vcpu_ioctl_get_lapic(vcpu, u.lapic);

   024   

   025  r = -EFAULT;

   026  if (copy_to_user(argp, u.lapic, sizeof(struct kvm_lapic_state)))

   027   goto out;

   028  r = 0;

   029  break;

   030 }

   031 case KVM_SET_LAPIC: {                     // KVM_SET_LAPIC

   032  r = -EINVAL;

   033   

   034  u.lapic = memdup_user(argp, sizeof(*u.lapic));

   035   

   036 

   037  r = kvm_vcpu_ioctl_set_lapic(vcpu, u.lapic);

   038  break;

   039 }

   040 case KVM_INTERRUPT: {                     // KVM_INTERRUPT

   041  struct kvm_interrupt irq;

   042 

   043  r = -EFAULT;

   044  if (copy_from_user(&irq, argp, sizeof irq))

   045   goto out;

   046  r = kvm_vcpu_ioctl_interrupt(vcpu, &irq);

   047  break;

   048 }

   049 case KVM_NMI: {                           // KVM_NMI

   050  r = kvm_vcpu_ioctl_nmi(vcpu);

   051  break;

   052 }

   053 case KVM_SET_CPUID: {                     // KVM_SET_CPUID

   054  struct kvm_cpuid __user *cpuid_arg = argp;

   055  struct kvm_cpuid cpuid;

   056 

   057  r = -EFAULT;

   058  if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))

   059   goto out;

   060  r = kvm_vcpu_ioctl_set_cpuid(vcpu, &cpuid, cpuid_arg->entries);

   061  break;

   062 }

   063 case KVM_SET_CPUID2: {                    // KVM_SET_CPUID2

   064  struct kvm_cpuid2 __user *cpuid_arg = argp;

   065  struct kvm_cpuid2 cpuid;

   066 

   067  r = -EFAULT;

   068  if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))

   069   goto out;

   070  r = kvm_vcpu_ioctl_set_cpuid2(vcpu, &cpuid,

   071           cpuid_arg->entries);

   072  break;

   073 }

   074 case KVM_GET_CPUID2: {                                         // KVM_GET_CPUID2

   075  struct kvm_cpuid2 __user *cpuid_arg = argp;

   076  struct kvm_cpuid2 cpuid;

   077 

   078  r = -EFAULT;

   079  if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))

   080   goto out;

   081  r = kvm_vcpu_ioctl_get_cpuid2(vcpu, &cpuid,

   082           cpuid_arg->entries);

   083  if (r)

   084   goto out;

   085  r = -EFAULT;

   086  if (copy_to_user(cpuid_arg, &cpuid, sizeof cpuid))

   087   goto out;

   088  r = 0;

   089  break;

   090 }

   091 case KVM_GET_MSRS:                                             // KVM_GET_MSRS

   092  r = msr_io(vcpu, argp, kvm_get_msr, 1);

   093  break;

   094 case KVM_SET_MSRS:                                             // KVM_SET_MSRS

   095  r = msr_io(vcpu, argp, do_set_msr, 0);

   096  break;

   097 case KVM_TPR_ACCESS_REPORTING: {                               // KVM_TPR_ACCESS_PERORITING

   098  struct kvm_tpr_access_ctl tac;

   099 

   100  r = -EFAULT;

   101  if (copy_from_user(&tac, argp, sizeof tac))

   102   goto out;

   103  r = vcpu_ioctl_tpr_access_reporting(vcpu, &tac);

   104  if (r)

   105   goto out;

   106  r = -EFAULT;

   107  if (copy_to_user(argp, &tac, sizeof tac))

   108   goto out;

   109  r = 0;

   110  break;

   111 };

   112 case KVM_SET_VAPIC_ADDR: {                                     // KVM_SET_VAPIC_ADDR

   113  struct kvm_vapic_addr va;

   114 

   115  r = -EINVAL;

   116  if (!irqchip_in_kernel(vcpu->kvm))

   117   goto out;

   118  r = -EFAULT;

   119  if (copy_from_user(&va, argp, sizeof va))

   120   goto out;

   121  r = 0;

   122  kvm_lapic_set_vapic_addr(vcpu, va.vapic_addr);

   123  break;

   124 }

   125 case KVM_X86_SETUP_MCE: {                                       // KVM_X86_SETUP_MCE

   126  u64 mcg_cap;

   127 

   128  r = -EFAULT;

   129  if (copy_from_user(&mcg_cap, argp, sizeof mcg_cap))

   130   goto out;

   131  r = kvm_vcpu_ioctl_x86_setup_mce(vcpu, mcg_cap);

   132  break;

   133 }

   134 case KVM_X86_SET_MCE: {                                         // KVM_x86_SET_MCE

   135  struct kvm_x86_mce mce;

   136 

   137  r = -EFAULT;

   138  if (copy_from_user(&mce, argp, sizeof mce))

   139   goto out;

   140  r = kvm_vcpu_ioctl_x86_set_mce(vcpu, &mce);

   141  break;

   142 }

   143 case KVM_GET_VCPU_EVENTS: {                                     // KVM_GET_VCPU_EVENTS

   144  struct kvm_vcpu_events events;

   145 

   146  kvm_vcpu_ioctl_x86_get_vcpu_events(vcpu, &events);

   147 

   148  r = -EFAULT;

   149  if (copy_to_user(argp, &events, sizeof(struct kvm_vcpu_events)))

   150   break;

   151  r = 0;

   152  break;

   153 }

   154 case KVM_SET_VCPU_EVENTS: {                                     // KVM_SET_VCPU_EVENTS

   155  struct kvm_vcpu_events events;

   156 

   157  r = -EFAULT;

   158  if (copy_from_user(&events, argp, sizeof(struct kvm_vcpu_events)))

   159   break;

   160 

   161  r = kvm_vcpu_ioctl_x86_set_vcpu_events(vcpu, &events);

   162  break;

   163 }

   164 case KVM_GET_DEBUGREGS: {                                       // KVM_GET_DEBUGERGS

   165  struct kvm_debugregs dbgregs;

   166 

   167  kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);

   168 

   169  r = -EFAULT;

   170  if (copy_to_user(argp, &dbgregs,

   171     sizeof(struct kvm_debugregs)))

   172   break;

   173  r = 0;

   174  break;

   175 }

   176 case KVM_SET_DEBUGREGS: {                                      // KVM_SET_DEBUGERS

   177  struct kvm_debugregs dbgregs;

   178 

   179  r = -EFAULT;

   180  if (copy_from_user(&dbgregs, argp,

   181       sizeof(struct kvm_debugregs)))

   182   break;

   183 

   184  r = kvm_vcpu_ioctl_x86_set_debugregs(vcpu, &dbgregs);

   185  break;

   186 }

   187 case KVM_GET_XSAVE: {                                         // KVM_GET_XSAVE

   188  u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL);

   189  r = -ENOMEM;

   190  if (!u.xsave)

   191   break;

   192 

   193  kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);

   194 

   195  r = -EFAULT;

   196  if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))

   197   break;

   198  r = 0;

   199  break;

   200 }

   201 case KVM_SET_XSAVE: {                                        // KVM_SET_XSAVE

   202  u.xsave = memdup_user(argp, sizeof(*u.xsave));

   203  if (IS_ERR(u.xsave))

   204   return PTR_ERR(u.xsave);

   205 

   206  r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);

   207  break;

   208 }

   209 case KVM_GET_XCRS: {                                         // KVM_GET_XCRS

   210  u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL);

   211  r = -ENOMEM;

   212  if (!u.xcrs)

   213   break;

   214 

   215  kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);

   216 

   217  r = -EFAULT;

   218  if (copy_to_user(argp, u.xcrs,

   219     sizeof(struct kvm_xcrs)))

   220   break;

   221  r = 0;

   222  break;

   223 }

   224 case KVM_SET_XCRS: {                                        // KVM_SET_XCRS

   225  u.xcrs = memdup_user(argp, sizeof(*u.xcrs));

   226  if (IS_ERR(u.xcrs))

   227   return PTR_ERR(u.xcrs);

   228 

   229  r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs);

   230  break;

   231 }

   232 case KVM_SET_TSC_KHZ: {                                      // KVM_SET_TSC_KHZ

   233  u32 user_tsc_khz;

   234 

   235  r = -EINVAL;

   236  user_tsc_khz = (u32)arg;

   237 

   238  if (user_tsc_khz >= kvm_max_guest_tsc_khz)

   239   goto out;

   240 

   241  if (user_tsc_khz == 0)

   242   user_tsc_khz = tsc_khz;

   243 

   244  kvm_set_tsc_khz(vcpu, user_tsc_khz);

   245 

   246  r = 0;

   247  goto out;

   248 }

   249 case KVM_GET_TSC_KHZ: {                                    // KVM_GET_TSC_KHZ

   250  r = vcpu->arch.virtual_tsc_khz;

   251  goto out;

   252 }

   253 case KVM_KVMCLOCK_CTRL: {                                  // KVM_KVMCLOCK_CTRL

   254  r = kvm_set_guest_paused(vcpu);

   255  goto out;

   256 }

   257 default:                                                   // Default: Error

   258  r = -EINVAL;

   259 }

   260out:

   261 kfree(u.buffer);

   262 return r;

   263}

vcpu_enter_guest(kvm_vcpu *vcpu)

   001static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

   002{

   003 int r;

   004 bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&

   005  vcpu->run->request_interrupt_window;

   006 bool req_immediate_exit = false;

   007 

   008 if (vcpu->requests) {                                      // Check if there is any requests which are not handled.

   009  if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))

   010   kvm_mmu_unload(vcpu);

   011  if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))

   012   __kvm_migrate_timers(vcpu);

   013  if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))

   014   kvm_gen_update_masterclock(vcpu->kvm);

   015  if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))

   016   kvm_gen_kvmclock_update(vcpu);

   017  if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {

   018   r = kvm_guest_time_update(vcpu);

   019   if (unlikely(r))

   020    goto out;

   021  }

   022  if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))

   023   kvm_mmu_sync_roots(vcpu);

   024  if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))

   025   kvm_x86_ops->tlb_flush(vcpu);

   026  if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {

   027   vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;

   028   r = 0;

   029   goto out;

   030  }

   031  if (kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)) {

   032   vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;

   033   r = 0;

   034   goto out;

   035  }

   036  if (kvm_check_request(KVM_REQ_DEACTIVATE_FPU, vcpu)) {

   037   vcpu->fpu_active = 0;

   038   kvm_x86_ops->fpu_deactivate(vcpu);

   039  }

   040  if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {

   041   /* Page is swapped out. Do synthetic halt */

   042   vcpu->arch.apf.halted = true;

   043   r = 1;

   044   goto out;

   045  }

   046  if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))

   047   record_steal_time(vcpu);

   048  if (kvm_check_request(KVM_REQ_NMI, vcpu))

   049   process_nmi(vcpu);

   050  if (kvm_check_request(KVM_REQ_PMU, vcpu))

   051   kvm_handle_pmu_event(vcpu);

   052  if (kvm_check_request(KVM_REQ_PMI, vcpu))

   053   kvm_deliver_pmi(vcpu);

   054  if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))

   055   vcpu_scan_ioapic(vcpu);

   056 }

   057 

   058 if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {             // Check if inject an event?

   059  kvm_apic_accept_events(vcpu);

   060  if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {

   061   r = 1;

   062   goto out;

   063  }

   064 

   065  inject_pending_event(vcpu);

   066 

   067  /* enable NMI/IRQ window open exits if needed */

   068  if (vcpu->arch.nmi_pending)

   069   req_immediate_exit =

   070    kvm_x86_ops->enable_nmi_window(vcpu) != 0;

   071  else if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)

   072   req_immediate_exit =

   073    kvm_x86_ops->enable_irq_window(vcpu) != 0;

   074 

   075  if (kvm_lapic_enabled(vcpu)) {

   076   /*

   077    * Update architecture specific hints for APIC

   078    * virtual interrupt delivery.

   079    */

   080   if (kvm_x86_ops->hwapic_irr_update)

   081    kvm_x86_ops->hwapic_irr_update(vcpu,

   082     kvm_lapic_find_highest_irr(vcpu));

   083   update_cr8_intercept(vcpu);

   084   kvm_lapic_sync_to_vapic(vcpu);

   085  }

   086 }

   087 

   088 r = kvm_mmu_reload(vcpu);                   // Load VM memory page table

   089 if (unlikely(r)) {

   090  goto cancel_injection;

   091 }

   092 

   093 preempt_disable();                          // Set preemption disable

   094 

   095 kvm_x86_ops->prepare_guest_switch(vcpu);    // Actually, kvm_x86_ops->vmx_save_host_state(). Save host machine state including fs and gs segment selector

   096 if (vcpu->fpu_active)

   097  kvm_load_guest_fpu(vcpu);

   098 kvm_load_guest_xcr0(vcpu);                  // Check CR4.OSXSAVE feature.

   099 

   100 vcpu->mode = IN_GUEST_MODE;

   101 

   102 /* We should set ->mode before check ->requests,

   103  * see the comment in make_all_cpus_request.

   104  */

   105 smp_mb();                                   // Guess: this is used to lock other CPU in order to force strict CPU ordering

   106 

   107 local_irq_disable();                        // Finally, it will executes native_irq_disable() = CLI instruction

   108 

   109 if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests

   110     || need_resched() || signal_pending(current)) {

   111  vcpu->mode = OUTSIDE_GUEST_MODE;

   112  smp_wmb();

   113  local_irq_enable();

   114  preempt_enable();

   115  r = 1;

   116  goto cancel_injection;

   117 }

   118 

   119 srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);  // RCU is a synchronization mechanism added to linux kernel during 2.5

   120 

   121 if (req_immediate_exit)

   122  smp_send_reschedule(vcpu->cpu);

   123 

   124 kvm_guest_enter();                                   // Guess: this is RCU and scheduling related.

   125 

   126 if (unlikely(vcpu->arch.switch_db_regs)) {

   127  set_debugreg(0, 7);

   128  set_debugreg(vcpu->arch.eff_db[0], 0);

   129  set_debugreg(vcpu->arch.eff_db[1], 1);

   130  set_debugreg(vcpu->arch.eff_db[2], 2);

   131  set_debugreg(vcpu->arch.eff_db[3], 3);

   132 }

   133 

   134 trace_kvm_entry(vcpu->vcpu_id);

   135 kvm_x86_ops->run(vcpu);                            //

   136 

   137 /*

   138  * If the guest has used debug registers, at least dr7

   139  * will be disabled while returning to the host.

   140  * If we don't have active breakpoints in the host, we don't

   141  * care about the messed up debug address registers. But if

   142  * we have some of them active, restore the old state.

   143  */

   144 if (hw_breakpoint_active())

   145  hw_breakpoint_restore();

   146 

   147 vcpu->arch.last_guest_tsc = kvm_x86_ops->read_l1_tsc(vcpu,

   148          native_read_tsc());

   149 

   150 vcpu->mode = OUTSIDE_GUEST_MODE;

   151 smp_wmb();

   152 

   153 /* Interrupt is enabled by handle_external_intr() */

   154 kvm_x86_ops->handle_external_intr(vcpu);

   155 

   156 ++vcpu->stat.exits;

   157 

   158 /*

   159  * We must have an instruction between local_irq_enable() and

   160  * kvm_guest_exit(), so the timer interrupt isn't delayed by

   161  * the interrupt shadow.  The stat.exits increment will do nicely.

   162  * But we need to prevent reordering, hence this barrier():

   163  */

   164 barrier();

   165 

   166 kvm_guest_exit();

   167 

   168 preempt_enable();

   169 

   170 vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);

   171 

   172 /*

   173  * Profile KVM exit RIPs:

   174  */

   175 if (unlikely(prof_on == KVM_PROFILING)) {

   176  unsigned long rip = kvm_rip_read(vcpu);

   177  profile_hit(KVM_PROFILING, (void *)rip);

   178 }

   179 

   180 if (unlikely(vcpu->arch.tsc_always_catchup))

   181  kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

   182 

   183 if (vcpu->arch.apic_attention)

   184  kvm_lapic_sync_from_vapic(vcpu);

   185 

   186 r = kvm_x86_ops->handle_exit(vcpu);

   187 return r;

   188 

   189cancel_injection:

   190 kvm_x86_ops->cancel_injection(vcpu);

   191 if (unlikely(vcpu->arch.apic_attention))

   192  kvm_lapic_sync_from_vapic(vcpu);

   193out:

   194 return r;

   195}

vmx_vcpu_run( kvm_vcpu *vcpu)

   001static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)

   002{

   003 struct vcpu_vmx *vmx = to_vmx(vcpu);

   004 unsigned long debugctlmsr;

   005 

   006 /* Record the guest's net vcpu time for enforced NMI injections. */

   007 if (unlikely(!cpu_has_virtual_nmis() && vmx->soft_vnmi_blocked))

   008  vmx->entry_time = ktime_get();

   009 

   010 /* Don't enter VMX if guest state is invalid, let the exit handler

   011    start emulation until we arrive back to a valid state */

   012 if (vmx->emulation_required)

   013  return;

   014 

   015 if (vmx->nested.sync_shadow_vmcs) {

   016  copy_vmcs12_to_shadow(vmx);

   017  vmx->nested.sync_shadow_vmcs = false;

   018 }

   019 

   020 if (test_bit(VCPU_REGS_RSP, (unsigned long *)&vcpu->arch.regs_dirty))

   021  vmcs_writel(GUEST_RSP, vcpu->arch.regs[VCPU_REGS_RSP]);

   022 if (test_bit(VCPU_REGS_RIP, (unsigned long *)&vcpu->arch.regs_dirty))

   023  vmcs_writel(GUEST_RIP, vcpu->arch.regs[VCPU_REGS_RIP]);

   024 

   025 /* When single-stepping over STI and MOV SS, we must clear the

   026  * corresponding interruptibility bits in the guest state. Otherwise

   027  * vmentry fails as it then expects bit 14 (BS) in pending debug

   028  * exceptions being set, but that's not correct for the guest debugging

   029  * case. */

   030 if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)

   031  vmx_set_interrupt_shadow(vcpu, 0);

   032 

   033 atomic_switch_perf_msrs(vmx);

   034 debugctlmsr = get_debugctlmsr();

   035 

   036 vmx->__launched = vmx->loaded_vmcs->launched;

   037 asm(

   038  /* Store host registers */

   039  "push %%" _ASM_DX "; push %%" _ASM_BP ";"

   040  "push %%" _ASM_CX " \n\t" /* placeholder for guest rcx */

   041  "push %%" _ASM_CX " \n\t"

   042  "cmp %%" _ASM_SP ", %c[host_rsp](%0) \n\t"

   043  "je 1f \n\t"

   044  "mov %%" _ASM_SP ", %c[host_rsp](%0) \n\t"

   045  __ex(ASM_VMX_VMWRITE_RSP_RDX) "\n\t"

   046  "1: \n\t"

   047  /* Reload cr2 if changed */

   048  "mov %c[cr2](%0), %%" _ASM_AX " \n\t"

   049  "mov %%cr2, %%" _ASM_DX " \n\t"

   050  "cmp %%" _ASM_AX ", %%" _ASM_DX " \n\t"

   051  "je 2f \n\t"

   052  "mov %%" _ASM_AX", %%cr2 \n\t"

   053  "2: \n\t"

   054  /* Check if vmlaunch of vmresume is needed */

   055  "cmpl $0, %c[launched](%0) \n\t"

   056  /* Load guest registers.  Don't clobber flags. */

   057  "mov %c[rax](%0), %%" _ASM_AX " \n\t"

   058  "mov %c[rbx](%0), %%" _ASM_BX " \n\t"

   059  "mov %c[rdx](%0), %%" _ASM_DX " \n\t"

   060  "mov %c[rsi](%0), %%" _ASM_SI " \n\t"

   061  "mov %c[rdi](%0), %%" _ASM_DI " \n\t"

   062  "mov %c[rbp](%0), %%" _ASM_BP " \n\t"

   063#ifdef CONFIG_X86_64

   064  "mov %c[r8](%0),  %%r8  \n\t"

   065  "mov %c[r9](%0),  %%r9  \n\t"

   066  "mov %c[r10](%0), %%r10 \n\t"

   067  "mov %c[r11](%0), %%r11 \n\t"

   068  "mov %c[r12](%0), %%r12 \n\t"

   069  "mov %c[r13](%0), %%r13 \n\t"

   070  "mov %c[r14](%0), %%r14 \n\t"

   071  "mov %c[r15](%0), %%r15 \n\t"

   072#endif

   073  "mov %c[rcx](%0), %%" _ASM_CX " \n\t" /* kills %0 (ecx) */

   074 

   075  /* Enter guest mode */

   076  "jne 1f \n\t"

   077  __ex(ASM_VMX_VMLAUNCH) "\n\t"

   078  "jmp 2f \n\t"

   079  "1: " __ex(ASM_VMX_VMRESUME) "\n\t"

   080  "2: "

   081  /* Save guest registers, load host registers, keep flags */

   082  "mov %0, %c[wordsize](%%" _ASM_SP ") \n\t"

   083  "pop %0 \n\t"

   084  "mov %%" _ASM_AX ", %c[rax](%0) \n\t"

   085  "mov %%" _ASM_BX ", %c[rbx](%0) \n\t"

   086  __ASM_SIZE(pop) " %c[rcx](%0) \n\t"

   087  "mov %%" _ASM_DX ", %c[rdx](%0) \n\t"

   088  "mov %%" _ASM_SI ", %c[rsi](%0) \n\t"

   089  "mov %%" _ASM_DI ", %c[rdi](%0) \n\t"

   090  "mov %%" _ASM_BP ", %c[rbp](%0) \n\t"

   091#ifdef CONFIG_X86_64

   092  "mov %%r8,  %c[r8](%0) \n\t"

   093  "mov %%r9,  %c[r9](%0) \n\t"

   094  "mov %%r10, %c[r10](%0) \n\t"

   095  "mov %%r11, %c[r11](%0) \n\t"

   096  "mov %%r12, %c[r12](%0) \n\t"

   097  "mov %%r13, %c[r13](%0) \n\t"

   098  "mov %%r14, %c[r14](%0) \n\t"

   099  "mov %%r15, %c[r15](%0) \n\t"

   100#endif

   101  "mov %%cr2, %%" _ASM_AX "   \n\t"

   102  "mov %%" _ASM_AX ", %c[cr2](%0) \n\t"

   103 

   104  "pop  %%" _ASM_BP "; pop  %%" _ASM_DX " \n\t"

   105  "setbe %c[fail](%0) \n\t"

   106  ".pushsection .rodata \n\t"

   107  ".global vmx_return \n\t"

   108  "vmx_return: " _ASM_PTR " 2b \n\t"

   109  ".popsection"

   110       : : "c"(vmx), "d"((unsigned long)HOST_RSP),

   111  [launched]"i"(offsetof(struct vcpu_vmx, __launched)),

   112  [fail]"i"(offsetof(struct vcpu_vmx, fail)),

   113  [host_rsp]"i"(offsetof(struct vcpu_vmx, host_rsp)),

   114  [rax]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RAX])),

   115  [rbx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBX])),

   116  [rcx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RCX])),

   117  [rdx]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDX])),

   118  [rsi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RSI])),

   119  [rdi]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RDI])),

   120  [rbp]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_RBP])),

   121#ifdef CONFIG_X86_64

   122  [r8]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R8])),

   123  [r9]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R9])),

   124  [r10]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R10])),

   125  [r11]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R11])),

   126  [r12]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R12])),

   127  [r13]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R13])),

   128  [r14]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R14])),

   129  [r15]"i"(offsetof(struct vcpu_vmx, vcpu.arch.regs[VCPU_REGS_R15])),

   130#endif

   131  [cr2]"i"(offsetof(struct vcpu_vmx, vcpu.arch.cr2)),

   132  [wordsize]"i"(sizeof(ulong))

   133       : "cc", "memory"

   134#ifdef CONFIG_X86_64

   135  , "rax", "rbx", "rdi", "rsi"

   136  , "r8", "r9", "r10", "r11", "r12", "r13", "r14", "r15"

   137#else

   138  , "eax", "ebx", "edi", "esi"

   139#endif

   140       );

   141 

   142 /* MSR_IA32_DEBUGCTLMSR is zeroed on vmexit. Restore it if needed */

   143 if (debugctlmsr)

   144  update_debugctlmsr(debugctlmsr);

   145 

   146#ifndef CONFIG_X86_64

   147 /*

   148  * The sysexit path does not restore ds/es, so we must set them to

   149  * a reasonable value ourselves.

   150  *

   151  * We can't defer this to vmx_load_host_state() since that function

   152  * may be executed in interrupt context, which saves and restore segments

   153  * around it, nullifying its effect.

   154  */

   155 loadsegment(ds, __USER_DS);

   156 loadsegment(es, __USER_DS);

   157#endif

   158 

   159 vcpu->arch.regs_avail = ~((1 << VCPU_REGS_RIP) | (1 << VCPU_REGS_RSP)

   160      | (1 << VCPU_EXREG_RFLAGS)

   161      | (1 << VCPU_EXREG_CPL)

   162      | (1 << VCPU_EXREG_PDPTR)

   163      | (1 << VCPU_EXREG_SEGMENTS)

   164      | (1 << VCPU_EXREG_CR3));

   165 vcpu->arch.regs_dirty = 0;

   166 

   167 vmx->idt_vectoring_info = vmcs_read32(IDT_VECTORING_INFO_FIELD);

   168 

   169 vmx->loaded_vmcs->launched = 1;

   170 

   171 vmx->exit_reason = vmcs_read32(VM_EXIT_REASON);

   172 trace_kvm_exit(vmx->exit_reason, vcpu, KVM_ISA_VMX);

   173 

   174 vmx_complete_atomic_exit(vmx);

   175 vmx_recover_nmi_blocking(vmx);

   176 vmx_complete_interrupts(vmx);

   177}

vmx_handle_exit()

1

static int (const kvm_vmx_exit_handlers[])(struct kvm_vcpu vcpu)

   01static int (*const kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {

   02 [EXIT_REASON_EXCEPTION_NMI]           = handle_exception,

   03 [EXIT_REASON_EXTERNAL_INTERRUPT]      = handle_external_interrupt,

   04 [EXIT_REASON_TRIPLE_FAULT]            = handle_triple_fault,

   05 [EXIT_REASON_NMI_WINDOW]       = handle_nmi_window,

   06 [EXIT_REASON_IO_INSTRUCTION]          = handle_io,

   07 [EXIT_REASON_CR_ACCESS]               = handle_cr,

   08 [EXIT_REASON_DR_ACCESS]               = handle_dr,

   09 [EXIT_REASON_CPUID]                   = handle_cpuid,

   10 [EXIT_REASON_MSR_READ]                = handle_rdmsr,

   11 [EXIT_REASON_MSR_WRITE]               = handle_wrmsr,

   12 [EXIT_REASON_PENDING_INTERRUPT]       = handle_interrupt_window,

   13 [EXIT_REASON_HLT]                     = handle_halt,

   14 [EXIT_REASON_INVD]        = handle_invd,

   15 [EXIT_REASON_INVLPG]        = handle_invlpg,

   16 [EXIT_REASON_RDPMC]                   = handle_rdpmc,

   17 [EXIT_REASON_VMCALL]                  = handle_vmcall,

   18 [EXIT_REASON_VMCLEAR]               = handle_vmclear,

   19 [EXIT_REASON_VMLAUNCH]                = handle_vmlaunch,

   20 [EXIT_REASON_VMPTRLD]                 = handle_vmptrld,

   21 [EXIT_REASON_VMPTRST]                 = handle_vmptrst,

   22 [EXIT_REASON_VMREAD]                  = handle_vmread,

   23 [EXIT_REASON_VMRESUME]                = handle_vmresume,

   24 [EXIT_REASON_VMWRITE]                 = handle_vmwrite,

   25 [EXIT_REASON_VMOFF]                   = handle_vmoff,

   26 [EXIT_REASON_VMON]                    = handle_vmon,

   27 [EXIT_REASON_TPR_BELOW_THRESHOLD]     = handle_tpr_below_threshold,

   28 [EXIT_REASON_APIC_ACCESS]             = handle_apic_access,

   29 [EXIT_REASON_APIC_WRITE]              = handle_apic_write,

   30 [EXIT_REASON_EOI_INDUCED]             = handle_apic_eoi_induced,

   31 [EXIT_REASON_WBINVD]                  = handle_wbinvd,

   32 [EXIT_REASON_XSETBV]                  = handle_xsetbv,

   33 [EXIT_REASON_TASK_SWITCH]             = handle_task_switch,

   34 [EXIT_REASON_MCE_DURING_VMENTRY]      = handle_machine_check,

   35 [EXIT_REASON_EPT_VIOLATION]       = handle_ept_violation,

   36 [EXIT_REASON_EPT_MISCONFIG]           = handle_ept_misconfig,

   37 [EXIT_REASON_PAUSE_INSTRUCTION]       = handle_pause,

   38 [EXIT_REASON_MWAIT_INSTRUCTION]       = handle_invalid_op,

   39 [EXIT_REASON_MONITOR_INSTRUCTION]     = handle_invalid_op,

   40 [EXIT_REASON_INVEPT]                  = handle_invept,

   41};

QEMU Source Code Notes

Here’s some of my badly organized notes while reading QEMU’s source code. These notes only cover some small parts of QEMU.

I’ll update this note when I encounter any of the following situations:

The implementation of some specific part is complicated
I forget something I understand previously

I find pictures are easier to follow, and use LucidChart as my drawing tool. (I find LucidChart most easier to use for me.) But I haven’t found a satisfactory way to represent the logic of code in graph. I’ve tried various forms of graphs, like plain flowcharts, UML sequence diagram. So the style of my picture is not consistent, sorry for that.

CPU emulation

CPU initialization

The CPU initialization process is illustrated in the following picture.

CPU initialization process

CPU emulation thread

Starting from QEMU 1.0, all emulated CPUs run in a single thread. Here’s an overview of what this thread is doing.

CPU thread

Emulated Memory

Emulated memory allocation

Note the following description applies to QEMU 0.13. It’s changed in QEMU 1.0.

The following picture is an overview of how is emulated main memory allocated.

allocating emulated memory

Let’s look into how the main memory is allocated in QEMU. In hw/pc_piix.c, pc_init1 is initialize the whole PC, it calls pc_memory_init to allocate emulated memory. The memory is actually allocated by qemu_ram_alloc which is called in pc_memory_init.

QEMU needs to emulated many kinds of devices, each kind of device needs there own memory, e.g. RAM, BIOS, VGA. QEMU uses a list to manage all the allocated device memory.

In qemu_ram_alloc, it allocates a new RAMBlock node, which records:

host the actual host memory (in QEMU)
length the size of this memory region
offset the ram_addr (more details following)

To support pluggable device, emulated memory needs to be releasable. This means the same guest physical memory address (guest is the emulated target) may refer to different devices, thus different host memory. I guess this is why QEMU creates an indirection between guest physical memory address and host memory, which is ram_addr. (I’m not sure about this because I don’t see problems of directly mapping from guest physical memory to host memory.)

ram_addr is allocated when memory is allocated. Using ram_list, it’s possible to convert between ram_addr and host address. There are many places in the softmmu system uses this ram_addr to refer to the allocated memory.

After memory is allocated, it needs to be registered by calling cpu_register_physical_memory_offset so that the softmmu system knows how to access it (use IO function or memory read/write function). To register memory, we need to specify which guest physical memory range does the allocated memory represent. To accelerate the process of finding ram_addr from guest physical memory (which is used in functions like ldl_phys), the register function will set up a data structure which is very similar to the page table, l1_phys_map. (TODO: more details)

Softmmu

The softmmu emulation uses C macro to emulated template system. There are several template head files which are included in other files multiple times to generate functions that work for different sized memory and functions to access guest memory with different privileges.

The following picture shows the including relationship between the related files.

softmmu header file include relation

IRQState and related function

There are 2 types of hardware IRQ:

Triggered if the electronic level is high (level sensitive)
Triggered if the electronic incurs a pulse (edge triggered)

 /* in hw/irq.h */

typedef void (*qemu_irq_handler)(void *opaque, int n, int level);


/* in hw/irq.c */
struct IRQState {
    qemu_irq_handler handler;
    void *opaque;
    int n;
}
/* qemu-common.h typedef struct IRQState *qemu_irq */

void qemu_set_irq(qemu_irq irq, int level)
{
    if (!irq)
        return;

    irq->handler(irq->opaque, irq->n, level);
}

In hw/irq.c, qemu_irq_set calls the irq handler stored in IRQState. qemu_irq_raise, qemu_irq_lower and qemu_irq_pulse simply calls qemu_irq_set to adjust the level of the IRQ.

qemu_allocate_irqs will allocate memory for several qemu_irq, each pointing to new allocated IRQState, but all with the same handler and opaque value.

i8254 as an example

TODO

Timer

I’m not very familiar with x86 hardware, the description related to hardware is very likely to be wrong.

Host alarm

On x86, both main board and cpu cores have timer. I call the main board timer host alarm, which corresponds to PIT timer. More specifically, qemu emulates i8254 timer attached to i8259 interrupt controller.

`QEMUTimer`

In pit_init, it creates a QEMUTimer whose callback is pit_irq_timer. The definition for QEMUTimer is:

 struct QEMUTimer {
  QEMUClock *clock;
  int64_t expire_time;
  QEMUTimerCB *cb;
  void *opaque;
  struct QEMUTimer *next;
};

The clock type for pit timer is vm_clock, which only runs when the vm runs. (There are also rt_clock and host_clock, the difference is commented in qemu-timer.h.) The timer should be put onto the active_timers list (using qemu_mod_timer) in order to run. main_loop_wait will call qemu_run_all_timers which will then invoke all expired timers’ (with expire time > current time) callbacks.

QEMUTimer only describes when the callback should be called. As QEMU uses block chaining, virtual cpu may execute lots of translated code and does not run timer callback as soon as possible. So it uses alarm timer (introduced later) to avoid waiting too long before handling timer interrupt.

Timer IRQ handler

When the timer fires, it should send an interrupt to the CPU. This is done through the timer irq handler.

In pit_initfn, a new QEMUTimer is created, and an irq handler is acquired from

The irq passed to pit_init needs explanation. Here’s the initialization function calls:

pc_init1 (pc_iiix.c):
  isa_irq = qemu_allocate_irqs(isa_irq_handler, …);
  ...
  pc_basic_device_init(isa_irq, …);

pc_basic_device_init (pc.c):
  pit = pit_init(0x40, isa_reserve_irq(0));

Here, the reserved irq 0’s handler is allocated in pc_init1, so the handler is isa_irq_handler, which actually calls i8259_set_irq.

i8259’s parent irq is cpu irq, whose handler function is pic_irq_request, which will modify CPUState’s interrupt_request field which finally delivers the irq to the virtual cpu.

Alarm timer

To periodically notify the cpu, interval timer is provided by qemu_alarm_timer:

struct qemu_alarm_timer {
    char const *name;
    int (*start)(struct qemu_alarm_timer *t);
    void (*stop)(struct qemu_alarm_timer *t);
    void (*rearm)(struct qemu_alarm_timer *t);
    void *priv;

    char expired;
    char pending;
};

On different system, the alarm timer can use different mechanism to implement. On Linux, it is implemented using timer_create(CLOCK_REALTIME) and is named “dynticks”. (Other alarm timer can be “hpet”, “rtc”, “unix”.)

Each time the alarm timer fires, a signal will be sent. The signal handler will set the alarm timer’s pending field if it is expired, so we know it’s fired and can rearm it later. The most important thing is to notify the cpu, so they will stop executing translated code and handler timer as soon as possible.

Module inrastructure

The module infrastructure provides easy initialization for device, machine and other components in QEMU. The following description using QEMU v1.0.1.

Types of modules:

block
device
machine
qapi

In module.h, the module_init macro generates a function with constructor attribute, this function will register module initialization function. Each type modules has a corresponding macro to invoke this macro with the module type specified.

 #define module_init(function, type)                                         \
static void __attribute__((constructor)) do_qemu_init_ ## function(void) {  \
    register_module_init(function, type);                                   \
}

#define block_init(function) module_init(function, MODULE_INIT_BLOCK)
#define device_init(function) module_init(function, MODULE_INIT_DEVICE)
#define machine_init(function) module_init(function, MODULE_INIT_MACHINE)
#define qapi_init(function) module_init(function, MODULE_INIT_QAPI)

Take the i8254 timer as an example

 /* In hw/i8254.c */
static void pit_register(void)
{
    isa_qdev_register(&pit_info);
}
device_init(pit_register)

/* Generated function after macro expansion. */
static void __attribute__((constructor)) do_qemu_init_pit_register(void) {
    register_module_init(pit_register, MODULE_INIT_DEVICE);
}

Each module has a list holding all the initialization function pointer, register_module_init will find the corresponding list, then add the function pointer to the end of the list.

When is the initialization function called?

module_call_init will get the specified module’s initialization functions list and invoke the function one by one. Simply grep this function can find the calling site. Here’s a list:

vl.c: module_call_init(MODULE_INIT_MACHINE)
vl.c: module_call_init(MODULE_INIT_DEVICE)
block.c: module_call_init(MODULE_INIT_BLOCK)
qemu-ga.c: module_call_init(MODULE_INIT_QAPI)

Coroutine in QEMU

Learning the coroutine concept

I recommend Simon Tatham’s article Coroutines in C

The key is to provide a way for a function to ‘return and continue’. By continue it means starts executing at the same point when it “returns” upon next call.

QEMU’s implementation

Commit 00dccaf1f848 introduces coroutine, here’s the commit message

coroutine: introduce coroutines

Asynchronous code is becoming very complex.  At the same time
synchronous code is growing because it is convenient to write.
Sometimes duplicate code paths are even added, one synchronous and the
other asynchronous.  This patch introduces coroutines which allow code
that looks synchronous but is asynchronous under the covers.

A coroutine has its own stack and is therefore able to preserve state
across blocking operations, which traditionally require callback
functions and manual marshalling of parameters.

Creating and starting a coroutine is easy:

  coroutine = qemu_coroutine_create(my_coroutine);
  qemu_coroutine_enter(coroutine, my_data);

The coroutine then executes until it returns or yields:

  void coroutine_fn my_coroutine(void *opaque) {
      MyData *my_data = opaque;

      /* do some work */

      qemu_coroutine_yield();

      /* do some more work */
  }

Yielding switches control back to the caller of qemu_coroutine_enter().
This is typically used to switch back to the main thread's event loop
after issuing an asynchronous I/O request.  The request callback will
then invoke qemu_coroutine_enter() once more to switch back to the
coroutine.

Note that if coroutines are used only from threads which hold the global
mutex they will never execute concurrently.  This makes programming with
coroutines easier than with threads.  Race conditions cannot occur since
only one coroutine may be active at any time.  Other coroutines can only
run across yield.

This coroutines implementation is based on the gtk-vnc implementation
written by Anthony Liguori <anthony@codemonkey.ws> but it has been
significantly rewritten by Kevin Wolf <kwolf@redhat.com> to use
setjmp()/longjmp() instead of the more expensive swapcontext() and by
Paolo Bonzini <pbonzini@redhat.com> for Windows Fibers support.

On Debian Linux 6, makecontext is defined, so coroutine-ucontext.c is used to provide coroutine implementatoin

The coroutine implementation in QEMU is quite complicated. Here’s a graph showing the relationship between the related functions.

QEMU's coroutine implementation

TCG related

A useful debugging technique to pass state in translated code to outside is to store the information in global variable. This out be helpful for debugging purpose and ugly hacking.

TODO

IO port emulation

All IO port related devices for PC emulation

Format: device address comment

vmport 0x5658 (hw/vmport.c)
firmware 0x511 (hw/fw_cfg.c)
i8259 0x20,0x4d0,0xa0,0x4d1 Programmable Interrupt Controller (hw/i8259.c)
i440FX/PIIX3 PCI Bridge 0xcf8, 0xcfc (hw/piix_pci.c)
Cirrus VGA emulator 0x3c0, 0x3b4, 0x3d4, 0x3ba, 0x3da (hw/cirrus_vga.c)
RTC emulation 0x70 (hw/mc146818rtc.c)
i8254 interval timer 0x40 (hw/i8254.c)
PC speaker 0x61 (`hw/pcspk.c)
UART 0x3f8 (hw/serial.c)
Parallel PORT 0x378 (hw/parallel.c)
PC keyboard 0x60, 0x64 (hw/pckbd.c)
Port 92 0x92 (hw/pc.c)
DMA 0x0~0xf,0x81~0x83,0x87,0x89,0x8a,0x8b,0x8f,0xc0~0xce/2,0xd0,0xd2~0xde/2 (hw/dma.c)
Floppy disk 0x3f1,0x3f7 (hw/fdc.c)
IDE 0x1f0,0x170,0x3f6,0x376 (hw/ide/core.c)
ACPI 0xb2,0xb100,0xafe0,0xae00,0xae08 (hw/acpi_piix4.c)
IDE PCI PIIX ¾ 0xc000,0xc004,0xc008,0xc00c (hw/ide/piix.c dmdma_map)
Unknown 0xb000 (while executing TC, registers ioport_readx)