KVM I/O slowness on RHEL 6

本文通过实验测试了不同缓存策略和预分配策略对KVM虚拟机性能的影响,发现RAW格式在无缓存策略时性能最佳,而QCOW2格式则依赖于缓存策略和预分配策略。使用VirtIO控制器时,写通过场景受影响最大。最终建议在使用KVM和QCOW2格式时,避免使用写通过缓存,并始终启用元数据预分配。

转载:http://www.ilsistemista.net/index.php/virtualization/11-kvm-io-slowness-on-rhel-6.html?limitstart=0

Over one year has passed since my last virtual machine hypervisor comparison so, in the last week, I was preparing an article showing a face to face comparison between RHEL 6 KVM technologies versus Oracle VirtualBox 4.0 product. I spent several days creating some nice, automated script to evaluate these two products under different point of views, and I was quite confident that the benchmark session would be completed without too much trouble. So, I installed Red Hat Enterprise Linux 6 (license courtesy of Red Hat Inc. - thank you guys!) on my workstation and I begin the virtual images installation.

However, the unexpected happened: using KVM, a Windows Server 2008 R2 Foundation installation took almost 3 hours, while normally it should be completed in about 30-45 minutes. Similarly, the installation of the base system anticipating the “real” Debian 6.0 installation took over 5 minutes, when normally it can be completed in about 1 minute. In short: the KVM virtual machines were affected by awfully slow disk I/O subsystem. In previous tests, I saw that KVM I/O subsystem was a bit slower, but not by so much; clearly, something was impairing my KVM I/O speed. I tried different combination of virtualized disk controllers(IDE or VirtIO) and cache settings, but without success. I also changed my physical disk filesystem to EXT3, to avoid any possible, hypothetical EXT4 speed regression, but again with no results: the KVM slow I/O speed problem remained.

I needed a solution – and a real one: with such awfully slow I/O, the KVM guests were virtually unusable. After some wasted hours, I decided to run some targeted, systematic tests regarding VM image formats, disk controllers, cache settings and preallocation policy. Now that I find the solution and I run my KVM guest at full speed, I am very happy – and I would like to share my results with you.

Testbed and methods

First, let me describe the workstation used in this round of tests; system specifications are:

  • CPU: Core-i7 860 (quad cores, eight threads) @ 2.8 GHz and with 8 MB L3 cache

  • RAM: 8 GB DDR3 (4x 2 GB) @ 1333 MHz

  • DISKS: 4x WD Green 1 TB in software RAID 10 configuration

  • OS: Red Hat Enterprise Linux 6 64 bit

The operation system was installed with “basic server” profile and then I selectively installed the various other software required (libvirtd, qemu, etc). Key systems software are:

  • kernel version 2.6.32-71.18.1.el6.x86_64

  • qemu-kvm version 0.12.1.2-2.113.el6_0.6.x86_64

  • libvirt versio 0.8.1-27.el.x86_64

  • virt-manager version 0.8.4-8.el6.noarch

As stated before, initially all host-system partitions were formatted in EXT4, but to avoid any possible problem related to the new filesystem, I changed the VM-storing partition to EXT3.

To measure guest I/O speed, I timed the Debian 6.0 (x86_64 version) basic system installation. This is the process that, during Debian installation, immediately follow the partitions creation and format phase.

Let me thanks again Red Hat Inc. - and especially Justin Clift – to give me a free RHEL 6 license.

OK – I know that you want to know why on earth KVM I/O was so slow. However, first you had to understand something about caching and preallocation policies.


On caching and preallocation

Note: in this page, I try to condensate in very little space some hard-to-explain concepts, so I had to do some approximations. I ask the expert reader to forgive me for the over-simplification.

Normally, a virtual guest system use an host-side file to store its data: this file represent a virtual disk, that the guest use as a normal, physical disk. However, from the host view this virtual disk is a normal data file and it may be subject to caching and preallocation.

In this context, caching is the process to “hide” some disk-related data to physical RAM. When we use that cache to storein RAM only data previously read from the disk, we speak about a read cache, or write-through cache. When we store in RAM some data that will be later flushed to disk, we speak abut a write cache, or write-back cache. A write-back cache, by caching write request in the fast RAM, has higher performance; however, it is also more prone to data loss than a write-through one, as the latter only cache read requests and immediately write to disk any data.

As disk I/O is a very important parameter, Linux and Windows o.s. generally use a write-back policy with periodic flush to the physical disk. However, when using an hypervisor to virtualize some guest system, you can effectively cache things twice (one time in the host memory and another time in the virtual guest memory), so is often better to disable host-based caching on the virtual disk file and to let the guest system to manage its own caching. Moreover, a host-side write-back policy on virtual disk file significantly increase the risk of data loss in case of guest crash.

KVM let you choose one of these three cache policy: no caching, write-through (read only cache) and write-back (read and write cache). It also has a “default” setting that effectively is an alias for the write-through one. As you will see, pick the right caching scheme is a crucial choice for fast guest I/O.

Now, some words about preallocation: this is the process to better prepare the virtual disk file to store the data written by the guest system. Generally, preallocate a file means to fill it with zeros, so that the host system had to reserve in advance all the disk space assigned to the guest. In this manner, when the guest try to write to the virtual disk, it never waits for the host system to reserve the required space. Some time, preallocation does not fill the target file with zeros, but only prepare some its internal data structure: in this case, we talk about metadata preallocation. RAW disk format can use full preallocation, while QCOW2 actually use metadata preallocation (there are some patches that force full preallocation, but are experimental ones).

Why speak about caching and preallocation? Because the super-slow KVM I/O speed really boil down to this two parameters, as we are going to see.

RAW image format performance

Let's begin with some tests regarding the most basic disk image format – the RAW format. RAW images are very fast, but they miss a critical feature: the possibility to take real, fast snapshot of the virtual disk. So, they can be used only in situations were you not need a real snapshot support (or you have snapshotting capability at the filesystem level – but this is another story).

How the RAW format performs, and how caching affect the results?

RAW image performance

As you can see, as long you stay away from write-through cache, RAW image have very high speed. Note that in RAW image with no caching or write-back policy, preallocation only have a small influence.

What about the much more feature-rich QCOW2 format?

QCOW2 image format performance

The QCOW2 format is the default QEMU/KVM image format. It has some very interesting features, as compression and encryption, but especially it enable the use of real, file-level snapshots.

But how it performs?

QCOW2 image performance

Mmm... without metadata preallocation, it performs very badly. Enable metadata preallocation, stay away from write-through cache and it perform very well.

To better compare it to the RAW format, I made a chart with the no-caching RAW and QCOW2 results:

RAW vs QCOW2 performance

While without metadata preallocation the QCOW2 format is 5X slower then RAW, with enabled metadata preallocation the two are practically tied. This prove thatwhile RAW format is primarily influenced by caching setting, QCOW2 is much dependent on both the preallocation and caching policies.

The influence of the virtualized I/O controller

Another important thing to check is the influence of the virtualized I/O controller that is presented to the guest. KVM let you use not only the default IDE virtual controller, but also a new, paravirtualized I/O controller called VirtIO. This virtualized controller promise better speed and less CPU usage.

How it affect the results?

IDE vs VirtIO controller performance

As you can see, the write-through scenario is the most affected one, while with the no-caching and write-back policies it has a lesser effect.

This does not means that the VirtIO is an unimportant project: the scope of this test was only to be sure that it don't comport any I/O slowness. In a following article I will analyze this very promising driver in a much more complete manner.

I/O slowness cause: bad default settings

So, we can state that to obtain good I/O throughput from the QCOW2 format, two conditions must be met:

  • don't use a write-through cache

  • always use metadata preallocation.

However, using the virt-manager GUI interface that is normally used to create virtual disks and guest systems on Red Hat and Fedora, you can not enablemetadata preallocation on QCOW2 files. While the storage volume creation interface let you specify if you want to preallocate the virtual disk, this function actually only workwith RAW files; if you use a QCOW2 file it does nothing.

To create a file with metadata preallocation, you must open a terminal and issue the “qemu-img create” command. For example, if you want to create a ~10 GB QCOW2 with metadata preallocation, you must issue the command “qemu-img create -f qcow2 -o size=10000000000,preallocation=metadata file.img”.

Moreover, the default caching scheme is the write-through one. While generally the guest creation wizard correctly disable host-side cache, if you later add any virtual disk to the guest, often the disk is added with the “default” caching policy – a write-through one.

So, if you are using Red Hat Enterprise Linux or Fedora Linux as the host operating system for you virtualization server and you plan to use the QCOW2 format, remember to manually create preallocated virtual disk files and to use a “none” cache policy (you can also use a “write-back” policy, but be warned that your guests will be more prone to data loss).

Conclusions

First of all, don't let me wrong: I'm very exited about KVM and libvirt progresses. Now we have not only a very robust hypervisor, but also some critical paravirtualized drivers, a good graphical interface and excellent host / guest remote management capabilities. I would publicly thanks all the talented guys involved in the realization of these great and important projects – thank you boys!

However, it's a shame that the current virt-manager GUI interface don't permit to perform metadata preallocation on QCOW2 image format, as this image is much more feature-rich than the RAW one. Moreover, I would like to see not only the guest creation wizard, but all the guest editing windows to always default to no cache policy for virtual disk, but it is a secondary problem: it is not so difficult to manually change a parameter...

The first problem – no metadata preallocation on QCOW2 – is way more serious, as it can not be overcomed without resort to the command line. This problem should really be corrected as soon as possible. In the meantime, you can use the workaround described above, and remember to always check your virtual disk caching policy – don't use the “default” or “write-through” settings.

I hope than this article can help you to get the max from the very good KVM, libvirt and related projects.



<think>我们正在讨论PyTorch DataLoader的num_workers参数设置过多时出现的警告。这个警告通常发生在设置的num_workers值超过了系统建议的最大值(通常是CPU核心数的限制)时。下面我将详细解释这个警告的含义、产生原因以及解决方法。 ### 警告信息含义 当你在创建DataLoader时设置`num_workers`超过系统建议的最大值(比如你的CPU只有8个核心,但你设置了`num_workers=16`),PyTorch会发出类似这样的警告: ``` This DataLoader will create N worker processes (where N is the value of num_workers you set). Our suggested maximum (to prevent resource overallocation) is M (which is usually the number of CPU cores). You are exceeding this limit and might experience out-of-memory errors or reduced performance due to excessive context switching. ``` 这个警告的含义是: 1. **资源过度分配(Resource Over-allocation)**:你设置的worker进程数超过了系统建议的最大值(通常是CPU核心数)。 2. **潜在风险**:可能导致系统资源(特别是CPU和内存)竞争加剧,进而引发: - 内存不足(OOM)错误 - 由于过多的进程上下文切换导致性能下降 - 程序运行缓慢甚至冻结 ### 原因分析 1. **CPU核心数限制**: - 每个worker进程都需要一个CPU核心来执行数据加载任务(如读取数据、数据增强等)。 - 如果worker数量超过CPU核心数,操作系统会通过**时间片轮转**调度进程,导致频繁的上下文切换[^1]。 2. **上下文切换开销**: - 当活跃进程数超过CPU核心数时,系统必须不断保存和恢复进程状态。 - 上下文切换本身不执行有用工作,却消耗CPU周期,导致实际数据处理效率降低[^2]。 3. **内存压力**: - 每个worker进程都需独立加载数据集的一部分(如每个worker加载一个batch),这会导致内存消耗随worker数线性增长。 - 当总内存占用接近物理内存上限时,系统会使用磁盘交换(swap),进一步降低性能[^2]。 ### 解决方案 #### 1. 调整`num_workers`至合理值 - **确定CPU核心数**: ```python import os cpu_count = os.cpu_count() print(f"可用CPU核心数: {cpu_count}") ``` - **设置建议**: - 初始值:`num_workers = min(cpu_count, 4)`(保守起见) - 调优策略:从`num_workers=0`开始,每次增加2,监控训练速度直至稳定 - 经验公式:`num_workers = min(4 * num_gpus, cpu_count - 1)`(多GPU场景)[^3] #### 2. 优化数据加载效率 - **启用`pin_memory`加速**(当使用GPU时): ```python dataloader = DataLoader( dataset, batch_size=32, num_workers=4, # 调整为合理值 pin_memory=True, # 锁页内存,加速CPU→GPU传输 persistent_workers=True # 避免重复创建worker(PyTorch 1.7+) ) ``` - **减少单批次加载时间**: - 简化数据预处理操作 - 使用更高效的图像解码库(如`turbojpeg`替代`PIL`) - 预加载部分数据到内存(若内存充足) #### 3. 监控系统资源 - **实时监控工具**: - CPU负载:`htop`(Linux)或任务管理器(Windows) - 内存使用:`free -h`(Linux)或资源监视器(Windows) - 上下文切换次数:`vmstat 1`(查看`cs`列) - **诊断性能瓶颈**: ```python import time from tqdm import tqdm # 测试不同num_workers下的迭代速度 for nw in [0, 2, 4, 6, 8]: loader = DataLoader(dataset, num_workers=nw, batch_size=32) start = time.time() for batch in tqdm(loader): pass print(f"num_workers={nw}: {time.time()-start:.2f}s") ``` ### 特殊情况处理 1. **Windows系统限制**: - Windows使用`spawn`而非`fork`创建进程,需将主要代码放在`if __name__ == '__main__':`中 - 示例: ```python if __name__ == '__main__': train_loader = DataLoader(..., num_workers=4) ``` 2. **共享数据集场景**: - 当多个DataLoader共享同一数据集时,总worker数应不超过`cpu_count` - 使用`torch.multiprocessing.set_sharing_strategy('file_system')`避免文件描述符耗尽 ### 总结建议 | 场景 | 推荐`num_workers` | 配套措施 | |--------------------------|-------------------|---------------------------| | 调试/小数据集 | 0 | 禁用多进程 | | 单GPU训练(中等数据集) | `min(4, cpu_count)` | `pin_memory=True` | | 多GPU训练(大型数据集) | `min(4*num_gpu, cpu_count-1)` | `persistent_workers=True` | | 内存受限环境 | 0~2 | 减少`prefetch_factor` | > **关键原则**:`num_workers`并非越大越好,目标是使数据加载时间略快于GPU计算时间。理想状态下,GPU利用率应保持在90%~95%之间[^3]。 §§相关问题§§ 1. 如何监控`num_workers`的实际工作效率? 2. `pin_memory`参数如何与`num_workers`协同优化? 3. 多GPU训练时`num_workers`该如何设置? 4. 为什么Windows系统需要特殊处理多进程数据加载? 5. 数据加载瓶颈的诊断工具有哪些?</think>### PyTorch DataLoader警告分析:`num_workers`超限问题详解 #### 警告含义与核心原因 当出现 **"创建的worker进程数超过系统建议的最大值"** 警告时,表明`num_workers`参数设置过高,超过了系统资源的承受能力。这通常发生在: 1. **CPU核心数不足**:设置的`num_workers`值 > CPU物理核心数(可通过`os.cpu_count()`获取) 2. **内存资源紧张**:每个worker需独立加载数据副本,内存占用随worker数线性增长 3. **系统调度瓶颈**:过多的进程会引发频繁的上下文切换(context switching),消耗额外CPU资源[^1] #### 具体风险与表现 | 风险类型 | 表现症状 | 根本原因 | |------------------|-----------------------------------|---------------------------| | 性能下降 | 训练速度反而比`num_workers=0`更慢 | 进程调度开销 > 并行收益 | | 内存溢出(OOM) | 程序崩溃,报`RuntimeError: CUDA out of memory` | 多worker占用过多内存[^2] | | 系统冻结 | 整个系统无响应 | 资源竞争导致死锁 | | 数据加载失败 | 出现`BrokenPipeError`等异常 | 进程间通信超时 | #### 解决方案与优化步骤 1. **确定系统上限值** ```python import os cpu_cores = os.cpu_count() # 获取CPU物理核心数 max_workers = max(1, cpu_cores - 1) # 保留1个核心给系统 print(f"建议最大worker数: {max_workers}") ``` 2. **动态调整策略** ```python # 根据系统资源自动配置 dataloader = DataLoader( dataset, batch_size=64, num_workers=min(8, max_workers), # 取min(经验值, 系统上限) pin_memory=True, persistent_workers=True # 避免重复创建进程(PyTorch 1.7+) ) ``` 3. **内存优化技巧** - 减少数据副本:使用`torch.utils.data.Subset`替代完整数据集 - 及时释放资源: ```python del batch # 显式删除变量 torch.cuda.empty_cache() # 清空GPU缓存[^2] ``` #### 特殊场景处理 1. **Windows系统限制** ```python if __name__ == '__main__': # Windows必须加此保护 dataloader = DataLoader(..., num_workers=4) ``` 2. **分布式训练配置** - 多GPU场景:总worker数 = `num_workers * num_gpus` - 推荐设置:`num_workers = max(1, cpu_cores // num_gpus - 1)` #### 性能验证方法 1. **基准测试脚本** ```python import time from tqdm import tqdm def test_loader_speed(loader): start = time.time() for _ in tqdm(loader): # 迭代整个数据集 pass return time.time() - start # 测试不同worker配置 for nw in [0, 2, 4, 8]: loader = DataLoader(..., num_workers=nw) print(f"num_workers={nw}: {test_loader_speed(loader):.2f}s") ``` 2. **监控指标** - **理想状态**:GPU利用率 > 90% 且无OOM - **危险信号**:CPU上下文切换次数 > 10000次/秒(可用`vmstat 1`监控) > **经验法则**:当`num_workers`从$N$增加到$N+1$时,若加速比$S < \frac{N+1}{N} \times 0.9$,则表明已达到系统瓶颈[^3]。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值