Understanding Disk I/O

本文深入探讨了磁盘I/O瓶颈的原因、影响及解决策略,通过使用Scout监控插件,优化应用程序和硬件配置,提升磁盘性能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这篇文章是我翻译的第一篇文章有些地方翻译的不是很好见谅:-),文章来自scout的官方博客(点我),个人第一次读的时候感觉很赞。文章主要讲了什么呢?

  • 磁盘IO是什么?
  • 怎么样才能判断你遇到了磁盘I/O的瓶颈?
  • 如何解决磁盘IO的瓶颈
  • 本文还推荐了scout的一些监控插件

关于scout的介绍

Scout是一个服务器和应用扩展监控服务,它主要关注于安装和配置的易用性。
Scout默认提供了报警功能,帮助管理员更快地在不同负载的情况下理解应用程序的行为,
同样,Scout也允许程序员创建插件来扩展Scout

如果你年纪足够大,那你应该知道软盘驱动器,你应该听说过的磁盘I/O瓶颈的症状。 例如,在俄勒冈小道加载下一个场景,你会听到硬盘磨走,从磁盘读取数据。 CPU就在这段时间处于闲置状态,摆弄它的手指等待数据。如果软盘驱动器是更快,你会现在运行的哥伦比亚河的急流。(直译,我没见过软盘,对这块的理解不是那么深:-) )

如果磁盘不是在你的桌面上,很难察觉的I/O瓶颈。对于Web应用程序,我比较关注四个重要的磁盘I/O问题:

  • 你有一个I/O瓶颈?
  • 有什么影响I/O性能?
  • 什么是解决I/O瓶颈的最佳路径?
  • 你如何监视磁盘I/O?
A banana slug vs. an F-18 Hornet

A banana slug vs. an F-18 Hornet

磁盘I / O包含一个物理磁盘上的输入/输出操作。如果你从磁盘上的一个文件中读取数据,处理器需要等待要读取的文件(写文件也是如此)。 磁盘工作时的杀手锏?访问时间。访问时间是为了处理来自处理器的数据请求,然后检索存储装置所需要的数据所需的计算机的时间。 由于硬盘是机械的,你需要等待磁盘转动到所需的磁盘扇区(数据存储的扇区)。 磁盘延迟大约是13ms,由于它依赖于硬盘的质量和旋转速度。内存延迟大约83ns。区别有多大?如果内存是一架最大速度可达1,190mile/h的F-18大黄蜂,那么磁盘访问速度则是只有0.007mile/h的香蕉蛞蝓(感觉像蜗牛,参考上图图片)。

这就是为什么在内存中缓存数据对于性能是如此重要 ------内存和硬盘驱动器之间的延迟的差异是巨大的。

Do you have an I/O bottleneck?

你的I/O等待测量直接反应是否存在I/O瓶颈。I/O等待是处理器正在等待磁盘数据所占的百分比。

举例来说,假设需要1秒内从MySQL读取10000行,这个操作需要对这些行执行某些操作。 mysql读取对比

mysql的某一行被检索时,磁盘处于被访问的状态。在此期间,处理器处于闲置状态。它等待在磁盘数据读取。在上面的例子中,磁盘访问了700毫秒,因此I/O等待的70%。

你可以通过top命令来检查你的 iowait ,top命令在任何版本的Linux上的都存在的: top_iowait

如果你的I/O wait 百分比大于(1/# of CPU cores,核心数分之一),那么你的CPU花费了大量的时间用于等待磁盘子系统(等待磁盘的读写)。

在上面的输出,I/O等待为12.1%。该服务器有8个内核(通过执行cat /proc/cpuinfo 可以看到)。这是非常接近(1/8 cores=0.125)。如果I/O等待一直处于这个阈值的(1/# of CPU cores,核心数分之一)前后,那么磁盘访问会使程序运行变缓慢。

What impacts I/O performance?

对于一些磁盘随机访问(数据库,邮件服务器,文件服务器等),你应着眼于IOPS(每秒钟执行I/O操作的次数)。

四个主要因素影响的IOPS:

  • Multidisk Arrays(多磁盘阵列) - 磁盘阵列磁盘越多意味着IOPS越大。如果一个磁盘的IOPS是 150,那么两个磁盘的IOPS可以达到300。
  • Average IOPS per-drive(每个驱动器的平均IOPS) - 每个驱动器可处理的IOPS越大,那么总IOPS的能力就越强。这在很大程度上是由驱动装置的旋转速度决定。
  • RAID Factor(RAID 因素) - 你的应用程序可能会使用RAID做存储配置,这意味着你使用多个磁盘来保障可 * “靠性和冗余度。一些RAID配置对写操作有显著的影响。对于RAID6,每一个写请求需要操作6个磁盘。对于RAID1和RAID10,一个写请求,只需要操作2个磁盘。对于一个RAID集,磁盘操作的次数越低,IOPS就越大。本文对RAID和IOPS的性能有很透彻的分析。
  • Read and Write Workload(读写工作负载) - 如果你有写操作的比例很高,RAID的配置导致对于每个写请求会执行执行许多次操作(如RAID 5或RAID6),那么你的IOPS会显著更低。
Calculating your maximum IOPS

一个非常准确的让你设身处地的去了解你机器的最大IOPS到底是有大的方式 就是去计算下你机器的理论IOPS是多大,然后把它同你实际测试到的IOPS进行对比,如果两者接近,那么也会你的IO会有点问题了。 计算公式 上面的每一项都可以从硬件信息中获取,你需要知道你的读/写负载,当然肯定得需要通过一些软件来实现了。你可以使用类似sar的命令行工具,也可以安装scout的检测插件,以上两种方式都可以获取读/写的负载情况。

一旦,你计算出来机器的理论IOPS,那么和sar命令中的tps 行进行比较,TPS每秒钟物理设备的 I/O 传输总量。sar命令中的tps是你的实际IOPS。如果TPS接近理论的IOPS,那么你的机器还是在承受范围之内的。下图是iostate执行的结果。
iostat结果分析

What’s the best path to fixing an I/O bottleneck?

即使香蕉蛞蝓听从来自“The 4 Hour Body”的所有建议,它决不会和F-18大黄蜂的速度一样快。同样,你可以调整你的磁盘硬件有更好的表现, 但它的复杂性决定了他永远不会不接近RAM的速度。

如果你现在面临磁盘I/O瓶颈,调整你的硬件可能并不是最快的补救措施。硬件改动可能涉及到的应用程序开发人员和系统管理员之间大量测试,数据迁移和交流。

当我们看到Blue Box Group的I/O瓶颈的时候,我们首先尝试调整该公司使用I/O最多的服务,使该服务尽可能多的数据缓存到ram中。 例如,我们通常配置我们的数据库服务器使其有尽可能多的RAM (高达64 GB,或者更多),然后让MySQL的缓存尽可能多的内存。

How do you monitor disk I/O?

在数据重服务器上的检测磁盘性能是非常重要,通过监控你可以判断随着时间的推移应用程序是如何影响磁盘性能。TOP命令的输出并没有给出太多的上下文:你看到的是正常的吗?还是这仅仅是一个短暂的高峰?怎么我们怎么获取 2个月前的I/O wait信息呢?

scout有两个关键的插件,用于测量你的磁盘性能。

  • The CPU usage plugin monitors key CPU metrics, which include I/O Wait %
  • The Device Input/Output plugin provides additional I/O metrics for a given device, including the I/O Wait time in milliseconds and read/write throughput
Three takeaways
  • Disk access is slooowww(磁盘访问毕竟是慢的) – 磁盘的访问速度是无法接近RAM的访问速度的.
  • Optimize your apps first(首先对你的app进行调优)– 调优磁盘的硬件是无价值的或者说并不是一个最快的修复方式.你应该让你的app中重I/O的服务尽可能从ram缓存中获取更多的数据。
  • Measure(测量或者检测) – 对你app代码的休克可能对磁盘I/O造成很大的影响,记录一段时间内关键的I/O指标.
Chapter 4: Processor Architecture. This chapter covers basic combinational and sequential logic elements, and then shows how these elements can be combined in a datapath that executes a simplified subset of the x86-64 instruction set called “Y86-64.” We begin with the design of a single-cycle datapath. This design is conceptually very simple, but it would not be very fast. We then introduce pipelining, where the different steps required to process an instruction are implemented as separate stages. At any given time, each stage can work on a different instruction. Our five-stage processor pipeline is much more realistic. The control logic for the processor designs is described using a simple hardware description language called HCL. Hardware designs written in HCL can be compiled and linked into simulators provided with the textbook, and they can be used to generate Verilog descriptions suitable for synthesis into working hardware. Chapter 5: Optimizing Program Performance. This chapter introduces a number of techniques for improving code performance, with the idea being that programmers learn to write their C code in such a way that a compiler can then generate efficient machine code. We start with transformations that reduce the work to be done by a program and hence should be standard practice when writing any program for any machine. We then progress to transformations that enhance the degree of instruction-level parallelism in the generated machine code, thereby improving their performance on modern “superscalar” processors. To motivate these transformations, we introduce a simple operational model of how modern out-of-order processors work, and show how to measure the potential performance of a program in terms of the critical paths through a graphical representation of a program. You will be surprised how much you can speed up a program by simple transformations of the C code. Bryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiii (front) Windfall Software, PCA ZzTEX 16.2 xxiv Preface Chapter 6: The Memory Hierarchy. The memory system is one of the most visible parts of a computer system to application programmers. To this point, you have relied on a conceptual model of the memory system as a linear array with uniform access times. In practice, a memory system is a hierarchy of storage devices with different capacities, costs, and access times. We cover the different types of RAM and ROM memories and the geometry and organization of magnetic-disk and solid state drives. We describe how these storage devices are arranged in a hierarchy. We show how this hierarchy is made possible by locality of reference. We make these ideas concrete by introducing a unique view of a memory system as a “memory mountain” with ridges of temporal locality and slopes of spatial locality. Finally, we show you how to improve the performance of application programs by improving their temporal and spatial locality. Chapter 7: Linking. This chapter covers both static and dynamic linking, including the ideas of relocatable and executable object files, symbol resolution, relocation, static libraries, shared object libraries, position-independent code, and library interpositioning. Linking is not covered in most systems texts, but we cover it for two reasons. First, some of the most confusing errors that programmers can encounter are related to glitches during linking, especially for large software packages. Second, the object files produced by linkers are tied to concepts such as loading, virtual memory, and memory mapping. Chapter 8: Exceptional Control Flow. In this part of the presentation, we step beyond the single-program model by introducing the general concept of exceptional control flow (i.e., changes in control flow that are outside the normal branches and procedure calls). We cover examples of exceptional control flow that exist at all levels of the system, from low-level hardware exceptions and interrupts, to context switches between concurrent processes, to abrupt changes in control flow caused by the receipt of Linux signals, to the nonlocal jumps in C that break the stack discipline. This is the part of the book where we introduce the fundamental idea of a process, an abstraction of an executing program. You will learn how processes work and how they can be created and manipulated from application programs. We show how application programmers can make use of multiple processes via Linux system calls. When you finish this chapter, you will be able to write a simple Linux shell with job control. It is also your first introduction to the nondeterministic behavior that arises with concurrent program execution. Chapter 9: Virtual Memory. Our presentation of the virtual memory system seeks to give some understanding of how it works and its characteristics. We want you to know how it is that the different simultaneous processes can each use an identical range of addresses, sharing some pages but having individual copies of others. We also cover issues involved in managing and manipulating virtual memory. In particular, we cover the operation of storage allocators such as the standard-library malloc and free operations. CovBryant & O’Hallaron fourth pages 2015/1/28 12:22 p. xxiv (front) Windfall Software, PCA ZzTEX 16.2 Preface xxv ering this material serves several purposes. It reinforces the concept that the virtual memory space is just an array of bytes that the program can subdivide into different storage units. It helps you understand the effects of programs containing memory referencing errors such as storage leaks and invalid pointer references. Finally, many application programmers write their own storage allocators optimized toward the needs and characteristics of the application. This chapter, more than any other, demonstrates the benefit of covering both the hardware and the software aspects of computer systems in a unified way. Traditional computer architecture and operating systems texts present only part of the virtual memory story. Chapter 10: System-Level I/O. We cover the basic concepts of Unix I/O such as files and descriptors. We describe how files are shared, how I/O redirection works, and how to access file metadata. We also develop a robust buffered I/O package that deals correctly with a curious behavior known as short counts, where the library function reads only part of the input data. We cover the C standard I/O library and its relationship to Linux I/O, focusing on limitations of standard I/O that make it unsuitable for network programming. In general, the topics covered in this chapter are building blocks for the next two chapters on network and concurrent programming. Chapter 11: Network Programming. Networks are interesting I/O devices to program, tying together many of the ideas that we study earlier in the text, such as processes, signals, byte ordering, memory mapping, and dynamic storage allocation. Network programs also provide a compelling context for concurrency, which is the topic of the next chapter. This chapter is a thin slice through network programming that gets you to the point where you can write a simple Web server. We cover the client-server model that underlies all network applications. We present a programmer’s view of the Internet and show how to write Internet clients and servers using the sockets interface. Finally, we introduce HTTP and develop a simple iterative Web server. Chapter 12: Concurrent Programming. This chapter introduces concurrent programming using Internet server design as the running motivational example. We compare and contrast the three basic mechanisms for writing concurrent programs—processes, I/O multiplexing, and threads—and show how to use them to build concurrent Internet servers. We cover basic principles of synchronization using P and V semaphore operations, thread safety and reentrancy, race conditions, and deadlocks. Writing concurrent code is essential for most server applications. We also describe the use of thread-level programming to express parallelism in an application program, enabling faster execution on multi-core processors. Getting all of the cores working on a single computational problem requires a careful coordination of the concurrent threads, both for correctness and to achieve high performance翻译以上英文为中文
08-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值