ANALYZING I/O PERFORMANCE IN LINUX

本文详细探讨了Linux环境下磁盘I/O性能监控与分析的关键概念,包括IOPS定义、计算方法、RAID级别选择、SATA与SAS硬盘对比,以及如何根据工作负载调整文件系统性能。通过案例研究,作者展示了如何使用sysstat等工具收集数据,并利用这些数据进行深入分析,以确定存储子系统的瓶颈和潜在改进措施。

转自:http://www.cmdln.org/2010/04/22/analyzing-io-performance-in-linux/


ANALYZING I/O PERFORMANCE IN LINUX

Mon­i­tor­ing and ana­lyz­ing per­for­mance is an impor­tant task for any sysad­min. Disk I/O bot­tle­necks can bring appli­ca­tions to a crawl. What are IOPS?  Should I use SATASAS, or FC? How many spin­dles do I need? What RAID level should I use? Is my sys­tem read or write heavy? These are com­mon ques­tions for any­one embark­ing on an disk I/O analy­sis quest. Oblig­a­tory dis­claimer: I do not con­sider myself an expert in stor­age or any­thing for that mater. This is just how I have done I/O analy­sis in the past. I wel­come addi­tions and cor­rec­tions. I believe it’s also impor­tant to note that this analy­sis is geared toward ran­dom oper­a­tions than sequen­tial read/write workloads.

Let’s start at the very begin­ning … a very good place to start. Hey it worked for Julie Andrews … So what are IOPS? They are input out­put (I/O) oper­a­tions mea­sured in sec­onds. It’s good to note that IOPS are also referred to as trans­fers per sec­ond (tps). IOPs are impor­tant for appli­ca­tions that require fre­quent access to disk. Data­bases, ver­sion con­trol sys­tems, and mail stores all come to mind.

Great so now that I know what IOPS are how do I cal­cu­late them? IOPS are a func­tion of rota­tional speed (aka spin­dle speed), latency and seek time. The equa­tion is pretty sim­ple, 1/(seek + latency) = IOPS. Scott Lowe has agood exam­ple on his techreplublic.com blog.

Sam­ple drive:

  • Cal­cu­lated IOPS for this disk: 1/(0.003 + 0.0045) = about 133 IOPS

It’s great to know how to cal­cu­late a disks IOPS but for the most part you can get by with com­monly accepted aver­ages. Of course sources vary but from what I have seen.

Rota­tional Speed (rpm) IOPS
5400 50–80
7200 75–100
10k 125–150
15k 175–210

Should I use SATASAS or FC? That’s a loaded ques­tion. As with most things the answer is “depends”. I don’t want to get into the SATA vs SAS debate you can do your own research and make your own deci­sions based on your needs, but I will point out a few things.

These fac­tors are  key con­sid­er­a­tions when choos­ing what kind of dri­ves to use.

What RAID level should I use? You know what IOPS are, how to cal­cu­late them and deter­mined what kind of dri­ves to use, the next log­i­cal ques­tion is com­monly RAID 5 vs RAID 10. There is dif­fer­ence in reli­a­bil­ity, espe­cially as the num­ber of dri­ves in your raid-set increases but that is out­side the scope of this post.

Raid Level
Write Oper­a­tions Read Oper­a­tions Notes
0 1 1 Write/Read: high through­put, low CPU uti­liza­tion, no redundancy
1 2 1 Write: only as fast as sin­gle driveRead: Two read schemes avail­able. Read data from both dri­ves, or data from the drive that returns it first. One is higher through­put the other is faster seek times.
5 4 1 Write: Read-Modify-Write requires two reads and two writes per write request. Lower through­put higher CPU if the HBAdoesn’t have a ded­i­cated IOprocessor.Read-Modify-Write requires two reads and two writes per write request. Lower through­put higher CPU if the HBAdoesn’t have a ded­i­cated IOproces­sor.Read: High through­put low CPU uti­liza­tion nor­mally, in a failed state per­for­mance falls dra­mat­i­cally due to par­ity cal­cu­la­tion and any rebuild oper­a­tions that are going on.
6 5 1 Write: Read-Modify-Write requires three reads and three writes per write request. Do not use a soft­ware imple­men­ta­tion if it is avail­ableRead: High through­put low CPU uti­liza­tion nor­mally, in a failed state per­for­mance falls dra­mat­i­cally due to par­ity cal­cu­la­tion and any rebuild oper­a­tions that are going on.

As you can see in the table above, writes are where you take your per­for­mance hit. Now that the penalty or RAID fac­tor is known for dif­fer­ent raid lev­els we can get a good esti­mate of the the­o­ret­i­cal max­i­mum IOPS for aRAID set (exclud­ing caching of course). To do this you take the prod­uct of the num­ber of disks and IOPS per disk divided by the sum of the %read work­load and the prod­uct of the raid fac­tor (see write oper­a­tions col­umn) and %write workload.

Here is the equation:

d = num­ber of disks
dIOPS = IOPS per disk
%r = % of read work­load
%w = % of write work­load
F = raid fac­tor (write oper­a­tions column)

Wait a sec­ond, where am I sup­posed to get %read and %write from?

You need to exam­ine your work­load. I usu­ally turn to my favorite sta­tis­tics col­lec­tor, sys­stat.  sar –d –p will report activ­ity for each block device and pretty print the device name. I am assum­ing you already know what block device you are look­ing to ana­lyze but if your look­ing for the busiest device just look in the tps col­umn.  the rd_sec/s and wr_sec/s columns dis­play num­ber of sec­tors read/written from/to the device. To get the per­cent­age of read or writes divide rd_sec/s by the sum of rd_sec/s and wr_sec/s.

The equa­tions:

An exam­ple from my workstation:

Aver­age for sdb rd_sec/s = 1150.80
Aver­age for sdb wr_sec/s = 1166.53

As you can see my work­sta­tion read/write work­load is pretty bal­anced at 49.6% read, and 50.3% write. Com­pare that to a cvs server (don’t get me started on how bad cvs is, its just some­thing I have to deal with).

Aver­age for sdb rd_sec/s = 27.78k
Aver­age for sdb wr_sec/s = 2.07k

This server work­load is extremely high on reads. Ok time to ana­lyze the performance.

In and of itself being a heavy read work­load is not a prob­lem. My prob­lem is user com­plaints of slow­ness. I note (again from sys­stat col­lected met­rics) that the tps or aver­age IOPS on this device is about 574. Again thats not an issue in and of itself, we need to know what we can expect from its sub­sys­tem. This device hap­pens to be SAN based stor­age. The raid set its on is com­prised of 4 10kRPM FC dri­ves in a raid 10. Remem­ber from the table above that IOPS for a 10kRPM drive are in the 125-150ish range. We need to cal­cu­late the expected IOPS from that raid set using the IOPS equa­tion above, our mea­sured work­loads for read/write, the num­ber of disks, and the raid level (10 and 1 are treated the same).

Using the high end of the scale for 10kRPM IOPS per drive results in a max­i­mum the­o­ret­i­cal IOPS of 561.79, thats pretty close to what I am observ­ing (remem­ber cache is not taken into account). So based on these num­bers it looks like my stor­age sub­sys­tem is sat­u­rated. I guess I bet­ter add some spin­dles. Unfor­tu­nately there is no his­tor­i­cal data for this sys­tem so I have no way of know­ing how many tps I need to aim for.

Don’t get stuck where I am and have to guess how many spin­dles need to be added to reduce the pain, start record­ing your trends now! Even bet­ter, once you start col­lect­ing your sta­tis­ti­cal infor­ma­tion go ahead and set an alert for 65% or 70% uti­liza­tion of the­o­ret­i­cal max IOPS for an extended period as well as increas­ingly both­er­some alerts going up from there. It’s never good to have to react to per­for­mance issues, always bet­ter to be proac­tive. There was absolutely noth­ing wrong with the siz­ing of this exam­ple raid set 2–4 years ago. Had it been under mon­i­tor­ing the entire time with proper thresh­olds set a proper plan could have been made, and spin­dles could have been added before caus­ing users any pain.

If you want to use sys­stat like I did, you might find this Nagios plug-in that I wrote help­ful check_sar_perf. I use it with Zenoss, but it could be tied into any NMS that records the per­for­mance data from a Nagios plug-in.

Go forth, col­lect, ana­lyze and plan so your users aren’t call­ing you with issues.

19 COMMENTS
  • Jean-Francois Mac OS X Google Chrome 5.0.342.9 wrote:

    The link for the check_sar_perf script seems com­pletely unrelated.

    Other than that, this is a very good article!

  • Matt Windows XP Opera 9.80  wrote:

    The link for check_sar_perf points tohttp://vmtoday.com/2010/04/storage-basics-part-vi-storage-workload-characterization/ , which I don’t think was your intention.

  • Thanks for the com­ment Jean-Francois, check_sar_perf is just an easy way to trans­port sar met­rics for stuff­ing into an NMS.
    The only thing that makes it related would be the out­put from check_sar_perf disk sda etc … stuffed into your cacti or zenoss so you could see a trend.

  • lol oops, fix­ing it now

  • Kees Linux Firefox 3.5.5  wrote:

    Does any­body know if inter­laced sec­tors are used on hard­disks? In the day of flop­pies this could increase the data through­put con­sid­er­ably by choos­ing the right interlace.

  • Tormak Ubuntu Firefox 3.0.19 wrote:

    When com­par­ing SATA and SAS it’s impor­tant to remem­ber thatSATA is only 1/2 duplex. This is a good intro­duc­tion to the disks but I’d really like to see you fol­low it up (for the sake of all noobs) with a dis­cus­sion about band­width over the bus and maybe even con­trollers and their lim­i­ta­tions (includ­ing caching issues/options writeback/write through/cache mir­ror­ing, etc.).

  • Tormak Ubuntu Firefox 3.0.19 wrote:

    Addi­tion­ally, a dis­cus­sion wrt tun­ing the VFS for spe­cific work­load per­for­mance would dove­tail nicely. Maybe a future article?

  • Slappy Linux Firefox 3.6.3  wrote:

    Whoa, that is some high-level hard­ware ana­lyz­ing. I hope to get to that level one day.

  • twogunmickey Ubuntu Firefox 3.5.8  wrote:

    What about cashe size? You failed to men­tion it in the arti­cle. I fig­ure it doesn’t play that big of a dif­fer­ence. Espe­cially in long, con­stant trans­fers, but it has to help some? With new dri­vers com­ing with even larger cashes 64MBs!

    Another ques­tion I’ve always have had but have never heard any­one address. It seems to me that even though a higher RPMdrive might have bet­ter per­for­mance, a lower RPM drive would have a longer life. Espe­cially if they were both mod­els from the same line from the same company.

  • Vonskippy Windows XP Firefox 3.6.3 wrote:

    In your first table, you have 5200 rpm dri­ves — it should be 5400 rpm.

  • It would be good to note” that IOPS are also known as TPS, if it were any­where close to *true*.

    Alas, it’s not. “Trans­ac­tions per sec­ond”, as it’s gen­er­ally used, refers to a very spe­cific DBMS bench­mark, pro­mul­gated by the Trans­ac­tion Pro­cess­ing Per­for­mance Coun­cil; the TPS-C rat­ing. (They have two oth­ers, but the –C is the one most com­monly quoted)

    Whether you’re that specfic or not, you’re almost cer­tainly still talk­ing about SQL trans­ac­tions, and each one of those is going to take a *lot* more than 1 IOP. Gen­er­ally by 2 to 3 orders of mag­ni­tude, but 4 isn’t uncom­mon, and 5 or 6 isn’t unreasonable.

    And if your IOPs are tak­ing sec­onds, my con­do­lences. :-)

  • @Vonskippy — Thanks I’ll fix it.

    @Baylink — Yes, I meant to say Trans­fers per sec­ond. I’ll cor­rect it. From man sar — A trans­fer is an I/O request to a phys­i­cal device. Mul­ti­ple log­i­cal requests can be com­bined into a sin­gleIO request to the device. A trans­fer is of inde­ter­mi­nate size.

  • @Tormak — Thanks for the com­ments! You are right about not­ing sata is only 1/2 duplex. I thought about men­tion­ing it but didn’t want to get deep into that dis­cus­sion. Since you bring it up, Ill toss it in.
    As for tun­ing file sys­tems, that does seem like it would be a good read. In fact I noticed my exam­ple file sys­tem could use some tun­ing. If I can get some good data from doing that I may write a file-system tun­ing follow-up.

  • My apolo­gies for the tone; yes­ter­day was kindof a cranky day.

    Nice bits in the rest of the piece.

    And I like your captcha.

  • @Baylink, hey no prob­lem :) every­one has cranky days. Thanks for the cor­rec­tion. As for the captcha, I agree. It was the least obnox­ious one I could find. Props to http://clickcha.com/ the author.

  • phil Linux Firefox 3.5.9  wrote:

    Nice write up. One point to make though, there really is no such thing as an IOP sin­gu­lar. IOPS means IO Oper­a­tions Per Sec­ond. The “OP” part is not short for OPer­a­tion. Leav­ing off the “S” is non­sen­si­cal. It’s like PPS in the net­work space (or hope­fully,KPPS). Sin­gu­lar is just IO (for either “I/O Oper­a­tion” or just I/O sans “/” for the lazy).

  • Roger Themocap Linux Firefox 3.5.9  wrote:

    The num­bers in the equa­tions for %r and %w are the same. The results are different.

  • @phil — Yeah, your write it can be mis­lead­ing i sup­pose. In my head I was just think­ing Input Out­put OPer­a­tion :) Kind of like say­ing your ATM PIN Num­ber (Per­sonal Iden­ti­fi­ca­tion Num­ber Num­ber). I should prob­a­bly change it up in the post. Thanks for the input.

  • @Roger The­mo­cap — Ah your right. I screwed up when putting my equa­tions intohttp://private.codecogs.com/components/equationeditor/equationeditor.phpto gen­er­ate the images. Ill get that fixed. FYI its the one for %w thats wrong, should be .503 = 1166.53/(1150.08+1166.53). Thanks for the cor­rec­tion :)

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值