MegaCli入坑指南

MegaCli RAID管理教程

MegaCli 是LSI公司官方提供的SCSI卡管理工具,由于LSI被收购变成了现在的Broadcom,所以现在想下载MegaCli,需要去Broadcom官网查找Legacy产品支持,搜索MegaRAID即可。

现在官方有storcli,整合了LSI和3ware所有产品。但是个人认为Megacli用起来更顺手,而且线上用了几家国产厂商服务器,用Megacli都能管理好RAID,所以换不换无所谓。

查看Adapter 信息:

./MegaCli64 -AdpAllInfo -aALL

返回结果太长很多都看不懂但没关系,新手先记住第一行,表示我的机器上有个0号适配器。MegaCli64很多命令都要在最后用-a指定Adapter,我只有Adapter #0 所以今后都写-a0就行,还可以-a0,1,2或-aALL

Adapter #0
==============================================================================
                    Versions
                ================
Product Name    : PERC H710 Adapter
Serial No       : 31P003R
FW Package Build: 21.1.0-0007
                    Mfg. Data
                ================
Mfg. Date       : 01/26/13
Rework Date     : 01/26/13
Revision No     : A00
Battery FRU     : N/A
...

查看Adapter的具体配置,这台机器插了12块盘,一块做RAID0装系统,剩下的盘做了RAID5:

./MegaCli64 -CfgDsply -aALL
==============================================================================
Adapter: 0
Product Name: PERC H710 Adapter
Memory: 512MB
BBU: Present
Serial No: 31P003R
==============================================================================
Number of DISK GROUPS: 2 #有俩磁盘组
DISK GROUPS: 0 #0号磁盘组
Number of Spans: 1
SPAN: 0
Span Reference: 0x00
Number of PDs: 1
Number of VDs: 1
Number of dedicated Hotspares: 0
Virtual Disk Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-0, Secondary-0, RAID Level Qualifier-0 #做了RAID0
Size:2.728 TB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:1
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Physical Disk Information:
Physical Disk: 0
Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abee
Connected Port Number: 0(path0) 
Inquiry Data:             手动马赛克 #这里是序列号
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device

DISK GROUPS: 1 #1号磁盘组
Number of Spans: 1
SPAN: 0
Span Reference: 0x01
Number of PDs: 11 #11块物理盘
Number of VDs: 1 #做成了1块虚拟盘
Number of dedicated Hotspares: 0
Virtual Disk Information:
Virtual Disk: 0 (Target Id: 1)
Name:
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3 #做了RAID5
Size:27.285 TB
State: Optimal
Stripe Size: 64 KB
Number Of Drives:11
Span Depth:1
Default Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAdaptive, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None
Physical Disk Information:
Physical Disk: 0 #第一块物理盘
Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abec
Connected Port Number: 0(path0) 
Inquiry Data:             手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
...

查看每块物理盘的信息和状态,跟前面一样,只是少了Adapter信息。

./MegaCli64 -PDList -a0
 
Adapter #0
Enclosure Device ID: 32
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abee
Connected Port Number: 0(path0) 
Inquiry Data:             手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
Enclosure Device ID: 32
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.728 TB [0x15d400000 Sectors]
Firmware state: Online
SAS Address(0): 0x500056b37789abec
Connected Port Number: 0(path0) 
Inquiry Data:             手动马赛克 #这里是磁盘的序列号,跟磁盘标签一致
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: 3.0Gb/s 
Link Speed: 3.0Gb/s 
Media Type: Hard Disk Device
...

这里会拿到很多有用的信息:

1、Slot Number:slot号,应该跟机器外观上的标识一致。如果机器上有多块盘,直接告诉现场工程师slot X的硬盘有问题,工程师就会直接换盘。

2、Inquiry Data: 这里是磁盘的序列号,跟磁盘标签上一致。磁盘标签需要拔盘才能看到,按slot拔盘看到磁盘的序列号应该跟Inquiry Data一致。

3、Firmware state: 这里能看到磁盘的状态,Online是我们期望看到的最好状态,除此之外还有 Unconfigured Offline Failed等等,大多表达一个悲伤的事实:你要加班报修/修复他们了。。。

4、需要特别关注这几个指标:Media Error / Other Error / Predictive Failure Count / Last Predictive Failure Event Seq Number 都有可能不是0。这意味着磁盘虽然能用但已经不再可靠,很有可能存在坏簇、坏道之类的问题,必须尽快换掉这块盘。如果坚持使用,那磁盘就离彻底坏掉不远了。网上流传的说法是前3个Count越大代表磁盘状态越差,实际并不是这样,以下2个截图就可以说明。

图片描述
图片描述

同事为这个问题专门与服务器RAID卡磁盘厂家沟通,得到的反馈是:
查到之前的资料,Medium error、other error数值的绝对值,不能直接反应硬盘的状态。
根据与RAID卡、硬盘厂家的沟通,建议做法是监控Predictive Failure 的数值,不为零说明硬盘有问题。另外,如果硬盘failed,也可以直接报修。
Predictive Failure Count
指令:storcli /c0/eall/sall show all
监控关键字Predictive Failure Count,标准为不能大于0,若有计数,将对应的硬盘换掉;
Predictive Failure中已经涵盖media error,而且比media error的范围更广、更全面。
硬盘的 SMART 子系统已经具备一套完整的算法来评估硬盘的健康状况
SMART 子系统算法会参考硬盘运行时各个方面的参数,media error 是其中一项
SMART 对于 media error 的评估是基于单位时间增长数来计算的
当 SMART 子系统中任何一个评估项达到对应的阈值时,硬盘会报告 Sense Code: 01 5D 00 (FAILURE PREDICTION THRESHOLD EXCEEDED)
遵循 SCSI 协议标准的 host (OS SCSI 子系统,SAS 控制器, RAID 卡等) 可以正确解析出该 Sense Code
综上,由于 media error 已经被硬盘 SMART 子系统所涵盖,并且会依据 SCSI 协议标准上报 predictive failure,所有硬盘部分只需要在Raid卡下监控Predictive Failure就好,标准为不能大于0。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值