Predicting server hardware failure with mcelog

最新推荐文章于 2025-12-05 13:08:46 发布

最新推荐文章于 2025-12-05 13:08:46 发布 · 100 阅读

文章标签：

通过监控Machine Check Event Log (MCE log)，我们成功预测了约90%的服务器硬件故障，从而避免了服务器崩溃。该方法适用于X86_64内核系统，通过设置cron任务定期读取和解析MCE日志文件，并使用Nagios插件进行实时监控。

12 May

Have you ever wanted to predict that a piece of hardware in your server was failing before it actually caused the server to crash?

Sure! We all do.

Over the past few months, I have been tracking the correlation between errors logged to the Machine Check Event Log (MCElog) and the hard crash of a server or application running on that server (mostly MySQL). So far, the correlation is about 90%. That is to say, about 9 times out of 10, there will be an error logged to the MCElog before the server actually crashes. It may take days or even weeks between the time of the logged error and the crash, but it will happen. We are now actively monitoring this log and replacing hardware (RAM and CPUs) which show errors before they actually fail which I thought was pretty cool, so I thought I would share how we are doing it.

On Debian, there is a package for the mcelog utility which will allow you to decode and display the kernel messages logged to /dev/mcelog Part of this package is a cron job which outputs the decoded contents of /dev/mcelog to /var/log/mcelog every 5 minutes:

*/5 * * * * root test -x /usr/sbin/mcelog -a ! -e /etc/mcelog-disabled && /usr/sbin/mcelog --ignorenodev --filter >> /var/log/mcelog

We modify this a little bit and add another cron job which rotates this log file on reboot:

@reboot root test -f /var/log/mcelog && mv /var/log/mcelog /var/log/mcelog.0

The reason we do this is because after a reboot, which is most likely a result of the hardware repair, we want to clear the active logfile (monitored by the nagios plugin below), so the alert will clear. In case, however, the reboot was not part of the hardware maintenance, we still want to have a record of the hardware errors so we move the log file to mcelog.0.

We then have a simple nagios plugin which monitors /var/log/mcelog for errors:

#!/bin/bash

LOGFILE=/var/log/mcelog

if [ ! -f "$LOGFILE" ]
then
	echo "No logfile exists"
	exit 3
else
	ERRORS=$( grep -c "HARDWARE ERROR" /var/log/mcelog )
	if [ $ERRORS -eq 0 ]
	then
		echo "OK: $ERRORS hardware errors found"
		exit 0
	elif [ $ERRORS -gt 0 ]
	then
		echo "WARNING: $ERRORS hardware errors found"
		exit 1
	fi
fi

And thats pretty much it. In just a few weeks we have caught about a dozen hardware faults before they led to server crashes.

Disclaimer: This only works when running a X86_64 kernel and YMMV.