mcelog用法详解

手动启动mcelog方法:

# mcelog --daemon

Run mcelog in daemon mode, waiting for errors from the kernel.

 

 

后台服务启动mcelog:

RHEL 7:

    systemctl start mcelog

    systemctl enable mcelog

 

RHEL 6:

    service mcelogd start

    chkconfig mcelogd on

 

 

查看mcelog日志:

# vim /var/log/mcelog

 

 

查看mcelog守护进程是否检测到错误信息:

# mcelog --client

Query a currently running mcelog daemon for errors

 

 

解析系统异常时的mce输出:

# mcelog --ascii < file.log

or:

# mcelog --ascii --file file.log

Decode machine check ASCII output from kernel logs

 

 

    异常输出内容示例如下:

[Hardware Error]: CPU 12: Machine Check Exception: 5 Bank 22: be200000000c110a

[Hardware Error]: RIP !INEXACT! 10:<ffffffff81014527> {mwait_idle+0x77/0xd0}

[Hardware Error]: TSC 103e7072fa77de ADDR c5f17ee00 MISC b0fe435602184086 

[Hardware Error]: PROCESSOR 0:306e4 TIME 1462390781 SOCKET 0 APIC 1

[Hardware Error]: Run the above through 'mcelog --ascii'

 

   file.log内容要去掉前面的“[Hardware Error]: ”:

CPU 12: Machine Check Exception: 5 Bank 22: be200000000c110a

RIP !INEXACT! 10:<ffffffff81014527> {mwait_idle+0x77/0xd0}

TSC 103e7072fa77de ADDR c5f17ee00 MISC b0fe435602184086 

PROCESSOR 0:306e4 TIME 1462390781 SOCKET 0 APIC 1

 

mcelog logs and accounts machine checks (in particular memory, IO, and CPU hardware errors) on modern x86 Linux systems.

mcelog is required by both 32bit x86 Linux kernels (since 2.6.30) and 64bit Linux kernels (since early 2.6 kernel releases) to log machine checks and should run on all Linux systems that need error handling.

The mcelog daemon accounts memory and some other errors errors in various waysmcelog --client can be used to query a running daemon. The daemon can also execute triggers when configurable error thresholds are exceeded. This is used to implement a range of automatic predictive failure analysis algorithms: including bad page offlining and automatic cache error handling. User defined actions can be also configured.

All errors are logged to /var/log/mcelog or syslog or the journal.

For memory errors it supports modern x86 systems with integrated memory controllers; for CPU errors all modern x86 systems are supported.

Traditionally mcelog was run as a cronjob, but this usage is deprecated now. The modern way to run it is to start it at boot up time and run it always as a daemon. In addition it can be used to decode fatal machine checks on the command line (but this is also usually not needed anymore on modern kernels which log those after reboot automatically)

For installation information and how to set up a mcelog package (if you're a distributor) please see the README.

mcelog.conf reference

mcelog is configured through the /etc/mcelog.conf configuration file.

General format

optionname = value

white space is not allowed in value currently, except at the end where it is dropped

 

In general all command line options that are not commands work here. See man mcelog or mcelog --help for a list. e.g. to enable the --no-syslog option use

no-syslog = yes (or no to disable)

when the option has a argument

logfile = /tmp/logfile

below are the options which are not command line options.

 

Set cpu type for which mcelog decodes events:

cpu = type

For valid values for type please see mcelog --help. If this value is set incorrectly the decoded output will be likely incorrect. By default when this parameter is not set mcelog uses the CPU it is running on on very new kernels the mcelog events reported by the kernel also carry the CPU type which is used too when available and not overriden.

 

Enable daemon mode:

daemon = yes

By default mcelog just processes the currently pending events and exits. In daemon mode it will keep running as a daemon in the background and poll the kernel for events and then decode them.

 

Filter out known broken events by default.

filter = yes

Don't log memory errors individually. They still get accounted if that is enabled.

filter-memory-errors = yes

 

Output in undecoded raw format to be easier machine readable (default is decoded).

raw = yes

 

Set cpu mhz to decode uptime from time stamp counter (output unreliable, not needed on new kernels which report the event time directly. A lot of systems don't have a linear time stamp clock and the output is wrong then. Normally mcelog tries to figure out if it the TSC is reliable and only uses the current frequency then. Setting a frequency forces timestamp decoding. This setting is obsolete with modern kernels which report the time directly.

cpumhz = 1800.00

 

Log output options Log decoded machine checks in syslog (default stdout or syslog for daemon)

syslog = yes

Log decoded machine checks in syslog with error level

syslog-error = yes

Never log anything to syslog

no-syslog = yes

Append log output to logfile instead of stdout. Only when no syslog logging is active

logfile = filename

 

Use smbios information to decode dimms (needs root). This function is not recommended to use right now and generally not needed. The exception is memdb prepopulation, which is configured separately below.

dmi = no

 

When in daemon mode run as this user after set up. Note that the triggers will run as this user too. Setting this to non root will mean that triggers cannot take some corrective action, like offlining objects.

run-credentials-user = root

 

Group to run as daemon with default to the group of the run-credentials-user

run-credentials-group = nobody

 

The [server] config section

User allowed to access client socket. when set to * match any root is always allowed to access. default: root only

client-user = root

group allowed to access mcelog When no group is configured any group matches (but still user checking). when set to * match any

client-group = root

Path to the unix socket for client<->server communication. When no socket-path is configured the server will not start

socket-path = /var/run/mcelog-client

When mcelog starts it checks if a server is already running. This configures the timeout for this check.

initial-ping-timeout = 2

The [dimm] config section

Is the in memory dimm error tracking enabled? Only works on systems with integrated memory controller and which are supported. Only takes effect in daemon mode.

dimm-tracking-enabled = yes

Use DMI information from the BIOS to prepopulate DIMM database. Note this might not work with all BIOS and requires mcelog to run as root. Alternative is to let mcelog create DIMM objects on demand.

dmi-prepopulate = yes

Execute these triggers when the rate of corrected or uncorrected Errors per DIMM exceeds the threshold. Note when the hardware does not report DIMMs this might also be per channel. The default of 10/24h is reasonable for server quality DDR3 DIMMs as of 2009/10.

uc-error-trigger = dimm-error-trigger

uc-error-threshold = 1 / 24h

ce-error-trigger = dimm-error-trigger

ce-error-threshold = 10 / 24h

 

The [socket] config section

Enable memory error accounting per socket.

socket-tracking-enabled = yes

 

Threshold and trigger for uncorrected memory errors on a socket.

mem-uc-error-trigger = socket-memory-error-trigger

 

 

mem-uc-error-threshold = 100 / 24h

 

Trigger script for corrected memory errors on a socket.

mem-ce-error-trigger = socket-memory-error-trigger

 

Threshold on when to trigger a correct error for the socket.

 

mem-ce-error-threshold = 100 / 24h

 

log socket error threshold explicitely?

mem-ce-error-log = yes

 

Trigger script for uncorrected bus error events

bus-uc-threshold-trigger = bus-error-trigger

 

Trigger script for uncorrected iomca erors

iomca-threshold-trigger = iomca-error-trigger

 

Trigger script for other uncategorized errors

unknown-threshold-trigger = unknown-error-trigger

 

The [cache] config section

Processing of cache error thresholds reported by intel cpus.

cache-threshold-trigger = cache-error-trigger

 

Should cache threshold events be logged explicitely?

cache-threshold-log = yes

 

The [page] config section

Memory error accouting per 4k memory page. Threshold for the correct memory errors trigger script.

memory-ce-threshold = 10 / 24h

 

Trigger script for corrected errors.

memory-ce-trigger = page-error-trigger

 

Should page threshold events be logged explicitely?

memory-ce-log = yes

 

Specify the internal action in mcelog to exceeding a page error threshold this is done in addition to executing the trigger script if available

memory-ce-action = off|account|soft|hard|soft-then-hard

memory-ce-action = soft

offno action
accountonly account errors
softtry to soft-offline page without killing any processes
 This requires an uptodate kernel. Might not be successfull.
hardtry to hard-offline page by killing processes
 Requires an uptodate kernel. Might not be successfull.
soft-then-hardFirst try to soft offline, then try hard offlining

 

The [trigger] config section

Maximum number of running triggers

children-max = 2

execute triggers in this directory

directory = /etc/mcelog

 更详细的信息: http://www.mcelog.org

Overview
Download
Installation
Configuration
Triggers
FAQ
Manpage
Glossary

转载于:https://www.cnblogs.com/DataArt/p/10374165.html

mcelog 是一个用于处理机器检查异常(Machine Check Exceptions,MCE)事件的工具,可以用来分析和报告这些事件。MCE 是指当 CPU 或其他硬件组件检测到错误时,会产生的异常。这些错误可能是由于硬件故障引起的,也可能是由于其他原因导致的,如过度温度或电压不稳定等。 mcelog 通过读取 /dev/mcelog 设备文件来获取 MCE 事件信息。这些事件信息可以包括错误类型、错误地址、错误码等。mcelog 可以将这些信息记录到日志文件中,以便后续分析和调试。 mcelog 日志文件的格式如下: ``` mcelog: <CPU>: <ERROR_TYPE>: <MCG_STATUS>: <MCG_CAP>: <MCG_CTL>: <IPID>: <FLAGS>: <OTHER_INFO> ``` 其中,<CPU> 表示触发 MCE 事件的 CPU 编号,<ERROR_TYPE> 表示错误类型,<MCG_STATUS>、<MCG_CAP>、<MCG_CTL> 分别表示机器检查状态寄存器、机器检查功能寄存器和机器检查控制寄存器的值,<IPID> 表示指令指针 ID,<FLAGS> 表示标志位,<OTHER_INFO> 表示其他信息。 在分析 mcelog 日志时,需要注意以下几点: 1. 错误类型不同,处理方法也不同。例如,如果是内存错误,可能需要更换故障的内存条;如果是 CPU 错误,可能需要更换故障的 CPU。 2. 错误地址可以帮助我们定位故障的硬件组件。例如,如果错误地址在某个内存地址范围内,那么可能是该内存条有问题;如果错误地址在某个 I/O 端口范围内,那么可能是该 I/O 设备有问题。 3. 错误码可以提供更详细的错误信息。例如,某些错误码可以告诉我们是因为过度温度导致的错误,还是因为电压不稳定导致的错误。 总之,mcelog 是一款非常有用的工具,可以帮助我们快速定位和解决系统中的硬件故障问题。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值