A Guide to Unexpected System Restarts

最新推荐文章于 2023-12-04 23:51:56 发布

转载最新推荐文章于 2023-12-04 23:51:56 发布 · 863 阅读

文章标签：

#system #reboot

RHEL7 同时被 2 个专栏收录

432 篇文章

订阅专栏

RHEL8

217 篇文章

订阅专栏

本文详细介绍了如何诊断Red Hat Enterprise Linux系统意外重启的原因，包括用户操作、软件故障、硬件故障等。提供了检查/var/log/messages日志的方法，以及如何使用Kernel Oops Analyzer工具快速定位内核崩溃问题。

https://access.redhat.com/articles/206873

Updated 2018年一月2日23:31 -

English

Introduction

Red Hat provides a Kernel Oops Analyzer tool to help you diagnose a kernel crash issue. When you input a text or a file including one or more kernel oops messages, we will walk you through diagnosing the kernel crash issue. Try using the tool before you perform the manual steps below. It may find a solution for your kernel crash issue in seconds. You can leave feedback on the tool at Kernel Oops App Info.

While a Red Hat Enterprise Linux system will not reboot unless specifically configured to do so, there are still several instances in which an unexpected reboot can occur. At a basic level, these occurrences fall into three categories:

A deliberate action on the part of a user (fence event, shutdown commands, etc.)
A software fault upon the server (kernel panic, NMI, etc)
A hardware fault/power failure in the server (power supply failure, disk or memory corruption, etc.)

In this article we discuss how to identify these occurrences and steps to alter or prevent future occurrences.

Understanding the Environment

There are some important questions to ask when an unexpected reboot has recently occurred that will help narrow down likely causes. Taking our lead from the categories above:

Identifying deliberate actions/configurations that would cause a restart:
- Is the server in question a cluster node with an attached fence device?
- Was the software on the server performing any tasks which would change its typical resource use?
- Is the server configured with health monitoring software, such as HP ASR?
- Is there a Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.
Potential software faults will most typically leave traces in /var/log/messages, investigated in the next section.
Potential hardware faults are difficult to diagnose from an operating system level, but it remains important to note power failures, maintenance events, or other environmental occurrences around the time of the restart.

Investigating /var/log/messages

Many of the most common restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, so searching the message log for the phrase "Command line" is a good first step when beginning an investigation.

For example:

Raw

Sep 29 04:18:15 <hostname> kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M

Starting from this point and working backwards, look for messages similar to the following. Note that these are examples of trouble indicators, actual errors found may vary by application and release version:

User-initiated Shutdown
- shutdown: shutting down for system reboot
- init: Switching to runlevel: 6
- exiting on signal 15
- Got SIGTERM, quitting.
Veritas Cluster Fence Event
- GAB WARNING V-15-1-20138 Port h isolated due to client process failure
RHEL High-Availability Cluster Suite Fence Event
- fenced[xxxx]: fencing node "node1.example.com"
- [TOTEM ] A processor failed, forming new configuration.
- [TOTEM] The token was lost in the OPERATIONAL state.
Hardware Fault
- CPU 1: Machine Check Exception: 4 Bank 4: ba00000000070f0f
- Kernel panic - not syncing: Machine check
- Kernel panic - not syncing: Uncorrected machine check
Thermal Event/Cooling Failure
- kernel: CPUX: Temperature above threshold, cpu clock throttled
- kernel: CPUX: Core power limit notification (total events = 1)
Power Button Pressed
- received event "button/power PWRF 00000000 00000000"
Non-Maskable Interrupt Received
- kernel: Uhhuh. NMI received for unknown reason XX.
- kernel: NMI received for unknown reason 00
- kernel: Dazed and confused, but trying to continue
- kernel: Do you have a strange power saving mode enabled?
Kernel Soft Lockup
- kernel: BUG: soft lockup - CPU#7 stuck for 10s!
Task Blocked for Too Long
- kernel: INFO: task <process>:60 blocked for more than 120 seconds.

These messages may not necessarily be the root cause of the reboot, but are important clues worth investigating further.

Where to Go Next

Should a situation become apparent in which the system has suffered a hang, lockup, or loss of service causing an external application to reboot it then an investigation of server load and performance leading up to the event is in order. By default, the System Activity Reporter facility provided by the sysstat package is the recorder of such data. Analyzing any SAR files collected is detailed further in our Knowledge Base. See How to analyze and interpret sar data.

Should none of the above messages show up in the logs, then the reboot cause can be narrowed down to an event that does not print messages to the logs. There are a limited number of operations that perform in this manner. The most prevalent of these follow.

Kernel Panic

A Red Hat Enterprise Linux system can be configured to reboot after experiencing a kernel panic. The kernel parameter by which this is set represents the number of seconds after a panic has been experienced before a reboot command will be issued, and is exposed in the /proc filesystem:

Raw

# cat /proc/sys/kernel/panic

If this value is set to 0, this functionality is disabled. Should an unexpected restart occur when this feature is enabled, there is a strong likelihood that the system is experiencing kernel panics. In these cases, we strongly recommend configuring netdump (version 4 or below) or kdump (version 5 or above) on the affected system to gather information regarding the panic cause.

NOTE: On a Red Hat Enterprise Linux 6 system, you can often speed up analysis of a kernel panic through use of a small file called the kernel log. See RHEL6: Speeding up kernel crash / hang analysis with the kernel log for more information.

SysRq

The SysRq facility contains functionality that can force an instantaneous system reboot. While shutdown commands are generally logged to the system's messages file, SysRq commands are not always captured in the same way. There are two ways a SysRq can be issued to cause a reboot. If the "Magic" SysRq key sequence has been enabled, then the key sequence Alt+PrintScreen+b will trigger a system reboot on the spot. This can be enabled and disabled with the kernel parameter kernel.sysrq, again exposed through the /proc filesystem:

Raw

# cat /proc/sys/kernel/sysrq

If this command returns 0, then triggering SysRq command with the above key sequence is disabled. A 1 indicates that this functionality is enabled.

Alternatively, the file /proc/sysrq-trigger can be used to issue a SysRq command whether or not the "Magic" key sequence is enabled and the command

Raw

# echo 'b' > /proc/sysrq-trigger

will instantly trigger a system reboot. Many different clustering software suites use this file and functionality as a fencing solution. The cluster management software will monitor the cluster nodes for errors or hangs, and upon detection that a node has become unresponsive the above command will be issued on the unresponsive node resulting in a restart. The Red Hat High Availability clustering software does not use this functionality, but if there is non-Red Hat clustering software present on the system it is recommended to investigate what fencing solution that cluster software employs.

IPMI and Baseboard Management Controllers

There are many pieces of software that will monitor a system for perceived performance difficulties, and if detected will use an IPMI signal to a BMC on the system board to restart the poorly performing server. Different implementations of IPMI exist on different hardware platforms, including HP iLO and Dell DRAC. A frequent culprit of this type of unexpected reboot is the Automated System Recovery (ASR) functionality provided by the hp-health package on HP hardware with iLO cards. If this packages is installed, one can check for ASR events with the following commands:

Raw

# hpasmcli -s "show asr"
# hpasmcli -s "show iml"

Additionally, some clustering software, including Red Hat's own, can use IPMI signals to fence unresponsive nodes. If the server in question has such hardware installed, investigating the related hardware logs and/or cluster logs can shed further light on reboot occurrences.

Failing Hardware

Should no evidence of the above be present, then the remaining piece of the equation to investigate is hardware. There have been previous cases where a bad motherboard, faulty CPU, or a failing Power Supply Unit has caused power to be lost to the machine causing a hard shutdown. This behaviour is entirely dependent on the hardware within the system, and performing full hardware diagnostics against the machine is generally the only method to rule this out as a possibility.