A Guide to Unexpected System Restarts

本文详细介绍了如何诊断Red Hat Enterprise Linux系统意外重启的原因,包括用户操作、软件故障、硬件故障等。提供了检查/var/log/messages日志的方法,以及如何使用Kernel Oops Analyzer工具快速定位内核崩溃问题。

https://access.redhat.com/articles/206873

Updated 2018年一月2日23:31 - 

English 

Contents

  1. Introduction
  2. Understanding the Environment
  3. Investigating /var/log/messages
  4. Where to Go Next
  5. Kernel Panic
  6. SysRq
  7. IPMI and Baseboard Management Controllers
  8. Failing Hardware

Introduction

Red Hat provides a Kernel Oops Analyzer tool to help you diagnose a kernel crash issue. When you input a text or a file including one or more kernel oops messages, we will walk you through diagnosing the kernel crash issue. Try using the tool before you perform the manual steps below. It may find a solution for your kernel crash issue in seconds. You can leave feedback on the tool at Kernel Oops App Info.

While a Red Hat Enterprise Linux system will not reboot unless specifically configured to do so, there are still several instances in which an unexpected reboot can occur. At a basic level, these occurrences fall into three categories:

  • A deliberate action on the part of a user (fence event, shutdown commands, etc.)
  • A software fault upon the server (kernel panic, NMI, etc)
  • A hardware fault/power failure in the server (power supply failure, disk or memory corruption, etc.)

In this article we discuss how to identify these occurrences and steps to alter or prevent future occurrences.

Understanding the Environment

There are some important questions to ask when an unexpected reboot has recently occurred that will help narrow down likely causes. Taking our lead from the categories above:

  • Identifying deliberate actions/configurations that would cause a restart:

    • Is the server in question a cluster node with an attached fence device?
    • Was the software on the server performing any tasks which would change its typical resource use?
    • Is the server configured with health monitoring software, such as HP ASR?
    • Is there a Baseboard Management Controller connected to the system? HP iLO, Dell DRAC, etc.
  • Potential software faults will most typically leave traces in /var/log/messages, investigated in the next section.

  • Potential hardware faults are difficult to diagnose from an operating system level, but it remains important to note power failures, maintenance events, or other environmental occurrences around the time of the restart.

Investigating /var/log/messages

Many of the most common restart causes will leave traces in /var/log/messages. All full system restarts will begin by listing the kernel command line, so searching the message log for the phrase "Command line" is a good first step when beginning an investigation.

For example:

Raw

Sep 29 04:18:15 <hostname> kernel: Command line: ro root=LABEL=/ rhgb quiet crashkernel=128M@16M

Starting from this point and working backwards, look for messages similar to the following. Note that these are examples of trouble indicators, actual errors found may vary by application and release version:

  • User-initiated Shutdown

    • shutdown: shutting down for system reboot
    • init: Switching to runlevel: 6
    • exiting on signal 15
    • Got SIGTERM, quitting.
  • Veritas Cluster Fence Event

    • GAB WARNING V-15-1-20138 Port h isolated due to client process failure
  • RHEL High-Availability Cluster Suite Fence Event

    • fenced[xxxx]: fencing node "node1.example.com"
    • [TOTEM ] A processor failed, forming new configuration.
    • [TOTEM] The token was lost in the OPERATIONAL state.
  • Hardware Fault

    • CPU 1: Machine Check Exception: 4 Bank 4: ba00000000070f0f
    • Kernel panic - not syncing: Machine check
    • Kernel panic - not syncing: Uncorrected machine check
  • Thermal Event/Cooling Failure

    • kernel: CPUX: Temperature above threshold, cpu clock throttled
    • kernel: CPUX: Core power limit notification (total events = 1)
  • Power Button Pressed

    • received event "button/power PWRF 00000000 00000000"
  • Non-Maskable Interrupt Received

    • kernel: Uhhuh. NMI received for unknown reason XX.
    • kernel: NMI received for unknown reason 00
    • kernel: Dazed and confused, but trying to continue
    • kernel: Do you have a strange power saving mode enabled?
  • Kernel Soft Lockup

    • kernel: BUG: soft lockup - CPU#7 stuck for 10s!
  • Task Blocked for Too Long

    • kernel: INFO: task <process>:60 blocked for more than 120 seconds.

These messages may not necessarily be the root cause of the reboot, but are important clues worth investigating further.

Where to Go Next

Should a situation become apparent in which the system has suffered a hang, lockup, or loss of service causing an external application to reboot it then an investigation of server load and performance leading up to the event is in order. By default, the System Activity Reporter facility provided by the sysstat package is the recorder of such data. Analyzing any SAR files collected is detailed further in our Knowledge Base. See How to analyze and interpret sar data.

Should none of the above messages show up in the logs, then the reboot cause can be narrowed down to an event that does not print messages to the logs. There are a limited number of operations that perform in this manner. The most prevalent of these follow.

Kernel Panic

A Red Hat Enterprise Linux system can be configured to reboot after experiencing a kernel panic. The kernel parameter by which this is set represents the number of seconds after a panic has been experienced before a reboot command will be issued, and is exposed in the /proc filesystem:

Raw

# cat /proc/sys/kernel/panic

If this value is set to 0, this functionality is disabled. Should an unexpected restart occur when this feature is enabled, there is a strong likelihood that the system is experiencing kernel panics. In these cases, we strongly recommend configuring netdump (version 4 or below) or kdump (version 5 or above) on the affected system to gather information regarding the panic cause.

NOTE: On a Red Hat Enterprise Linux 6 system, you can often speed up analysis of a kernel panic through use of a small file called the kernel log. See RHEL6: Speeding up kernel crash / hang analysis with the kernel log for more information.

SysRq

The SysRq facility contains functionality that can force an instantaneous system reboot. While shutdown commands are generally logged to the system's messages file, SysRq commands are not always captured in the same way. There are two ways a SysRq can be issued to cause a reboot. If the "Magic" SysRq key sequence has been enabled, then the key sequence Alt+PrintScreen+b will trigger a system reboot on the spot. This can be enabled and disabled with the kernel parameter kernel.sysrq, again exposed through the /proc filesystem:

Raw

# cat /proc/sys/kernel/sysrq

If this command returns 0, then triggering SysRq command with the above key sequence is disabled. A 1 indicates that this functionality is enabled.

Alternatively, the file /proc/sysrq-trigger can be used to issue a SysRq command whether or not the "Magic" key sequence is enabled and the command

Raw

# echo 'b' > /proc/sysrq-trigger

will instantly trigger a system reboot. Many different clustering software suites use this file and functionality as a fencing solution. The cluster management software will monitor the cluster nodes for errors or hangs, and upon detection that a node has become unresponsive the above command will be issued on the unresponsive node resulting in a restart. The Red Hat High Availability clustering software does not use this functionality, but if there is non-Red Hat clustering software present on the system it is recommended to investigate what fencing solution that cluster software employs.

IPMI and Baseboard Management Controllers

There are many pieces of software that will monitor a system for perceived performance difficulties, and if detected will use an IPMI signal to a BMC on the system board to restart the poorly performing server. Different implementations of IPMI exist on different hardware platforms, including HP iLO and Dell DRAC. A frequent culprit of this type of unexpected reboot is the Automated System Recovery (ASR) functionality provided by the hp-health package on HP hardware with iLO cards. If this packages is installed, one can check for ASR events with the following commands:

Raw

# hpasmcli -s "show asr"
# hpasmcli -s "show iml"

Additionally, some clustering software, including Red Hat's own, can use IPMI signals to fence unresponsive nodes. If the server in question has such hardware installed, investigating the related hardware logs and/or cluster logs can shed further light on reboot occurrences.

Failing Hardware

Should no evidence of the above be present, then the remaining piece of the equation to investigate is hardware. There have been previous cases where a bad motherboard, faulty CPU, or a failing Power Supply Unit has caused power to be lost to the machine causing a hard shutdown. This behaviour is entirely dependent on the hardware within the system, and performing full hardware diagnostics against the machine is generally the only method to rule this out as a possibility.

Table of Contents

### 解决方案概述 当遇到 Docker Desktop 更新过程中出现的意外错误时,通常可以通过一系列方法来诊断并解决问题。以下是针对该问题的具体分析以及解决方案。 --- #### 错误原因分析 根据描述中的报错信息,“An unexpected error occurred Docker Desktop encountered an unexpected error and needs to close.” 这表明 Docker Desktop 在运行期间遇到了无法处理的异常情况[^1]。具体可能的原因包括但不限于: - **权限不足**:某些操作需要管理员权限才能完成,例如修改 `C:\WINDOWS\System32\drivers\etc\hosts` 文件。 - **虚拟机设置失败**:Docker Desktop 使用 WSL 2 或 Hyper-V 来管理其内部虚拟化环境。如果这些组件未正确配置,则可能导致初始化失败。 - **网络连接问题**:在更新或下载资源时,可能会因网络不稳定而中断进程。 - **文件损坏**:安装包或者现有配置文件可能存在损坏的情况。 --- #### 推荐解决步骤 ##### 方法一:重新启动 Docker Desktop 和主机系统 尝试关闭当前正在运行的所有 Docker 容器和服务,并重启计算机以清除潜在的状态冲突。这是最基础也是最常见的排查手段之一。 ##### 方法二:检查日志记录 通过查看详细的日志可以帮助定位具体的故障位置。可以按照如下方式获取更多细节: ```bash wsl --shutdown # 如果使用的是基于WSL2模式下的DockerDesktop,请先执行此命令停止后台服务 cd %USERPROFILE%\AppData\Local\Docker\log\ type *.txt # 找到最近生成的日志文件并读取其中的内容寻找关键字如error,fail等 ``` ##### 方法三:修复 Windows 主机上的依赖项 由于提到对路径 “C:\WINDOWS\System32\drivers\etc\hosts” 的访问被拒绝,这提示我们应当验证操作系统层面是否存在安全策略阻止正常写入行为。建议采取以下措施恢复默认状态: 1. 右键点击“我的电脑”,选择属性 -> 高级系统设置 -> 性能选项卡下高级按钮 -> 启动和故障回复部分勾选自动重置; 2. 确认已启用开发者模式(适用于Windows 10 Pro及以上版本),因为这样能够简化HyperV功能开启流程; 3. 利用PowerShell脚本授予必要目录适当权限: ```powershell Takeown /f C:\Windows\System32\Drivers\Etc\Hosts /r /d y | Out-Null Icacls C:\Windows\System32\Drivers\Etc\Hosts /grant Administrators:F /t /q ``` ##### 方法四:卸载后再全新安装最新稳定版软件 假如以上办法均未能奏效的话,那么彻底移除旧版本再部署新发行版本不失为一种有效途径。记得提前备份重要数据以防丢失! > 注意事项:务必从官方渠道下载可信分发源链接地址https://www.docker.com/products/docker-desktop/ --- ### 示例代码片段展示如何测试基本功能是否恢复正常 下面给出一段简单的Python程序用来演示调用docker api接口查询本地已有images列表的操作过程作为最终确认环节的一部分: ```python import docker client = docker.from_env() try: images_list = client.images.list() print(f"Total Images Found:{len(images_list)}") except Exception as e: print(e) finally: del client ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值