Kubernetes Node-Problem-Detector 使用教程-优快云博客

Kubernetes Node-Problem-Detector 使用教程

【免费下载链接】node-problem-detector This is a place for various problem detectors running on the Kubernetes nodes. 项目地址: https://gitcode.com/gh_mirrors/no/node-problem-detector

概述

Kubernetes Node-Problem-Detector（节点问题检测器，简称 NPD）是一个运行在每个 Kubernetes 节点上的守护进程，用于检测节点问题并将其报告给 API Server。它通过监控系统日志、运行自定义检查脚本等方式，让集群管理层能够感知到节点级别的各种问题。

核心功能

mermaid

安装部署

使用 DaemonSet 部署

NPD 主要通过 DaemonSet 方式部署到每个节点上：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: node-problem-detector
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: node-problem-detector
        image: registry.k8s.io/node-problem-detector/node-problem-detector:v0.8.19
        command:
        - /node-problem-detector
        - --logtostderr
        - --config.system-log-monitor=/config/kernel-monitor.json,/config/readonly-monitor.json,/config/docker-monitor.json
        securityContext:
          privileged: true
        volumeMounts:
        - name: log
          mountPath: /var/log
          readOnly: true
        - name: kmsg
          mountPath: /dev/kmsg
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: kmsg
        hostPath:
          path: /dev/kmsg

RBAC 配置

NPD 需要相应的权限来创建事件和更新节点状态：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-problem-detector
  namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: node-problem-detector
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: node-problem-detector
subjects:
- kind: ServiceAccount
  name: node-problem-detector
  namespace: kube-system

配置详解

系统日志监控器配置

NPD 支持多种日志监控方式，以下是内核监控配置示例：

{
  "plugin": "kmsg",
  "logPath": "/dev/kmsg",
  "lookback": "5m",
  "bufferSize": 10,
  "source": "kernel-monitor",
  "conditions": [
    {
      "type": "KernelDeadlock",
      "reason": "KernelHasNoDeadlock",
      "message": "kernel has no deadlock"
    }
  ],
  "rules": [
    {
      "type": "temporary",
      "reason": "OOMKilling",
      "pattern": "Killed process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB.*"
    },
    {
      "type": "permanent",
      "condition": "KernelDeadlock",
      "reason": "DockerHung",
      "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
    }
  ]
}

规则类型说明

规则类型	描述	Kubernetes 对象
temporary	临时问题	Event
permanent	永久问题	NodeCondition

支持的检测器类型

检测器类型	功能描述	配置示例
SystemLogMonitor	监控系统日志	kernel-monitor.json
SystemStatsMonitor	收集系统指标	system-stats-monitor.json
CustomPluginMonitor	自定义插件检查	custom-plugin-monitor.json
HealthChecker	健康检查	health-checker-*.json

自定义插件监控器

自定义插件监控器允许用户通过脚本检测特定的节点问题：

配置示例

{
  "plugin": "custom",
  "pluginConfig": {
    "invoke_interval": "30s",
    "timeout": "5s",
    "max_output_length": 80,
    "concurrency": 3
  },
  "source": "ntp-monitor",
  "conditions": [
    {
      "type": "NTPProblem",
      "reason": "NTPIsUp",
      "message": "ntp service is up"
    }
  ],
  "rules": [
    {
      "type": "permanent",
      "condition": "NTPProblem",
      "reason": "NTPIsDown",
      "path": "./config/plugin/check_ntp.sh",
      "timeout": "3s"
    }
  ]
}

插件脚本示例

#!/bin/bash
# check_ntp.sh - 检查 NTP 服务状态

if systemctl is-active --quiet ntp; then
    echo "NTP service is running"
    exit 0
else
    echo "NTP service is not running"
    exit 1
fi

问题检测与报告机制

检测流程

mermaid

常见问题类型

问题类型	检测方式	影响级别
内核死锁	系统日志监控	严重
文件系统只读	系统日志监控	严重
Docker 挂起	系统日志监控	严重
NTP 服务异常	自定义插件	中等
硬件错误	系统日志监控	严重

监控与告警

Prometheus 指标

NPD 提供 Prometheus 指标端点（默认端口 20257）：

# 查看指标
curl http://localhost:20257/metrics

关键指标

指标名称	类型	描述
node_problem_detector_condition	Gauge	节点条件状态
node_problem_detector_event_count	Counter	事件计数
node_problem_detector_plugin_duration	Histogram	插件执行时间

故障排查与测试

手动测试问题检测

# 模拟内核 OOPS 错误
sudo sh -c "echo 'kernel: BUG: unable to handle kernel NULL pointer dereference at TESTING' >> /dev/kmsg"

# 查看生成的事件
kubectl get events -w --field-selector involvedObject.kind=Node

日志调试

启用详细日志输出：

node-problem-detector --v=4 --logtostderr

高级配置

多监控器配置

NPD 支持同时运行多个监控器：

node-problem-detector \
  --config.system-log-monitor=config/kernel-monitor.json,config/docker-monitor.json \
  --config.custom-plugin-monitor=config/custom-plugin-monitor.json \
  --config.system-stats-monitor=config/system-stats-monitor.json

构建自定义版本

禁用不需要的组件以减少资源占用：

BUILD_TAGS="disable_custom_plugin_monitor disable_system_stats_monitor" make

最佳实践

1. 资源限制

resources:
  limits:
    cpu: 20m
    memory: 100Mi
  requests:
    cpu: 10m
    memory: 50Mi

2. 监控配置

定期审查和更新检测规则，确保覆盖重要的节点问题。

3. 告警集成

将 NPD 检测到的问题集成到现有的监控告警系统中。

4. 自定义插件开发

根据实际环境需求开发特定的检测插件。

常见问题解答

Q: NPD 检测到问题后如何自动修复？

A: NPD 只负责检测和报告问题，修复需要结合其他工具如 Descheduler、mediK8S 等。

Q: 如何添加自定义的问题检测规则？

A: 通过修改相应的 JSON 配置文件，添加新的规则模式。

Q: NPD 对节点性能的影响如何？

A: NPD 设计为轻量级，通常占用很少的 CPU 和内存资源。

总结

Kubernetes Node-Problem-Detector 是集群运维中的重要组件，它通过多种方式检测节点级别的问题，让运维团队能够及时发现和处理节点异常。合理配置和使用 NPD 可以显著提高集群的稳定性和可靠性。

通过本教程，您应该能够：

理解 NPD 的工作原理和架构
正确部署和配置 NPD
使用自定义插件扩展检测能力
集成 NPD 到现有的监控体系
进行故障排查和问题分析

记得根据实际环境需求调整配置，并定期更新检测规则以保持有效性。

【免费下载链接】node-problem-detector This is a place for various problem detectors running on the Kubernetes nodes. 项目地址: https://gitcode.com/gh_mirrors/no/node-problem-detector

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考