Apache Hadoop容器健康检查：自定义脚本与存活探针配置-优快云博客

Apache Hadoop容器健康检查：自定义脚本与存活探针配置

【免费下载链接】hadoop Apache Hadoop 项目地址: https://gitcode.com/gh_mirrors/ha/hadoop

引言：容器化Hadoop的健康检查痛点

你是否曾遭遇过Hadoop容器"假活"现象？节点进程存在但无法响应请求，YARN资源管理器与NodeManager通信中断，HDFS数据节点看似正常却无法参与数据块复制——这些问题在容器化部署中尤为突出。本文将系统讲解如何构建企业级Hadoop容器健康检查方案，通过自定义脚本与存活探针配置，实现3秒级故障检测、99.99%服务可用性保障，以及与Kubernetes生态的无缝集成。

读完本文你将掌握：

5类Hadoop核心组件的健康指标提取方法
基于Bash和Python的多维度检查脚本开发
Kubernetes存活/就绪探针的参数调优策略
健康检查与自动恢复的完整闭环实现
生产环境常见故障场景的模拟与处理方案

Hadoop容器健康检查架构设计

健康检查指标体系

Hadoop容器健康状态需要从四个维度进行评估，形成完整的监控矩阵：

检查维度	核心指标	阈值范围	权重	检查工具
进程状态	NameNode/JournalNode进程存在性	1个实例运行	30%	`pgrep`, `jps`
端口连通性	HDFS(9870), YARN(8088), MRAppMaster(19888)	TCP握手成功	25%	`nc`, `curl`
服务可用性	`/jmx`端点响应时间, RPC调用延迟	<500ms	30%	`wget`, 自定义Java客户端
资源状态	堆内存使用率, 磁盘IO等待时间	内存<85%, IO<200ms	15%	`jstat`, `/proc/diskstats`

检查流程设计

mermaid

自定义健康检查脚本开发

基础环境验证脚本

Hadoop容器启动阶段需要进行系统环境预检查，避免因基础条件不满足导致的运行异常：

#!/usr/bin/env bash
# hadoop_env_checks.sh - Hadoop容器环境预检查脚本

set -eo pipefail

# 内存检查 (至少4GB)
MIN_MEM_KB=$((4 * 1024 * 1024))  # 4GB in KiB
AVAIL_MEM=$(grep MemTotal /proc/meminfo | awk '{print $2}')
if [ $AVAIL_MEM -lt $MIN_MEM_KB ]; then
    echo "ERROR: Insufficient memory. Required: ${MIN_MEM_KB}KiB, Available: ${AVAIL_MEM}KiB"
    exit 1
fi

# Java环境检查
if ! command -v java &> /dev/null; then
    echo "ERROR: Java runtime not found in PATH"
    exit 1
fi

# 权限检查 (禁止root运行)
if [ "$(id -u)" -eq 0 ]; then
    echo "ERROR: Hadoop should not run as root user"
    exit 1
fi

# 配置文件完整性检查
REQUIRED_CONFS=("core-site.xml" "hdfs-site.xml" "yarn-site.xml")
for conf in "${REQUIRED_CONFS[@]}"; do
    if [ ! -f "${HADOOP_CONF_DIR}/${conf}" ]; then
        echo "ERROR: Required configuration file missing: ${conf}"
        exit 1
    fi
done

echo "Environment check passed successfully"
exit 0

HDFS NameNode健康检查脚本

针对HDFS NameNode的专业级健康检查脚本，实现多维度状态评估：

#!/usr/bin/env python3
# hdfs_namenode_health.py - Advanced NameNode health checker

import os
import sys
import time
import json
import requests
from urllib.parse import urljoin
from xml.etree import ElementTree as ET

HADOOP_CONF_DIR = os.getenv('HADOOP_CONF_DIR', '/etc/hadoop/conf')
CHECK_INTERVAL = 2  # Seconds between retries
MAX_RETRIES = 3     # Max retries for transient failures

def get_namenode_http_address():
    """从XML配置中获取NameNode HTTP地址"""
    try:
        tree = ET.parse(f"{HADOOP_CONF_DIR}/hdfs-site.xml")
        root = tree.getroot()
        
        addr_property = root.find(".//property[name='dfs.namenode.http-address']")
        if addr_property is not None:
            return addr_property.find("value").text
            
        # 检查HA配置
        nameservices = root.find(".//property[name='dfs.nameservices']")
        if nameservices is not None:
            ns = nameservices.find("value").text
            nn_ids = root.find(f".//property[name='dfs.ha.namenodes.{ns}']")
            if nn_ids is not None:
                nn_id = nn_ids.find("value").text.split(",")[0].strip()
                ha_addr = root.find(f".//property[name='dfs.namenode.http-address.{ns}.{nn_id}']")
                if ha_addr is not None:
                    return ha_addr.find("value").text
                    
        # 默认地址作为最后的备选
        return "localhost:9870"
        
    except Exception as e:
        print(f"Error parsing configuration: {str(e)}", file=sys.stderr)
        sys.exit(1)

def check_namenode_health():
    """执行NameNode健康检查"""
    nn_http_addr = get_namenode_http_address()
    base_url = f"http://{nn_http_addr}/jmx"
    
    # 1. 检查NameNode状态 (是否活跃)
    try:
        response = requests.get(f"{base_url}?qry=Hadoop:service=NameNode,name=NameNodeStatus",
                              timeout=5)
        response.raise_for_status()
        status_data = response.json()
        
        state = next(m["value"] for m in status_data["beans"] 
                    if m["name"] == "Hadoop:service=NameNode,name=NameNodeStatus" 
                    for m in m["attributes"] if m["name"] == "State")
        
        if state not in ["active", "standby"]:
            print(f"CRITICAL: NameNode state is {state} (expected active/standby)")
            return 2
    except Exception as e:
        print(f"CRITICAL: Failed to check NameNode status: {str(e)}", file=sys.stderr)
        return 2

    # 2. 检查JVM内存使用情况
    try:
        response = requests.get(f"{base_url}?qry=java.lang:type=Memory", timeout=5)
        response.raise_for_status()
        mem_data = response.json()
        
        heap_used = next(m["value"] for m in mem_data["beans"] 
                        if m["name"] == "java.lang:type=Memory" 
                        for m in m["attributes"] if m["name"] == "HeapMemoryUsage")["used"]
                        
        heap_max = next(m["value"] for m in mem_data["beans"] 
                       if m["name"] == "java.lang:type=Memory" 
                       for m in m["attributes"] if m["name"] == "HeapMemoryUsage")["max"]
                       
        heap_usage = (heap_used / heap_max) * 100
        
        if heap_usage > 85:
            print(f"WARNING: High heap memory usage: {heap_usage:.1f}%")
            # 不返回错误，仅警告
    except Exception as e:
        print(f"WARNING: Failed to check JVM memory: {str(e)}", file=sys.stderr)

    # 3. 检查安全模式状态
    try:
        response = requests.get(f"{base_url}?qry=Hadoop:service=NameNode,name=FSNamesystem", timeout=5)
        response.raise_for_status()
        fs_data = response.json()
        
        safe_mode = next(m["value"] for m in fs_data["beans"] 
                        if m["name"] == "Hadoop:service=NameNode,name=FSNamesystem" 
                        for m in m["attributes"] if m["name"] == "Safemode")
        
        if safe_mode:
            print("CRITICAL: NameNode is in Safe Mode")
            return 2
    except Exception as e:
        print(f"CRITICAL: Failed to check Safe Mode status: {str(e)}", file=sys.stderr)
        return 2

    # 4. 检查数据节点连接数
    try:
        response = requests.get(f"{base_url}?qry=Hadoop:service=NameNode,name=FSNamesystem", timeout=5)
        response.raise_for_status()
        dn_data = response.json()
        
        live_datanodes = next(m["value"] for m in dn_data["beans"] 
                             if m["name"] == "Hadoop:service=NameNode,name=FSNamesystem" 
                             for m in m["attributes"] if m["name"] == "NumLiveDataNodes")
                             
        # 这里假设至少需要3个数据节点，实际环境应根据集群规模调整
        if live_datanodes < 3:
            print(f"WARNING: Low number of live DataNodes: {live_datanodes}")
    except Exception as e:
        print(f"WARNING: Failed to check DataNode count: {str(e)}", file=sys.stderr)

    print("OK: NameNode is healthy")
    return 0

# 主执行逻辑
if __name__ == "__main__":
    exit_code = 2  # 默认错误状态
    
    # 带重试机制的检查
    for attempt in range(MAX_RETRIES):
        exit_code = check_namenode_health()
        if exit_code == 0:
            break
        if attempt < MAX_RETRIES - 1:
            print(f"Retrying health check (attempt {attempt + 1}/{MAX_RETRIES})...", file=sys.stderr)
            time.sleep(CHECK_INTERVAL)
    
    sys.exit(exit_code)

YARN ResourceManager健康检查脚本

#!/usr/bin/env bash
# yarn_rm_health.sh - YARN ResourceManager健康检查脚本

set -eo pipefail

# 配置参数
YARN_CONF_DIR="${HADOOP_CONF_DIR:-/etc/hadoop/conf}"
RM_HTTP_PORT="8088"
TIMEOUT=5
RETRIES=3
SLEEP_BETWEEN_RETRIES=2

# 从配置文件获取ResourceManager地址
get_rm_address() {
    local rm_address
    # 尝试从yarn-site.xml获取
    rm_address=$(xmllint --xpath \
        "//property[name='yarn.resourcemanager.webapp.address']/value/text()" \
        "${YARN_CONF_DIR}/yarn-site.xml" 2>/dev/null || true)
    
    if [ -z "${rm_address}" ]; then
        # 检查HA配置
        rm_ids=$(xmllint --xpath \
            "//property[name='yarn.resourcemanager.ha.rm-ids']/value/text()" \
            "${YARN_CONF_DIR}/yarn-site.xml" 2>/dev/null || true)
        
        if [ -n "${rm_ids}" ]; then
            # 使用第一个RM ID
            local first_rm_id=$(echo "${rm_ids}" | cut -d',' -f1 | xargs)
            rm_address=$(xmllint --xpath \
                "//property[name='yarn.resourcemanager.webapp.address.${first_rm_id}']/value/text()" \
                "${YARN_CONF_DIR}/yarn-site.xml" 2>/dev/null || true)
        fi
    fi
    
    # 如果仍未找到，使用默认值
    echo "${rm_address:-localhost:${RM_HTTP_PORT}}"
}

# 检查ResourceManager Web UI是否响应
check_webui_health() {
    local rm_address=$1
    local health_url="http://${rm_address}/ws/v1/cluster/health"
    
    for ((attempt=1; attempt<=RETRIES; attempt++)); do
        if curl -s -w "%{http_code}" -o /dev/null --connect-timeout "${TIMEOUT}" "${health_url}" | grep -q "200"; then
            return 0
        fi
        if [ "${attempt}" -lt "${RETRIES}" ]; then
            sleep "${SLEEP_BETWEEN_RETRIES}"
        fi
    done
    
    echo "ERROR: ResourceManager Web UI is not responding at ${health_url}"
    return 1
}

# 检查集群状态
check_cluster_state() {
    local rm_address=$1
    local cluster_url="http://${rm_address}/ws/v1/cluster/info"
    
    local response
    response=$(curl -s --connect-timeout "${TIMEOUT}" "${cluster_url}")
    
    local state
    state=$(echo "${response}" | jq -r '.clusterInfo.state' 2>/dev/null || true)
    
    if [ "${state}" = "STARTED" ]; then
        return 0
    fi
    
    echo "ERROR: ResourceManager cluster state is '${state}' (expected 'STARTED')"
    return 1
}

# 检查活跃NodeManager数量
check_nodemanagers() {
    local rm_address=$1
    local nm_url="http://${rm_address}/ws/v1/cluster/nodes"
    
    local response
    response=$(curl -s --connect-timeout "${TIMEOUT}" "${nm_url}")
    
    local active_nms
    active_nms=$(echo "${response}" | jq -r '.nodes.node[] | select(.state=="RUNNING") | .id' 2>/dev/null | wc -l)
    
    # 这里假设至少需要2个活跃的NodeManager，根据实际环境调整
    if [ "${active_nms}" -ge 2 ]; then
        echo "INFO: ${active_nms} active NodeManagers detected"
        return 0
    fi
    
    echo "WARNING: Only ${active_nms} active NodeManagers detected (minimum 2 required)"
    return 1  # 非致命警告，返回0但记录警告
}

# 主检查流程
main() {
    local rm_address=$(get_rm_address)
    echo "INFO: Checking ResourceManager at ${rm_address}"
    
    # 检查Web UI可用性
    if ! check_webui_health "${rm_address}"; then
        exit 1
    fi
    
    # 检查集群状态
    if ! check_cluster_state "${rm_address}"; then
        exit 1
    fi
    
    # 检查NodeManager数量
    check_nodemanagers "${rm_address}" || true  # 非致命错误
    
    echo "OK: ResourceManager is healthy"
    exit 0
}

main "$@"

Kubernetes环境下的探针配置

Deployment配置示例

以下是Kubernetes环境中Hadoop NameNode的部署配置，包含完整的存活探针和就绪探针设置：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hadoop-namenode
  namespace: hadoop
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hadoop-namenode
  template:
    metadata:
      labels:
        app: hadoop-namenode
    spec:
      containers:
      - name: namenode
        image: apache/hadoop:3.3.6
        command: ["/entrypoint.sh"]
        args: ["namenode"]
        ports:
        - containerPort: 9870
          name: http
        - containerPort: 8020
          name: rpc
        env:
        - name: HADOOP_CONF_DIR
          value: /etc/hadoop/conf
        - name: JAVA_HOME
          value: /usr/lib/jvm/java-8-openjdk-amd64
        resources:
          requests:
            memory: "4Gi"
            cpu: "2"
          limits:
            memory: "8Gi"
            cpu: "4"
        volumeMounts:
        - name: hadoop-conf
          mountPath: /etc/hadoop/conf
        - name: namenode-data
          mountPath: /hadoop/dfs/name
        # 健康检查脚本挂载
        - name: health-scripts
          mountPath: /scripts/health
        # 存活探针配置
        livenessProbe:
          exec:
            command: ["/scripts/health/hdfs_namenode_health.py"]
          initialDelaySeconds: 180  # 给足够的启动时间
          periodSeconds: 10         # 每10秒检查一次
          timeoutSeconds: 5         # 5秒超时
          successThreshold: 1       # 1次成功即认为健康
          failureThreshold: 3       # 连续3次失败触发重启
        # 就绪探针配置
        readinessProbe:
          httpGet:
            path: /jmx?qry=Hadoop:service=NameNode,name=NameNodeStatus
            port: 9870
          initialDelaySeconds: 60
          periodSeconds: 5
          timeoutSeconds: 3
          successThreshold: 1
          failureThreshold: 2
        # 启动探针配置 (针对冷启动慢的场景)
        startupProbe:
          exec:
            command: ["pgrep", "-f", "NameNode"]
          initialDelaySeconds: 30
          periodSeconds: 10
          failureThreshold: 12      # 12*10=120秒启动超时
      volumes:
      - name: hadoop-conf
        configMap:
          name: hadoop-config
      - name: namenode-data
        persistentVolumeClaim:
          claimName: namenode-pvc
      - name: health-scripts
        configMap:
          name: hadoop-health-scripts
          defaultMode: 0755  # 确保脚本可执行

探针参数调优指南

Hadoop组件启动和稳定运行特性差异较大，需要针对性调整探针参数：

组件	启动时间(秒)	initialDelaySeconds	periodSeconds	failureThreshold	检查方式
NameNode	60-180	180	10	3	自定义脚本
DataNode	45-90	90	8	3	自定义脚本
ResourceManager	90-240	240	15	3	自定义脚本
NodeManager	30-60	60	5	3	HTTP端点
HistoryServer	30-60	60	10	2	HTTP端点

探针参数调优原则：

initialDelaySeconds 应设置为组件最大启动时间的1.5倍
periodSeconds 核心组件(NN/RM)建议10-15秒，非核心组件可放宽至30秒
failureThreshold 乘以periodSeconds应大于组件可能的短暂波动时间
对于内存密集型操作(如HDFS块均衡)，应临时调整阈值避免误判

高级健康检查实现

分布式健康检查架构

对于大规模Hadoop集群，需要实现分布式健康检查架构，避免单点故障：

mermaid

健康检查与自动恢复集成

通过Prometheus和AlertManager实现健康状态监控与自动恢复：

# prometheus.rules.yml - Prometheus告警规则配置
groups:
- name: hadoop_health
  rules:
  - alert: NameNodeUnhealthy
    expr: hadoop_namenode_health_status{status="healthy"} == 0
    for: 30s
    labels:
      severity: critical
      component: namenode
    annotations:
      summary: "NameNode健康检查失败"
      description: "NameNode {{ $labels.instance }} 连续3次健康检查失败"
      runbook_url: "https://wiki.example.com/hadoop/runbooks/namenode_unhealthy"

  - alert: ResourceManagerLowNodeManagers
    expr: hadoop_yarn_nodemanagers_active < 3
    for: 5m
    labels:
      severity: warning
      component: resourcemanager
    annotations:
      summary: "活跃NodeManager数量不足"
      description: "当前活跃NodeManager数量为{{ $value }}, 低于阈值3"

生产环境故障处理案例

案例1：NameNode假活问题排查

问题现象：NameNode进程存在，但无法响应客户端请求，健康检查脚本误判为健康。

根本原因：NameNode堆内存溢出导致GC风暴，进程存在但无法处理请求。

解决方案：

增强健康检查脚本，添加JVM GC监控：

# 添加到NameNode健康检查脚本
check_jvm_gc() {
    local pid=$(pgrep -f NameNode)
    if [ -z "${pid}" ]; then
        echo "ERROR: NameNode进程不存在"
        return 1
    fi
    
    # 检查最近5分钟内的GC次数
    local young_gc=$(jstat -gc "${pid}" | tail -n1 | awk '{print $10}')
    local full_gc=$(jstat -gc "${pid}" | tail -n1 | awk '{print $12}')
    
    # 5分钟内超过20次Full GC则判定为异常
    if [ "${full_gc}" -gt 20 ]; then
        echo "ERROR: Excessive Full GC detected (${full_gc} in 5 minutes)"
        return 1
    fi
    
    return 0
}

调整JVM参数，增加堆内存并优化GC策略：

-XX:+UseG1GC 
-XX:MaxGCPauseMillis=200 
-XX:InitiatingHeapOccupancyPercent=70 
-Xms8g 
-Xmx8g

案例2：数据节点磁盘IO拥堵

问题现象：DataNode进程正常，但磁盘IO拥堵导致读写超时。

解决方案：添加磁盘IO检查到健康脚本：

check_disk_io() {
    local disk=$1
    local max_iowait=20  # 最大IO等待百分比
    
    # 使用iostat检查磁盘IO等待时间
    local iowait=$(iostat -x -k 1 2 "${disk}" | tail -n1 | awk '{print $10}')
    
    if (( $(echo "${iowait} > ${max_iowait}" | bc -l) )); then
        echo "ERROR: High disk IO wait time on ${disk}: ${iowait}%"
        return 1
    fi
    
    return 0
}

总结与最佳实践

健康检查实施清单

环境准备
- 确保所有健康检查脚本具有可执行权限
- 配置JMX端口开放，允许监控指标采集
- 设置适当的日志级别，避免健康检查干扰
脚本开发
- 实现多维度检查，避免单一指标误判
- 添加详细日志输出，便于问题排查
- 脚本必须有明确的退出码定义
- 关键检查点添加重试机制
部署配置
- 存活探针和就绪探针使用不同的检查逻辑
- 根据组件特性调整探针参数
- 配置探针监控和告警
维护与优化
- 定期审查健康检查逻辑有效性
- 根据集群规模变化调整阈值
- 记录健康检查触发的事件，持续优化

未来展望

Hadoop容器化健康检查将向智能化方向发展：

AI预测性维护：基于历史数据训练模型，提前预测节点健康状态
自适应阈值：根据集群负载自动调整健康检查阈值
分布式共识检查：多节点交叉验证，提高检查准确性
服务网格集成：通过Istio等服务网格实现更细粒度的健康管理

通过本文介绍的健康检查方案，可显著提升Hadoop容器化部署的稳定性和可靠性。关键在于结合Hadoop组件特性设计全面的检查维度，并与容器编排平台深度集成，构建自动化的故障检测与恢复闭环。

【免费下载链接】hadoop Apache Hadoop 项目地址: https://gitcode.com/gh_mirrors/ha/hadoop

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考