Nodemanager in unhealthy state

本文解决Ambari系统中出现的Unhealthy告警问题,具体表现为Nodemanager未挂掉但所有MapReduce任务无法执行。通过调整yarn-site.xml中的磁盘健康检查配置参数,或清理磁盘空间来解决因磁盘空间不足导致的任务阻塞。

Unhealthy Node local-dirs and log-dirs are bad”

ambari的系统里出现这个告警,但是nodemanager也没有挂掉,所有的mapreduce任务都不能进行,处于卡主状态,有的日志还出现这个

2017-02-20 15:45:07,920 ERROR [WALProcedureStoreSyncThread] wal.WALProcedureStore: Unable to close the stream
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /apps/hbase/data/MasterProcWALs/state-00000000000000000058.log (inode 45740): File does not exist. Holder DFSClient_NONMAPREDUCE_-96565186_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3597)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3400)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3256)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:677)
at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:213)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:485)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080)

yarn all application 中 显示任务已经accept,但是没有执行,没有分配job url,原因就是磁盘的空间太满了,跟yarn-site.xml中的以下配置有关

<property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
        <value>98.5</value>
</property>
try adding the property  yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage  to yarn-site.xml. This property specifies the maximum percentage of disk space utilization allowed after which a disk is marked as bad. Values can range from 0.0 to 100.0.

你可以清理磁盘空间,或者你可以暂时的吧这个使用率调大,但是这不是解决问题的根本,最好是清理磁盘空间或扩展

[root@hjh01 ~]# yarn node -list 2025-07-02 19:05:04,602 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hjh01/192.168.63.101:8032 Total Nodes:3 Node-Id Node-State Node-Http-Address Number-of-Running-Containers hjh03:41086 RUNNING hjh03:8042 0 hjh02:33452 RUNNING hjh02:8042 0 hjh01:34457 RUNNING hjh01:8042 0 [root@hjh01 ~]# yarn top 2025-07-02 19:05:23,316 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at hjh01/192.168.63.101:8032 YARN top - 19:06:19, up 0d, 0:26, 0 active users, queue(s): root NodeManager(s): 3 total, 3 active, 0 unhealthy, 0 decommissioned, 0 lost, 0 reb Queue(s) Applications: 0 running, 2 submitted, 0 pending, 0 completed, 0 killed Queue(s) Mem(GB): 24 available, 0 allocated, 0 pending, 0 reserved Queue(s) VCores: 24 available, 0 allocated, 0 pending, 0 reserved Queue(s) Containers: 0 allocated, 0 pending, 0 reserved q APPLICATIONID USER TYPE QUEUE PRIOR #CONT [root@hjh01 ~]# vim /usr/local/hadoop/etc/hadoop/yarn-site.xml [root@hjh01 ~]# hive which: no hbase in (/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin:/usr/local/jdk/bin:/usr/local/hadoop/bin:/usr/local/hadoop/sbin:/usr/local/hive/bin:/root/bin) SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hive/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/hadoop/share/hadoop/common/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Hive Session ID = e131a6d9-ab5a-4b4c-92ef-b2ef2c57e699 Logging initialized using configuration in jar:file:/usr/local/hive/lib/hive-common-3.1.3.jar!/hive-log4j2.properties Async: true Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases. Hive Session ID = f58a189b-6fa4-4d22-9937-2b2ffd8bf3f6 hive> SET mapreduce.map.memory.mb=2048; hive> SET mapreduce.reduce.memory.mb=4096; hive> SELECT COUNT(1) FROM hjh_table LIMIT 1; FAILED: SemanticException [Error 10001]: Line 1:21 Table not found 'hjh_table' hive> SELECT > city, > COUNT(*) AS user_count, > ROUND(AVG(age), 1) AS avg_age > FROM users > GROUP BY city > ORDER BY user_count DESC; Query ID = root_20250702192000_765e8e26-cd76-45c6-b01a-5817bcfe7ff9 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1751506776899_0003, Tracking URL = http://hjh01:8088/proxy/application_1751506776899_0003/ Kill Command = /usr/local/hadoop/bin/mapred job -kill job_1751506776899_0003 Hadoop job information for Stage-1: number of mappers: 0; number of reducers: 0 2025-07-02 19:20:53,194 Stage-1 map = 0%, reduce = 0% Ended Job = job_1751506776899_0003 with errors Error during job, obtaining debugging information... FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 FAIL Total MapReduce CPU Time Spent: 0 msec hive>
最新发布
07-04
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值