yarn中resourcemanager启动不了,启动hregionserver后又挂掉了问题的解决

问题一、启动Hadoop-2.2.0中的yarn时,resourcemanager进程一直没有启动起来。

查看日志文件中的信息tail -n 50 yarn-dell-resourcemanager-master1.log

出现一下异常:

2016-09-09 14:41:09,341 INFO org.apache.hadoop.service.AbstractService: Service ResourceManager failed in state STARTED; cause: org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server
org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:262)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:623)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:655)
    at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.main(ResourceManager.java:872)
Caused by: java.net.BindException: Port in use: 192.168.1.120:8088
    at org.apache.hadoop.http.HttpServer.openListener(HttpServer.java:742)
    at org.apache.hadoop.http.HttpServer.start(HttpServer.java:686)
    at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:257)
    ... 4 more
Caused by: java.net.BindException: Address already in use
    at sun.nio.ch.Net.bind0(Native Method)
    at sun.nio.ch.Net.bind(Net.java:444)
    at sun.nio.ch.Net.bind(Net.java:436)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnector.java:216)
    at org.apache.hadoop.http.HttpServer.openListener(HttpServer.java:738)
    ... 6 more

解决方法:

1. ps aux | grep -i resourcemanager,  查看主机master中的resourcemanager的进程个数

2.   然后使用 kill -9 <RESOURCE_MANAGER_PID> 杀死相关进行

3. sbin目录下重启yarn即可复现进行

   ./stop-yarn.sh   ./start-yarn.sh 

在主节点master上面即可出现resourcemanager进程


问题二、有时,启动hregionserver后又挂掉了,查看Hbase启动的日志

dell@master1:/usr/local/hbase-0.98.7-hadoop2/logs$ tail -n 100  hbase-dell-regionserver-master1.log
    at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1286)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:862)
    at java.lang.Thread.run(Thread.java:745)
2017-01-12 10:02:23,347 FATAL [regionserver60020] regionserver.HRegionServer: ABORTING region server master1,60020,1484186540447: Unhandled: Cannot create directory /hbase/WALs/master1,60020,1484186540447. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3355)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3330)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:724)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:502)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59598)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /hbase/WALs/master1,60020,1484186540447. Name node is in safe mode.
Resources are low on NN. Please add or free up more resources then turn off safe mode manually. NOTE:  If you turn off safe mode before adding resources, the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode leave" to turn safe mode off.
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirsInt(FSNamesystem.java:3355)
    at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.mkdirs(FSNamesystem.java:3330)
    at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.mkdirs(NameNodeRpcServer.java:724)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.mkdirs(ClientNamenodeProtocolServerSideTranslatorPB.java:502)
    at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59598)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)

    at org.apache.hadoop.ipc.Client.call(Client.java:1347)
    at org.apache.hadoop.ipc.Client.call(Client.java:1300)
    at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
    at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186)
    at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
    at com.sun.proxy.$Proxy14.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.mkdirs(ClientNamenodeProtocolTranslatorPB.java:467)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:294)
    at com.sun.proxy.$Proxy15.mkdirs(Unknown Source)
    at org.apache.hadoop.hdfs.DFSClient.primitiveMkdir(DFSClient.java:2394)
    at org.apache.hadoop.hdfs.DFSClient.mkdirs(DFSClient.java:2365)
    at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:817)
    at org.apache.hadoop.hdfs.DistributedFileSystem$16.doCall(DistributedFileSystem.java:813)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirsInternal(DistributedFileSystem.java:813)
    at org.apache.hadoop.hdfs.DistributedFileSystem.mkdirs(DistributedFileSystem.java:806)
    at org.apache.hadoop.fs.FileSystem.mkdirs(FileSystem.java:1933)
    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.<init>(FSHLog.java:408)
    at org.apache.hadoop.hbase.regionserver.wal.FSHLog.<init>(FSHLog.java:334)
    at org.apache.hadoop.hbase.regionserver.wal.HLogFactory.createHLog(HLogFactory.java:58)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.instantiateHLog(HRegionServer.java:1552)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.setupWALAndReplication(HRegionServer.java:1531)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.handleReportForDutyResponse(HRegionServer.java:1286)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:862)
    at java.lang.Thread.run(Thread.java:745)
2017-01-12 10:02:23,350 FATAL [regionserver60020] regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: []
2017-01-12 10:02:23,367 INFO  [regionserver60020] ipc.RpcServer: Stopping server on 60020
2017-01-12 10:02:23,368 INFO  [regionserver60020] regionserver.HRegionServer: Stopping infoServer
2017-01-12 10:02:23,373 INFO  [regionserver60020] mortbay.log: Stopped SelectChannelConnector@0.0.0.0:60030
2017-01-12 10:02:23,475 INFO  [regionserver60020] snapshot.RegionServerSnapshotManager: Stopping RegionServerSnapshotManager abruptly.
2017-01-12 10:02:23,475 INFO  [regionserver60020] regionserver.HRegionServer: aborting server master1,60020,1484186540447
2017-01-12 10:02:23,475 DEBUG [regionserver60020] catalog.CatalogTracker: Stopping catalog tracker org.apache.hadoop.hbase.catalog.CatalogTracker@58465d50
2017-01-12 10:02:23,475 INFO  [regionserver60020] client.HConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x358d3e5582442fb
2017-01-12 10:02:23,485 INFO  [regionserver60020] zookeeper.ZooKeeper: Session: 0x358d3e5582442fb closed
2017-01-12 10:02:23,485 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread shut down
2017-01-12 10:02:23,488 INFO  [regionserver60020] regionserver.HRegionServer: stopping server master1,60020,1484186540447; all regions closed.
2017-01-12 10:02:23,588 INFO  [regionserver60020] regionserver.Leases: regionserver60020 closing leases
2017-01-12 10:02:23,588 INFO  [regionserver60020] regionserver.Leases: regionserver60020 closed leases
2017-01-12 10:02:23,589 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Split Thread to finish...
2017-01-12 10:02:23,589 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Merge Thread to finish...
2017-01-12 10:02:23,589 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Large Compaction Thread to finish...
2017-01-12 10:02:23,589 INFO  [regionserver60020] regionserver.CompactSplitThread: Waiting for Small Compaction Thread to finish...
2017-01-12 10:02:23,636 INFO  [regionserver60020] zookeeper.ZooKeeper: Session: 0x558d3e6026242f9 closed
2017-01-12 10:02:23,636 INFO  [regionserver60020-EventThread] zookeeper.ClientCnxn: EventThread shut down
2017-01-12 10:02:23,636 INFO  [regionserver60020] regionserver.HRegionServer: stopping server master1,60020,1484186540447; zookeeper connection closed.
2017-01-12 10:02:23,636 INFO  [regionserver60020] regionserver.HRegionServer: regionserver60020 exiting
2017-01-12 10:02:23,636 ERROR [main] regionserver.HRegionServerCommandLine: Region server exiting
java.lang.RuntimeException: HRegionServer Aborted
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.start(HRegionServerCommandLine.java:66)
    at org.apache.hadoop.hbase.regionserver.HRegionServerCommandLine.run(HRegionServerCommandLine.java:85)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:126)
    at org.apache.hadoop.hbase.regionserver.HRegionServer.main(HRegionServer.java:2489)
2017-01-12 10:02:23,639 INFO  [Thread-10] regionserver.ShutdownHook: Shutdown hook starting; hbase.shutdown.hook=true; fsShutdownHook=org.apache.hadoop.fs.FileSystem$Cache$ClientFinalizer@68ee3eb2
2017-01-12 10:02:23,640 INFO  [Thread-10] regionserver.ShutdownHook: Starting fs shutdown hook thread.
2017-01-12 10:02:23,641 INFO  [Thread-10] regionserver.ShutdownHook: Shutdown hook finished.
You have new mail in /var/mail/dell
解决方法:

1. hdfs dfsadmin -safemode leave, 释放安全模式

2. 然后使用

启动集群中所有的regionserver

./hbase-daemons.sh start regionserver
或者启动某个regionserver
./hbase-daemon.sh start regionserver
3.查看Hbase webUI 
http://192.168.1.120:60010/master-status
可以看到Region Servers的存活个数。






参考文献:http://stackoverflow.com/questions/26704763/yarn-resourcetrackerservice-failed-in-state-started

<think>嗯,用户想找Hadoop YARN ResourceManager启动失败的原因和解决方法。根据提供的引用内容,我需要先整理可能的原因。首先,引用[1]提到启动YarnResourceManager没有出现,可能是配置问题,比如内存设置不够或者端口冲突。然后,引用[3]和[4]提到未授权访问漏洞,虽然这更多是安全问题,但可能和配置相关,比如某些端口开放导致的问题。不过用户的问题主要是启动失败,所以可能更关注配置错误、环境变量或日志检查。 接下来,可能的原因还包括权限问题,比如目录权限不正确,或者Hadoop用户没有足够的权限。另外,集群节点通信问题也可能导致ResourceManager启动失败,比如防火墙设置阻止了节点间的通信。还有资源不足的情况,比如内存或磁盘空间不足,导致进程无法启动解决方案方面,需要逐步排查。首先检查配置文件,如yarn-site.xml和core-site.xml,确认参数是否正确。然后查看日志文件,特别是ResourceManager的日志,寻找错误信息。然后检查端口占用情况,使用netstat命令看看默认端口是否被占用。此外,确保环境变量和权限设置正确,特别是Hadoop相关的目录权限。还有集群节点之间的SSH无密码登录是否配置正确,避免通信问题。最后,确保系统资源足够,比如内存和磁盘空间。 另外,引用[1]中的用户遇到了启动ResourceManager没有出现的情况,可能和配置文件中的节点配置有关,比如没有正确指定主节点的主机名或IP地址。需要检查yarn.resourcemanager.hostname是否设置正确。同时,内存参数如yarn.scheduler.maximum-allocation-mb是否分配合理,避免因内存不足导致启动失败。 安全配置方面,虽然未授权访问漏洞主要影响安全性,但错误的配置可能间接导致服务启动异常。比如,如果某些端口被错误地关闭或配置错误,可能影响ResourceManager的正常启动。需要确保必要的端口如8088、8030-8033等是开放的,并且没有被防火墙阻止。 还有可能的问题包括Java版本不兼容,Hadoop对Java版本有特定要求,需要确认安装的Java版本是否符合要求,并且JAVA_HOME环境变量是否正确设置。此外,Hadoop的配置文件可能存在语法错误,比如XML格式不正确,这也会导致启动失败。 总结一下,用户需要按照步骤检查配置文件、日志、端口占用、权限、资源、网络通信和Java环境等方面。逐一排除这些可能的原因,找到具体的问题所在,然后进行相应的调整和修复。</think>### Hadoop YARN ResourceManager 启动失败原因与解决方法 #### 一、常见启动失败原因分析 1. **配置错误** - **端口冲突**:ResourceManager默认使用端口`8088`(Web UI)和`8030-8033`(内部通信),若被占用会导致启动失败[^1][^3] - **内存分配不足**:Hadoop配置文件中未正确设置`yarn.scheduler.maximum-allocation-mb`等内存参数,可能导致进程因资源不足而终止 - **主节点配置错误**:`yarn-site.xml`中未正确指定`yarn.resourcemanager.hostname`或`core-site.xml`中`fs.defaultFS`指向错误的HDFS地址[^1] 2. **权限问题** - Hadoop用户对日志目录(如`/tmp/hadoop-yarn`)或HDFS目录缺乏写入权限 - SSH免密登录未配置,导致节点间通信失败[^1] 3. **安全漏洞触发异常** - 若存在未授权访问漏洞(如开放`8088`/`50070`端口未配置ACL),可能因恶意请求导致服务崩溃[^4] - 漏洞修复操作(如关闭端口)可能意外影响服务正常运行[^5] 4. **环境问题** - Java版本不兼容(如Hadoop 3.x需要Java 8+) - 系统防火墙阻止集群节点间通信 - 磁盘空间不足导致日志写入失败 --- #### 二、排查与解决方法 **步骤1:检查配置文件** 验证以下关键配置(文件路径:`$HADOOP_HOME/etc/hadoop/`): ```xml <!-- yarn-site.xml --> <property> <name>yarn.resourcemanager.hostname</name> <value>hadoop103</value> <!-- 需与实际主机名一致 --> </property> <property> <name>yarn.nodemanager.resource.memory-mb</name> <value>8192</value> <!-- 根据物理内存调整 --> </property> ``` **步骤2:查看日志定位错误** 通过日志文件快速定位问题: ```bash # ResourceManager日志路径 tail -n 100 $HADOOP_HOME/logs/yarn-hadoop-resourcemanager-*.log ``` 常见日志错误示例: - `BindException: Port in use` → 端口冲突 - `Permission denied` → 权限问题 - `Connection refused` → 网络通信故障 **步骤3:验证端口占用情况** ```bash netstat -tlnp | grep 8088 # 检查ResourceManager端口 lsof -i :8030 # 检查内部通信端口 ``` **步骤4:修复权限与网络** ```bash # 修复目录权限 chown -R hadoop:hadoop /tmp/hadoop-yarn # 配置SSH免密登录 ssh-keygen -t rsa ssh-copy-id hadoop@hadoop103 ``` **步骤5:安全加固(可选)** 若涉及漏洞修复: - 限制YARN Web UI端口(`8088`)的访问IP范围 - 启用Kerberos认证[^3] --- #### 三、典型场景案例 **案例1:端口冲突导致启动失败** **现象**:日志显示`java.net.BindException: Address already in use` **解决**: ```bash # 终止占用端口的进程 kill -9 $(lsof -t -i:8088) # 或修改ResourceManager端口 <property> <name>yarn.resourcemanager.webapp.address</name> <value>0.0.0.0:18088</value> </property> ``` **案例2:内存配置错误** **现象**:日志中出现`Container killed by YARN for exceeding memory limits` **解决**: 调整`yarn-site.xml`中的内存参数: ```xml <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>16384</value> <!-- 不超过物理内存的80% --> </property> ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值