Hadoop在YARN上运行案例卡死

本文分析了Hadoop与YARN在伪分布式环境下运行案例时出现的任务失败现象,通过日志发现主要警告信息涉及配置不匹配、资源分配不当、内存溢出及资源百分比设置过低等问题,并提供了详细的解决方案。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题

在本地环境运行模式,运行案例(pi/grep/wordcount)都是OK的

在伪分布式运行模式,运行上述案例,直接卡死

[root@master-node sbin]# start-dfs.sh              --启动HDFS集群
[root@master-node sbin]# start-yarn.sh             --启动YARN集群
[root@master-node sbin]# jps                       --查看启动状态
12880 NameNode
13764 Jps
13431 NodeManager
13322 ResourceManager
12988 DataNode
[root@master-node sbin]# hadoop jar /opt/module/hadoop-2.7.4/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.4.jar pi 5 5        --运行pi案例
Number of Maps  = 5
Samples per Map = 5
Wrote input for Map #0
Wrote input for Map #1
Wrote input for Map #2
Wrote input for Map #3
Wrote input for Map #4
Starting Job
19/09/04 18:42:42 INFO client.RMProxy: Connecting to ResourceManager at master-node/192.168.159.10:8032
19/09/04 18:42:43 INFO input.FileInputFormat: Total input paths to process : 5
19/09/04 18:42:44 INFO mapreduce.JobSubmitter: number of splits:5
19/09/04 18:42:44 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1567593642473_0001
19/09/04 18:42:46 INFO impl.YarnClientImpl: Submitted application application_1567593642473_0001
19/09/04 18:42:46 INFO mapreduce.Job: The url to track the job: http://master-node:8088/proxy/application_1567593642473_0001/
19/09/04 18:42:46 INFO mapreduce.Job: Running job: job_1567593642473_0001
19/09/04 18:43:03 INFO mapreduce.Job: Job job_1567593642473_0001 running in uber mode : false
19/09/04 18:43:03 INFO mapreduce.Job:  map 0% reduce 0%

卡死不动了,查看硬盘使用情况,查看日志yarn-root-nodemanager-master-node.log(WARN/ERROR信息)

[root@master-node logs]# df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda5        17G  8.1G  7.7G  52% /
tmpfs           931M     0  931M   0% /dev/shm
/dev/sda1       194M   34M  151M  19% /boot
/dev/sda2       2.0G   68M  1.9G   4% /home
[root@master-node logs]# cat yarn-root-nodemanager-master-node.log | grep WARN
2019-09-04 18:40:44,573 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.
2019-09-04 18:40:44,786 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: NodeManager configured with 8 G physical memory allocated to containers, which is more than 80% of the total physical memory available (1.8 G). Thrashing might happen.
2019-09-04 18:45:12,623 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000001 is : 137
2019-09-04 18:47:58,519 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000002 is : 65
2019-09-04 18:47:58,647 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000005 is : 65
2019-09-04 18:47:58,648 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000004 is : 65
2019-09-04 18:47:58,804 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000006 is : 65
2019-09-04 18:47:58,805 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1567593642473_0001_01_000003 is : 65
2019-09-04 18:47:59,425 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1567593642473_0001_01_000005 and exit code: 65
2019-09-04 18:47:59,426 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1567593642473_0001_01_000004 and exit code: 65
2019-09-04 18:47:59,425 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1567593642473_0001_01_000003 and exit code: 65
2019-09-04 18:47:59,426 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1567593642473_0001_01_000002 and exit code: 65
2019-09-04 18:47:59,425 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1567593642473_0001_01_000006 and exit code: 65
2019-09-04 18:48:02,010 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 65
2019-09-04 18:48:02,016 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 65
2019-09-04 18:48:02,019 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 65
2019-09-04 18:48:02,019 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 65
2019-09-04 18:48:02,071 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 65
2019-09-04 18:48:07,304 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000001
2019-09-04 18:48:27,426 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000006
2019-09-04 18:48:27,427 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000004
2019-09-04 18:48:27,427 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000002
2019-09-04 18:48:27,428 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000003
2019-09-04 18:48:27,429 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=root OPERATION=Container Finished - Failed  TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state: EXITED_WITH_FAILURE   APPID=application_1567593642473_0001    CONTAINERID=container_1567593642473_0001_01_000005

[root@master-node logs]# cat yarn-root-resourcemanager-master-node.log |grep WARN
2019-09-05 10:03:16,373 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start
2019-09-05 10:03:16,376 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue for user, it is likely set too low. skipping enforcement to allow at least one application to start

看日志,主要有4类警告信息

1.2019-09-04 18:40:44,573 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: The Auxilurary Service named 'mapreduce_shuffle' in the configuration is for class org.apache.hadoop.mapred.ShuffleHandler which has a name of 'httpshuffle'. Because these are not the same tools trying to send ServiceData and read Service Meta Data may have issues unless the refer to the name in the config.

2.2019-09-04 18:40:44,786 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: NodeManager configured with 8 G physical memory allocated to containers, which is more than 80% of the total physical memory available (1.8 G). Thrashing might happen.

3....exit code 65

Are you using a single node cluster? in such a case, if u are giving large file and there is no sufficient memory then the container might not be initialized.verify the pig configuration files also

4 maximum-am-resource-percent is likely set too low

2019-09-05 10:03:16,373 WARN org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue: maximum-am-resource-percent is insufficient to start a single application in queue, it is likely set too low. skipping enforcement to allow at least one application to start

原因分析

1.0 GB:任务所占的物理内存
1GB: mapreduce.map.memory.mb 参数默认设置大小
2.8 GB:程序占用的虚拟内存
2.1 GB: mapreduce.map.memory.mb 乘以 yarn.nodemanager.vmem-pmem-ratio 得到的

其中 yarn.nodemanager.vmem-pmem-ratio 是 虚拟内存和物理内存比例,在yarn-site.xml中设置,默认是2.1

很明显,container占用了2.8G的虚拟内存,但是分配给container的却只有2.1GB。所以kill掉了这个container

上面只是map中产生的报错,当然也有可能在reduce中报错,如果是reduce中,那么就是mapreduce.reduce.memory.db * yarn.nodemanager.vmem-pmem-ratio

注:
物理内存:真实的硬件设备(内存条)
虚拟内存:利用磁盘空间虚拟出的一块逻辑内存,用作虚拟内存的磁盘空间被称为交换空间(Swap Space)。(为了满足物理内存的不足而提出的策略)
linux会在物理内存不足时,使用交换分区的虚拟内存。内核会将暂时不用的内存块信息写到交换空间,这样以来,物理内存得到了释放,这块内存就可以用于其它目的,当需要用到原始的内容时,这些信息会被重新从交换空间读入物理内存。

解决方方法

1. 取消虚拟内存的检查(不建议):
在yarn-site.xml或者程序中中设置yarn.nodemanager.vmem-check-enabled为false

  yarn.nodemanager.vmem-check-enabled
  false
  Whether virtual memory limits will be enforced for containers.

除了虚拟内存超了,也有可能是物理内存超了,同样也可以设置物理内存的检查为 yarn.nodemanager.pmem-check-enabled :false
个人认为这种办法并不太好,如果程序有内存泄漏等问题,取消这个检查,可能会导致集群崩溃。

2. 增大mapreduce.map.memory.mb 或者 mapreduce.map.memory.mb (建议)
个人觉得是一个办法,应该优先考虑这种办法,这种办法不仅仅可以解决虚拟内存,或许大多时候都是物理内存不够了,这个办法正好适用

3. 适当增大 yarn.nodemanager.vmem-pmem-ratio的大小,为物理内存增大对应的虚拟内存, 但是这个参数也不能太离谱。

4. 如果任务所占用的内存太过离谱,更多考虑的应该是程序是否有内存泄漏,是否存在数据倾斜等,优先程序解决此类问题。

<think>嗯,用户想解决Hive启动时进程卡住的问题。首先,我需要回忆一下之前处理类似问题的经验,以及参考提供的引用内容。根据引用中的信息,有几个可能的原因需要考虑。 首先,内存不足可能是一个主要原因。引用[3]提到虚拟机内存不够,导致Java进程被杀死,类似的情况可能在Hive启动时出现,因为Hive依赖于Java进程。如果分配给Hive或Hadoop的内存不足,启动时可能会卡住,甚至进程被终止。这时候需要检查虚拟机或物理机的内存配置,看是否需要增加内存分配。 然后是Zookeeper和Metastore服务的问题。引用[2]指出,启动Zookeeper和各节点服务是必要的,如果Metastore服务没有正确启动,Hive可能无法继续。用户需要确认这些后台服务是否正常运行,比如通过jps命令检查是否有RunJar进程,或者手动启动Metastore服务。 接下来是配置错误。Hive和Hadoop的配置文件有很多,如hive-site.xml、core-site.xml等,任何配置错误都可能导致启动失败。例如,元数据库连接信息错误,或者HDFS地址配置不正确,都会让Hive在启动时卡住。需要逐一检查这些配置文件的正确性,特别是数据库连接部分。 还有资源冲突的可能性。引用[5]提到在执行某些操作时,如果资源不足,任务会被卡住。启动Hive时可能涉及到资源管理器的初始化,如果YARN资源不足或配置不当,Hive进程可能会等待资源而卡住。这时候需要检查YARN的配置,确保有足够的资源可用,并且NodeManager等进程正常运行。 此外,网络问题也不容忽视。如果Hive需要连接的组件如HDFS、MySQL等存在网络延迟或中断,Hive启动时可能会在连接阶段卡住。用户需要检查网络连通性,确保所有依赖服务都能正常访问。 最后,日志分析是关键。Hive的日志通常位于/tmp/<用户名>/hive.log,查看日志中的错误信息能快速定位问题。例如,内存溢出错误、连接超时或权限问题等,都可以在日志中找到线索。 总结起来,解决Hive启动卡住的问题需要从内存、服务状态、配置、资源和网络等多个方面排查,同时结合日志分析来快速确定问题根源。根据用户提供的引用内容,内存不足和Metastore服务未启动是比较常见的原因,应该优先检查这些方面。</think>### Hive启动卡住问题的原因及解决方案 #### 一、常见原因分析 1. **内存不足** Hive依赖Java进程(如`RunJar`)运行,若虚拟机或物理机内存不足,会导致进程被系统强制终止或卡死[^3]。可通过以下命令检查内存状态: ```bash free -m ``` 2. **后台服务未启动** - **Zookeeper未运行**:Hive依赖Zookeeper管理分布式协调任务,需确保所有节点启动Zookeeper服务[^2]。 - **Metastore服务未启动**:Hive元数据服务(`hive --service metastore &`)未运行会导致卡死。 3. **配置错误** - 配置文件(如`hive-site.xml`、`core-site.xml`)中的HDFS地址、端口或元数据库(如MySQL)连接信息错误。 - 权限问题(如HDFS目录权限未对应用户开放)[^5]。 4. **资源冲突** - YARN资源管理器(ResourceManager/NodeManager)未启动或资源不足,导致任务无法分配资源[^5]。 --- #### 二、解决方案步骤 1. **检查内存状态** - 若虚拟机内存不足,需扩容(如从2GB调整为4GB)[^3]。 - 调整Hive的JVM堆内存参数(例如在`hive-env.sh`中设置): ```bash export HADOOP_HEAPSIZE=2048 ``` 2. **启动必要服务** - 启动Zookeeper集群: ```bash zkServer.sh start ``` - 启动Hive Metastore服务: ```bash hive --service metastore & ``` - 检查服务进程: ```bash jps # 确认存在RunJar、NodeManager等进程 ``` 3. **验证配置文件** - 检查`hive-site.xml`中元数据库配置(示例): ```xml <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/hive_metastore</value> </property> ``` - 确保HDFS地址(如`hdfs://namenode:9000`)可访问。 4. **检查网络与资源** - 使用`telnet`验证HDFS、MySQL等服务的连通性: ```bash telnet namenode 9000 ``` - 重启YARN资源管理器: ```bash yarn-daemon.sh start resourcemanager ``` 5. **日志分析** 查看Hive日志(路径通常为`/tmp/<用户名>/hive.log`),定位具体错误: ```bash tail -f /tmp/hadoop/hive.log ``` --- #### 三、典型案例 - **案例1**:用户因虚拟机内存不足(2GB),导致NodeManager进程被系统杀死,Hive任务卡死。扩容至4GB后问题解决。 - **案例2**:`hive-site.xml`中MySQL连接字符串错误,导致Metastore初始化失败。修正后Hive正常启动。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值