NodeManager异常宕掉--GC overhead limit exceeded

本文记录了在运行Hive作业时遇到的Nodemanager进程异常问题,详细分析了异常日志,确定了问题根源在于堆内存不足,并提出了两种解决方案:关闭overheadlimitexceed检查特性和增加堆内存大小。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

问题描述 

运行Hive作业时发现node4、node6节点Nodemanager进程异常宕掉,查看Nodemanager日志如下:

2017-05-15 05:37:25,616 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Failed to launch container.

java.lang.OutOfMemoryError: GC overheadlimit exceeded

       at org.apache.xerces.xni.XMLString.toString(Unknown Source)

       at org.apache.xerces.parsers.AbstractDOMParser.characters(UnknownSource)

       at org.apache.xerces.xinclude.XIncludeHandler.characters(Unknown Source)

       atorg.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(UnknownSource)

       at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(UnknownSource)

       atorg.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(UnknownSource)

       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

       at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)

       at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)

       at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)

       at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)

       at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)

       at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2491)

       at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2479)

       atorg.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2550)

       atorg.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)

       at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)

       at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)

       atorg.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1032)

       at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

       atorg.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:151)

       atorg.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)

       at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:337)

       at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)

       at java.security.AccessController.doPrivileged(Native Method)

       at javax.security.auth.Subject.doAs(Subject.java:415)

        atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)

       atorg.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:334)

       at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:451)

       at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:428)

       atorg.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:414)

       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:204)

2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000026 transitioned from LOCALIZEDto EXITED_WITH_FAILURE

2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Cleaning up container container_1493093927370_36561_01_000026

2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Container container_1493093927370_36561_01_000026 not launched. No cleanupneeded to be done

2017-05-15 05:37:28,453 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Unknown localizer with localizerId container_1493093927370_36561_01_000026 issending heartbeat. Ordering it to DIE

2017-05-15 05:37:29,181 WARNorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous    OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state:EXITED_WITH_FAILURE   APPID=application_1493093927370_36561  CONTAINERID=container_1493093927370_36561_01_000026

2017-05-15 05:37:29,181 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000026 transitioned fromEXITED_WITH_FAILURE to DONE

2017-05-15 05:37:29,181 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Removing container_1493093927370_36561_01_000026 from applicationapplication_1493093927370_36561

2017-05-15 05:37:29,182 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_STOP for appId application_1493093927370_36561

2017-05-15 05:37:29,891 WARNorg.apache.hadoop.ipc.Client: interrupted waiting to send rpc request to server

java.lang.InterruptedException

       at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)

       at java.util.concurrent.FutureTask.get(FutureTask.java:187)

       atorg.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1050)

       at org.apache.hadoop.ipc.Client.call(Client.java:1445)

       at org.apache.hadoop.ipc.Client.call(Client.java:1403)

       atorg.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)

       at com.sun.proxy.$Proxy30.heartbeat(Unknown Source)

       atorg.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)

       atorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:255)

       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)

       atorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:128)

       at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1132)

2017-05-15 05:37:30,648 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36560 transitioned from RUNNING toAPPLICATION_RESOURCES_CLEANINGUP

2017-05-15 05:37:30,635 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Unknown localizer with localizerId container_1493093927370_36561_01_000026 issending heartbeat. Ordering it to DIE

2017-05-15 05:37:30,622 WARNorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Failed to launch container.

java.lang.OutOfMemoryError: GC overheadlimit exceeded

       at java.lang.String.toCharArray(String.java:2748)

       at java.util.zip.ZipCoder.getBytes(ZipCoder.java:78)

       at java.util.zip.ZipFile.getEntry(ZipFile.java:306)

       at java.util.jar.JarFile.getEntry(JarFile.java:227)

       at java.util.jar.JarFile.getJarEntry(JarFile.java:210)

       at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)

       at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)

       at sun.misc.URLClassPath.findResource(URLClassPath.java:176)

       at java.net.URLClassLoader$2.run(URLClassLoader.java:551)

       at java.net.URLClassLoader$2.run(URLClassLoader.java:549)

       at java.security.AccessController.doPrivileged(Native Method)

       at java.net.URLClassLoader.findResource(URLClassLoader.java:548)

       at java.lang.ClassLoader.getResource(ClassLoader.java:1147)

       at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:227)

       at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)

       at java.security.AccessController.doPrivileged(Native Method)

       at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(UnknownSource)

       at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(UnknownSource)

       at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)

       at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)

       at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)

       at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)

       at org.apache.xerces.jaxp.DocumentBuilderImpl.<init>(UnknownSource)

       at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(UnknownSource)

       atorg.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)

       atorg.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)

       at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)

       at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)

       atorg.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1032)

       at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

       atorg.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:151)

       atorg.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)

2017-05-15 05:37:31,341 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk2/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:31,335 INFOSecurityLogger.org.apache.hadoop.ipc.Server: Auth successful forappattempt_1493093927370_36564_000001 (auth:SIMPLE)

2017-05-15 05:37:31,334 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk1/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,020 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:Deleting absolute path :/chunk2/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,020 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk4/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path : /chunk5/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk6/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk7/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk8/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk9/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk10/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path : /chunk11/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk12/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,023 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk13/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560

2017-05-15 05:37:32,023 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent APPLICATION_STOP for appId application_1493093927370_36560

2017-05-15 05:37:35,652 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000014 transitioned from LOCALIZEDto EXITED_WITH_FAILURE

2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36560 transitioned fromAPPLICATION_RESOURCES_CLEANINGUP to FINISHED

2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Cleaning up container container_1493093927370_36561_01_000014

2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Container container_1493093927370_36561_01_000014 not launched. No cleanupneeded to be done

2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:Start request for container_1493093927370_36564_01_000006 by user anonymous

2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:Creating a new application reference for app application_1493093927370_36564

2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous    IP=134.32.66.14 OPERATION=Start ContainerRequest       TARGET=ContainerManageImpl      RESULT=SUCCESS  APPID=application_1493093927370_36564   CONTAINERID=container_1493093927370_36564_01_000006

2017-05-15 05:37:36,389 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler:Scheduling Log Deletion for application: application_1493093927370_36560, withdelay of 259200 seconds

2017-05-15 05:37:36,389 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36564 transitioned from NEW to INITING

2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Adding container_1493093927370_36564_01_000006 to applicationapplication_1493093927370_36564

2017-05-15 05:37:36,390 WARNorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous    OPERATION=Container Finished - Failed   TARGET=ContainerImpl    RESULT=FAILURE  DESCRIPTION=Container failed with state:EXITED_WITH_FAILURE   APPID=application_1493093927370_36561  CONTAINERID=container_1493093927370_36561_01_000014

2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000014 transitioned fromEXITED_WITH_FAILURE to DONE

2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Removing container_1493093927370_36561_01_000014 from applicationapplication_1493093927370_36561

2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_STOP for appId application_1493093927370_36561

2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36564 transitioned from INITING toRUNNING

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36564_01_000006 transitioned from NEW toLOCALIZING

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_INIT for appId application_1493093927370_36564

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent APPLICATION_INIT for appId application_1493093927370_36564

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: GotAPPLICATION_INIT for service mapreduce_shuffle

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.mapred.ShuffleHandler: Added token for job_1493093927370_36564

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/tmp/hive/anonymous/7faa83f3-6bfd-4a80-96ce-e92b5dd3a1d0/hive_2017-05-15_05-36-25_483_2221195354238842281-39188/-mr-10010/c42c404d-37b4-4edd-b6cf-7e1a3a040f2e/map.xmltransitioned from INIT to DOWNLOADING

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resource hdfs://drmcluster/tmp/hive/anonymous/7faa83f3-6bfd-4a80-96ce-e92b5dd3a1d0/hive_2017-05-15_05-36-25_483_2221195354238842281-39188/-mr-10010/c42c404d-37b4-4edd-b6cf-7e1a3a040f2e/reduce.xmltransitioned from INIT to DOWNLOADING

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/user/anonymous/.staging/job_1493093927370_36564/job.jartransitioned from INIT to DOWNLOADING

2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/user/anonymous/.staging/job_1493093927370_36564/job.xmltransitioned from INIT to DOWNLOADING

2017-05-15 05:37:36,391 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Created localizer for container_1493093927370_36564_01_000006

2017-05-15 05:37:37,097 FATALorg.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContainerMonitor,5,main] threw an Error.  Shuttingdown now...

java.lang.OutOfMemoryError: GC overheadlimit exceeded

 

问题解决

查看进程信息:

查看yarn-env.sh配置文件,YARN_NODEMANAGER_HEAPSIZE参数未配置,则使用默认参数1000m

问题原因:

GC overhead limt exceed检查:

是Hotspot VM 1.6定义的一个策略,通过统计GC时间来预测是否要OOM了,提前抛出异常,防止OOM发生。Sun 官方对此的定义是:“并行/并发回收器在GC回收时间过长时会抛出OutOfMemroyError。过长的定义是,超过98%的时间用来做GC并且回收了不到2%的堆内存。用来避免内存过小造成应用不能正常工作。“

Nodemanager异常宕掉归根结底是堆内存不足导致的。

问题解决

解决该问题有两种方法:

         关闭overheadlimit exceed检查特性,增加参数,-XX:-UseGCOverheadLimit;

         增加堆内存;

 

我们采用增加堆内存大小的方法,并增加GC日志以便后续分析,修改如下:

1、在yarn-env.sh中增加GC日志,分析是什么原因导致GC的,配置修改如下:

YARN_OPTS="$YARN_OPTS -verbose:gc-XX:+PrintGCDetails -Xloggc:${YARN_LOG_DIR}/yarn-gc.log -XX:+PrintGCTimeStamps-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime"

2、建议将堆大小调整为2G,如下:

exportYARN_NODEMANAGER_HEAPSIZE=2048

参考资料

http://www.cnblogs.com/hucn/p/3572384.html

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值