问题描述
运行Hive作业时发现node4、node6节点Nodemanager进程异常宕掉,查看Nodemanager日志如下:
2017-05-15 05:37:25,616 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Failed to launch container.
java.lang.OutOfMemoryError: GC overheadlimit exceeded
at org.apache.xerces.xni.XMLString.toString(Unknown Source)
at org.apache.xerces.parsers.AbstractDOMParser.characters(UnknownSource)
at org.apache.xerces.xinclude.XIncludeHandler.characters(Unknown Source)
atorg.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(UnknownSource)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(UnknownSource)
atorg.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(UnknownSource)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.parse(Unknown Source)
at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:150)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2491)
at org.apache.hadoop.conf.Configuration.parse(Configuration.java:2479)
atorg.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2550)
atorg.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)
atorg.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1032)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
atorg.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:151)
atorg.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:337)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
atorg.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:334)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:451)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:428)
atorg.apache.hadoop.fs.FileContext.getLocalFSFileContext(FileContext.java:414)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:204)
2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000026 transitioned from LOCALIZEDto EXITED_WITH_FAILURE
2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Cleaning up container container_1493093927370_36561_01_000026
2017-05-15 05:37:25,618 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Container container_1493093927370_36561_01_000026 not launched. No cleanupneeded to be done
2017-05-15 05:37:28,453 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Unknown localizer with localizerId container_1493093927370_36561_01_000026 issending heartbeat. Ordering it to DIE
2017-05-15 05:37:29,181 WARNorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state:EXITED_WITH_FAILURE APPID=application_1493093927370_36561 CONTAINERID=container_1493093927370_36561_01_000026
2017-05-15 05:37:29,181 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000026 transitioned fromEXITED_WITH_FAILURE to DONE
2017-05-15 05:37:29,181 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Removing container_1493093927370_36561_01_000026 from applicationapplication_1493093927370_36561
2017-05-15 05:37:29,182 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_STOP for appId application_1493093927370_36561
2017-05-15 05:37:29,891 WARNorg.apache.hadoop.ipc.Client: interrupted waiting to send rpc request to server
java.lang.InterruptedException
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:400)
at java.util.concurrent.FutureTask.get(FutureTask.java:187)
atorg.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1050)
at org.apache.hadoop.ipc.Client.call(Client.java:1445)
at org.apache.hadoop.ipc.Client.call(Client.java:1403)
atorg.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)
at com.sun.proxy.$Proxy30.heartbeat(Unknown Source)
atorg.apache.hadoop.yarn.server.nodemanager.api.impl.pb.client.LocalizationProtocolPBClientImpl.heartbeat(LocalizationProtocolPBClientImpl.java:62)
atorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.localizeFiles(ContainerLocalizer.java:255)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer.runLocalization(ContainerLocalizer.java:169)
atorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:128)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:1132)
2017-05-15 05:37:30,648 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36560 transitioned from RUNNING toAPPLICATION_RESOURCES_CLEANINGUP
2017-05-15 05:37:30,635 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Unknown localizer with localizerId container_1493093927370_36561_01_000026 issending heartbeat. Ordering it to DIE
2017-05-15 05:37:30,622 WARNorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Failed to launch container.
java.lang.OutOfMemoryError: GC overheadlimit exceeded
at java.lang.String.toCharArray(String.java:2748)
at java.util.zip.ZipCoder.getBytes(ZipCoder.java:78)
at java.util.zip.ZipFile.getEntry(ZipFile.java:306)
at java.util.jar.JarFile.getEntry(JarFile.java:227)
at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
at sun.misc.URLClassPath.findResource(URLClassPath.java:176)
at java.net.URLClassLoader$2.run(URLClassLoader.java:551)
at java.net.URLClassLoader$2.run(URLClassLoader.java:549)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findResource(URLClassLoader.java:548)
at java.lang.ClassLoader.getResource(ClassLoader.java:1147)
at java.net.URLClassLoader.getResourceAsStream(URLClassLoader.java:227)
at org.apache.xerces.parsers.SecuritySupport$6.run(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.xerces.parsers.SecuritySupport.getResourceAsStream(UnknownSource)
at org.apache.xerces.parsers.ObjectFactory.findJarServiceProvider(UnknownSource)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.ObjectFactory.createObject(Unknown Source)
at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
at org.apache.xerces.parsers.DOMParser.<init>(Unknown Source)
at org.apache.xerces.jaxp.DocumentBuilderImpl.<init>(UnknownSource)
at org.apache.xerces.jaxp.DocumentBuilderFactoryImpl.newDocumentBuilder(UnknownSource)
atorg.apache.hadoop.conf.Configuration.loadResource(Configuration.java:2541)
atorg.apache.hadoop.conf.Configuration.loadResources(Configuration.java:2503)
at org.apache.hadoop.conf.Configuration.getProps(Configuration.java:2409)
at org.apache.hadoop.conf.Configuration.get(Configuration.java:982)
atorg.apache.hadoop.conf.Configuration.getTrimmed(Configuration.java:1032)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
atorg.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:151)
atorg.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:242)
2017-05-15 05:37:31,341 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk2/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:31,335 INFOSecurityLogger.org.apache.hadoop.ipc.Server: Auth successful forappattempt_1493093927370_36564_000001 (auth:SIMPLE)
2017-05-15 05:37:31,334 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk1/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,020 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor:Deleting absolute path :/chunk2/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,020 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk4/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path : /chunk5/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk6/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk7/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,021 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk8/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk9/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk10/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path : /chunk11/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,022 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk12/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,023 INFOorg.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deletingabsolute path :/chunk13/yarn/local/usercache/anonymous/appcache/application_1493093927370_36560
2017-05-15 05:37:32,023 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent APPLICATION_STOP for appId application_1493093927370_36560
2017-05-15 05:37:35,652 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000014 transitioned from LOCALIZEDto EXITED_WITH_FAILURE
2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36560 transitioned fromAPPLICATION_RESOURCES_CLEANINGUP to FINISHED
2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Cleaning up container container_1493093927370_36561_01_000014
2017-05-15 05:37:35,653 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch:Container container_1493093927370_36561_01_000014 not launched. No cleanupneeded to be done
2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:Start request for container_1493093927370_36564_01_000006 by user anonymous
2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl:Creating a new application reference for app application_1493093927370_36564
2017-05-15 05:37:35,654 INFOorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous IP=134.32.66.14 OPERATION=Start ContainerRequest TARGET=ContainerManageImpl RESULT=SUCCESS APPID=application_1493093927370_36564 CONTAINERID=container_1493093927370_36564_01_000006
2017-05-15 05:37:36,389 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.NonAggregatingLogHandler:Scheduling Log Deletion for application: application_1493093927370_36560, withdelay of 259200 seconds
2017-05-15 05:37:36,389 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36564 transitioned from NEW to INITING
2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Adding container_1493093927370_36564_01_000006 to applicationapplication_1493093927370_36564
2017-05-15 05:37:36,390 WARNorg.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=anonymous OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state:EXITED_WITH_FAILURE APPID=application_1493093927370_36561 CONTAINERID=container_1493093927370_36561_01_000014
2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36561_01_000014 transitioned fromEXITED_WITH_FAILURE to DONE
2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Removing container_1493093927370_36561_01_000014 from applicationapplication_1493093927370_36561
2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_STOP for appId application_1493093927370_36561
2017-05-15 05:37:36,390 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.application.Application:Application application_1493093927370_36564 transitioned from INITING toRUNNING
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:Container container_1493093927370_36564_01_000006 transitioned from NEW toLOCALIZING
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent CONTAINER_INIT for appId application_1493093927370_36564
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Gotevent APPLICATION_INIT for appId application_1493093927370_36564
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: GotAPPLICATION_INIT for service mapreduce_shuffle
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.mapred.ShuffleHandler: Added token for job_1493093927370_36564
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/tmp/hive/anonymous/7faa83f3-6bfd-4a80-96ce-e92b5dd3a1d0/hive_2017-05-15_05-36-25_483_2221195354238842281-39188/-mr-10010/c42c404d-37b4-4edd-b6cf-7e1a3a040f2e/map.xmltransitioned from INIT to DOWNLOADING
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resource hdfs://drmcluster/tmp/hive/anonymous/7faa83f3-6bfd-4a80-96ce-e92b5dd3a1d0/hive_2017-05-15_05-36-25_483_2221195354238842281-39188/-mr-10010/c42c404d-37b4-4edd-b6cf-7e1a3a040f2e/reduce.xmltransitioned from INIT to DOWNLOADING
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/user/anonymous/.staging/job_1493093927370_36564/job.jartransitioned from INIT to DOWNLOADING
2017-05-15 05:37:36,391 INFOorg.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource:Resourcehdfs://drmcluster/user/anonymous/.staging/job_1493093927370_36564/job.xmltransitioned from INIT to DOWNLOADING
2017-05-15 05:37:36,391 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:Created localizer for container_1493093927370_36564_01_000006
2017-05-15 05:37:37,097 FATALorg.apache.hadoop.yarn.YarnUncaughtExceptionHandler: Thread Thread[ContainerMonitor,5,main] threw an Error. Shuttingdown now...
java.lang.OutOfMemoryError: GC overheadlimit exceeded
问题解决
查看进程信息:
查看yarn-env.sh配置文件,YARN_NODEMANAGER_HEAPSIZE参数未配置,则使用默认参数1000m
问题原因:
GC overhead limt exceed检查:
是Hotspot VM 1.6定义的一个策略,通过统计GC时间来预测是否要OOM了,提前抛出异常,防止OOM发生。Sun 官方对此的定义是:“并行/并发回收器在GC回收时间过长时会抛出OutOfMemroyError。过长的定义是,超过98%的时间用来做GC并且回收了不到2%的堆内存。用来避免内存过小造成应用不能正常工作。“
Nodemanager异常宕掉归根结底是堆内存不足导致的。
问题解决
解决该问题有两种方法:
关闭overheadlimit exceed检查特性,增加参数,-XX:-UseGCOverheadLimit;
增加堆内存;
我们采用增加堆内存大小的方法,并增加GC日志以便后续分析,修改如下:
1、在yarn-env.sh中增加GC日志,分析是什么原因导致GC的,配置修改如下:
YARN_OPTS="$YARN_OPTS -verbose:gc-XX:+PrintGCDetails -Xloggc:${YARN_LOG_DIR}/yarn-gc.log -XX:+PrintGCTimeStamps-XX:+PrintGCApplicationConcurrentTime -XX:+PrintGCApplicationStoppedTime"
2、建议将堆大小调整为2G,如下:
exportYARN_NODEMANAGER_HEAPSIZE=2048
参考资料
http://www.cnblogs.com/hucn/p/3572384.html