最近在做自定义实时报表,由于在前期测试Flink发现Flink拥有比较好的性能,并且天然支持窗口,所以在实时计算的时候,我们选择Flink在做计算框架.在所有的代码完成,代码准备上线的时候,发现Flink程序运行一段时间后,要么自己直接挂掉要么过一段时间,程序从checkpoint恢复.在排查日志的过程中,程序中并没有什么异常报出,最终排查每个taskManager的日志发现在其中一个TaskManager的日志有如下日志:
2019-07-08T04:36:52,662 INFO [SIGTERM handler] org.apache.flink.yarn.YarnTaskManagerRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2019-07-08T04:36:52,666 INFO [Thread-13] org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache
2019-07-08T04:36:52,666 INFO [Thread-15] org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache
2019-07-08T04:36:52,667 INFO [I/O manager shutdown hook] org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /mnt/disk1/yarn/usercache/hadoop/appcache/application_1556227576661_157615/flink-io-487a22f1-5ac1-4158-9030-a6f499d80356
2019-07-08T04:36:52,670 INFO [I/O manager shutdown hook] org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /mnt/disk2/yarn/usercache/hadoop/appcache/application_1556227576661_157615/flink-io-2d383cf8-0192-4a2d-901d-9ee7cfeae3e2
2019-07-08T04:36:52,671 INFO [I/O manager shutdown hook] org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /mnt/disk3/yarn/usercache/hadoop/appcache/application_1556227576661_157615/flink-io-20cf5475-3be3-48a5-b92c-cbf95434aa97
2019-07-08T04:36:52,671 INFO [I/O manager shutdown hook] org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /mnt/disk4/yarn/usercache/hadoop/appcache/application_1556227576661_157615/flink-io-8bb33b5e-e7ff-4da0-ba5c-a808e1fcb657
所以可一断定,是程序中存在内存泄露,所以猜测可能是程序中的LIST,MAP等等使用存错误导致程序使用内存一直在增长最终达到上限被yarn给kill掉了(之前也遇到过这个问题,是缓存地域的concurrentHashMap,不停的增长,导致内存泄露,在本问题之前已经解决处理),但是经过排查,发现并没有这样的地方,能导致内存溢出,并日志中也没有GC OverHead Limit等异常报出,所有基本上可以断定不是JVM的内存的问题,那就可能是堆外内存的bug.
由于TaskManager的日志中查找不到明显的日志,我去报如下日志的节点查找yarn的日志:
2019-07-08T04:36:52,662 INFO [SIGTERM handler] org.apache.flink.yarn.YarnTaskManagerRunner - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
命令如下
cat yarn-hadoop-nodemanager-emr-worker-18.cluster-70637.log|grep "container_1556227576661_161684_01_000010" >>container_1556227576661_161684_01_000010.log
下载日志,查看如下:
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9724 for container-id container_1556227576661_159189_01_000004: 2.0 GB of 2 GB physical memory used; 4.3 GB of 9.8 TB virtual memory used
2019-07-09 01:05:44,545 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9724 for container-id container_1556227576661_159189_01_000004: 2.0 GB of 2 GB physical memory used; 4.3 GB of 9.8 TB virtual memory used
2019-07-09 01:05:47,597 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9724 for container-id container_1556227576661_159189_01_000004: 2.0 GB of 2 GB physical memory used; 4.3 GB of 9.8 TB virtual memory used
2019-07-09 01:05:50,654 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 9724 for container-id container_1556227576661_159189_01_000004: 2.0 GB of 2 GB physical memory used; 4.3 GB of 9.8 TB virtual memory used
2019-07-09 01:05:50,654 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Process tree for container: container_1556227576661_159189_01_000004 has processes older than 1 iteration running over the configured limit. Limit=2147483648, current usage = 2150461440
2019-07-09 01:05:50,654 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Container [pid=9724,containerID=container_1556227576661_159189_01_000004] is running beyond physical memory limits. Current usage: 2.0 GB of 2 GB physical memory used; 4.3 GB of 9.8 TB virtual memory used. Killing container.
Dump of the process-tree for container_1556227576661_159189_01_000004 :
|- 9741 9724 9724 9724 (java) 128802 33208 4534272000 524661 /usr/lib/jvm/java-1.8.0/bin/java -Xms1448m -Xmx1448m -Dlog.file=/mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskManager --configDir .
|- 9724 9722 9724 9724 (bash) 0 0 115855360 354 /bin/bash -c /usr/lib/jvm/java-1.8.0/bin/java -Xms1448m -Xmx1448m -Dlog.file=/mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.log -Dlogback.configurationFile=file:./logback.xml -Dlog4j.configuration=file:./log4j.properties org.apache.flink.yarn.YarnTaskManager --configDir . 1> /mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.out 2> /mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.err
2019-07-09 01:05:50,654 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556227576661_159189_01_000004 transitioned from RUNNING to KILLING
2019-07-09 01:05:50,654 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1556227576661_159189_01_000004
2019-07-09 01:05:50,661 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1556227576661_159189_01_000004 is : 143
2019-07-09 01:05:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556227576661_159189_01_000004 transitioned from KILLING to CONTAINER_CLEANEDUP_AFTER_KILL
2019-07-09 01:05:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1556227576661_159189_01_000004 transitioned from CONTAINER_CLEANEDUP_AFTER_KILL to DONE
2019-07-09 01:05:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationImpl: Removing container_1556227576661_159189_01_000004 from application application_1556227576661_159189
2019-07-09 01:05:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Considering container container_1556227576661_159189_01_000004 for log-aggregation
2019-07-09 01:05:50,674 INFO org.apache.spark.network.yarn.YarnShuffleService: Stopping container container_1556227576661_159189_01_000004
2019-07-09 01:05:53,668 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Stopping resource-monitoring for container_1556227576661_159189_01_000004
2019-07-09 01:05:58,638 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Removed completed containers from NM context: [container_1556227576661_159189_01_000004]
2019-07-09 01:06:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /mnt/disk1/yarn/usercache/hadoop/appcache/application_1556227576661_159189/container_1556227576661_159189_01_000004
2019-07-09 01:06:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /mnt/disk3/yarn/usercache/hadoop/appcache/application_1556227576661_159189/container_1556227576661_159189_01_000004
2019-07-09 01:06:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /mnt/disk2/yarn/usercache/hadoop/appcache/application_1556227576661_159189/container_1556227576661_159189_01_000004
2019-07-09 01:06:50,674 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting absolute path : /mnt/disk4/yarn/usercache/hadoop/appcache/application_1556227576661_159189/container_1556227576661_159189_01_000004
2019-07-09 08:44:39,569 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Stopping container with container Id: container_1556227576661_159189_01_000004
2019-07-09 08:44:41,580 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl: Uploading logs for container container_1556227576661_159189_01_000004. Current good log dirs are /mnt/disk1/log/hadoop-yarn/containers,/mnt/disk2/log/hadoop-yarn/containers,/mnt/disk3/log/hadoop-yarn/containers,/mnt/disk4/log/hadoop-yarn/containers
2019-07-09 08:45:41,598 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.out
2019-07-09 08:45:41,599 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Deleting path : /mnt/disk1/log/hadoop-yarn/containers/application_1556227576661_159189/container_1556227576661_159189_01_000004/taskmanager.err
查看日志发现TaskManager使用的内存逐渐增到,最终超过了申请的最大物理内存,所以可以肯定是内存泄露的问题,经过多方查看发现这个是RocksDb的bug:
https://github.com/facebook/rocksdb/issues/4112
备注:我们Flink 使用的是1.4.0的版本.
所以我们为了稳定,放弃了使用RockDbStateBackEnd改用FsStatebackEnd程序目前没有报出内存泄露的问题.运行良好.