今天帮助一个同学解决job运行时间过长的问题, task 被kill后的 error信息是: “Lost task tracker: tracker_xxxxxx”, 从job history可以看到“Stage-2 map = 100%, reduce = 100%” 打印了很长时间,所以怀疑是dump文件的时间过长, 然后查看代码发现她的sql中存在两个大表的join操作(9亿+ * 97亿+), 而97亿的表是一个全量表,完全可以加上filter, 这样可以将数据量降低到5亿+。 另外这个作业在白天运行的时候不会出现长时间打印“Stage-2 map = 100%, reduce = 100%”, 也说明了凌晨集群的资源很紧张, 下面是在cloudera上其他人遇到的类似问题和出现问题的原因。
问题:
Hi All
Could you please support me on this:
I have roughly 500 billion rows 200 columns on a 250 node cluster CDH 4.1.2.
I have a job which take about 290,000 mappers and 2000 reducers and there are no jobs running in parallel.
When I run my query I get the following error in 1400 reducers which failed at various localhosts and
task_201302181813_2057_r_00000025.78%
Lost task tracker: tracker_osel400811.com:localhost/127.0.0.1:53127reduce > copy (226117 of 292320 at 1.21 MB/s) >
25-Feb-2013 02:19:57
The mapper phase completed in about 6 hrs and the reducer start.
Once I get this error occurs the job reinitiates about 30,000 mappers and it take a lot more time to complete.
Could you tell how to fix this issue.
Regards
Nirup
原因:
Nirup:
This error means that one of your task trackers (tracker_osel400811.com) lost contact with the job tracker (JT). "Losing contact" means it failed to heartbeat for ~10 minutes and, as a result, is considered dead. Any map task that ran on that node would have written their output to the local disks so the reducers could copy the data during the shuffle phase. Since the machine is no longer accessible, neither is the output of those map tasks. As a result, they all must be re-run.
These errors are usually the result of significant network congestion, an actual node failure, the failure of the task tracker process, or some situation that makes the node effectively unusable (e.g. local disks filling up). First, check to see if the node in question returned to a health state on its own. If so, check the task tracker logs (in /var/log/hadoop/ by default). You'll almost certainly find the answer in the logs.
问题:
Hi All
Could you please support me on this:
I have roughly 500 billion rows 200 columns on a 250 node cluster CDH 4.1.2.
I have a job which take about 290,000 mappers and 2000 reducers and there are no jobs running in parallel.
When I run my query I get the following error in 1400 reducers which failed at various localhosts and
task_201302181813_2057_r_00000025.78%
Lost task tracker: tracker_osel400811.com:localhost/127.0.0.1:53127reduce > copy (226117 of 292320 at 1.21 MB/s) >
25-Feb-2013 02:19:57
The mapper phase completed in about 6 hrs and the reducer start.
Once I get this error occurs the job reinitiates about 30,000 mappers and it take a lot more time to complete.
Could you tell how to fix this issue.
Regards
Nirup
原因:
Nirup:
This error means that one of your task trackers (tracker_osel400811.com) lost contact with the job tracker (JT). "Losing contact" means it failed to heartbeat for ~10 minutes and, as a result, is considered dead. Any map task that ran on that node would have written their output to the local disks so the reducers could copy the data during the shuffle phase. Since the machine is no longer accessible, neither is the output of those map tasks. As a result, they all must be re-run.
These errors are usually the result of significant network congestion, an actual node failure, the failure of the task tracker process, or some situation that makes the node effectively unusable (e.g. local disks filling up). First, check to see if the node in question returned to a health state on its own. If so, check the task tracker logs (in /var/log/hadoop/ by default). You'll almost certainly find the answer in the logs.
Best of luck.
Ref: https://groups.google.com/a/cloudera.org/forum/#!topic/cdh-user/2A8jUlXTCc4