hadoop:skip bad records

本文介绍在使用Hadoop Streaming遇到第三方库导致的任务失败问题,并提供了解决方案——利用Hadoop的skipbadrecords模式来跳过处理这些坏记录,避免整个作业因个别错误而失败。

最近用hadoop  streaming 写hadoop程序,由于调用了第三方库,在处理输入文件时有些情况会使得第三方库出现segment fault 而使得任务挂掉,这样整个hadoop  job会一直尝试重启任务来处理这段bad records直到达到设置的attempt值,最后整个job失败 。害我预先处理了下输入文本把bad records找到并删掉,但是这样处理大规模数据每次都得先检查一遍输入数据太蛋疼了。

幸运的是,hadoop有个skip bad records 模式,通过此方法可以跳过bad records ,具体可见hadoop权威指南中文版P184页,6.5.3跳过坏记录。

这个博客讲的也不错  http://devblog.factual.com/practical-hadoop-streaming-dealing-with-brittle-code


SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/C:/Users/Administrator/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/Administrator/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.10.0/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/Administrator/.m2/repository/org/slf4j/slf4j-simple/1.7.30/slf4j-simple-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 2025-11-24 09:27:04,150 WARN [org.apache.hadoop.metrics2.impl.MetricsConfig] - Cannot locate configuration: tried hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties 2025-11-24 09:27:04,189 INFO [org.apache.hadoop.metrics2.impl.MetricsSystemImpl] - Scheduled Metric snapshot period at 10 second(s). 2025-11-24 09:27:04,189 INFO [org.apache.hadoop.metrics2.impl.MetricsSystemImpl] - JobTracker metrics system started 2025-11-24 09:27:04,386 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2025-11-24 09:27:04,395 WARN [org.apache.hadoop.mapreduce.JobResourceUploader] - No job jar file set. User classes may not be found. See Job or Job#setJar(String). 2025-11-24 09:27:04,416 INFO [org.apache.hadoop.mapreduce.lib.input.FileInputFormat] - Total input files to process : 1 2025-11-24 09:27:04,433 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - number of splits:1 2025-11-24 09:27:04,485 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Submitting tokens for job: job_local2106793956_0001 2025-11-24 09:27:04,486 INFO [org.apache.hadoop.mapreduce.JobSubmitter] - Executing with tokens: [] 2025-11-24 09:27:04,554 INFO [org.apache.hadoop.mapreduce.Job] - The url to track the job: http://localhost:8080/ 2025-11-24 09:27:04,555 INFO [org.apache.hadoop.mapreduce.Job] - Running job: job_local2106793956_0001 2025-11-24 09:27:04,555 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter set in config null 2025-11-24 09:27:04,559 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 2 2025-11-24 09:27:04,559 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2025-11-24 09:27:04,559 INFO [org.apache.hadoop.mapred.LocalJobRunner] - OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter 2025-11-24 09:27:04,589 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for map tasks 2025-11-24 09:27:04,589 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local2106793956_0001_m_000000_0 2025-11-24 09:27:04,601 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 2 2025-11-24 09:27:04,601 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2025-11-24 09:27:04,607 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux. 2025-11-24 09:27:04,631 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@1439d592 2025-11-24 09:27:04,634 INFO [org.apache.hadoop.mapred.MapTask] - Processing split: file:/D:/dgh/java/Hadoop/input/subject_score.csv:0+80578 2025-11-24 09:27:04,673 INFO [org.apache.hadoop.mapred.MapTask] - (EQUATOR) 0 kvi 26214396(104857584) 2025-11-24 09:27:04,673 INFO [org.apache.hadoop.mapred.MapTask] - mapreduce.task.io.sort.mb: 100 2025-11-24 09:27:04,673 INFO [org.apache.hadoop.mapred.MapTask] - soft limit at 83886080 2025-11-24 09:27:04,673 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufvoid = 104857600 2025-11-24 09:27:04,673 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396; length = 6553600 2025-11-24 09:27:04,675 INFO [org.apache.hadoop.mapred.MapTask] - Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer 2025-11-24 09:27:04,709 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 2025-11-24 09:27:04,709 INFO [org.apache.hadoop.mapred.MapTask] - Starting flush of map output 2025-11-24 09:27:04,709 INFO [org.apache.hadoop.mapred.MapTask] - Spilling map output 2025-11-24 09:27:04,709 INFO [org.apache.hadoop.mapred.MapTask] - bufstart = 0; bufend = 42108; bufvoid = 104857600 2025-11-24 09:27:04,709 INFO [org.apache.hadoop.mapred.MapTask] - kvstart = 26214396(104857584); kvend = 26199088(104796352); length = 15309/6553600 2025-11-24 09:27:04,732 INFO [org.apache.hadoop.mapred.MapTask] - Finished spill 0 2025-11-24 09:27:04,744 INFO [org.apache.hadoop.mapred.Task] - Task:attempt_local2106793956_0001_m_000000_0 is done. And is in the process of committing 2025-11-24 09:27:04,745 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map 2025-11-24 09:27:04,745 INFO [org.apache.hadoop.mapred.Task] - Task 'attempt_local2106793956_0001_m_000000_0' done. 2025-11-24 09:27:04,749 INFO [org.apache.hadoop.mapred.Task] - Final Counters for attempt_local2106793956_0001_m_000000_0: Counters: 17 File System Counters FILE: Number of bytes read=80748 FILE: Number of bytes written=446408 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3828 Map output records=3828 Map output bytes=42108 Map output materialized bytes=49770 Input split bytes=113 Combine input records=0 Spilled Records=3828 Failed Shuffles=0 Merged Map outputs=0 GC time elapsed (ms)=4 Total committed heap usage (bytes)=265814016 File Input Format Counters Bytes Read=80578 2025-11-24 09:27:04,749 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Finishing task: attempt_local2106793956_0001_m_000000_0 2025-11-24 09:27:04,749 INFO [org.apache.hadoop.mapred.LocalJobRunner] - map task executor complete. 2025-11-24 09:27:04,750 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Waiting for reduce tasks 2025-11-24 09:27:04,751 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Starting task: attempt_local2106793956_0001_r_000000_0 2025-11-24 09:27:04,754 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - File Output Committer Algorithm version is 2 2025-11-24 09:27:04,754 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false 2025-11-24 09:27:04,754 INFO [org.apache.hadoop.yarn.util.ProcfsBasedProcessTree] - ProcfsBasedProcessTree currently is supported only on Linux. 2025-11-24 09:27:04,776 INFO [org.apache.hadoop.mapred.Task] - Using ResourceCalculatorProcessTree : org.apache.hadoop.yarn.util.WindowsBasedProcessTree@7b215062 2025-11-24 09:27:04,778 INFO [org.apache.hadoop.mapred.ReduceTask] - Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@5bb9ff0f 2025-11-24 09:27:04,779 WARN [org.apache.hadoop.metrics2.impl.MetricsSystemImpl] - JobTracker metrics system already initialized! 2025-11-24 09:27:04,785 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - MergerManager: memoryLimit=1305057664, maxSingleShuffleLimit=326264416, mergeThreshold=861338112, ioSortFactor=10, memToMemMergeOutputsThreshold=10 2025-11-24 09:27:04,786 INFO [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] - attempt_local2106793956_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events 2025-11-24 09:27:04,800 INFO [org.apache.hadoop.mapreduce.task.reduce.LocalFetcher] - localfetcher#1 about to shuffle output of map attempt_local2106793956_0001_m_000000_0 decomp: 49766 len: 49770 to MEMORY 2025-11-24 09:27:04,803 INFO [org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput] - Read 49766 bytes from map-output for attempt_local2106793956_0001_m_000000_0 2025-11-24 09:27:04,803 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - closeInMemoryFile -> map-output of size: 49766, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->49766 2025-11-24 09:27:04,804 INFO [org.apache.hadoop.mapreduce.task.reduce.EventFetcher] - EventFetcher is interrupted.. Returning 2025-11-24 09:27:04,804 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied. 2025-11-24 09:27:04,804 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs 2025-11-24 09:27:04,811 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments 2025-11-24 09:27:04,811 INFO [org.apache.hadoop.mapred.Merger] - Down to the last merge-pass, with 1 segments left of total size: 49757 bytes 2025-11-24 09:27:04,817 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merged 1 segments, 49766 bytes to disk to satisfy reduce memory limit 2025-11-24 09:27:04,818 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merging 1 files, 49770 bytes from disk 2025-11-24 09:27:04,818 INFO [org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl] - Merging 0 segments, 0 bytes from memory into reduce 2025-11-24 09:27:04,818 INFO [org.apache.hadoop.mapred.Merger] - Merging 1 sorted segments 2025-11-24 09:27:04,819 INFO [org.apache.hadoop.mapred.Merger] - Down to the last merge-pass, with 1 segments left of total size: 49757 bytes 2025-11-24 09:27:04,819 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied. 2025-11-24 09:27:04,822 INFO [org.apache.hadoop.conf.Configuration.deprecation] - mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords 2025-11-24 09:27:04,835 INFO [org.apache.hadoop.mapred.Task] - Task:attempt_local2106793956_0001_r_000000_0 is done. And is in the process of committing 2025-11-24 09:27:04,836 INFO [org.apache.hadoop.mapred.LocalJobRunner] - 1 / 1 copied. 2025-11-24 09:27:04,836 INFO [org.apache.hadoop.mapred.Task] - Task attempt_local2106793956_0001_r_000000_0 is allowed to commit now 2025-11-24 09:27:04,839 INFO [org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter] - Saved output of task 'attempt_local2106793956_0001_r_000000_0' to file:/D:/dgh/java/Hadoop/output 2025-11-24 09:27:04,839 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce > reduce 2025-11-24 09:27:04,839 INFO [org.apache.hadoop.mapred.Task] - Task 'attempt_local2106793956_0001_r_000000_0' done. 2025-11-24 09:27:04,840 INFO [org.apache.hadoop.mapred.Task] - Final Counters for attempt_local2106793956_0001_r_000000_0: Counters: 24 File System Counters FILE: Number of bytes read=180320 FILE: Number of bytes written=496250 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Combine input records=0 Combine output records=0 Reduce input groups=6 Reduce shuffle bytes=49770 Reduce input records=3828 Reduce output records=6 Spilled Records=3828 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=0 Total committed heap usage (bytes)=265814016 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Output Format Counters Bytes Written=72 2025-11-24 09:27:04,840 INFO [org.apache.hadoop.mapred.LocalJobRunner] - Finishing task: attempt_local2106793956_0001_r_000000_0 2025-11-24 09:27:04,840 INFO [org.apache.hadoop.mapred.LocalJobRunner] - reduce task executor complete. 2025-11-24 09:27:05,564 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local2106793956_0001 running in uber mode : false 2025-11-24 09:27:05,565 INFO [org.apache.hadoop.mapreduce.Job] - map 100% reduce 100% 2025-11-24 09:27:05,566 INFO [org.apache.hadoop.mapreduce.Job] - Job job_local2106793956_0001 completed successfully 2025-11-24 09:27:05,569 INFO [org.apache.hadoop.mapreduce.Job] - Counters: 30 File System Counters FILE: Number of bytes read=261068 FILE: Number of bytes written=942658 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 Map-Reduce Framework Map input records=3828 Map output records=3828 Map output bytes=42108 Map output materialized bytes=49770 Input split bytes=113 Combine input records=0 Combine output records=0 Reduce input groups=6 Reduce shuffle bytes=49770 Reduce input records=3828 Reduce output records=6 Spilled Records=7656 Shuffled Maps =1 Failed Shuffles=0 Merged Map outputs=1 GC time elapsed (ms)=4 Total committed heap usage (bytes)=531628032 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=80578 File Output Format Counters Bytes Written=72 Process finished with exit code 0 运行以后是这样的
11-25
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值