hadoop streaming 技术整理

本文详细梳理了Hadoop Streaming的工作原理,包括分布式计算框架、mapper和reducer的交互过程、数据处理逻辑以及跨语言支持。通过分析map-reduce流程,解释了数据如何在mapper与reducer之间传输、排序和合并。此外,还探讨了如何在压缩文件与split支持之间取得平衡,以及streaming如何允许使用任意可执行程序进行map和reduce操作。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

这两年零零散散用hadoop做了些项目,也看了些hadoop相关资料,每回都是现学现用。这回做kbuild项目,要用到hadoop streaming,发现很多东西又忘了,趁这次机会把hadoop相关的知识再过一遍整理下。

hadoop由两部分构成,分布式文件系统hdfs,和并行计算框架map-reduce。做应用开发,都是跟map-reduce打交道;hdfs逻辑上相对来说比较简单,可以参考:http://hadoop.apache.org/common/docs/current/hdfs_design.html 。map-reduce相关文档:http://hadoop.apache.org/common/docs/current/mapred_tutorial.html; http://hadoop.apache.org/common/docs/current/streaming.html 。 

map一般做分布式计算,reduce做数据收集。map对外提供的编程接口:map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter),处理输入的KV对,处理结果放到output中;output的结果经由框架处理,传递给reduce,相应接口:reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter),对一个key一系列的value处理结束后,结果再放入output中。这样一个简单的map-reduce就结束了。框架层面会帮我们处理以下事情:

1. mapper的数量如何确定,这些mapper该起在那些tasktracker机器上;

  首先说一下hdfs上文件数据的存放形式,每个文件会被分成一定大小(比如256M)的block,每个block按照replica在多台机器上有备份。这个block的分割和文件是否压缩无关,只和文件大小相关,比如一个512M的gz文件,也会被分成两个block存放。 如果文件没有压缩,就是支持split的,不指定mapper个数的话,默认就是一个block启动一个mapper处理;可以设置mapper到更多(每个mapper处理更少的数据),但不能更少。除非明确知道收益,不建议设置mapper个数,因为一般不会比系统默认的更高效。如果输入文件经过压缩了,比如gz格式,就不支持split,只能一个文件一个mapper。

hdfs可以提供当前block在哪台机器上,框架根据这个信息,可以把mapper在该台机器上启动。框架按照一个block大小启动一个mapper也是基于这个考虑的,避免了网络数据传输。另外文件压缩,如果压缩后文件还比较大,mapper过程中就会出现数据的网络传输,所以使用压缩特性,建议单个压缩文件在一个block大小以内。用压缩文件,一方面可以节省集群硬盘空间;另一方面对压缩文件的读取性能也不错。

2. 用户的文件数据怎么转换为key-value形式提供给map接口

从数据文件到mapper的key-value经过了几个过程:根据当前mapper的InputSplit信息(每个inputsplit包括文件名,起始地址,结束地址,数据所在机器列表等)构建InputFormat; InputFormat包含一个RecordReader实现;这个recordreader实现负责从inputsplit中解析出一个个key-value信息(读取数据使用hdfs接口)。比如常用的TextInputFormat包含了一个LineRecordReader,每回从InputSplit中读取一行,返回的key是当前字节位置(SEEK_CUR),value就是该行文本。看到这就不难理解,为什么压缩文件只能由单个mapper处理了,因为从中间没办法解析压缩数据。如果想即压缩又支持split,可以考虑SequenceFile。

这个逻辑里有个问题:inputsplit可能在一行的中间切开,这种如何处理? 现在的逻辑是上一个inputsplit会读取到整行;当前inputsplit忽略这个不完整行。

3. map的输出在output中,数据如何组织,如何传输给相应的reduce

output实际是对本地文件调用的封装,collect动作会转化为本地文件写入动作,格式可以理解为:key-len,value-len,key,value。mapper的输出传递给reduce,这个过程被称作shuffle。mapper的输出首先在本地按照key进行排序,之后通过Partitioner确认key应该发往哪些reduce。默认情况下,是对key取个hash值然后取模reduce个数;如果我们想控制key应该具体发往哪个reduce,应该实现自己的partitioner。

4. reduce的key和values是怎么产生的,

map的数据经过shuffle后,到reduce机器上;再做一次merge sort;然后就构建了key和相应的valueiterator。

5. reduce怎么把output中的信息输出到hdfs上

output信息只是到本地,本地到hdfs由框架负责。


现在对map的数据产出到生成reduce输入的这个过程,还缺少详细了解,需要看看代码。


streaming是在map-reduce基础上开发的一个jar包,主要是为了跨语言支持;map和reduce可以是任意的可执行程序。streaming会把map作为单独进程start;把maper接口输入的key-value 直接写到map进程的stdin;同时截获map进程的stdout,把每一行解析成key-value对;然后调用mapper的output.collect 后面的事情都是框架来处理了。reduce的过程与此类似;只是reduce由于是基于stdin每行工作的,所以每行都要有key-value信息,不像java可以用key,value-iterator。进程要像输出一些进度信息,可以写到stderr中,streaming会截获这些信息,并识别是否为进度信息。


[root@master exp3]# hadoop jar /export/servers/hadoop-3.4.0/share/hadoop/tools/lib/hadoop-streaming-3.4.0.jar -D mapreduce.job.reduces=1 -D stream.map.output.field.separator=\t -D stream.num.map .output.key.fields=1 -mapper "python /export/data/chap-5/exp3/Mapper.py" -reducer "python /export/data/chap-5/exp3/Reducer.py" -input /exp1/input -output /exp1/output_$(date +%s) -file ./Mapper.py -file ./Reducer.py 2025-05-31 16:24:06,054 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead. packageJobJar: [./Mapper.py, ./Reducer.py, /tmp/hadoop-unjar2796039736802025778/] [] /tmp/streamjob2856802959225773798.jar tmpDir=null 2025-05-31 16:24:06,522 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/192.168.24.67:8032 2025-05-31 16:24:06,621 INFO client.DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at master/192.168.24.67:8032 2025-05-31 16:24:06,844 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for path: /tmp/hadoop-yarn/staging/root/.staging/job_1748671547381_0012 2025-05-31 16:24:07,087 INFO mapred.FileInputFormat: Total input files to process : 1 2025-05-31 16:24:07,123 INFO mapreduce.JobSubmitter: number of splits:2 2025-05-31 16:24:07,218 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1748671547381_0012 2025-05-31 16:24:07,218 INFO mapreduce.JobSubmitter: Executing with tokens: [] 2025-05-31 16:24:07,334 INFO conf.Configuration: resource-types.xml not found 2025-05-31 16:24:07,334 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'. 2025-05-31 16:24:07,379 INFO impl.YarnClientImpl: Submitted application application_1748671547381_0012 2025-05-31 16:24:07,404 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1748671547381_0012/ 2025-05-31 16:24:07,405 INFO mapreduce.Job: Running job: job_1748671547381_0012 2025-05-31 16:24:13,507 INFO mapreduce.Job: Job job_1748671547381_0012 running in uber mode : false 2025-05-31 16:24:13,508 INFO mapreduce.Job: map 0% reduce 0% 2025-05-31 16:24:19,780 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000001_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:19,816 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000000_0, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:23,906 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000001_1, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:23,912 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000000_1, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:27,978 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000001_2, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:29,025 INFO mapreduce.Job: Task Id : attempt_1748671547381_0012_m_000000_2, Status : FAILED Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:326) at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:539) at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:129) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61) at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:350) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:178) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:172) 2025-05-31 16:24:33,127 INFO mapreduce.Job: map 50% reduce 100% 2025-05-31 16:24:34,137 INFO mapreduce.Job: map 100% reduce 100% 2025-05-31 16:24:34,145 INFO mapreduce.Job: Job job_1748671547381_0012 failed with state FAILED due to: Task failed task_1748671547381_0012_m_000001 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2025-05-31 16:24:34,197 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Data-local map tasks=2 Total time spent by all maps in occupied slots (ms)=23810 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=23810 Total vcore-milliseconds taken by all map tasks=23810 Total megabyte-milliseconds taken by all map tasks=24381440 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2025-05-31 16:24:34,197 ERROR streaming.StreamJob: Job not successful! Streaming Command Failed!
06-01
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值