hadoopstreaming

最新推荐文章于 2019-10-31 08:40:08 发布

翻译最新推荐文章于 2019-10-31 08:40:08 发布 · 385 阅读

文章标签：

#地图 #dfs

hadoop 专栏收录该内容

19 篇文章

订阅专栏

本文介绍Hadoop Streaming的基本使用方法及高级特性，包括如何利用自定义方法进行数据分割、配置MapReduce任务、设置输入输出格式等。同时介绍了如何通过命令行参数调整任务执行流程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

http://hadoop.apache.org/docs/r1.0.4/cn/streaming.html#%E4%BD%BF%E7%94%A8%E8%87%AA%E5%AE%9A%E4%B9%89%E7%9A%84%E6%96%B9%E6%B3%95%E5%88%87%E5%88%86%E8%A1%8C%E6%9D%A5%E5%BD%A2%E6%88%90Key%2FValue%E5%AF%B9

简介

hadoopstreaming input参数下的文件一行一行给mapper程序，然后再由reducer一行一行输出。其中mapper参数和reducer参数都是进程,比如在python中按sys.stdin的标准输入，再按print输出。在读入的过程中，默认以TAB作为分隔符，将输入的内容分割成key/value的形式。reducer也是一样。

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input myInputDirs \
    -output myOutputDir \
    -mapper /bin/cat \
    -reducer /bin/wc

hadoopstreaming参数

参数	描述	描述	描述
-input		DFS input file(s) for the Map step.	DFS输入的文件
-output		DFS output directory for the Reduce step.	DFS输出的文件
-mapper		JavaClassName> Optional.	Command to be run as mapper.
-combiner		JavaClassName> Optional.	Command to be run as combiner.
-reducer		JavaClassName> Optional.	Command to be run as reducer.
-file	Optional.	File/dir to be shipped in the Job jar file.Deprecated. Use generic option “-files” instead.	需要的文件数据
-inputformat		SequenceFileAsTextInputFormat	JavaClassName>Optional.
-outputformat		JavaClassName>,Optional.	The output format class.
-partitioner	Optional.	The partitioner class.	partitioner类
-numReduceTasks	Optional.	Number of reduce tasks.	reducer的数量
-inputreader	Optional.	Input recordreader spec.	？？
-cmdenv	= Optional.	Pass env.var to streaming commands.	程序运行的环境变量
-mapdebug	Optional.	To run this script when a map task fails.
-reducedebug	Optional.	To run this script when a reduce task fails.
-io	Optional.	Format to use for input to and output from mapper/reducer commands
-lazyOutput	Optional.	Lazily create Output.
-background	Optional.	Submit the job and don’t wait till it completes.
-verbose	Optional.	Print verbose output.
-info	Optional.	Print detailed usage.
-help	Optional.	Print help message.
通用参数
-conf		specify an application configuration file	指定一个配置文件
-D		use value for given property
-fs		namenode:port>	specify a namenode
-jt		resourcemanager:port>	specify a ResourceManager
-files		specify comma separated files to be copied to the map reduce cluster	指定用逗号分隔的文件复制到reduce集群
-libjars		specify comma separated jar files to include in the classpath.	制定用逗号分隔的jar包到classpath
-archives		specify comma separated archives to be unarchived on the compute machines.

-file,-cacheFile与-cacheArchive

file是本地的文件
-file选项用来提交-mapper和-reducer执行文件，以及他们所需依赖的文件。这些文件将会放置在每个节点的”./”根目录下。

cache文件是在hdfs之上的文件
-cacheFile hdfs://host:port/path/file#linkname
这样mapper和reducer就可以直接在本地通过./linkname调用file文件了

-cacheArchive hdfs://host:port/path/archivefile#linkname

比如将mapper.py,reducer.py,/data/df.csv打包成app.tar.gz,这个压缩包就是archiefile
当执行的时候,在./linkname/mapper.py执行mapper，在./linkname/reducer.py执行reducer,
mapper,reducer会在./data/df.csv读取临时数据

三种文件分发方式的区别：-file将客户端本地文件打成jar包上传到HDFS然后分发到计算节点，-cacheFile将HDFS文件分发到计算节点，-cacheArchive将HDFS压缩文件分发到计算节点并解压。

-partitioner，-combiner

字段的选取

-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

-reduce aggregate

-jobconf

设置maptask和reducetask的个数

-jobconf mapred.map.tasks=10
-jobconf mapred.reduce.tasks=10 # 当reducetask被设置成0时，将不会执行reduce

设置key-value的分隔符和二次排序

当Map/Reduce框架从mapper的标准输入读取一行时，它把这一行切分为key/value对。在默认情况下，每行第一个tab符之前的部分作为key，之后的部分作为value（不包括tab符）。

通过partitionerorg.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner 选项，可以设置二次排序

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
    -input InputDirs \
    -output OutputDir \
    -mapper python mapper.py \
    -reducer python reducer.py \
    -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \#载入二次排序类
    -jobconf stream.map.output.field.separator=. \# 指定分隔符
    -jobconf stream.num.map.output.key.fields=4 \#指定第4个分隔符作为key/value的分割点，默认为1，如果大于分隔符，则value的值为空字符串
    -jobconf map.output.key.field.separator=. \切分key使用的分隔符
    -jobconf num.key.fields.for.partition=2 \切分key的位置，保证前分割点前面的字段分配到一个桶里面