《Hadoop权威指南》知识点整理2

最新推荐文章于 2023-05-02 21:46:19 发布

Prometheus1969

最新推荐文章于 2023-05-02 21:46:19 发布

阅读量192

点赞数

CC 4.0 BY-SA版权

文章标签： hadoop mapreduce

本文链接：https://blog.youkuaiyun.com/weixin_44336824/article/details/106331425

本文详细介绍了Hadoop MapReduce应用的配置方法，包括Configuration类的使用、配置文件读取及属性覆盖，以及如何使用MRUnit进行MapReduce组件的单元测试。同时，还讲解了如何在本地和集群环境中运行MapReduce作业。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

《Hadoop权威指南》知识点整理2

MapReduce部分

MapReduce应用开发_配置

一个Configuration类的实例代表配置属性极其取值的一个集合，属性由String类型来命名，而值的类型任意，Configuration从使用简单结构名值对(name-value)的XML文件中读取其属性值。后添加到资源文件的属性会覆盖之前定义的属性。

<!--  配置文件 configuration-1.xml   -->
<?xml version="1.0"?>
<configuration>
	<property>
		<name>color</name>
		<value>yellow</value>
		<description>这里是对属性的描述</description>
	</property>
	<property>
		<name>sizer</name>
		<value>10</value>
		<description>hello size</description>
	</property>
	<property>
		<name>weight</name>
		<value>heavy</value>
		<final>true</final>
		<description>若final为true，则为常量且不可被覆盖</description>
	</property>
</configuration>

<!--  配置文件 configuration-2.xml   -->
<?xml version="1.0"?>
<configuration>
	<property>
		<name>color</name>
		<value>red</value>
		<description>若后读取则将覆盖</description>
	</property>
</configuration>

Configuration conf = new Configuration();
conf.addResource("configuration-1.xml");
conf.get("color");	// 得到yellow，若后添加configuration-2.xml，则该属性被覆盖为red
conf.getInt("size", "0");	// 得到10

Maven POM(Project Object Model)：项目对象模型，groupId 是项目创建团体或组织的唯一标志符，artifactId 是工程artifact唯一的基地址名，编译和测试MapReduce的POM应包含如下artifact：hadoop-client，junit，mrunit，hadoop-minicluster。其中hadoop-client包含了HDFS以及MapReduce所需的类，junit进行单元测试，mrunit进行MapReduce测试用例，hadoop-minicluster有助于在一个单JVM中运行Hadoop集群
etc/hadoop目录下放置配置文件，如下命令切换配置文件
hadoop fs -conf ***.xml -ls .

MapReduce应用开发_测试

MRUnit是一个测试库，它便于将已知的输入传递给mapper或者检查reducer的输出是否符合预期，与JUnit一起使用。
mapper测试类示例

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mrunit.mapreduce.MapDriver;
import org.junit.*;
public class MaxTemperatureMapperText {
	@Test
	public void processesValidRecord() throws IOException,InterruptedException{
		Text value = new Text("****************");
		// 测试驱动：mapper类，测试输入，测试理想输出
		new MapDriver<LongWritable, Text, Text, IntWritable>()
			.withMapper(new MaxTemperatureMapper())
			.withInput(new LongWritable(0), value)
			.withOutput(new Text("1950"), new IntWritable(-11))
			.runText();
	}
}

与其在mapper中加入更多的逻辑考虑，不如写出解析类来封装解析逻辑，在mapper中调用解析类来对数据进行清洗，这样相似的mapper也不需要重写类似逻辑
reducer测试类示例

@Test
public void returnsMaximumIntegerInValues() throws IOException, InterruptedException{
	// 测试驱动：reducer类，测试输入，测试理想输出
	new ReduceDriver<Text, IntWritable, Text, IntWritable>()
		.withReducer(new MaxTeperatureReducer())
		.withInput(new Text("1950"), Arrays.asList(new IntWritable(10), new IntWritable(5)))
		.withOutput(new Text("1950"), new IntWritable(10))
		.runTest();
}

测试驱动程序：本地作业运行器或mini集群

// 本地作业运行器
@Test
public void test() throws Exception {
	Configuration conf = new Configuration();
	conf.set("fs.defaultFS", "file:///");
	conf.set("mapreduce.framework.name", "local");
	conf.set("mapreduce.task.io.sort.mb", 1);
	Path input = new Paht("input/ncdc/micro");
	Path output = new Path("output");
	FileSystem fs = FileSystem.getLocal(conf);
	fs.delete(output, true);
	MaxTemperatureDriver driver = new MaxTemperatureDriver();
	driver.setConf(conf);
	int exitCode = driver.run(new String[]{
	input.toString(), output.toString() });
	checkOutput(conf, output);
}

MapReduce应用开发_集群运行

利用Ant或者Maven可以方便地创建作业的JAR文件，有依赖关系的JAR文件应打包到作业JAR文件的lib子目录中
mvn package -DskipTesrs
启动作业

unset HADOOP_CLASSPATH
hadoop jar hadoop-example.jar V2.MaxTemperatureDriver -conf conf/hadoop-cluster.xml input/ncdc/all max-temp
命令为hadoop jar [jar包] [包名.主类名] [-conf 配置表] [输入数据] [输出位置]

Job的waitForCompletion()方法检查并输出作业进度

http://resource-manager-host:9090/ => 用户界面信息：用于浏览作业信息

作业完成后，每个reducer产生一个输出文件，通过-getmerge方法可以合并输出文件，reduce的输出分区无序(使用了哈希分区函数)，可通过sort命令来排序
hadoop -fs -getmerge max-temp max-temp-local
sort max-temp-local | tail
通过计数器来对作业进行调试和统计

public class MaxTemperatureMapper extends<LongWritable, Text, Text, IntWritable> {
	enum Temperature{
	OVER_100	// 计数器
	}
	private NcdcRecordParser parser = new NcdcRecordParser();	// 解析类
	@override
	public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{
		parse.parse(value);
		if( parser.isValidTemperature ){
			int airTemperature = parser.getAirTemperature();
			if( airTemperature > 1000 ){
				System.err.println('Temperature over 100 degrees for input: "+ value);		// 记录下数据
				context.setStatus("Detected possibly corrupt recod: see logs.");
				context.getCounter(Temperature.OVER_100.increment(1));		// 计数器+1
			}
			context.write(new Text(parser.getYear()), new IntWritable(airTemperature));
		}
	}
}

mapred job -counter job_ID => 查看计数器

YARN有一个日志聚合服务，可以取到已完成的应用的任务日志，并将其搬移到HDFS中，在那里任务日志被存储在一个容器文件中存档。将yarn.log-aggregation-enable设置为true后，可通过命令来查看任务日志。若未开启服务，则通过页面http://node-manager-host:8042/logs/user来查看任务日志
mapred job -logs =>　在服务开启的情况下查看任务日志
日志：系统守护进程日志(管理员，每个Hadoop守护进程通过log4j产生一个日志文件)、HDFS审计日志(管理员，记录所有HDFS请求，默认关闭)、MapReduce作业历史日志(用户，记录作业运行期间的诸如任务完成的事件，存于HDFS)、MapReduce任务日志(用户，每个任务子进程都用log4j生成一个日志文件)
作业调优检查表：mapper的数量、reducer的数量、combiner(减少shuffle传输的数据量)、中间值压缩(压缩mapper输出)、自定义序列(使用自定义的Writable或comparator)、调整shuffle(优化内存管理参数)