Hadoop学习笔记 (二)

最新推荐文章于 2024-07-11 16:44:30 发布

无籽西瓜吃吗

最新推荐文章于 2024-07-11 16:44:30 发布

阅读量344

点赞数

CC 4.0 BY-SA版权

本文链接：https://blog.youkuaiyun.com/qq_24326765/article/details/80963320

本文详细介绍了Hadoop2.X的重要组成部分，包括HDFS架构分析、数据块管理、集群维护命令、NameNode启动过程、SecondaryNameNode的功能、YARN架构及资源管理方式、MapReduce编程模型等内容。

二、深入Hadoop 2.X

1. HDFS架构详解

经典案例：
杭州一公司原本有40个节点，现增加20个硬盘容量较大，性能较佳的节点，如何让数据尽量存储在新的20个节点上？

答：bin/hdfs 有一个balancer方法，可调整策略。

数据块损坏的处理：

2. NameNode和DataNode的数据存储位置

存储位置

namenode:

datanode:

3. HDFS集群管理命令

查看集群状态 (和50070端口一致)：

bin/hdfs dfsadmin –report

刷新集群状态 (添加节点或删除节点时使用)：
bin/hdfs dfsadmin –fresh

将文件系统改为本地文件系统：
bin/hdfs dfs –Dfs.defaultFS=file:///

4. 在linux下安装eclipse和maven

安装maven：
① tar解压maven
② 配置环境变量

mvn –version检测是否安装成功

当环境变量source刷新不可用时：
直接在命令行export

5. 获取FileSystem对象

6. 读hdfs文件到控制台

7. 将本地文件写入hdfs

8. NameNode启动过程详解（fsimage和edits作用）

注意：当hdfs长时间运行，会导致edits日志太大，合并耗时太长，下次启动namenode将会热别困难。

9. SecondaryNameNode辅助功能讲解

注意：SecondaryNameNode只起到辅助功能的作用，即定期合并fsimage和edits，由于本机数据量、数据操作不多，所以可不开启SecondaryNameNode。

10. HDFS启动时Safemode讲解

① namenode启动中：

② namenode启动后：

③ datanode启动后：

注意：为什么不立即关闭安全模式？为了给服务器一个缓冲，让其稳定再关闭。

④ HDFS启动完成后：

11. 手动进入安全模式（Safemode）

退出安全模式：

12. YARN发展和架构组件功能详解

hadoop 1.x的集群资源和数据的处理都由mapreduce管理，而hadoop 2.x将二者分开，集群资源由YARN管理，而MapReduce只是作为一个进程运行在YARN上。

13. YARN如何对集群资源进行管理与调度及如何配置节点的资源（内存和CPU核数）

虚拟CPU：假如有5台i5的机器和5台i7的机器，1核的i5和1核的i7处理能力肯定不一样，这时我们可以把2个i7的虚拟化成3个i5的。这样我们假如有4台i7的就可以虚拟成6个i5的。

在yarn-default.xml中查找对应的配置信息：

修改：

14. YARN的生态系统及Slider讲解

Slider:
Apache Slider成为了Apache二级孵化项目(官方首页为：http://slider.incubator.apache.org/)，该项目是YARN之外的孵化项目，目的是将用户的已有服务或者应用直接部署到YANR上。
随着YARN的完善，目前已经能够直接部署服务，比如HBase，Storm等，而Apache Slider直接源自于Hoya,一个尝试将HBase部署到YARN上的项目。将HBase运行在YARN上将带来众多好处, 包括：
（1）在一个物理集群中可同时部署多个HBase集群实例。
（2）为HBase集群提供资源隔离（这一点HBase本身做不到）。
（3）将多个版本的HBase集群部署到一个物理集群中。
当然，以上几个好处也是将其他服务部署到YARN上的好处。
随着Slider项目的发布，用户可以在不对已存在服务进行任何修改的前提下将之部署到YARN集群中。

15. 并行计算框架MapReduce编程模型之数据传输

16. 编写WordCount，并在yarn上运行

mapper:

reducer:

component job：

打包并在yarn上运行：

17. WordCount代码的优化

继承Confugured的原因：

使得，一次传递，多次使用。

实现Tool接口的原因：

将job的组装，封装到run方法中，前面我们写的wordcount代码是直接自己定义了一个run方法。所以这里实现Tool接口的时候没有提示要实现其中的方法。

为什么要使用ToolRunner的run方法：

18. 编写mapreduce模板

public class ModuleMapReduce extends Configured implements Tool{
	
	// TODO
	public static class ModuleMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

		@Override
		protected void setup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO
		}

		@Override
		protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO
		}
		
		@Override
		protected void cleanup(Mapper<LongWritable, Text, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO
		}
	}

	// TODO
	public static class ModuleReducer extends Reducer<Text, IntWritable, Text, IntWritable>{

		@Override
		protected void setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO
		}

		@Override
		protected void reduce(Text key, Iterable<IntWritable> values,
				Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
			//TODO
		}
		
		@Override
		protected void cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)
				throws IOException, InterruptedException {
			// TODO
		}
	}

	public int run(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		
		Configuration configuration=getConf();
		// job
		Job job=Job.getInstance(configuration);
		job.setJarByClass(ModuleMapReduce.class);
		// 设置输入路径
		Path inPath=new Path(args[0]);
		FileInputFormat.addInputPath(job, inPath);
		// map
		job.setMapperClass(ModuleMapper.class);
		// TODO
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
//*************************Shuffle*************************
		// 1) 分区
//		job.setPartitionerClass(cls);
		// 2) 排序
//		job.setSortComparatorClass(cls);
		// 3) combiner（可选）
//		job.setCombinerClass(cls);
		// 4) 分组
//		job.setGroupingComparatorClass(cls);
		
//*************************Shuffle*************************
		// reduce
		job.setReducerClass(ModuleReducer.class);
		// TODO
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);
        // 设置reduce数量(默认为1)
		//job.setNumReduceTasks(2);
		// 设置输出路径
		Path outPath=new Path(args[1]);
		FileOutputFormat.setOutputPath(job, outPath);
		// 验证
		boolean isSuccess = job.waitForCompletion(true);
		return isSuccess?0:1;
	}
	
	public static void main(String[] args) throws Exception {
		
		Configuration configuration=new Configuration();
//		configuration.set("mapreduce.map.output.compress", "true");
        // 默认org.apache.hadoop.io.compress.DefaultCodec
//		configuration.set("mapreduce.map.output.compress.codec", "");
		int status=ToolRunner.run(configuration, new ModuleMapReduce(), args);
		System.out.println(status);
	}
}

设置压缩：

configuration.set("mapreduce.map.output.compress", "true");
configuration.set("mapreduce.map.output.compress.codec", "");

编码格式：

查看源码，查看类的子类：

（快捷键：ctrl+T）

注意：目前使用SnappyCodec较多

19. MapReduce框架中数据类型讲解及编写Demo

注意：key要继承Writable和Comparable，因为key需要排序；而value只需要继承Writable。

查看源码：

Demo:

自定义字段作为key：

自定义字段作为value：

20. MapReduce执行流程Shuffle讲解

注意：压缩可在大文件写入本地磁盘时压缩。

21. MapReduce 在实际应用中常见的优化

设置reduce数量（默认为1）：

① job中设置

② Configuration中设置

压缩见上文笔记

shuffe调优：

① 小文件达到多少时合并：

factor：因子

② 内存的大小：

假如想让爬虫扒下来的数据直接到文件中，不要到缓存中，可把该参数的数值和溢写的阈值调小。例如1（MB）；阈值：0.1。

③ 溢写的阈值：

④ map、reduce运行分配的虚拟核数（默认为1）：

可根据需求调整。