关于Hadoop中的采样器

最新推荐文章于 2022-04-30 23:17:41 发布

原创

最新推荐文章于 2022-04-30 23:17:41 发布 · 1k 阅读

1 ·

CC 4.0 BY-SA版权

本文介绍了Hadoop中采样器的作用，强调了合理使用采样器能解决数据分区不均的问题，提高效率。详细讲解了如何使用采样器，包括设置reduce数量和采样文件路径。此外，还探讨了RandomSampler、SplitSampler和IntervalSampler三种常见采样器的工作原理，并给出了采样器在实际应用，如terasort中的使用示例。

1、为什么要使用采样器
在这个网页上有一段描述比较靠谱
http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning/

简单的来说就是解决"How to automatically find “good” partitioning function",因为很多时候无法直接制订固定的partitioner策略,所以需要知道实际的数据分布.糟糕的策略导致的结果就是每个reduce节点得到的数据部均匀,对效率影响挺大

2.如何使用采样器

最简单的产生采样信息的示例如下：

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
		//创建job
		Configuration conf = new Configuration();
		Job job = new Job(conf,"sampler job");
		job.setJarByClass(SampleJob.class);
		
		Path inputPath = new Path(args[0]);
		String partitionDir = args[1];
		//创建生成的partition采样文件所存放的地方
		Path partitionPath = new Path(partitionDir, "_partition");
		
		FileInputFormat.setInputPaths(job, inputPath);
		
		//注意:MapOutputKeyClass的值必须和InputFormat读取文件后的key的类型一致。
		job.setMapOutputKeyClass(LongWritable.class);
		
		job.setPartitionerClass(TotalOrderPartitioner.class);
		//通过TotalOrderPartitioner把采样文件路径传递给job
		TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionPath);
		
		//必须要设置reduce的数量。Sampler会根据它来生成分区的临界值
		job.setNumReduceTasks(20);
		
		//设置采样器
		InputSampler.RandomSampler<LongWritable, Text> sampler = new InputSampler.RandomSampler<LongWritable, Text>(0.1, 1000, 10);
		InputSampler.writePartitionFile(job, sampler);
	}

如上所示，收集采样信息是在map执行之前做的，即是在client上运行的，因此你可以不运行mapreduce任务就可以生成采样数据。需要注意的有以下几点：

1、InputFormat读取数据后返回的KEY,VALUE的类型必须和InputSampler的key,value的类型一致。

2、Mapper的输出的key类型必须和输入的key类型一致（如果不运行mr，那么可以手动使用job.setMapOutputKeyClass方法，把输出和输入的key的类型设为相同即可）

3、必须要设置reduce的数量，InputSampler将会根据reduce数生成(numReduces-1)个值，以便分成numReduces个分区。

4、必须要通过TotalOrderPartitioner.setPartitionFile(job.getConfiguration(), partitionPath)方法把采样器产生文件的路径传递给job。

注意：采样器采样后写在输出文件中的数据类型和InputFormat读取数据后的key的类型一致，所以你想采样获得什么类型的数据，你可以使用hadoop自带的InputFormat，或者自己编写InputFormat类，让它返回的Key的类型和你所需采样得到的数据类型一致。

一般都将该文件作分布式缓存处理：

    //一般都将该文件做distribute cache处理 
   
   URI partitionURI =  
   new URI(partitionFile.toString() + “#_partitions”); 
   
   DistributedCache.addCacheFile(partitionURI, conf); 
   
   DistributedCache.createSymlink(conf);

3.常用的采样器介绍

http://blog.youkuaiyun.com/andyelvis/article/details/7294811

Hadoop中采样是由org.apache.hadoop.mapred.lib.InputSampler类来实现的。

InputSampler类实现了三种采样方法：RandomSampler,SplitSampler和IntervalSampler。//RandomSampler最耗时

RandomSamplerSplitSampler、RandomSampler和IntervalSampler都是InputSampler的静态内部类，它们都实现了InputSampler的内部接口Sampler接口