Hadoop TOPN 实现

最新推荐文章于 2025-07-18 17:24:22 发布

it小奋

最新推荐文章于 2025-07-18 17:24:22 发布

阅读量2.5k

点赞数

CC 4.0 BY-SA版权

分类专栏： Java Hadoop 文章标签： Hadoop 大数据

本文链接：https://blog.youkuaiyun.com/u010820702/article/details/59118493

Java 同时被 2 个专栏收录

13 篇文章

订阅专栏

Hadoop

11 篇文章

订阅专栏

本文介绍了一种在大数据处理中优化TOPN排序的方法，通过MapReduce框架实现，并使用有序定长集合来降低内存消耗，同时提供了具体的Java实现代码。

业务场景:大数据的挖掘的形式多种多样,即便是最基本的数据大处理技术,也应该关注全部数据而不是局部或者部分,以TOPN(排序取顶部N项目数据)为例,对全批量数据进行统计技术并筛选目标数据.

数据格式:

VERSION=1.0,PASSTIME=2016-11-3000:00:39 000,CARSTATE=1,CARPLATE=无,PLATETYPE=99,SPEED=0,PLATECOLOR=4,LOCATIONID=-1,DEVICEID=-1,DRIVEWAY=2,DRIVEDIR=4,CAPTUREDIR=1,CARCOLOR=10,CARBRAND=99,CARBRANDZW=其它,TGSID=1125,PLATECOORD=0,0,0,0,CABCOORD=0,0,0,0,IMGID1=http://11.110.248.59:9099/image/dhdfs/2016-11-30/archivefile-2016-11-30-000040-00677B0200000001:5750848/308059.jpg,IMGID2=,IMGID3=,

总记录数在千万级,分多个文件存储,.writed文件为数据文件具备读条件标识,真实数据为.writed文件名去后缀.wrired,目录格式如下:

处理目标:获取所有数据集中卡点通行车辆最大的前N个卡口的卡口编号,供相关模块使用显示.

处理思路:

Map à(卡口ID,1) à(卡口ID,List<Int>) {1.求和2.缓存所有的求和后的数据进行排序3.输出N个满足条件的数据}

初步优化:不缓存数据,直接使用有序定长集合限制存储的数据长度,将内存消耗降到最低,该过程在Reduce阶段体现.

初步实现:

CarTopNMapper.java,Mapper函数

class CarTopNMapper extends Mapper<LongWritable,Text, Text, IntWritable> {

         @Override

         protectedvoid map(LongWritable key, Text value, Mapper<LongWritable, Text, Text,IntWritable>.Context context)

                            throwsIOException, InterruptedException {

                   Stringtemp = value.toString();

                   if(temp.length() > 13) {

                            temp= temp.substring(12);

                            String[]items = temp.split(",");

                            if(items.length > 10) {

                                     //CarPlate As Key

                                     try{

                                               Stringtgsid = items[14].substring(6);

                                               Integer.parseInt(tgsid);

                                               context.write(newText(tgsid), new IntWritable(1));

                                     }catch (Exception e) {

                                               e.printStackTrace();

                                     }

                            }

                   }

         }

}

的实现,对数据预处理,只向后发射有效数据,类似操作可额外带数据.

CarTopNReduce.java ,合理使用Reduce架构的生命周期方法,收集TOPN数据,使用有序集合实现对结果集的间接排序,限制集合的数量很大程序上避免了出现内存不足的可能性.

class CarTopNReduce extends Reducer<Text,IntWritable, Text, IntWritable> {

         privatefinal TreeMap<Integer, String> tm = new TreeMap<Integer, String>();

         privateint N;

 

         @Override

         protectedvoid setup(Reducer<Text, IntWritable, Text, IntWritable>.Context context)

                            throwsIOException, InterruptedException {

                   Configurationconf = context.getConfiguration();

                   N= conf.getInt(CarTopN.TOPN, 10);

         }

 

         @Override

         protectedvoid reduce(Text key, Iterable<IntWritable> values,

                            Reducer<Text,IntWritable, Text, IntWritable>.Context arg2) throws IOException,InterruptedException {

                   Integerweight = 0;

                   for(IntWritable iw : values) {

                            weight+= iw.get();

                   }

                   tm.put(weight,key.toString());

                   if(tm.size() > N) {

                            tm.remove(tm.firstKey());

                   }

         }

 

         @Override

         protectedvoid cleanup(Reducer<Text, IntWritable, Text, IntWritable>.Contextcontext)

                            throwsIOException, InterruptedException {

                   for(Integer key : tm.keySet()) {

                            context.write(newText("byonet:" + tm.get(key)), new IntWritable(key));

                   }

         }

}

ITGSParition.java ,分区函数,此部分可选,对非复合KEY来说显得没有那么重要.

class ITGSParition extends Partitioner<Text,Text> {

         @Override

         publicint getPartition(Text key, Text value, int numPartitions) {

                   return(Math.abs(key.hashCode())) % numPartitions;

         }

}

CarTopN.java,驱动函数此处的技巧在于使用MR框架将配置参数N传递给M||R,使得程序名副其实,文件处理方式一如既往,提前处理不满足条件的文件避免引起意外.读取LOZ压缩数据且使用CombineTextInputFormat 强烈建议显示指明最大分片大小

public class CarTopN {

         publicstatic final String TOPN = "TOPN";

         publicstatic void main(String[] args) throws Exception {

                   Pathinput = new Path(args[0]);

                   Pathoutput = new Path(args[1]);

                   IntegerN = Integer.parseInt(args[2]);

                   Configurationconf = new Configuration();

                   //define the N

                   conf.setInt(CarTopN.TOPN,N);

                   Jobjob = Job.getInstance(conf, "CAR_Top10_BY_TGSID");

                   job.setJarByClass(cn.com.zjf.MR_04.CarTopN.class);

                   job.setInputFormatClass(CombineTextInputFormat.class);

                   job.setMapperClass(CarTopNMapper.class);

                   //not use

                   //job.setCombinerClass(Top10Combine.class);

                   job.setReducerClass(CarTopNReduce.class);

                   job.setNumReduceTasks(1);

                   job.setMapOutputValueClass(IntWritable.class);

                   job.setOutputKeyClass(Text.class);

                   job.setOutputValueClass(IntWritable.class);

                   job.setPartitionerClass(ITGSParition.class);

                   FileSystemfs = FileSystem.get(conf);

                   //预处理文件.只读取写完毕的文件.writed结尾.只读取文件大小大于0的文件

                   {

                            FileStatuschilds[] = fs.globStatus(input, new PathFilter() {

                                     publicboolean accept(Path path) {

                                               if(path.toString().endsWith(".writed")) {

                                                        returntrue;

                                               }

                                               returnfalse;

                                     }

                            });

                            Pathtemp = null;

                            for(FileStatus file : childs) {

                                     temp= new Path(file.getPath().toString().replaceAll(".writed",""));

                                     if(fs.listStatus(temp)[0].getLen() > 0) {

                                               FileInputFormat.addInputPath(job,temp);

                                     }

                            }

                   }

                   CombineTextInputFormat.setMaxInputSplitSize(job,67108864);

 

                   //强制清理输出目录

                   if(fs.exists(output)) {

                            fs.delete(output,true);

                   }

                   FileOutputFormat.setOutputPath(job,output);

 

                   if(!job.waitForCompletion(true))

                            return;

         }

}

进一步优化:考虑到当前程序的目的仅仅是为了在大规模数据中的排序操作,有必要寻求降低在Mapper输出阶段的IO操作,在不影响排序结果的前提下推荐使用Conbine特性,即在Mapper端对每一此Mapper的输出结果预先进行一次和Reduce阶段相同的帅选过程.注:该操作适合不适合一个Mapper任务中存在多个key的情况,谨慎使用!