Mapreduce《案例之数据去重复》

最新推荐文章于 2024-11-16 14:46:21 发布

原创最新推荐文章于 2024-11-16 14:46:21 发布 · 272 阅读

·

0

·

CC 4.0 BY-SA版权

版权声明：本文为博主原创文章，遵循 CC 4.0 BY-SA 版权协议，转载请附上原文出处链接和本声明。

文章标签：

#Mapreduce #去重复 #hadoop demo

分布式专栏收录该内容

7 篇文章

订阅专栏

本文介绍了一个使用MapReduce实现的数据去重案例。通过两个文件中的示例数据，展示了如何利用MapReduce处理并去除重复记录的过程。代码示例中包含了Mapper和Reducer的具体实现。

Mapreduce《案例之数据去重复》

源数据：

a.txt内容：

2012-3-1 b

2012-3-2 a

2012-3-3 b

2012-3-4 d

2012-3-5 a

2012-3-6 c

2012-3-7 d

2012-3-3 c

b.txt内容：

2012-3-1 a

2012-3-2 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-7 c

2012-3-3 c

输出结果：

2012-3-1 a

2012-3-1 b

2012-3-2 a

2012-3-2 b

2012-3-3 b

2012-3-3 c

2012-3-4 d

2012-3-5 a

2012-3-6 b

2012-3-6 c

2012-3-7 c

2012-3-7 d

//===================================JAVA CODE=========================

package gq;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

/**

* 数据去重复

* @author tallqi

*

*/

public class Dereplication {

public static class Map extends Mapper<Object, Text, Text, Text>{

private static Text line = new Text();

public void map(Object key,Text value,Context context) throws IOException, InterruptedException{

line =value;

System.out.println("Map:"+value);

context.write(line, new Text(""));

}

}

public static class Reduce extends Reducer<Text, Text, Text, Text>{

public void reduce(Text key,Iterable<Text> value,Context context) throws IOException, InterruptedException{

System.out.println("Reduce:"+key);

context.write(key, new Text(""));;

}

}

public static void main(String[] args) throws Exception{

Configuration conf = new Configuration();

Job job = new Job(conf,"Dereplication");

job.setJarByClass(Dereplication.class);

//设置Map、Combine和Reduce处理类

job.setMapperClass(Map.class);

// job.setCombinerClass(Reduce.class);

job.setReducerClass(Reduce.class);

//输出Key，value的类型

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

//数据源地址，数据输出地址

FileInputFormat.addInputPath(job, new Path("hdfs://h0:9000/user/tallqi/in/input"));

FileOutputFormat.setOutputPath(job, new Path("hdfs://h0:9000/user/tallqi/in/output"));

System.exit(job.waitForCompletion(true)?0:1);

}

}

评论

被折叠的条评论为什么被折叠?

到【灌水乐园】发言

查看更多评论

添加红包

成就一亿技术人!

hope_wisdom

发出的红包

实付元

使用余额支付

点击重新获取

扫码支付

钱包余额 0

抵扣说明：

1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。