MapReduce自定义对象序列化
数据如下:
首先在本地文件系统这里我使用的是centos6.7图形化界面安装
打开终端,最好切换到root用户下,规避需要权限的操作,
可以参考我写的l博客linux基础入门
要把同一个用户的上行流量、下行流量进行累加,并计算出综合。
例如上面的13897230503有两条记录,就要对这两条记录进行累加,计算总和,得到:
13897230503,500,1600,2100
(2)实现思路
map
接收日志的一行数据,key为行的偏移量,value为此行数据。
输出时,应以手机号为key,value应为一个整体,包括:上行流量、下行流量、总流量。
手机号是字符串类型Text,而这个整体不能用基本数据类型表示,需要我们自定义一个bean对象,并且要实现可序列化。
key: 13897230503
value: < upFlow:100, dFlow:300, sumFlow:400 >
reduce
接收一个手机号标识的key,及这个手机号对应的bean对象集合。
例如:
key:
13897230503
value:
< upFlow:400, dFlow:1300, sumFlow:1700 >,
< upFlow:100, dFlow:300, sumFlow:400 >
迭代bean对象集合,累加各项,形成一个新的bean对象,例如:
< upFlow:400+100, dFlow:1300+300, sumFlow:1700+400 >
最后输出:
key: 13897230503
value: < upFlow:500, dFlow:1600, sumFlow:2100 >
创建项目此处我使用的集成开发环境为Eclipse
第一个类是封装的JavaBean名称为FlowBean
package comtoo;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class FlowBean implements Writable{
int upFlow;
int downFlow;
double toFlow;
public FlowBean(int upFlow,int downFlow,double toFlow){
this.upFlow=upFlow;
this.downFlow=downFlow;
this.toFlow=toFlow;
}
public int getUpFlow() {
return upFlow;
}
public void setUpFlow(int upFlow) {
this.upFlow = upFlow;
}
public int getDownFlow() {
return downFlow;
}
public void setDownFlow(int downFlow) {
this.downFlow = downFlow;
}
public double getToFlow() {
return toFlow;
}
public void setToFlow(int toFlow) {
this.toFlow = toFlow;
}
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readInt();
downFlow = in.readInt();
toFlow = in.readDouble();
}
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(upFlow);
out.writeInt(downFlow);
out.writeDouble(toFlow);
}
@Override
public String toString() {
return upFlow+":"+downFlow+":"+toFlow;
}
public FlowBean() {
super();
}
}
第二个类是控制分区数量的,体现在Map端执行,类名为MyPartitioner
package com;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
public class MyPartitioner extends Partitioner<Text, FlowBean> {
@Override
public int getPartition(Text key, FlowBean value, int partitionNum) {
//截取key的前三位用来做比较
String phoneAre=key.toString().substring(0, 3);
if(phoneAre.equals("137")){
//如果为第一种情况,划分至第一个分区
return 0;
}
if(phoneAre.equals("133")){
return 1;
}
if(phoneAre.equals("138")){
return 2;
}
if(phoneAre.equals("135")){
return 3;
}
return 4;
}
}
第三个类是实现具体的业务逻辑的,包含了程序主入口,和Map,Reducer函数的实现
类名为FlowWritable
package com;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class FlowWritable {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
FileSystem fs = FileSystem.get(conf);
Path inputpath = new Path(args[0]);
Path outputpath = new Path(args[1]);
if(fs.exists(outputpath)){
fs.delete(outputpath, true);
}
job.setJarByClass(FlowWritable.class);
job.setJobName("Flow");
job.setMapperClass(Map.class);
job.setReducerClass(Red.class);
FileInputFormat.setInputPaths(job, inputpath);
FileOutputFormat.setOutputPath(job, outputpath);
job.setMapOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
job.setPartitionerClass(MyPartitioner.class);
job.setNumReduceTasks(5);
job.waitForCompletion(true);
}
public static class Map extends Mapper<LongWritable,Text,Text,FlowBean>{
public void map(LongWritable key,Text value,Context context) throws IOException, InterruptedException{
String[] line = value.toString().split("\\|");
FlowBean fl = new FlowBean(Integer.parseInt(line[1]), Integer.parseInt(line[2]));
context.write(new Text(line[0]),fl);
}
}
public static class Red extends Reducer<Text,FlowBean,Text,Text>{
public void reduce(Text key,Iterable<FlowBean> value,Context context) throws IOException, InterruptedException{
int upSum=0;
int downSum=0;
int toSum=0;
for (FlowBean fl : value) {
upSum+=fl.getUpFlow();
downSum+=fl.getDownFlow();
}
toSum=upSum+downSum;
context.write(key, new Text(upSum+":"+downSum+":"+toSum));
}
}
}
根据以上步骤可实现基础的MapReduce分析归类及分区操作,重要:Map端的Partitioner基本上决定了reduce的数量。当然在主函数中需要设置job.setNumReduceTasks(数字);l来控制reduce的数量。从而提升运算速度!