在mapreduce编程中,有挺多实现了Comparable, Writable接口的内置变量类型,比如
Text, IntWritable,LongWritable等等。
这次我要自己定义一个类,将它用作
设计自定义类MyData
按照需求,这个类应该有4个变量,分别记录手机号,上行流量,下行流量以及总流量。
这个类需要实现Writable接口,所以需要实现两个函数:
- write函数 ,将MyData数据序列化成为二进制数据流;
readFields函数,从二进制数据流中取出MyData数据。
要注意的有,write和readFileds两个函数写属性以及读属性的顺序以及类型不要弄错了。
string类型使用UTF格式进行保存,所以使用writeUTF(string),readUTF(string)进行存取。
至于long等类型数据则可以使用writeLong(long),readLong(long)进行存取。
两个函数的参数分别为DataOutput以及DataInput。
还有个toString函数需要注意呀!toString里边怎么写的,在结果文件中就会是怎么写的哦!
package data;
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;
public class MyData implements Writable{
private String id;
private long upPayload;
private long downPayload;
private long totalPayload;
//构造函数
public MyData(){}
public MyData(String id, long upPayload, long downPayload) {
this.id = id;
this.upPayload = upPayload;
this.downPayload = downPayload;
this.totalPayload = upPayload + downPayload;
}
//deserialize the data
public void readFields(DataInput in) throws IOException {
// TODO Auto-generated method stub
this.id = in.readUTF();
this.upPayload = in.readLong();
this.downPayload = in.readLong();
this.totalPayload = in.readLong();
}
//serialize the data
public void write(DataOutput out) throws IOException {
// TODO Auto-generated method stub
out.writeUTF(id);
out.writeLong(upPayload);
out.writeLong(downPayload);
out.writeLong(totalPayload);
}
@Override
public String toString() {
return "[upPayload=" + upPayload
+ ", downPayload=" + downPayload + ", totalPayload="
+ totalPayload + "]";
}
...//此处实现四个变量的get,set方法
}
MapReduce的设计
Mapper的设计
从文件中读取一行数据,用”\t”一一分割,从中得到手机号id,以及两个流量信息,并创建MyData保存数据。
因为想要试试自定义数据,所以map的输出设计为
public static class HadoopTest1Mapper extends Mapper<LongWritable, Text, Text, MyData>{
public void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException{
String strLine = value.toString();
String[] tokens = strLine.split("\t");
Text newKey = new Text(tokens[1]);
MyData newValue = new MyData("", Long.parseLong(tokens[8]), Long.parseLong(tokens[9]));
context.write(newKey, newValue);
}
}
Reducer的设计
把相同key的数据的流量加起来就是啦!
注意注意,是Iterable values;
Iterable是个接口,这样使用是Java的反射机制啦!
最后context写的时候是可以直接写MyData的哦!
public static class HadoopTest1Reducer extends Reducer<Text, MyData, Text, MyData>{
public void reduce(Text key, Iterable<MyData> values, Context context) throws IOException, InterruptedException{
long sumUp = 0;
long sumDown = 0;
for(MyData value : values){
sumUp += value.getUpPayload();
sumDown += value.getDownPayload();
}
MyData newValue = new MyData("", sumUp, sumDown);
context.write(key, newValue);
}
}
main函数
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
// TODO Auto-generated method stub
String inputStr = "hdfs://127.0.0.1:9000/user/HTTP.dat";
String outputStr = "hdfs://127.0.0.1:9000/user/HTTPresult";
Configuration conf = new Configuration();
Job job = new Job(conf, "Test1");
job.setJarByClass(HadoopTest1.class);
job.setNumReduceTasks(4);
job.setMapperClass(HadoopTest1Mapper.class);
job.setReducerClass(HadoopTest1Reducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(MyData.class);
FileInputFormat.addInputPath(job, new Path(inputStr));
FileOutputFormat.setOutputPath(job, new Path(outputStr));
job.waitForCompletion(true);
}
实验运行与结果
我是在Eclipse的Maven项目里边运行的,关于具体怎么使用maven运行hadoop项目可以参考http://blog.youkuaiyun.com/jianjian1992/article/details/46957811
运行结果如下:
测试数据
1363157985066 13726230503 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157995052 13826544101 5C-0E-8B-C7-F1-E0:CMCC 120.197.40.4 4 0 264 0 200
1363157991076 13926435656 20-10-7A-28-CC-0A:CMCC 120.196.100.99 2 4 132 1512 200
1363154400022 13926251106 5C-0E-8B-8B-B1-50:CMCC 120.197.40.4 4 0 240 0 200
1363157993044 18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99 iface.qiyi.com 视频网站 15 12 1527 2106 200
1363157995074 84138413 5C-0E-8B-8C-E8-20:7DaysInn 120.197.40.4 122.72.52.12 20 16 4116 1432 200
1363157993055 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200
1363157995033 15920133257 5C-0E-8B-C7-BA-20:CMCC 120.197.40.4 sug.so.360.cn 信息安全 20 20 3156 2936 200
1363157983019 13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82 4 0 240 0 200
1363157984041 13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4 s19.cnzz.com 站点统计 24 9 6960 690 200
1363157973098 15013685858 5C-0E-8B-C7-F7-90:CMCC 120.197.40.4 rank.ie.sogou.com 搜索引擎 28 27 3659 3538 200
1363157986029 15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99 www.umeng.com 站点统计 3 3 1938 180 200
1363157992093 13560439658 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 15 9 918 4938 200
1363157986041 13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4 3 3 180 180 200
1363157984040 13602846565 5C-0E-8B-8B-B6-00:CMCC 120.197.40.4 2052.flash2-http.qq.com 综合门户 15 12 1938 2910 200
1363157995093 13922314466 00-FD-07-A2-EC-BA:CMCC 120.196.100.82 img.qfc.cn 12 12 3008 3720 200
1363157982040 13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99 y0.ifengimg.com 综合门户 57 102 7335 110349 200
1363157986072 18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99 input.shouji.sogou.com 搜索引擎 21 18 9531 2412 200
1363157990043 13925057413 00-1F-64-E1-E6-9A:CMCC 120.196.100.55 t3.baidu.com 搜索引擎 69 63 11058 48243 200
1363157988072 13760778710 00-FD-07-A4-7B-08:CMCC 120.196.100.82 2 2 120 120 200
1363157985066 13726238888 00-FD-07-A4-72-B8:CMCC 120.196.100.82 i02.c.aliimg.com 24 27 2481 24681 200
1363157993055 13560436666 C4-17-FE-BA-DE-D9:CMCC 120.196.100.99 18 15 1116 954 200