流量统计实战
先复习一波hadoop shuffle的过程

1.map task 过程中会把数据写入到内存中,在spill写入之前,会先进行二次排序,首先根据数据所属的partition进行排序,然后每个partition中的数据再按key来排序。
2. 接着会进行combine过程(如果设置了combiner了的话) combine本身也是一个reducer 会对写入到磁盘的数据处理,期望减少写入到磁盘数据的大小
3.当数据达到阈值,会进行spill 生成多个磁盘文件。,多个磁盘文件进行归并排序。map task的shuffle就此结束。
map task partition->sort->combine->spill(磁盘)->归并排序
reduce task 会接受不同 map task 同一个的partition的数据 然后又进行归并排序。然后进行reduce处理 最终结果写到hdfs上面。

特别说明:
如果指定reduceTask的数量为1 那么任何数摸1都会得0 其实就一个分区号0 如果指定reduceTask 其实最多就2个分区 分区号0 和分区号1
自定义hadoop 可以比较 可以序列化的bean
public class FlowBean implements WritableComparable<FlowBean> {
public FlowBean() {
}
public FlowBean(long downFlow, long upFlow, String phoneNumber) {
this.downFlow = downFlow;
this.upFlow = upFlow;
this.phoneNumber = phoneNumber;
this.sumFlow=upFlow+downFlow;
}
/**
* this.sumFlow指定参数 返回-1 指定参数要靠前 在这里 总流量大的靠前
* @param o 方法参数
* @return
*/
@Override
public int compareTo(FlowBean o) {
return this.sumFlow>o.sumFlow?-1:1;
}
/**
* 从磁盘到内存
* @param dataOutput
* @throws IOException
*/
@Override
public void write(DataOutput dataOutput) throws IOException {
dataOutput.writeUTF(phoneNumber);
dataOutput.writeLong(upFlow);
dataOutput.writeLong(downFlow);
dataOutput.writeLong(sumFlow);
}
/**
* 序列化 从内存到磁盘
* @param dataInput
* @throws IOException
*/
@Override
public void readFields(DataInput dataInput) throws IOException {
phoneNumber=dataInput.readUTF();
upFlow=dataInput.readLong();
downFlow=dataInput.readLong();
sumFlow=dataInput.readLong();
}
private long upFlow;
private long downFlow;
private long sumFlow;
private String phoneNumber;
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
public String getPhoneNumber() {
return phoneNumber;
}
public void setPhoneNumber(String phoneNumber) {
this.phoneNumber = phoneNumber;
}
public void set(String phoneNumber, long upFlow, long downFlow) {
this.phoneNumber=phoneNumber;
this.upFlow=upFlow;
this.downFlow=downFlow;
this.sumFlow=upFlow+downFlow;
}
@Override
public String toString() {
return "\t"+upFlow+"\t"+downFlow+"\t"+sumFlow;
}
}
代码runner如下
public class FlowSort {
public static class FlowSortMapper extends Mapper<Object, Text, FlowBean, NullWritable> {
private FlowBean outKey = new FlowBean();
@Override
protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String regex = "\\s+";
String[] split = line.split(regex);
String phoneNumber = "";
long upFlow=0;
long downFlow=0;
try {
phoneNumber = split[1];
upFlow= Long.parseLong(split[8]);
downFlow = Long.parseLong(split[9]);
} catch (Exception e) {
context.getCounter("FlowSort","splitException").increment(1);
return;
}
outKey.set(phoneNumber, upFlow, downFlow);
context.write(outKey,NullWritable.get());
}
}
public static class FlowSortReducer extends Reducer<FlowBean, NullWritable, Text, FlowBean> {
private Text outKey = new Text();
// 这里针对一个reduceTask或者说针对一个partition 但是一个partition可能有多个hashcode不一样的key
@Override
protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException {
// 这里只针对同一个key
String phoneNumber = key.getPhoneNumber();
outKey.set(phoneNumber);
context.write(outKey,key);
}
}
public static void main(String[] args) throws Exception {
Configuration configuration = new Configuration();
FileUtils.deleteDirectory(new File("E:\\IdeaProjects\\hadoopstudy\\data\\flowdata\\results"));
Job job = Job.getInstance(configuration, "flowSort");
job.setJarByClass(FlowSort.class);
job.setReducerClass(FlowSort.FlowSortReducer.class);
job.setMapOutputKeyClass(FlowBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputKeyClass(FlowBean.class);
job.setNumReduceTasks(1);
MultipleInputs.addInputPath(job,new Path(args[0]), TextInputFormat.class,FlowSort.FlowSortMapper.class);
FileOutputFormat.setOutputPath(job,new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

map到reduce中间的shuffle会根据key进行sort哦
好的 我们下次再见
本文深入解析Hadoop Shuffle过程,包括MapTask中的数据处理流程:partition、sort、combine和spill,以及ReduceTask接收数据后的归并排序。通过自定义FlowBean类实现流量数据的比较与序列化,展示如何利用Hadoop进行高效流量统计。
862

被折叠的 条评论
为什么被折叠?



