需求:有如下数据,求出每一个订单id中成交金额最大的三笔交易(字段分别为:订单id,用户id,商品名称,单价,数量),即分组求TOP-N。
order001,u001,小米6,1999.9,2 order001,u001,雀巢咖啡,99.0,2 order001,u001,安慕希,250.0,2 order001,u001,经典红双喜,200.0,4 order001,u001,防水电脑包,400.0,2 order002,u002,小米手环,199.0,3 order002,u002,榴莲,15.0,10 order002,u002,苹果,4.5,20 order002,u002,肥皂,10.0,40 |
实现思路:
实现思路:
map: 读取数据切分字段(用逗号切分),封装数据到一个OrderBean对象中作为key传输,key要按照成交金额比大小
reduce:利用自定义GroupingComparator将数据按订单id进行分组,然后在reduce方法中输出每组数据的前N条即可
(1)首先需要保证的是让orderid相同的数据发送给相同的reduce task,如果同一个订单没有发送到同一个reduce task,那么排序为局部排序,没有意义。为了完成此目的,需要继承Partitioner类自定义一个OrderIdPartitioner类。(未自定义前是按照key的哈希值进行分区,本例中key为OrderBean对象,不同对象的哈希值肯定不同)。我们需要做的是按照OrderBean对象的orderid分区。
OrderIdPartitioner类代码如下:
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.Partitioner;
public class OrderIdPartitioner extends Partitioner<OrderBean, NullWritable>{
@Override
public int getPartition(OrderBean key, NullWritable value, int numPartitions) {
//按照订单中的orderid来分区,分发数据
return (key.getOrderId().hashCode() & Integer.MAX_VALUE)% numPartitions;
}
}
(2)相同orderid的数据被分到同一task下后,接下来需要做的是通过OrderBean上的compareto方法实现让框架将数据以orderid为第一排序规则,订单总金额为第二排序规则进行排序(即:以orderid排序,若orderid相同,则按订单总金额降序排序,这样就可以得到每个订单下各商品的金额排序)
OrderBean 类代码如下:
注意:需要重写readFields(序列化),write(反序列化),compareTo(按照指定方式排序),toString(按照指定格式输出)四个方法。
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.WritableComparable;
public class OrderBean implements WritableComparable<OrderBean>{
private String orderId;
private String userId;
private String pdtName;
private float price;
private int number;
private float amountFee;
public void set(String orderId, String userId, String pdtName, float price, int number) {
this.orderId = orderId;
this.userId = userId;
this.pdtName = pdtName;
this.price = price;
this.number = number;
this.amountFee = price * number;
}
public String getOrderId() {
return orderId;
}
public void setOrderId(String orderId) {
this.orderId = orderId;
}
public String getUserId() {
return userId;
}
public void setUserId(String userId) {
this.userId = userId;
}
public String getPdtName() {
return pdtName;
}
public void setPdtName(String pdtName) {
this.pdtName = pdtName;
}
public float getPrice() {
return price;
}
public void setPrice(float price) {
this.price = price;
}
public int getNumber() {
return number;
}
public void setNumber(int number) {
this.number = number;
}
public float getAmountFee() {
return amountFee;
}
public void setAmountFee(float amountFee) {
this.amountFee = amountFee;
}
@Override
public void readFields(DataInput in) throws IOException {
this.orderId = in.readUTF();
this.userId = in.readUTF();
this.pdtName = in.readUTF();
this.price = in.readFloat();
this.number = in.readInt();
this.amountFee = this.price * this.number;
}
@Override
public String toString() {
return this.orderId + "," + this.userId + "," + this.pdtName + "," + this.price + "," + this.number + ","
+ this.amountFee;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeUTF(this.orderId);
out.writeUTF(this.userId);
out.writeUTF(this.pdtName);
out.writeFloat(this.price);
out.writeInt(this.number);
}
//比较规则:先比id,再比总金额
@Override
public int compareTo(OrderBean o) {
return this.orderId.compareTo(o.getOrderId())==0?Float.compare(o.getAmountFee(), this.getAmountFee()):this.orderId.compareTo(o.getOrderId());
}
}
(3)最后,设法让reduce task程序认为orderid相同的key属于同一组,若不修改,默认key为OrderBean,OrderBean都不相同,无法得到想要的排序输出,所以必须修改。此时,需要修改GroupingComparator逻辑。
OrderIdGroupingComparator类代码如下:
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class OrderIdGroupingComparator extends WritableComparator{
public OrderIdGroupingComparator() {
super(OrderBean.class,true);
}
@Override
public int compare(WritableComparable a, WritableComparable b) {
OrderBean o1 = (OrderBean) a;
OrderBean o2 = (OrderBean) b;
return o1.getOrderId().compareTo(o2.getOrderId());
}
}
此时,大部分已经完成,只需要写OrderTopnMapper类、OrderTopnReducer类以及JobSubmitter类(用于提交mapreduce job的客户端程序,也可直接放在main方法中)
OrderTopn类代码如下:
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class OrderTopn {
public static class OrderTopnMapper extends Mapper<LongWritable, Text, OrderBean, NullWritable>{
OrderBean orderBean = new OrderBean();
NullWritable n = NullWritable.get();
@Override
protected void map(LongWritable key, Text value,
Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
String[] split = value.toString().split(",");
orderBean.set(split[0], split[1], split[2], Float.parseFloat(split[3]), Integer.parseInt(split[4]));
context.write(orderBean, n);
}
}
public static class OrderTopnReducer extends Reducer<OrderBean, NullWritable, OrderBean, NullWritable>{
//虽然reduce方法中是参数key只有一个,但是迭代器迭代一次,key中的值也会迭代一次
@Override
protected void reduce(OrderBean key, Iterable<NullWritable> values,
Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context)
throws IOException, InterruptedException {
int i =0;
for (NullWritable value : values) {
context.write(key, value);
i++;
if(i==3) return;
}
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(OrderTopn.class);
job.setMapperClass(OrderTopnMapper.class);
job.setReducerClass(OrderTopnReducer.class);
job.setPartitionerClass(OrderIdPartitioner.class);
job.setGroupingComparatorClass(OrderIdGroupingComparator.class);
job.setMapOutputKeyClass(OrderBean.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(OrderBean.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.setInputPaths(job, new Path("E:\\hadoopdatas\\order_topn\\input"));
FileOutputFormat.setOutputPath(job, new Path("E:\\hadoopdatas\\order_topn\\output"));
job.setNumReduceTasks(3);
job.waitForCompletion(true);
}
}
此时,所有步骤都已完成。可在本地eclipse下运行调试或者将程序打成jar,在linux下通过hadoop jar top-n.jar xx.oo.OrderTopn命令调用yarn集群实现。
总结:本任务主要运用了三部分的控制:排序控制(修改compareTo)、分区控制(修改Partitioner)、分组控制(修改GroupingComparator)。通过该三部分的控制,使得分组topn的效率大大提高,map以及reduce过程只需要处理自己该做的部分,排序及分组任务通过修改默认方法完成,不占用额外资源。