HDPCD-Java-复习笔记(5)

本文介绍了MapReduce中分区和排序的基本原理及实现方法,包括默认的HashPartitioner工作流程、自定义分区器的编写,以及如何通过TotalOrderPartitioner实现全序分区。此外,还详细解释了排序过程中的关键任务,即按自然顺序对键进行排序并将相等的键分组,同时提供了实现二次排序的方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Partition and Sorting


Partitioners


All values with the same key must be sent to the same Reducer.

If there is no Partitioner configured, MapReduce jobs use the HashPartitioner by default. The HashPartitioner uses the hashCode method of Object (along with the modulus operator) to determine how the records are partitioned.


A group of records from the intermediate key space is assigned to each reduce node. These groups of records are called partitions, anda partition represents the input of a Reducer.

How a record gets assigned to a Reducer:

1.The Mapper outputs <key, value> pairs (records). Once the map task is complete, the partitioning of records can begin.

2.The Partitioner is an object that defines a getPartition method. Each < key ,value > pair is passed into the getPartition method, along with the number of Reducers.

3.The getPartition method returns an int that determines which Reducer the < key ,value > pair is sent to.


The Default Partitioner

public class HashPartitioner<K, V> extends Partitioner<K, V> {
    /** Use {@link Object#hashCode()} to partition. */
    public int getPartition(K key, V value, int numReduceTasks) {
        return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
    }
}

Writing a Custom Partitioner

public class WordCountPartitioner extends Partitioner<Text, IntWritable> {
    @Override
    public int getPartition(Text key, IntWritable value, int numReduceTasks) {
        if (numReduceTasks == 1) {
            return 0;
        }

        return (key.toString().length() * value.get()) % numReduceTasks;
    }
}

The TotalOrderPartitioner  accomplishes its task by using an external file that defines how the keys are split across partitions.

Involves two main steps:

1.Create a partition file.

2.Share the partition file amongst all Mappers.

Create a partition file

  • job.setPartitionerClass(TotalOrderPartitioner.class);
  • InputSampler.Sampler<Text, Text> sampler = new InputSampler.RandomSampler<Text, Text>(0.1, 200, 3);
  • InputSampler.writePartitionFile(job, sampler);

hadoop jar hadoop-core.jar org.apache.hadoop.
             mapreduce.lib.partition.InputSampler
-inFormat org.apache.hadoop.mapreduce.lib.
             input.KeyValueTextInputFormat
-keyClass org.apache.hadoop.io.Text
-r 3
-splitInterval 0.1 3 population_data.txt _partition.lst


Distributing the Partition File

String partitionFile = TotalOrderPartitioner.getPartitionFile(conf);
URI partitionUri = new URI(partitionFile + "#"
                  + TotalOrderPartitioner.DEFAULT_PATH);
job.addCacheFile(partitionUri);

The  getPartitionFile method returns the path to partition file.

The  partitionUri is a URI that represents the path that theTotalOrderPartitioner looks up when retrieving the partition file.

The  addCacheFile method adds the partition file to the  LocalResource so each Container can have access to it.


Overview of Sorting


The  shuffle/sort phaseperforms two key tasks:

1.Keys are sorted in their natural order.

2.Keys that are equal are grouped together.


Recall the key class has to be of type WritableComparable, which  forces a compareTo method to be defined. The compareTo method creates what is called  a natural order for the keys.

Grouping Comparator, it decides which keys are equal to each other. If two keys are equal, their values get grouped together.


Secondary Sort



The easiest way to  implement a secondary sortis to move part of the value into the key to form a composite key.

Here are the three main stepsto follow:

1.Write a custom key class that contains the secondary key. While it may be possible to use a built-in Hadoop key class, typically it must be defined.

2.Write a custom Grouping Comparator  to determine how keys are grouped.

3.Write a custom Partitioner that ensures grouped keys are sent to the same reducer.


Writing Custom Keys

A custom key class needs to implement the WritableComparable interface.

public class CustomerKey implements WritableComparable<CustomerKey> {
    private int customerId;
    private String zipCode;

    @Override
    public int compareTo(CustomerKey arg0) {
        int result = this.zipCode - arg0.zipCode;

        return ((result != 0) ? result : (this.customerId - arg0.customerId));
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        this.customerId = in.readInt();
        this.zipCode = in.readUTF();
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeInt(customerId);
        out.writeUTF(zipCode);
    }

    //setters and getters...
}


Writing a Group Comparator

A custom Group Comparator needs to implement the WritableComparator class.

public class CustomerGroupComparator extends WritableComparator {
    protected CustomerGroupComparator() {
        super(CustomerKey.class, true);
    }

    @SuppressWarnings("rawtypes")
    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        CustomerKey lhs = (CustomerKey) a;
        CustomerKey rhs = (CustomerKey) b;

        return lhs.getZipCode().compareTo(rhs.getZipCode());
    }
}



评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值