hadoop几种排序简介

最新推荐文章于 2023-12-25 15:57:54 发布

leibnitz09

最新推荐文章于 2023-12-25 15:57:54 发布

阅读量235

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop sources reading hadoop 文章标签：大数据数据结构与算法

本文链接：https://blog.youkuaiyun.com/leibnitz09/article/details/84116080

hadoop 同时被 2 个专栏收录

33 篇文章

订阅专栏

hadoop sources reading

12 篇文章

订阅专栏

在map reduce框架中，除了常用的分布式计算外，排序也算是比较重要的一环了。这形如sql查询中的排序数据一样重要。

一。无排序

当书写code 时，如果指定了mapred.reduce.tasks=0(same effect as setNumReduceTasks)。这样便达到目的。

产生的效果当然是只有一个part file，而且其中的entries是unorder.

二。默认排序（sort only in partition)

其实这也称”局部排序“。这种情况是产生若干个part files，并且各file内部是排序好的，但file之间没有内容排序之分。

三。全局排序

当你使用TotalOrderPartitioner来作partitioner时，便可以了(注意在mapreduce lib中已经删除了）。当然要更新一下它的setPartitionFile(xx)，以便它利用样本估算得出边界的几个参数（数量是reduces num - 1)。但通常会使用InputSampler.RandomSampler实现来取样。

具体的算法如下：

/**
     * Randomize the split order, then take the specified number of keys from
     * each split sampled, where each key is selected with the specified
     * probability and possibly replaced by a subsequently selected key when
     * the quota of keys from that split is satisfied.
     */
public K[] getSample(InputFormat<K,V> inf, JobConf job) throws IOException {
      InputSplit[] splits = inf.getSplits(job, job.getNumMapTasks());
      ArrayList<K> samples = new ArrayList<K>(numSamples);
      int splitsToSample = Math.min(maxSplitsSampled, splits.length);  //取多少样本(splits)

      Random r = new Random();
      long seed = r.nextLong();
      r.setSeed(seed);
      LOG.debug("seed: " + seed);
      // shuffle splits；其实就 是随机交換splits达到混乱的效果显得更加均匀。
      for (int i = 0; i < splits.length; ++i) {
        InputSplit tmp = splits[i];
        int j = r.nextInt(splits.length);
        splits[i] = splits[j];
        splits[j] = tmp;
      }
      // our target rate is in terms of the maximum number of sample splits,
      // but we accept the possibility of sampling additional splits to hit
      // the target sample keyset
      for (int i = 0; i < splitsToSample ||
                     (i < splits.length && samples.size() < numSamples); ++i) {
        RecordReader<K,V> reader = inf.getRecordReader(splits[i], job,
            Reporter.NULL);
        K key = reader.createKey();
        V value = reader.createValue();
        while (reader.next(key, value)) {
          if (r.nextDouble() <= freq) {    // 概率要小于初始概率 
            if (samples.size() < numSamples) {  //未达到上限时直接添加样本
              samples.add(key);
            } else {
              // When exceeding the maximum number of samples, replace a
              // random element with this one, then adjust the frequency
              // to reflect the possibility of existing elements being
              // pushed out
              int ind = r.nextInt(numSamples); /// 否则更新某个样本元素
              if (ind != numSamples) {
                samples.set(ind, key);
              }
              freq *= (numSamples - 1) / (double) numSamples; //更新了之后降低后续更新概率，否则太频繁了。
            }
            key = reader.createKey();
          }
        }
        reader.close();
      }
      return (K[])samples.toArray();
    }

利用上述返回值，hadoop便会得出此样本的比例情况。具体的算法我没有找到在哪里实现，但大概我认为是这样的：

1.利用当前100 ／ reduce num ／ 100来得出平均概率分布；

2.对样本进行排序

3.由低到高（相反也可以）逐个区间进行各种key占比例统计，当达到平均概率值（当然允许有偏差）时停止此区间的添加，并得到最大key作为第一个边界值；

4.同样道理处理其它keys

5.这样处理可能最后出现很多组边界值，所以得有一个优化算法再进一步筛选。

不过我尝试实现过，发现这种计算也是挺复杂的，因为你不知道该什么时候结束；而且要记住不同情况下的边界值。

我认为hadoop也会设置一个offset值，并且限制优化次数。TODO 有空我会继续找源码看看。

四。分组（二次排序）

这个功用就类似于sql中的group by clause，就是对已经排序的数据再进一步key去重。

实现也是很简单的，过程大概是这样：

1.生成复合键；