Hadoop源码篇---解读Mapprer源码outPut输出

一。前述

上次讲完MapReduce的输入后,这次开始讲MapReduce的输出。注意MapReduce的原语很重要:

相同”的key为一组,调用一次reduce方法,方法内迭代这一组数据进行计算!!!!!

二。代码

继续看MapTask任务。

private <INKEY,INVALUE,OUTKEY,OUTVALUE>
  void runNewMapper(final JobConf job,
                    final TaskSplitIndex splitIndex,
                    final TaskUmbilicalProtocol umbilical,
                    TaskReporter reporter
                    ) throws IOException, ClassNotFoundException,
                             InterruptedException {
    // make a task context so we can get the classes
    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =
      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, 
                                                                  getTaskID(),
                                                                  reporter);
    // make a mapper
    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =
      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)
        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);
    // make the input format
    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =
      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)
        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);
    // rebuild the input split
    org.apache.hadoop.mapreduce.InputSplit split = null;
    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),
        splitIndex.getStartOffset());
    LOG.info("Processing split: " + split);

    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =
      new NewTrackingRecordReader<INKEY,INVALUE>
        (split, inputFormat, reporter, taskContext);
    
    job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());
    org.apache.hadoop.mapreduce.RecordWriter output = null;
    
    // get an output object
    if (job.getNumReduceTasks() == 0) {
      output = 
        new NewDirectOutputCollector(taskContext, job, umbilical, reporter);
    } else {
      output = new NewOutputCollector(taskContext, job, umbilical, reporter);源码解析一
    }

    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> 
    mapContext = 
      new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), 
          input, output, 
          committer, 
          reporter, split);

    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context 
        mapperContext = 
          new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(
              mapContext);

    try {
      input.initialize(split, mapperContext);
      mapper.run(mapperContext);
      mapPhase.complete();
      setPhase(TaskStatus.Phase.SORT);
      statusUpdate(umbilical);
      input.close();
      input = null;
      output.close(mapperContext);
      output = null;
    } finally {
      closeQuietly(input);
      closeQuietly(output, mapperContext);
    }
  }

解析一。构造OutPut对象:

 NewOutputCollector(org.apache.hadoop.mapreduce.JobContext jobContext,
                       JobConf job,
                       TaskUmbilicalProtocol umbilical,
                       TaskReporter reporter
                       ) throws IOException, ClassNotFoundException {
      collector = createSortingCollector(job, reporter);//对应解析源码1.2
      partitions = jobContext.getNumReduceTasks();//分区数等于Reduce数,分区数大于分组的概念。
      if (partitions > 1) {
        partitioner = (org.apache.hadoop.mapreduce.Partitioner<K,V>)
          ReflectionUtils.newInstance(jobContext.getPartitionerClass(), job);//对应源码1.1
      } else {
        partitioner = new org.apache.hadoop.mapreduce.Partitioner<K,V>() {
          @Override
          public int getPartition(K key, V value, int numPartitions) {
            return partitions - 1;//用户不设置时默认框架一个reduce,并且分区号为0
          }
        };
      }
    }
  @Override
    public void write(K key, V value) throws IOException, InterruptedException {
      collector.collect(key, value,
                        partitioner.getPartition(key, value, partitions));//上下文对象构造写出的值,放在collect缓存区中。
    }


解析1.1

public Class<? extends Partitioner<?,?>> getPartitionerClass()
throws ClassNotFoundException {
return (Class<? extends Partitioner<?,?>>)
conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);//当用户设置取用户的,没设置默认HashPartitioner 对应解析源码1.1.1

解析源码1.2createSortingCollector类的具体实现

 private <KEY, VALUE> MapOutputCollector<KEY, VALUE>
          createSortingCollector(JobConf job, TaskReporter reporter)
    throws IOException, ClassNotFoundException {
    MapOutputCollector.Context context =
      new MapOutputCollector.Context(this, job, reporter);

    Class<?>[] collectorClasses = job.getClasses(
      JobContext.MAP_OUTPUT_COLLECTOR_CLASS_ATTR, MapOutputBuffer.class);
    int remainingCollectors = collectorClasses.length;
    for (Class clazz : collectorClasses) {
      try {
        if (!MapOutputCollector.class.isAssignableFrom(clazz)) {
          throw new IOException("Invalid output collector class: " + clazz.getName() +
            " (does not implement MapOutputCollector)");
        }
        Class<? extends MapOutputCollector> subclazz =
          clazz.asSubclass(MapOutputCollector.class);
        LOG.debug("Trying map output collector class: " + subclazz.getName());
        MapOutputCollector<KEY, VALUE> collector =
          ReflectionUtils.newInstance(subclazz, job);
        collector.init(context);//解析源码对应1.2.1
        LOG.info("Map output collector class = " + collector.getClass().getName());
        return collector;
      } catch (Exception e) {
        String msg = "Unable to initialize MapOutputCollector " + clazz.getName();
        if (--remainingCollectors > 0) {
          msg += " (" + remainingCollectors + " more collector(s) to try)";
        }
        LOG.warn(msg, e);
      }
    }
    throw new IOException("Unable to initialize any output collector");
  }

 解析源码1.2.1 缓冲区collect的初始化

 public void init(MapOutputCollector.Context context
                    ) throws IOException, ClassNotFoundException {
      job = context.getJobConf();
      reporter = context.getReporter();
      mapTask = context.getMapTask();
      mapOutputFile = mapTask.getMapOutputFile();
      sortPhase = mapTask.getSortPhase();
      spilledRecordsCounter = reporter.getCounter(TaskCounter.SPILLED_RECORDS);
      partitions = job.getNumReduceTasks();
      rfs = ((LocalFileSystem)FileSystem.getLocal(job)).getRaw();

      //sanity checks
      final float spillper =
        job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);//缓冲区溢写阈值,
      final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);//缓冲区默认单位是100M
      indexCacheMemoryLimit = job.getInt(JobContext.INDEX_CACHE_MEMORY_LIMIT,
                                         INDEX_CACHE_MEMORY_LIMIT_DEFAULT);
      if (spillper > (float)1.0 || spillper <= (float)0.0) {
        throw new IOException("Invalid \"" + JobContext.MAP_SORT_SPILL_PERCENT +
            "\": " + spillper);
      }
      if ((sortmb & 0x7FF) != sortmb) {
        throw new IOException(
            "Invalid \"" + JobContext.IO_SORT_MB + "\": " + sortmb);
      }
      sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class",
            QuickSort.class, IndexedSorter.class), job);//Map从缓冲区往磁盘写文件的时候需要排序,用的快排。
      // buffers and accounting
      int maxMemUsage = sortmb << 20;
      maxMemUsage -= maxMemUsage % METASIZE;
      kvbuffer = new byte[maxMemUsage];
      bufvoid = kvbuffer.length;
      kvmeta = ByteBuffer.wrap(kvbuffer)
         .order(ByteOrder.nativeOrder())
         .asIntBuffer();
      setEquator(0);
      bufstart = bufend = bufindex = equator;
      kvstart = kvend = kvindex;

      maxRec = kvmeta.capacity() / NMETA;
      softLimit = (int)(kvbuffer.length * spillper);
      bufferRemaining = softLimit;
      LOG.info(JobContext.IO_SORT_MB + ": " + sortmb);
      LOG.info("soft limit at " + softLimit);
      LOG.info("bufstart = " + bufstart + "; bufvoid = " + bufvoid);
      LOG.info("kvstart = " + kvstart + "; length = " + maxRec);
 comparator = job.getOutputKeyComparator();//排序所使用的比较器 见源码解析1,2.1.1
      keyClass = (Class<K>)job.getMapOutputKeyClass();
      valClass = (Class<V>)job.getMapOutputValueClass();
      serializationFactory = new SerializationFactory(job);
      keySerializer = serializationFactory.getSerializer(keyClass);
      keySerializer.open(bb);
      valSerializer = serializationFactory.getSerializer(valClass);
      valSerializer.open(bb);
// combiner
      final Counters.Counter combineInputCounter =
        reporter.getCounter(TaskCounter.COMBINE_INPUT_RECORDS);
      combinerRunner = CombinerRunner.create(job, getTaskID(), //map端的组合
                                             combineInputCounter,
                                             reporter, null);
      if (combinerRunner != null) {
        final Counters.Counter combineOutputCounter =
          reporter.getCounter(TaskCounter.COMBINE_OUTPUT_RECORDS);
        combineCollector= new CombineOutputCollector<K,V>(combineOutputCounter, reporter, job);
      } else {
        combineCollector = null;
      }

      spillInProgress = false;
      minSpillsForCombine = job.getInt(JobContext.MAP_COMBINE_MIN_SPILLS, 3);//小文件最少是3时,会合并小文件。
      spillThread.setDaemon(true);//线程是另外一个线程负责写的 见解析源码1.2.1.2
      spillThread.setName("SpillThread");
      spillLock.lock();

总结:Mappper输出到缓冲区默认是100M,写到0.8时,会溢写!!!!这块可以调优。通过来回折半来调比如第一次调整50% 然后再80%中减小 70% 然后60%来回折半。

          Combine一定要注意,比如求平均值

 解析1,2.1.1排序比较器的实现

 

 public RawComparator getOutputKeyComparator() {
    Class<? extends RawComparator> theClass = getClass(
      JobContext.KEY_COMPARATOR, null, RawComparator.class);字典排序 默认
    if (theClass != null)
      return ReflectionUtils.newInstance(theClass, this);
    return WritableComparator.get(getMapOutputKeyClass().asSubclass(WritableComparable.class), this);//如果用户没有设置排序比较器,就是Key类型自己的比较器,所以Key必须实现序列化,反序列化,比较器。
  }

 

总结:框架默认使用Key的比较器,字典排序 默认,用户也可以覆盖Key的比较器,自定义。!!!

 

解析源码1.2.1.2 溢写线程做的事
protected class SpillThread extends Thread {

      @Override
      public void run() {
        spillLock.lock();
        spillThreadRunning = true;
        try {
          while (true) {
            spillDone.signal();
            while (!spillInProgress) {
              spillReady.await();
            }
            try {
              spillLock.unlock();
              sortAndSpill();//排序溢写
            } catch (Throwable t) {
              sortSpillException = t;
            } finally {
              spillLock.lock();
              if (bufend < bufstart) {
                bufvoid = kvbuffer.length;
              }
              kvstart = kvend;
              bufstart = bufend;
              spillInProgress = false;
            }
          }
        } catch (InterruptedException e) {
          Thread.currentThread().interrupt();
        } finally {
          spillLock.unlock();
          spillThreadRunning = false;
        }
      }
    }

总结:Map往缓冲区写入东西,线程把缓冲区中的内容做溢写,开始排序,溢写使用快排!!!Combine也在内存中,buffer也在内存,这些计算逻辑都在内存中,排序算法也在内存中,因为Map方法在内存中,这是第一次Combine,从Buffer产生一堆小文件的时候,然后一堆小文件在合并的时候还会执行一次Combine,这次有条件限制(小文件数量大于3)。

 

 

 

解析源码1.1.1

public class HashPartitioner<K, V> extends Partitioner<K, V> {

  /** Use {@link Object#hashCode()} to partition. */
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;!!!
  }
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;!!!重要取分区的写法!!

总结1.以上源码来源于 output = new NewOutputCollector(taskContext, job, umbilical, reporter);所以可得出在输出构造的时候需要构造一个分区器。要么是0的,要么是用户设置的,要么是默认的。
总结2.在输出构造中,有缓冲区的设置。
总结3,以上方法都是OutPut的初始化。
总结4.Map输出的K,V变成K,V,P然后写入到环形缓冲区,内存缓存区80%,然后溢写排序,(先按分区排序,然后再按Key的组排序),然后生成小文件,然后合并,用的归并算法,此时小文件已经是内部有序的,所以使用归并算法,一次io即可。

 

持续更新中。。。。,欢迎大家关注我的公众号LHWorld.

 

 

 

转载于:https://www.cnblogs.com/LHWorldBlog/p/8252953.html

"C:\Program Files\Java\jdk-17\bin\java.exe" -Didea.launcher.port=54222 "-Didea.launcher.bin.path=D:\hadoop\IntelliJ IDEA Community Edition 2018.3.6\bin" -Dfile.encoding=UTF-8 -classpath "D:\hadoop\Hadoop\target\classes;D:\hadoop\hadoop-3.1.4\share\hadoop\client\hadoop-client-api-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\client\hadoop-client-runtime-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\client\hadoop-client-minicluster-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\common\hadoop-kms-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\common\hadoop-nfs-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\common\hadoop-common-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\common\hadoop-common-3.1.4-tests.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-nfs-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-rbf-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-3.1.4-tests.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-client-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-httpfs-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-rbf-3.1.4-tests.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-client-3.1.4-tests.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-native-client-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\hdfs\hadoop-hdfs-native-client-3.1.4-tests.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-examples-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-client-hs-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-client-app-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-client-core-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-client-common-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-client-shuffle-3.1.4.jar;D:\hadoop\hadoop-3.1.4\share\hadoop\mapreduce\hadoop-mapreduce-c
最新发布
05-07
<think>好的,我现在需要帮助用户解决Hadoop 3.1.4的启动命令配置问题,特别是关于JDK 17和IDEA Community Edition 2018.3.6的classpath路径分析。首先,我得回顾一下用户提供的引用内容和之前的问题。 用户提到了几个关键点:Hadoop运行配置问题、启动命令的构成,以及classpath路径分析。引用中提到了ClassNotFoundException,JAVA_HOME未设置的问题,还有修改hadoop-env.sh和yarn.application.classpath的配置。此外,用户还提到了完全分布式环境的搭建步骤,比如移动和解压安装包,配置workers文件和环境变量。 首先,ClassNotFoundException通常和类路径有关,可能是驱动类没找到,或者打包的jar文件没有正确包含所需的类。用户使用JDK 17,需要注意Hadoop版本是否兼容,因为Hadoop 3.1.4可能对JDK 17的支持有限,官方文档可能推荐的是JDK 8或11。这时候可能需要确认兼容性,或者调整JDK版本。 接下来是启动命令的配置。Hadoop启动通常通过start-dfs.sh和start-yarn.sh,这些脚本会读取Hadoop的配置文件,比如hadoop-env.sh,里面需要正确设置JAVA_HOME。用户提到使用JDK 17,所以必须确保hadoop-env.sh中的JAVA_HOME指向正确的JDK路径,例如export JAVA_HOME=/usr/lib/jvm/jdk-17。如果路径错误,会导致JAVA_HOME找不到,从而引发各种错误。 关于classpath,用户需要运行hadoop classpath命令获取路径,并将结果添加到yarn.application.classpath的配置中。这确保YARN能够找到所有的依赖库。同时,如果用户使用IDEA进行开发,需要确保项目中的classpath配置正确,包括Hadoop的库和JDK 17的版本。IDEA 2018.3.6可能较旧,可能需要检查是否支持JDK 17,或者是否需要升级插件。 另外,用户的环境变量配置在/etc/profile.d/my_env.sh中,需要确认Hadoop和JDK的路径是否正确设置,并且通过source命令使其生效。同时,workers文件中的节点名称是否正确,比如master、slave1、slave2是否实际的主机名对应,SSH免密登录是否配置,这些都是启动分布式环境的关键步骤。 可能用户遇到的错误包括:ClassNotFoundException(类路径问题)、JAVA_HOME未设置(环境变量或配置文件错误)、或者节点通信问题(workers配置或SSH问题)。需要一步步检查这些方面,确认每个配置项的正确性。 最后,用户可能需要了解如何在IDEA中配置Hadoop项目,添加正确的依赖库,并设置运行参数,确保打包后的jar包含所有需要的类,并且Hadoop的classpath正确指向这些依赖。同时,考虑到JDK 17的新特性,比如模块化系统,可能需要调整Hadoop的启动脚本或依赖项,避免不兼容的问题。</think>### Hadoop 3.1.4 启动命令配置详解(JDK-17 + IDEA 2018.3.6) #### 一、核心配置项分析 1. **JAVA_HOME配置** 修改 `$HADOOP_HOME/etc/hadoop/hadoop-env.sh` 文件,明确指定JDK-17路径: ```bash export JAVA_HOME=/usr/lib/jvm/jdk-17 ``` 若未配置,会触发 `Error: JAVA_HOME is not set` 错误[^2]。 2. **Classpath动态获取** 运行 `hadoop classpath` 获取当前Hadoop类路径,输出结果需添加到 `yarn-site.xml` 的 `yarn.application.classpath` 配置项中: ```xml <property> <name>yarn.application.classpath</name> <value>/opt/module/hadoop-3.1.4/etc/hadoop:/opt/module/hadoop-3.1.4/share/hadoop/common/lib/*:...</value> </property> ``` 该路径包含Hadoop核心库和第三方依赖[^3]。 3. **IDEA项目配置要点** - 确保 `pom.xml` 中Hadoop依赖版本集群版本一致(3.1.4) - 设置编译兼容性为JDK-17: ```xml <properties> <maven.compiler.source>17</maven.compiler.source> <maven.compiler.target>17</maven.compiler.target> </properties> ``` #### 二、典型启动命令结构 完整作业提交命令示例: ```bash hadoop jar your-job.jar com.example.MainClass \ -D mapreduce.job.queuename=default \ /input/path /output/path ``` - `hadoop jar`:Hadoop执行入口 - `-D` 参数:动态配置MapReduce参数 - 末尾参数:程序自定义输入输出路径 #### 三、JDK-17兼容性问题 需注意: 1. Hadoop 3.1.4 官方支持最高JDK11,使用JDK-17需重新编译Hadoop源码: ```bash mvn clean package -Pdist,native -DskipTests -Dmaven.javadoc.skip=true -Djava.version=17 ``` 2. 若出现 `UnsupportedClassVersionError`,需检查IDEA的编译输出目标版本设置 #### 四、调试技巧 1. **环境验证命令**: ```bash # 检查JAVA_HOME echo $JAVA_HOME # 验证Hadoop类路径 hadoop classpath ``` 2. **日志定位**: - NameNode日志:`$HADOOP_HOME/logs/hadoop-*-namenode-*.log` - YARN日志:通过 `yarn logs -applicationId <app_id>` 获取
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值