第四章开发MapReduce应用程序

最新推荐文章于 2025-11-28 22:41:02 发布

原创最新推荐文章于 2025-11-28 22:41:02 发布 · 159 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#大数据

Hadoop实战专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了MapReduce程序的开发与运行流程，包括系统参数配置、开发环境搭建、程序编写、本地测试、运行与优化等内容。重点阐述了MapReduce的工作流程、任务函数setup、cleanup及run的执行过程，并探讨了性能调优策略。

4.1 系统参数的配置

配置中被标记为“final”的属性不能被重写

4.2 配置开发环境

Hadoop三种不同的运行方式：单机模式、伪分布式模式、完全分布式

4.3 编写MapReduce程序

4.4 本地测试

P62

4.5 运行MapReduce程序

P62

4.6 网络用户界面

P65

4.7 性能调优

P68

4.8 MapReduce工作流

1、setup函数

/**
   * Called once at the beginning of the task.
   */
  protected void setup(Context context
                       ) throws IOException, InterruptedException {
    // NOTHING
  }

在task函数启动之后数据处理之前值调用一次，而map函数和reduce函数会针对分片中每个key调用一次

2、cleanup函数

  /**
   * Called once at the end of the task.
   */
  protected void cleanup(Context context
                         ) throws IOException, InterruptedException {
    // NOTHING
  }

在task销毁之前调用

3、run数

  /**
   * Expert users can override this method for more complete control over the
   * execution of the Mapper.
   * @param context
   * @throws IOException
   */
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
  }

启动函数

MapReduce Job中的全局共享数据

1、读取HDFS文件

针对多个Map和Reduce写操作时会覆盖之前的数据，I/O消耗资源

2、配置Job属性

通过Configuration类中的set()设置属性，在task中通过get()获得属性，较大的数据共享乏力

3、DistributedCache

MapReduce为应用提供缓存文件的只读工具