map/reduce unit test

最新推荐文章于 2021-10-14 13:00:32 发布

kokojhuang

最新推荐文章于 2021-10-14 13:00:32 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

分类专栏： Hadoop应用

本文链接：https://blog.youkuaiyun.com/kokojhuang/article/details/8292924

Hadoop应用专栏收录该内容

2 篇文章

订阅专栏

Apache的MRUnit项目允许对MapReduce进行单元测试，简化了开发和调试过程。在本地环境中，开发者可以通过MapDriver模拟Map阶段，利用withInput和withOutput设置输入和期望输出，通过runTest执行测试用例。MRUnit帮助确保Map逻辑正确，减少分布式环境中的问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

由于Map/Reudce Job是运行在hadoop分布式环境中，所以给日常开发map/reduce的时候带来了很多不便，包括调试或者测试等。但是Apache下面一个开源的项目(MRUnit)可以对Map/Reduce进行单元测试，这样就可以使用单元测试用例来对Map/Reduce进行Debug，从而也可能通过丰富的测试用例来进行测试。可以在本地开发机上保证基本业务正确的前提下，再发布到hadoop分布式环境中解决一些分布式带来的问题。

MRUnit的具体使用官网中已经具体的使用说明。(MRUnit web site)

下面主要介绍一下Map/Reudce和MRUnit的最基本的原理，及对MRUnit中进行一些单独的展现来实现一些复杂的业务。

1. MAP

Map的基本原理（为了说明MRUnit在Map的基本原理），Map主要是读取原数据进行map操作，Hadoop Map/Reduce框架在map阶段调用org.apache.hadoop.mapreduce.Mapper.run方法，具体如下：

  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    while (context.nextKeyValue()) {
      map(context.getCurrentKey(), context.getCurrentValue(), context);
    }
    cleanup(context);
  }

从上面的方法，可以看到在run中在开始和结果分别做了setup和cleanup操作，核心操作就是通过map context读取数据，并用读取到的key/value数据调用具体的map逻辑(通常map的逻辑是我们来实现)，最终通过context.write把map的结果输出到中间结果（通常以文件形式）中提供reduce使用。

了解了map的基本原因之后，其实MRUnit主要做的就是mock一个map context，用于接受我们在单元测试中模拟数据，及对map的结果与测试用例中期望的结果进行比较。MapDriver通过 withInput和 withOutput来接受input和期望output，调用runTest来执行具体的map测试用例，如下：

        mapDriver.withInput(new ChunkKey(), chunkWritable);
        mapDriver.withOutput(new BytesWritable(SerializeUtil.serializeToBytes(multipleObject)), new BytesWritable(SerializeUtil.serializeToBytes(chunk)));
        mapDriver.runTest();

runTest的具体实现如下，执行具体的run得到map的output，再与期望的output比较：

public void runTest(final boolean orderMatters) {
    LOG.debug("Mapping input (" + inputKey + ", " + inputVal + ")");
    try {
      final List<Pair<K2, V2>> outputs = run();
      validate(outputs, orderMatters);
      validate(counterWrapper);
    } catch (final IOException ioe) {
      LOG.error("IOException in mapper", ioe);
      throw new RuntimeException("IOException in mapper: ", ioe);
    }
  }

run的具体实现如下，通过input来mock一个map context，然后调用map的run方法返回map的output。

    final List<Pair<K1, V1>> inputs = new ArrayList<Pair<K1, V1>>();
    inputs.add(new Pair<K1, V1>(inputKey, inputVal));
    final InputSplit inputSplit = new MockInputSplit();
    try {
      final MockMapContextWrapper<K1, V1, K2, V2> wrapper = new MockMapContextWrapper<K1, V1, K2, V2>(
          inputs, getCounters(), getConfiguration(), inputSplit);
      final Mapper<K1, V1, K2, V2>.Context context = wrapper.getMockContext();
      myMapper.run(context);
      return wrapper.getOutputs();
    } catch (final InterruptedException ie) {
      throw new IOException(ie);
    }

从上面的代码中可以看到，最主要是需要mock map context，MRUnit中主要mock了map context的nextKeyValue(), getCurrentKey()和getCurrentValue()，mock使用的原数据就是通过withInput输入的值，及write用于把输出写到内部的List对象中，使用 mockito来mock具体的对象，具体如下:

    when(context.nextKeyValue()).thenAnswer(new Answer<Boolean>() {
      @Override
      public Boolean answer(final InvocationOnMock invocation) {
        if (inputs.size() > 0) {
          currentKeyValue = inputs.remove(0);
          return true;
        } else {
          currentKeyValue = null;
          return false;
        }
      }
    });
    when(context.getCurrentKey()).thenAnswer(new Answer<KEYIN>() {
      @Override
      public KEYIN answer(final InvocationOnMock invocation) {
        return currentKeyValue.getFirst();
      }
    });
    when(context.getCurrentValue()).thenAnswer(new Answer<VALUEIN>() {
      @Override
      public VALUEIN answer(final InvocationOnMock invocation) {
        return currentKeyValue.getSecond();
      }
    });

    doAnswer(new Answer<Object>() {
      @Override
      public Object answer(final InvocationOnMock invocation) {
        final Object[] args = invocation.getArguments();
        outputs.add(new Pair(copy(args[0], conf), copy(args[1], conf)));
        return null;
      }
    }).when(context).write((KEYOUT) any(), (VALUEOUT) any());

1.1 Map阶段对MRUnit进行扩展

由于MRUnit只mock了map context中最基本的一些方法，如果使用到别的方法时，需要扩展mock map context，从run方法中可以看到mock出来的context是一个方法级变量，所以无法通过简单的值入来扩展，可以通过继承MapDriver并override run方法来扩展，如果想在setup中通过context来获取map/reduce job name就可以通过如下这种方式来扩展：

            final Mapper<K1, V1, K2, V2>.Context context = wrapper.getMockContext();
            when(context.getJobName()).thenAnswer(new Answer<String>() {
                @Override
                public String answer(final InvocationOnMock invocation) {
                    return jobName;
                }
            });
            myMapper.run(context);
            return wrapper.getOutputs();

2. Reduce
MRUnit对reduce的处理基本跟map差不多，这里就不多介绍，下面主要介绍在Reduce过程中对MultipleOutputs的支持，因为在很多map/reduce job中都会用到MultipleOutputs来在reduce阶段根据不能的业务输入不同的结果，而MRUnit目前还不支持MultipleOutputs，不过网上已经有人实现了对map/reduce v1的支持，可以通过这个地址下载。但是目前项目中使用的是map/reduce v2，所以需要自己来实现。其实了解了MRUnit的基本原理之后，基本需要做的只是去mock一个MultipleOutputs对象的write方法，用于把结果写到List中，执行完reduce之后比较这个List与期望值。

	    final MockReduceContextWrapper<K1, V1, K2, V2> wrapper = new MockReduceContextWrapper<K1, V1, K2, V2>(
                    inputs, getCounters(), getConfiguration());
            final Reducer<K1, V1, K2, V2>.Context context = wrapper.getMockContext();
            //mock multiple output object
            MockMultipleOutputs mockMultipleOutputs = new MockMultipleOutputs(context);
            //set multiple output
            setMultipleOutputs(myReducer, mockMultipleOutputs.getMultipleOutputs());
            myReducer.run(context);
            //check output path
            Assert.assertEquals(outputPath,mockMultipleOutputs.getOutputPath());
            //get output list from mock multiple output
            return mockMultipleOutputs.getOutputs();

由于Reduce的结构基本固定，因为map/reduce框架通过这种结构来执行，所以在不改变Reduce结构的前提下，通过反相反射来把mock multiple ouput object注入到reduce中，上面setMultipleOutputs的实现如下：

     private void setMultipleOutputs(Reducer<K1, V1, K2, V2> myReducer,
                                    MultipleOutputs multipleOutputs) throws Exception {
        Field field = getFieldByType(myReducer, MultipleOutputs.class);
        if (field != null) {
            field.setAccessible(true);
            field.set(myReducer, multipleOutputs);
        }
    }
    private Field getFieldByType(Reducer<K1, V1, K2, V2> myReducer, Class<?> clazz) {
        Field[] fields = myReducer.getClass().getDeclaredFields();
        Field candidate = null;
        for (Field field : fields) {
            if (field.getType().isAssignableFrom(clazz)) {
                candidate = field;
                break;
            }
        }
        return candidate;
    }

而mock multiple output主要实现write方法如下，把输入写到内容LIst中，并获取output path来验证reduce输出的目录是否正确：

	try {
            doAnswer(new Answer<Object>() {
                @Override
                public Object answer(final InvocationOnMock invocation) {
                    final Object[] args = invocation.getArguments();
                    outputPath = (String) args[3];
                    outputs.add(new Pair(copy(args[1], conf), copy(args[2], conf)));
                    return null;
                }
            }).when(output).write(anyString(), any(), any(), anyString());
        } catch (Exception e) {
            throw new RuntimeException(e);
        }

具体使用如下：

reduceDriver.withInput(new BytesWritable(SerializeUtil.serializeToBytes(key)), inputList);
        reduceDriver.withMultipleOutput("/test/url/20121216/tmp_/m/url-data", new BytesWritable(SerializeUtil.serializeToBytes(url)), new BytesWritable(SerializeUtil.serializeToBytes(urlAnalyzerResult)));
        reduceDriver.runTest();

总结：其实了解了MRUnit的基本原理之后，就可以对其进行一些简单的扩展来支持一些较复杂的业务。