HBase代码学习-MapReduce

最新推荐文章于 2025-09-08 10:07:54 发布

转载最新推荐文章于 2025-09-08 10:07:54 发布 · 98 阅读

0 ·

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/zcraft/p/8687822.html

文章标签：

#大数据

本文介绍了HBase如何与MapReduce框架集成以实现大规模数据处理。包括MapReduce作业的输入输出方式、设置Scan缓存的方法及一些实用的功能，如表复制、数据导入导出等。

1.概述

HBase提供了读写HBase tbale的MapReduce框架。

2.原理

2.1.MapReduce的输入

2.1.1.使用或者扩展TableInputFormat，inputFormat会生成一系列TableSplit。

2.1.2.inputFormat会建立region到mapper的映射（TableSplit），map-per-region或者mapreduce.job.maps指定的map数量。

2.1.3.每个TableSplit对应一个Mapper，即一个Task。

2.2.MapReduce的输出

2.2.1.使用或者扩展TableOutputFormat/MultiTableOutputFormat

2.2.2.由于hbase本来就是一个有序的结构，所以大多数情况下可以使用map-only的job直接将数据写入hbase。

2.2.3.如果必须使用reducer，最好配置合理数量的reducers，这样对hbase cluster的负载更友好。

2.3.MapReduce的处理

2.3.1.每个Map函数都以ImmutableBytesWritable row和Result value为输入。

2.3.2.框架会根据TableSplit中配置的Scan扫描table，每个扫描出来的Result都会调用Map函数。

3.设置Scan Caching

优先级从高到低：

在Scan中设置的cache值 > 在hbase-site.xml中配置hbase.client.scanner.caching或者通过TableMapReduceUtil.setScannerCaching()设置 > 默认值，100

设置的规则：如果太大，client可能会等待很长时间以至于请求超时；如果太小，可能一次scan需要多次rpc才能完成。

4.一些现成的功能：

An example program must be given as the first argument.
Valid program names are:
  copytable: Export a table from local cluster to peer cluster // 将一个表从当前集群拷贝到目标集群。
  completebulkload: Complete a bulk data load.  // 进行bulkload。
  export: Write table data to HDFS. // 将一个表从集群导出到HDFS。
  import: Import data written by Export.  // 将一个导出到HDFS上的数据导入到集群的表中。
  importtsv: Import data in TSV format.  // 导入TSV格式数据导入集群的表中。
  rowcounter: Count rows in HBase table  // 计算HBase表中的行数。

5.例子，HBase Read例子：

Configuration config = HBaseConfiguration.create();
Job job = new Job(config, "ExampleRead");
job.setJarByClass(MyReadJob.class); // class that contains mapper
Scan scan = new Scan();
scan.setCaching(500); // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false); // don't set to true for MR jobs
// set other scan attrs
...
TableMapReduceUtil.initTableMapperJob(
    tableName, // input HBase table name
    scan, // Scan instance to control CF and attribute selection
    MyMapper.class, // mapper
    null, // mapper output key
    null, // mapper output value
    job);
job.setOutputFormatClass(NullOutputFormat.class); // because we aren't emitting anything from mapper
boolean b = job.waitForCompletion(true);
if (!b) {
    throw new IOException("error with job!");
}
public static class MyMapper extends TableMapper<Text, Text> {
    public void map(ImmutableBytesWritable row, Result value, Context context) throws InterruptedException, IOException {
// process data for the row from the Result instance.
    }
}