Mapreduce 结果写入Hbase两种方法对比

最新推荐文章于 2021-12-19 21:48:38 发布

原创最新推荐文章于 2021-12-19 21:48:38 发布 · 932 阅读

1 ·

CC 4.0 BY-SA版权

大三暑期实习（有关于Hadoop）专栏收录该内容

8 篇文章

订阅专栏

本文对比了两种将MapReduce结果写入Hbase的方法：直接使用Hbase写入接口和通过TableReducer。虽然两者效率都不高，但TableReducer在性能上优于直接写入接口。同时提到了从Hbase读取数据的tableMapper接口，以及Hbase信息筛选的两种策略，但未进行性能测试。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

由于能力有限，对性能的评价皆出于主观感受，见谅。

方法一：通过Hbase提供的写入接口

在setup中配置Hbase信息，检测表是否存在，不存在创建表；在reduce函数中，调用table.put(put1)方法把结果写入到Hbase中

public class hbaseStatisticsReducer extends Reducer<Text, Text, Text, Text> {

	public static String tablename = "statistics";
	public static String[] cfs = { "data" };
	public static Configuration conf = new Configuration();

	@Override
	protected void setup(Context context) throws IOException {

		conf = context.getConfiguration();

		conf.set("hbase.rootdir", "hdfs://localhost:9000/hbase");
		conf.set("hbase.zookeeper.quorum", "localhost");
		conf.set("hbase.zookeeper.property.clientPort", "2181");

		HBaseAdmin admin = new HBaseAdmin(conf);
		if (admin.tableExists(tablename)) {
		} else {
			HTableDescriptor tableDesc = new HTableDescriptor(tablename);
			for (int i = 0; i < cfs.length; i++) {
				tableDesc.addFamily(new HColumnDescriptor(cfs[i]));
			}
			admin.createTable(tableDesc);
		}
	}

	@Override
	protected void reduce(Text item, Iterable<Text> input, Context context)
			throws IOException, InterruptedException {

		HTable table = new HTable(conf, tablename);
		int times = 0;
		long sum = 0l;
		for (Text tmp : input) {

			String[] tmpstr = tmp.toString().split(StatisticsMapper.separator);
			sum += Long.parseLong(tmpstr[0]);
			times += Integer.parseInt(tmpstr[1]);
		}

		Put put1 = new Put(Bytes.toBytes(item.toString()));
		put1.add(Bytes.toBytes(cfs[0]), Bytes.toBytes("sum"),
				Bytes.toBytes("" + sum));
		table.put(put1);

	}
}

效率低下。

方法二：通过Hbase提供的reduce接口

在驱动程序中设置Hbase的相关属性

conf.set("hbase.rootdir", "hdfs://172.17.238.152:9000/hbase");
conf.set("hbase.zookeeper.quorum", "172.17.238.151");
conf.set("hbase.zookeeper.property.clientPort", "2181");

以及通过

		TableMapReduceUtil.initTableReducerJob(AnalysisMain.TableName, AnalysisIntoHBaseReducer.class, job);

设置表名、Reducer类、job对象。

Reducer继承TableReducer，它默认输出为 Hbase的Put对象，并插入到对应的表中。此程序需要提前建表、列簇等，有待改进。效率也不高！

public class AnalysisIntoHBaseReducer extends
		TableReducer<Text, Text, ImmutableBytesWritable> {

	@Override
	public void reduce(Text item, Iterable<Text> input, Context context)
			throws InterruptedException, IOException {

		int times = 0;
		long sum = 0l;
		for (Text tmp : input) {

			String[] tmpstr = tmp.toString().split(AnalysisMain.separator);
			sum += Long.parseLong(tmpstr[0]);
			times += Integer.parseInt(tmpstr[1]);
		}

		Put put1 = new Put(Bytes.toBytes(item.toString()));
		put1.add(Bytes.toBytes("data"), Bytes.toBytes("sum"),
				Bytes.toBytes("" + sum));
		context.write(new ImmutableBytesWritable(item.toString().getBytes()),
				put1);

		Put put2 = new Put(Bytes.toBytes(item.toString()));
		put2.add(Bytes.toBytes("data"), Bytes.toBytes("times"),
				Bytes.toBytes("" + times));

		context.write(new ImmutableBytesWritable(item.toString().getBytes()),
				put2);

	}
}

该任务的部分信息

Map input records

10,000,000

Reduce input records	0	7,521,263	7,521,263
Reduce input groups	0	999,962	999,962

Reduce output records

1,999,924

reduce用时约为11分钟（同时6个task运行，共12个）

但比方法一更优！

相应的：Map对Hbase的操作（从中读书据）也有对应的tableMapper供继承（速度较快）。

TableMapReduceUtil.initTableMapperJob("analysis", scan, hbaseMapper.class, Text.class, Text.class, job);

public class hbaseMapper extends TableMapper<Text, Text> {

	@Override
	protected void map(ImmutableBytesWritable key, Result value, Context context)
			throws InterruptedException, IOException {

		String sum = new String(value.getValue(testMain.family, testMain.sum));
		String times = new String(value.getValue(testMain.family,
				testMain.times));
		String row = new String(key.get());

		String[] tokens = row.split(testMain.separator);
		String newRow = tokens[0] + testMain.separator + tokens[1]
				+ testMain.separator + tokens[4];

		context.write(new Text(newRow), new Text(sum + testMain.separator
				+ times));
	}
}

key值中可读取row key信息。

对Hbase中信息筛选有两种方法。

1、对scan进行设置，只取出需要的信息

2、读出所有信息，在map中处理

两种方法未测试（以上程序用了全部信息），性能未知。不负责任引用

前者在执行较少量scan记录的时候效率较后者高，但是执行的scan数量多了，便容易导致超时无返回而退出的情况。

最后的一点思考是，用后者效率仍然不高，即便可用前者时效率也不高，因为默认的tablemapper是将对一个region的scan任务放在了一个mapper里，
而我一个region有2G多，而我查的数据只占七八个region。于是，想能不能不以region为单位算做mapper，如果不能改，那只有用MR直接操作HBase
底层HDFS文件了，这个，…，待研究。