在工作中大家对hbase的bloom filter是否能作用于scan展开讨论。在没讨论前,我还真没想过这个问题,想当然的以为bloom filter肯定可以为scan剔除掉不需要的hfile。但Google了下才发现事实不是如此!
首先,学习了以下2篇文章:
hbase对bf的理解和使用
http://zjushch.iteye.com/blog/1530143
hbase的主要代码提交者对hbase Bloomfilter的解释
http://blog.youkuaiyun.com/macyang/article/details/6182629
大概对BloomFilter有了一些了解,然后找到了hbase中对有bloomfilter的table查询的2个优化:
1.get操作会enable bloomfilter帮助剔除掉不会用到的Storefile
在scan初始化时(get会包装为scan)对于每个storefile会做shouldSeek的检查,如果返回false,则表明该storefile里没有要找的内容,直接跳过
- <span style="font-size:14px;"> if (memOnly == false
- && ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {
- scanners.add(kvs);
- }</span>
- <span style="font-size:14px;"> if (!scan.isGetScan()) {
- return true;
- }
- byte[] row = scan.getStartRow();
- switch (this.bloomFilterType) {
- case ROW:
- return passesBloomFilter(row, 0, row.length, null, 0, 0);
- case ROWCOL:
- if (columns != null && columns.size() == 1) {
- byte[] column = columns.first();
- return passesBloomFilter(row, 0, row.length, column, 0,
- column.length);
- }
- // For multi-column queries the Bloom filter is checked from the
- // seekExact operation.
- return true;
- default:
- return true;</span>
2.指明qualified的scan在配了rowcol的情况下会剔除不会用掉的StoreFile。
对指明了qualify的scan或者get进行检查:seekExactly
- <span style="font-size:14px;"> // Seek all scanners to the start of the Row (or if the exact matching row
- // key does not exist, then to the start of the next matching Row).
- if (matcher.isExactColumnQuery()) {
- for (KeyValueScanner scanner : scanners)
- scanner.seekExactly(matcher.getStartKey(), false);
- } else {
- for (KeyValueScanner scanner : scanners)
- scanner.seek(matcher.getStartKey());
- }</span>
- <span style="font-size:14px;">public boolean seekExactly(KeyValue kv, boolean forward)
- throws IOException {
- if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||
- kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {
- return forward ? reseek(kv) : seek(kv);
- }
- boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),
- kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),
- kv.getQualifierOffset(), kv.getQualifierLength());
- if (isInBloom) {
- // This row/column might be in this store file. Do a normal seek.
- return forward ? reseek(kv) : seek(kv);
- }
- // Create a fake key/value, so that this scanner only bubbles up to the top
- // of the KeyValueHeap in StoreScanner after we scanned this row/column in
- // all other store files. The query matcher will then just skip this fake
- // key/value and the store scanner will progress to the next column.
- cur = kv.createLastOnRowCol();
- return true;
- }</span>
这边为什么是rowcol才能剔除storefile纳,很简单,scan是一个范围,如果是row的bloomfilter不命中只能说明该rowkey不在此storefile中,但next rowkey可能在。而rowcol的bloomfilter就不一样了,如果rowcol的bloomfilter没有命中表明该qualifiy不在这个storefile中,因此这次scan就不需要scan此storefile了!
结论如下:
1.任何类型的get(基于rowkey和基于row+col)bloomfilter都能生效,关键是get的类型要匹配bloomfilter的类型
2.基于row的scan是没办法优化的
3.row+col+qualify的scan可以去掉不存在此qualify的storefile,也算是不错的优化了,而且指明qualify也能减少流量,因此scan尽量指明qualify。