Hbase the definitive guide - Advanced Usage章阅读札记

HBase高级用法与优化

最新推荐文章于 2025-11-22 19:46:52 发布

最新推荐文章于 2025-11-22 19:46:52 发布 · 156 阅读

文章标签：

#hbase

hbase 专栏收录该内容

1 篇文章

订阅专栏

本文深入探讨了HBase的高级用法，包括表设计、部分键值扫描、分页、时间序列数据处理及BloomFilter应用。重点介绍了如何通过优化设计提升性能，以及在不同场景下选择合适的技术策略。

hbase的高级用法 - 读书札记。

2.hbase表的两种设计：（tall-narrow and flat-wide）高而窄型设计、宽而平型设计。

前者拥有较少的列和较多的行。后者行少而列多。考虑到keyrow查询粒度的问题，建议把cell中的值提取到key中，

尤其是当cell中的值需要查询时。另外，hbase仅支持行边界分裂，因此前者能很好支持，而flat-wide的设计中，

如果一行数据太大，超过region大小就会和region分裂策略相互冲突。

3.部分键值扫描器：通过设置startrow 和endrow来查询一段RowKey。（startrow + 1是一个很好的技巧）

4.分页：分页可通过partial key scan实现，原理是先通过设置start key和stop key确定一个范围，然后再在客户端

通过偏移量（Offset）和大小限制（limit）来实现分页，这和普通的数据库分页是有区别的。

或者使用PageFilter或是ColumnPaginationFilter也可实现分页。

5.时间序列数据：具有时间顺序的数据，一般RowKey代表了时间，这种单调递增的数据，造成所有进入的数据

都被写到一个region而不是分散到各个region上，这将导致集群的效率降低（数据更新将集中在一个特定的region上）。

解决的办法是在时间前面加上非序列的前缀，如：


byte prefix = (byte) (Long.hashCode(timestamp) % <number of regionservers>);
byte[] rowkey = Bytes.add(Bytes.toBytes(prefix), Bytes.toBytes(timestamp);

但这样一来RowKey的范围就会扩散。

在组合键的情况下可以考虑把时间戳放在次要位置。带来的问题是按照时间范围的查询难以进行。好处是某记录

是时间顺序（倒序）的。

总得来说键的设计，键值越随机，写性能越好，键值越连续，读性能约好。

#时间排序关系，FamilyColumn的每个Column都是排序的，把时间戳当做列名存储。可以把它当做二级索引的替代，关系数据库中的二级索引

可以通过多Column进行模拟，尽管这不是推荐的设计。但是对于只有少量索引的情况是有效的。

6.bloom Filter：

三种方式：

Type Description
NONE Disables the filter (default)
ROW Use the row key for the filter
ROWCOL Use the row key and column key (family+qualifier) for the filter

原文引用如下：

The final question is whether to use a row or a row+column Bloom filter. The answer depends on your usage pattern. If you are doing only row scans, having the more specific row+column filter will not help at all: having a row-level Bloom filter enables you to narrow down the number of files that need to be checked, even when you do row+column read operations, but not the other way around.

The row+column Bloom filter is useful when you cannot batch updates for a specific row, and end up with store files which all contain parts of the row. The more specific row+column filter can then identify which of the files contain the data you are re-questing. Obviously, if you always load the entire row, this filter is once again hardly useful, as the region server will need to load the matching block out of each file anyway.

Since the row+column filter will require more storage, you need to do the math to
determine whether it is worth the extra resources. It is also interesting to know that there is a maximum number of elements a Bloom filter can hold. If you have too many cells in your store file, you might exceed that number and would need to fall back to the row-level filter.

Depending on your use case, it may be useful to enable Bloom filters, to increase the overall performance of your system. If possible, you should try to use the row-level Bloom filter, as it strikes a good balance between the additional space requirements and the gain in performance coming from its store file selection filtering. Only resort to the more costly row+column Bloom filter when you would otherwise gain no ad-vantage from using the row-level one.

也就是说如果要使用bloom Filter，尽量把它加在row级别上，而不是row+col都加，第三中ROWCOL的方式将

占用最多的磁盘和内存。

如果你只是做Scan，row级别的bloom Filter就足够了，row+col毫无用处。