Hbase热点产生及解决办法

最新推荐文章于 2024-08-05 16:14:08 发布

原创最新推荐文章于 2024-08-05 16:14:08 发布 · 426 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#hbase

hadoop 专栏收录该内容

89 篇文章

订阅专栏

HBase热点
什么是热点
HBase中的行是按照rowkey的字典顺序排序的，这种设计优化了scan操作，可以将相关的行以及会被一起读取的行存取在临近位置，便于scan。然而糟糕的rowkey设计是热点的源头。热点发生在大量的client直接访问集群的一个或极少数个节点(访问可能是读，写或者其他操作)。大量访问会使热点region所在的单个机器超出自身承受能力，引起性能下降甚至region不可用，这也会影响同一个RegionServer上的其他region，由于主机无法服务其他region的请求。

产生的原因
Hbase 创建表默认只有一个分区
Rowkey 设计不合理
解决方案
Hbase 创建表时指定分区
合理设计Rowkey
Hbase 预分区
分区
HBase中，表会被划分为1…n个Region，被托管在RegionServer中。Region二个重要的属性:StartKey与 EndKey表示这个Region维护的rowKey范围，当我们要读/写数据时，如果rowKey落在某个start-end key范围内，那么就会定位到目标region并且读/写到相关的数据。

默认地，当我们只是通过HBaseAdmin指定TableDescriptor来创建一张表时，start-end key无边界，region的size越来越大时，大到一定的阀值，就会找到一个midKey将region一分为二，成为2个region,这个过程称为分裂，而midKey则为这二个region的临界。

缺点
总是往最大start-key的region写记录，之前分裂出来的region不会再被写数据，它们都处于半满状态
split是比较耗时耗资源
优点
合理设计rowkey 能让各个region 的并发请求平均分配(趋于均匀) 使IO 效率达到最高
预分区
shell创建
create ‘table1’,‘cf1’ ,SPLITS=>[‘10’,‘20’,‘30’,‘40’]
创建5个regin，每个region都有个startKey和endKey，第一个region没有startKey，最后一个没有endKey：

第一个region：" to 10"
第二个region：“10 t0 20”
第三个region：“20 t0 30”
第四个region：“30 t0 40”
第五个region："40 t0 "

java api createTable并预分区
在hbase包的Admin类中提供了4个create表的方法（前三个为同步创建，第四个为异步）：

一.直接根据描述创建表
这里是直接根据表描述创建表，不指定分区。

/**

Creates a new table. Synchronous operation.
@param desc table descriptor for table
@throws IllegalArgumentException if the table name is reserved
@throws MasterNotRunningException if master is not running
@throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
threads, the table may have been created between test-for-existence and attempt-at-creation).
@throws IOException if a remote or network exception occurs
*/
void createTable(HTableDescriptor desc) throws IOException;
二.根据描述和region个数以及startKey以及endKey自动分配

根据表描述以及指定startKey和endKey和region个数创建表，这里hbase会自动创建region个数，并且会为你的每一个region指定key的范围，但是所有的范围都是连续的且均匀的，如果业务key的某些范围内数据量很多有的很少，这样就会造成数据的数据的倾斜,这样的场景就必须自己指定分区的范围，可以用第三种或者第四种方式预分区。

/**

Creates a new table with the specified number of regions. The start key specified will become
the end key of the first region of the table, and the end key specified will become the start
key of the last region of the table (the first region has a null start key and the last region
has a null end key). BigInteger math will be used to divide the key range specified into enough
segments to make the required number of total regions. Synchronous operation.
@param desc table descriptor for table
@param startKey beginning of key range
@param endKey end of key range
@param numRegions the total number of regions to create
@throws IllegalArgumentException if the table name is reserved
@throws MasterNotRunningException if master is not running
@throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
threads, the table may have been created between test-for-existence and attempt-at-creation).
@throws IOException
*/
void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int numRegions)
throws IOException;
三.根据表的描述和自定义的分区设置创建表（同步）
根据表的描述和自定义的分区设置创建表，这个就可以自己自定义指定region执行的key的范围，比如：

byte[][] splitKeys = new byte[][] { Bytes.toBytes(“100000”),
Bytes.toBytes(“200000”), Bytes.toBytes(“400000”),
Bytes.toBytes(“500000”) };
调用接口的时候splitKeys传入上面的值，那么他会自动创建5个region并且为之分配key的分区范围。

/**

Creates a new table with an initial set of empty regions defined by the specified split keys.
The total number of regions created will be the number of split keys plus one. Synchronous
operation. Note : Avoid passing empty split key.
@param desc table descriptor for table
@param splitKeys array of split keys for the initial regions of the table
@throws IllegalArgumentException if the table name is reserved, if the split keys are repeated
and if the split key has empty byte array.
@throws MasterNotRunningException if master is not running
@throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
threads, the table may have been created between test-for-existence and attempt-at-creation).
@throws IOException
*/
void createTable(final HTableDescriptor desc, byte[][] splitKeys) throws IOException;
四.根据表的描述和自定义的分区设置创建表（异步）

/**

Creates a new table but does not block and wait for it to come online. Asynchronous operation.
To check if the table exists, use {@link #isTableAvailable} – it is not safe to create an
HTable instance to this table before it is available. Note : Avoid passing empty split key.
@param desc table descriptor for table
@throws IllegalArgumentException Bad table name, if the split keys are repeated and if the
split key has empty byte array.
@throws MasterNotRunningException if master is not running
@throws org.apache.hadoop.hbase.TableExistsException if table already exists (If concurrent
threads, the table may have been created between test-for-existence and attempt-at-creation).
@throws IOException
*/
void createTableAsync(final HTableDescriptor desc, final byte[][] splitKeys) throws IOException;
Hbase Rowkey设计原则
rowkey长度原则
rowkey是一个二进制码流，rowkey的长度被很多开发者建议说设计在10~100个字节，不过建议是越短越好，不要超过16个字节。

原因如下

数据的持久化文件HFile中是按照KeyValue存储的，如果Rowkey过长比如100个字节，1000万列数据光Rowkey就要占用100*1000万=10亿个字节，将近1G数据，这会极大影响HFile的存储效率.
MemStore将缓存部分数据到内存，如果Rowkey字段过长内存的有效利用率会降低，系统将无法缓存更多的数据，这会降低检索效率。因此Rowkey的字节长度越短越好。
目前操作系统是都是64位系统，内存8字节对齐。控制在16个字节，8字节的整数倍利用操作系统的最佳特性。
rowkey唯一原则
必须在设计上保证其唯一性，rowkey是按照字典顺序排序存储的，因此，设计rowkey的时候，要充分利用这个排序的特点，将经常读取的数据存储到一块，将最近可能会被访问的数据放到一块。

rowkey散列原则
如果rowkey按照时间戳的方式递增，不要将时间放在二进制码的前面，建议将rowkey的高位作为散列字段，由程序随机生成，低位放时间字段，这样将提高数据均衡分布在每个RegionServer，以实现负载均衡的几率。如果没有散列字段，首字段直接是时间信息，所有的数据都会集中在一个 RegionServer上，这样在数据检索的时候负载会集中在个别的RegionServer上，造成热点问题，会降低查询效率。

Hbase 常见避免热点问题方法
加盐
在rowkey的前面增加随机数。具体就是给rowkey分配一个随机前缀以使得它和之前排序不同。分配的前缀种类数量应该和你想使数据分散到不同的region的数量一致。如果你有一些热点rowkey反复出现在其他分布均匀的rowkey 中，加盐是很有用的。

假如你有下列 rowkey，你表中每一个 region 对应字母表中每一个字母。以 ‘a’ 开头是同一个 region, ‘b’开头的是同一个region。在表中，所有以 ‘f’开头的都在同一个 region，它们的 rowkey 像下面这样:

foo0001
foo0002
foo0003
foo0004
现在，假如你需要将上面这个 region 分散到 4个 region。你可以用4个不同的盐:’a’, ‘b’, ‘c’, ‘d’. 在这个方案下，每一个字母前缀都会在不同的 region 中。加盐之后，你有了下面的 rowkey:

a-foo0003
b-foo0001
c-foo0004
d-foo0002
所以，你可以向4个不同的 region 写，理论上说，如果所有人都向同一个region 写的话，你将拥有之前4倍的吞吐量。

现在，如果再增加一行，它将随机分配a,b,c,d中的一个作为前缀，并以一个现有行作为尾部结束:

a-foo0003
b-foo0001
c-foo0003
c-foo0004
d-foo0002
因为分配是随机的，所以如果你想要以字典序取回数据，你需要做更多工作。加盐这种方式增加了写时的吞吐量，但是当读时有了额外代价。

哈希
哈希会使同一行永远用同一个前缀加盐。哈希也可以使负载分散到整个集群，但是读却是可以预测的。使用确定的哈希可以让客户端重构完成的rowkey，使用Get操作获取正常的获取某一行数据。

翻转
第三种防止热点的方法时反转固定长度或者数字格式的rowkey。这样可以使得rowkey中经常改变的部分(最没有意义的部分)放在前面。这样可以有效的随机rowkey，但是牺牲了rowkey的有序性。

反转rowkey的例子
以手机号为rowkey，可以将手机号反转后的字符串作为rowkey，这样就避免了以手机号那样比较固定开头导致热点问题