HBase 101: Row key design for paging (LIMIT, OFFSET) queries

本文探讨了在HBase中实现分页查询的方法,通过设计合理的行键来支持时间顺序的用户行为记录查询,介绍了如何利用Java API进行数据的插入与检索。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

文章来源:http://devblog.streamy.com/2009/04/23/hbase-row-key-design-for-paging-limit-offset-queries/

 

另外一篇 CAP Theorem http://devblog.streamy.com/2009/08/24/cap-theorem/ 写的也不错,可以看一下。

 

Paging is a very common use-case for web sites and many other applications.  In relational databases, this is easily implemented with LIMIT and OFFSET, or by selecting the row number in the query and adding conditionals based on it’s value.  HBase 0.19.x, on the other hand, does not provide any queries or filters that support paging directly. After a quick example using SQL, I will show how to implement the same functionality in HBase.

Let’s assume that we have a large number of users .  Each user has performed a number of actions .  Each action has a unique identifier, a timestamp, and a name.

This is how you might get the third page of an individual users’ actions using SQL:





This utilizes secondary indexes on both userid and stamp , meaning to accomplish this query you need at least three indexes on this table as id is the primary key.  Though a simple query to write, you will run into problems as the actions table grows to millions of rows and beyond.  Insertions would look like:





HBase has no real indexes.  Rows are stored in sorted order, and columns in a family are sorted.  For more information, read the HBase Architecture page on the HBase Wiki .

Very conscious of the primary queries we will run on user-actions, we will design an HBase table to support paging queries on per-user, time-ordered lists of actions.

We will use the Java Client API for HBase, specifically the HTable class.  What we are looking for are two methods:






Please note, I am using a custom object, Action, for simplicity.  It is a client-side holder for the four action fields (id, userid, name, stamp).

There are a number of ways to store your data in HBase that will allow the getUserActions query, but in this case we will go with a very tall table design (lots of rows with few columns in them) rather than wide (lots of columns in each row).  Specifically, the difference here would be whether you have a row-per-action or a row-per-user.  We will do a row-per-action, but will be designing our row key (the primary key) to be a composite key to allow for grouping and sorting of actions, rather than just the action id.  This means we will not have random-access to an action by it’s id, so rather than defining this as the actions table (which might also exist if you needed actionid random access) we will define it as the useractions table, and we will only store a single column in a single family, content:name .

The row key that we will use in our HBase useractions table is:






It’s very important that each of these fields is fixed-length and binary so that the lexicographical/ascending byte-ordering of HBase will properly sort our rows.

The userid field will be a 4 byte, big endian integer.  reverse_order_stamp is an 8 byte, big endian long with a value of (Long.MAX_VALUE - epoch).  This is so the most recent stamp is at the top rather than the bottom.  actionid is another 4 byte, big endian integer.  Thankfully, HBase provides helpful utilties in the org.apache.hadoop.hbase.util.Bytes class to deal with this (unfortunately it lacked some key features in 0.19, so the code below makes use of the Bytes class available in 0.20/TRUNK).  Before we get into HBase code, let’s define the helper methods makeActionRow and readActionRow to deal with the composite key:






Now that we can deal with the composite keys, insertion is very straightforward:






We just serialize the fields into the composite row, and write the single column to HBase in a BatchUpdate.  Reading will deal with unserializing the fields and Scanners.  In addition to matching for the content:name column, we will also specify a startRow and stopRow so that the Scanner only returns results from the user we are looking at.  This way we do not have to worry about jumping to the next user in our code, the Scanner will just stop.






The storage of your data must be tied to how you need to query it.  Without a sophisticated query engine or indexing capabilities, you must design to take advantage of sorted rows and columns, potentially designing a table per query type.  Denormalization is okay!

In my next posts, I will show more interesting ways to use HBase for persisted dictionary/keyval/Object storage and directly address secondary indexing with HBase.

Disclaimer: The code in these examples is designed to illustrate the practical use of HBase.  While the design is sound, the code itself may not optimized for performance.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值