HBase (1): OpenTSDB Table Design

最新推荐文章于 2024-04-18 13:30:32 发布

转载最新推荐文章于 2024-04-18 13:30:32 发布 · 326 阅读

文章标签：

#hbase #rowkey

nosql 专栏收录该内容

1 篇文章

订阅专栏

本文介绍OpenTSDB——一款分布式时间序列数据库的设计理念及其如何利用HBase存储大量时间序列数据。OpenTSDB能够高效地收集、存储并提供数十亿计的数据点，适用于现代监控需求。文章详细解析了OpenTSDB的表格设计方式，特别是如何优化HBase表结构以应对大数据处理挑战。

Why we need to learn OpenTSDB? Is it good study case for us to know how to design HBase Table? For me, I would totally say yes. There are many good optimizations which already are applied to OpenTSDB, this open source project. So this post will only say how does OpenTSDB design the HBase table, not focus on how to use OpenTSDB or how to implement OpenTSDB to monitor server. Maybe in the future, I will write down this part.

So first, Let’s simply know some basic concepts in OpenTSDB.

What is OpenTSDB?

It is the distributed, scalable, time series database which is for modern monitor needs. It can collect, store and serve billion data points with no less of precision, can be used with Tcollector. Here are two key points, one is time series, the other is billion data. So timestamp is important point in OpenTSDB, and there are many data points which OpenTSDB needs to deal with. (That’s the main reason we need to learn OpenTSDB’s design; we are also facing big data and time is also significant field for the data)

Even though OpenTSDB is open source project, it is also used many other big companies, including Yahoo, Ebay, Pinterest, and so on.

Some Concepts

data points: (time, value)
metrics: proc.loadavg.cpu
tags: hosts=haimeili, ip=127.0.0.1
metric + tags = time series

There are two tables which OpenTSDB use to store data, one is tsdb, the other is tsdb-uid. Currently, it already have two additional tables, named tsdb-meta, tsdb-tree.(new in OpenTSDB 2.0)

tsdb-uid

This table is to map uid to name or map name to uid. There are only three kinds of qualifiers: metric, tagk and tagv. We need to remember that this is two ways, one is from uid to name, the other is from name to uid. Here is the example,

tsdb

tsdb is the main table to store data point. Its rowkey is a concatenation of uids and time.

This is rowkey format: <metric uid><timestamp><tagk1><tagv1><tagk2><tagv2>….
Timestamp normalized on 1 hour boundaries
All data points for an hour are stored in one row
There are two qualifer formats, one is 2 bytes, the other is 4 bytes. For 2 bytes, it looks like this: <12 bits><4bits>. The first 12 bits is to store min-second information. the 4 bits is a flag, first 1 bit is to tell the value is integer or double, the rest three bits is to tell the length of the value from 0 to 8 bytes. e.g. “000” means 1 byte value, “010” means 2 bytes value, etc. For 4 bytes, it looks like this: <4 bits><22 bits><2 bits><4 bits>. The first 4 bits is “0000” or “1111”. The 22 bits is the min-second information. The last 4 bits is flag which is the same with above.

Here is one example:

1297574486 = 2011-02-13 13:21:26    
MWeP = 01001101 01010111 01100101 01010000 = 1297573200 = 2011-02-13 13:00:00 (only select hours and cut down mins which will be stored in qualifier)
PK = 01010000 01101011 = 1286 (1286 seconds = 21 mins 26 seconds)
1297573200+1286=1297574486

Summary

When you design table for big table, you need to consider to use concatenation method to save space. If you have time-based data, you need to think about the position to store timestamp, and whether you want to store the data for per second or per minute. Also if your data is not good format, or too long, or you have the list of data, you might need to map data to a uid to save space.