HBase基础

最新推荐文章于 2024-10-10 11:10:29 发布

正在输入中in

最新推荐文章于 2024-10-10 11:10:29 发布

阅读量412

点赞数 1

CC 4.0 BY-SA版权

文章标签： HBase

本文链接：https://blog.youkuaiyun.com/mhb85470015/article/details/81133958

本文介绍了HBase的基础知识，包括其逻辑实体如表、行、列族、列限定符和单元，以及时间版本的概念。详细讲述了HBase的创建、安装、基本操作，如增删改查，并提供了行键设计的建议，强调了行键在HBase中的重要性。还简要提到了HBase在分布式环境下的使用以及模式设计的思考问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一.HBase介绍
基于行键(rowKey)、列键(columnKey)和时间戳(timestamp)建立索引，一种键值存储、面向列族的数据库，也可以作为多时间戳版本映射的数据库,
专门为半结构化数据和水平可扩展性设计的数据库。

HBase 逻辑实体
表table                    HBase用表组织数据
行row                        在表里，数据按行存储，行由行键唯一标识，行键rowkey没有数据类型，可以看做字节数组
列族column family           行里的数据按照列族分组，表中每行拥有相同的列族，列族名是字节数组
列限定符column qualifier   列族数据通过列来定位，不必事前定义，也不必在不同行之间保持一致，可以看做字节数组
单元cell                   行键、列族、列限定符确定一个单元，单元存储的数为单元值value,单元值也是字节数组
时间版本version               单元值有时间版本，事件版本用时间戳标识，默认当前时间戳作为操作基础，保留时间版本数量基于列族配置，默认三个

HBase 有序映射的映射数据结构类似于 Map<RowKey, Map<ColumnFamily, Map<ColumnQualifier, Map<Version, Date>>>>

二.安装
我放弃了，window 真他妈烦，怎么都启动不来。

三.基本操作

创建user表，table必须至少有一个列族，info就是列族
create user info

显示所有表
list

显示表参数,会显示表名和列族的列表
describe user

[行键列族列限定符] [rowkey, column family, column qualifier]
例[TheRealMT, info, name]
HTablePool pool = new HTablePool();
HTableInterface usersTable = pool.getTable(""user);
Put p = new Put(Bytes.toBytes("TheRealMT"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("name"), Bytes.toBytes("jack"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("1149412346@qq.com"));
usersTable.put(p);

修改操作同新增操作，执行行键、列族、列限定字符坐标
Put p = new Put(Bytes.toBytes("TheRealMT"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("email"), Bytes.toBytes("miaohanbin@zuifuli.com"));
usersTable.put(p);

读取数据,发挥所有列族的所有列
Get g = new Get(Bytes.toBytes("TheRealMT"));
Result result = usersTable.get(g);

读取限定列
Get g = new Get(Bytes.toBytes("TheRealMT"));
g.addColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));
Result result = usersTable.get(g);

读取某个列族某个列值
Get g = new Get(Bytes.toBytes("TheRealMT"));
g.addFamily(Bytes.toBytes("info"));
Result result = usersTable.get(g);
byte[] b = result.getValue(Bytes.toBytes("info"), Bytes.toBytes("name"));
String name = Bytes.toString(b);

删除数据
Delete d = new Delete(Bytes.toBytes("TheRealMT"));
usersTable.delete(d);

删除某一列
Delete d = new Delete(Bytes.toBytes("TheRealMT"));
d.deleteColumns(Bytes.toBytes("info"), Bytes.toBytes("name"));
usersTable.delete(d);

时间版本
List<KeyValue> names = result.getColumn(Bytes.toBytes("info"), Bytes.toBytes("name"));
bytes current = names.get(0).getValue();
long timestamp = names.get(0).getTimestamp;
bytes pre = names.get(1).getValue();

创建表
Configuration conf = HBaseConfiguration.create();
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor desc = new HTableDescriptor("twits");
HColumnDescriptor c = new HColumnDescriptor("twits");
c.setMaxVersion(1);
desc.addFamily(c);
admin.createTable(desc);

HBase在物理模型里按照行键顺序存储的，行键在设计表中作为第一重要的考量因素，如何利用行键排序来提现时间有序性
建议行键最好固定长度字符串,比如说md5(name) + timestamp
int longLength = Long.SIZE / 8;
byte[] userHash = Md5Uitls.md5sum("TheRealMT");
byte[] timestamp = Bytes.toBytes(-1 * 124546573345L); //-1作用就是倒序，目的就是最新的数据排在前面
byte[] rowKey = new byte[Md5Utils.MD5_LENGTH + longLength];
int offset = 0;
offset = Bytes.putBytes(rowKey, offset, uerHash, 0, userHash.length);
Bytes.putBytes(rowKey, offset, timestamp, o, timestamp.length);
Put put = new Put(rowKey);

行键扫描
int longLength = Long.SIZE / 8;
byte[] userHash = Md5Uitls.md5sum("TheRealMT");
byte[] startRow = Bytes.padtail(userHash, longLength);
byte[] stopRow = Bytes.padtail(userHash, longLength);
stopRow(Md5Uitls.MD5_LENGTH - 1)++;
Scan scan = new Scan(startRow, stopRow);
ResultScanner rs = twits.getScanner(scan);
for (Result r : rs) {
   byte[] b = r.getValue(Bytes.toBytes("twits", "user"));
   byte[] b1 = r.getValue(Bytes.toBytes("twits", "email"));
   byte[] b2 = Arrays.copyOfRange(r.getRow(), Md5Uitls.MD5_LENGTH, Md5Uitls.MD5_LENGTH + longLength);
   DateTime dt = new DateTime(-1 * Bytes.toLong(b2));
}

过滤器扫描
Scan s = new Scan();
s.addColumn(Bytes.toBytes("twits"), Bytes.toBytes("twit"));
Filter f = new ValueFilter(CompareOp.EQUAL, new RegexStringComparator("*TwitBase*"));
s.setFilter(f);

原子操作 hbase能保证一行的数据的原子性，但是不能保证多行数据的原子性
usersTable.incrementColumnValue(Bytes.toBytes("TheRealMT"), Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), 1L);

Get g = new Get(Bytes.toBytes("TheRealMT"));
Result r = usersTable.get(g);
long curVal = Bytes.toLong(r.getColumnLatest(Bytes.toBytes("info"), Bytes.toBytes("tweet_count")).getValue());
long incVal = curVal + 1;
Put p = new Put(Bytes.toBytes("TheRealMT"));
p.add(Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), Bytes.toBytes(incVal));
usersTable.checkAndPut(Bytes.toBytes("TheRealMT"), Bytes.toBytes("info"), Bytes.toBytes("tweet_count"), curVal, p);

总结：HBase物理数据模型是基于列族的列式数据库，一个列族有多个HFile存储键值数据，一单写入就不能更改，新值将保存在新HFile里，在读取数据
和数据合并时，数据视图需要在内存中重新衔接。

四.分布式HBase、HDFS和MapReduce
分布式计算框架MapReduce,包含map阶段和reduce阶段，map阶段完成后hadoop会自动执行洗牌和排序分组工作，最后才是reduce阶段。它的优点就是并行计算，屏蔽了其中如何切片、排序、分组工作。

MapReduce具体应用省略，这一块需要以后单独研究。

五.模式设计
考虑问题：
1.这个表应该有多少个列族？
2.列族使用什么数据？
3.每个列族应该有多少列？
4.列名应该是什么？
5.单元存放什么数据？
6.每个单元存储多少个时间版本？
7.行键结构是什么？应该包含什么信息？

行键设计至关重要，有些场景可以考虑复合行键，附加timestamp

用户关注表
高表设计
行键列族列限定符
md5(userid)+ md5(followid) f user:username

宽表设计
user f user:1