用Python3.6操作HBase之HBase-Thrift

最新推荐文章于 2023-10-20 16:29:48 发布

啊！漂泊的鱼

最新推荐文章于 2023-10-20 16:29:48 发布

阅读量982

点赞数 1

本文详细介绍了如何在Ubuntu上安装HBase及Thrift，并提供了Python操作HBase的完整示例，包括表的创建、数据的读写、扫描及删除等核心功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

ubuntu上hbase的安装和简单使用参考:https://blog.youkuaiyun.com/luanpeng825485697/article/details/81027601
参考链接https://blog.youkuaiyun.com/luanpeng825485697/article/details/81048468
本机Linux下安装Thrift

0.11.0版本下载地址：http://thrift.apache.org/

执行如下命令安装Thrift依赖：

apt-get install automake bison flex g++ git libboost1.55 libevent-dev libssl-dev libtool make pkg-config

解压编译：

tar -zxvf thrift-0.11.0.tar.gz
cd thrift-0.11.0
./configure --with-cpp --with-boost --with-python --without-csharp --with-java --without-erlang --without-perl --with-php --without-php_extension --without-ruby --without-haskell --without-go
make
make install

在集群Master中Hbase安装目录下的/usr/local/hbase/bin目录启动thrift服务：

./hbase-daemon.sh start thrift

python操作hbase

安装依赖包

pip install thrift
pip install hbase-thrift

1
2

先启动hbase

cd /usr/local/hbase
bin/start-hbase.sh

1
2

简单demo

from thrift.transport import TSocket,TTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase

thrift默认端口是9090

socket = TSocket.TSocket(‘127.0.0.1’,9090)
socket.setTimeout(5000)

transport = TTransport.TBufferedTransport(socket)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = Hbase.Client(protocol)
socket.open()

print(client.getTableNames()) # 获取当前所有的表名

python3操作hbase会报错.

首先要下载python3的Hbase文件,替换Hbase文件/usr/local/lib/python3.6/dist-packages/hbase/Hbase.py和ttypes.py
下载地址为:https://github.com/626626cdllp/infrastructure/tree/master/hbase
常用方法说明

createTable(tbaleName,columnFamilies)：创建表，无返回值
    tableName：表名
    columnFamilies：列族信息，为一个ColumnDescriptor列表

from hbase.ttypes import ColumnDescriptor

定义列族

column = ColumnDescriptor(name=‘cf’)

创建表

client.createTable(‘test4’,[column])

1
2
3
4
5
6

enableTable(tbaleName)：启用表，无返回值
    tableName：表名

启用表，若表之前未被禁用将会引发IOError错误

client.enableTable(‘test’)

1
2

disableTable(tbaleName)：禁用表，无返回值
    tableName：表名

禁用表，若表之前未被启用将会引发IOError错误

client.disableTable(‘test’)

1
2

isTableEnabled(tbaleName)：验证表是否被启用，返回一个bool值
    tableName：表名

client.isTableEnabled(‘test’)

1

getTableNames(tbaleName)：获取表名列表，返回一个str列表
    tableName：表名

client.getTableNames()

1

getColumnDescriptors(tbaleName)：获取所有列族信息，返回一个字典
    tableName：表名

client.getColumnDescriptors(‘test’)

1

getTableRegions(tbaleName)：获取所有与表关联的regions，返回一个TRegionInfo对象列表
    tableName：表名

client.getTableRegions(‘test’)

1

deleteTable(tbaleName)：删除表，无返回值
    tableName：表名

表不存在将会引发IOError(message='java.io.IOException: table does not exist…)错误

表未被禁用将会引发IOError(message='org.apache.hadoop.hbase.TableNotDisabledException:…)错误

client.deleteTable(‘test5’)

1
2
3
4

get(tableName,row,column)：获取数据列表，返回一个hbase.ttypes.TCell对象列表
    tableName：表名
    row：行
    column：列

result = client.get(‘test’,‘row1’,‘cf:a’) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
print result[0].timestamp
print result[0].value

1
2
3

getVer(tableName,row,column,numVersions)：获取数据列表，返回一个hbase.ttypes.TCell对象列表
    tableName：表名
    row：行
    column：列
    numVersions：要检索的版本数量

result = client.get(‘test’,‘row1’,‘cf:a’,2) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
print result[0].timestamp
print result[0].value

1
2
3

getVerTs(tableName,row,column,timestamp,numVersions)：获取小于当前时间戳的数据列表(但是要注意默认可装载的版本数目，因为默认只有一个版本数)，返回一个hbase.ttypes.TCell对象列表
    tableName：表名
    row：行
    column：列
    timestamp：时间戳
    numVersions：要检索的版本数量

result = client.get(‘test’,‘row1’,‘cf:a’,2) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
print result[0].timestamp
print result[0].value

1
2
3

getRow(tableName,row)：获取表中指定行在最新时间戳上的数据。返回一个hbase.ttypes.TRowResult对象列表，如果行号不存在返回一个空列表
    tableName：表名
    row：行

行

row = ‘row1’

列

column = ‘cf:a’

查询结果

result = client.getRow(‘test’,row) # result为一个列表
for item in result: # item为hbase.ttypes.TRowResult对象
print item.row
print item.columns.get(‘cf:a’).value # 获取值。item.columns.get(‘cf:a’)为一个hbase.ttypes.TCell对象
print item.columns.get(‘cf:a’).timestamp # 获取时间戳。item.columns.get(‘cf:a’)为一个hbase.ttypes.TCell对象

1
2
3
4
5
6
7
8
9
10

getRowWithColumns(tableName,row,columns)：获取表中指定行与指定列在最新时间戳上的数据。返回一个hbase.ttypes.TRowResult对象列表，如果行号不存在返回一个空列表
    tableName：表名
    row：行
    columns：列，list

result = client.getRowWithColumns(‘test’,‘row1’,[‘cf:a’,‘df:a’])
for item in result:
print item.row
print item.columns.get(‘cf:a’).value
print item.columns.get(‘cf:a’).timestamp

print item.columns.get('df:a').value
print item.columns.get('df:a').timestamp

1
2
3
4
5
6
7
8

getRowTs(tableName,row,timestamp)：获取表中指定行并且小于指定时间戳的最新的一条数据。返回一个hbase.ttypes.TRowResult对象列表，如果行号不存在返回一个空列表
    tableName：表名
    row：行
    timestamp：时间戳

result = client.getRowTs(‘test’,‘row1’,1513069831512)

1

getRowWithColumnsTs(tableName,row,columns,timestamp)：获取指定行与指定列，并且小于这个时间戳的最近一次数据。返回一个hbase.ttypes.TRowResult对象列表，如果行号不存在返回一个空列表
    tableName：表名
    row：行
    columns：列，list
    timestamp：时间戳

result = client.getRowWithColumnsTs(‘test’,‘row1’,[‘cf:a’,‘cf:b’,‘df:a’],1513069831512)

1

mutateRow(tableName,row,mutations)：在表中指定行执行一系列的变化操作。如果抛出异常，则事务被中止。使用默认的当前时间戳，所有条目将具有相同的时间戳。无返回值
    tableName：表名
    row：行
    mutations：变化,list

from hbase.ttypes import Mutation

mutation = Mutation(column=‘cf:a’,value=‘1’)

插入数据。如果在test表中row行cf:a列存在，将覆盖

client.mutateRow(‘test’,‘row1’,[mutation])

插入一行的多列
mutations = [Mutation(column=“colNameTest1:Name”, value=“Jason”),Mutation(column=“colNameTest1:age”, value=“5”)]
client.mutateRow(“our_table1”,“rowKey”,mutations)

1
2
3
4
5
6
7
8
9
10

mutateRowTs(tableName,row,mutations,timestamp)：在表中指定行执行一系列的变化操作。如果抛出异常，则事务被中止。使用指定的时间戳，所有条目将具有相同的时间戳。如果是更新操作时，如果指定时间戳小于原来数据的时间戳，将被忽略。无返回值
    tableName：表名
    row：行
    mutations：变化,list
    timestamp：时间戳

from hbase.ttypes import Mutation

value必须为字符串格式，否则将报错

mutation = Mutation(column=‘cf:a’,value=‘2’)
client.mutateRowTs(‘test’,‘row1’,[mutation],1513070735669)

1
2
3
4

mutateRows(tableName,rowBatches)：在表中执行一系列批次(单个行上的一系列突变)。如果抛出异常，则事务被中止。使用默认的当前时间戳，所有条目将具有相同的时间戳。无返回值
    tableName：表名
    rowBatches：一系列批次

from hbase.ttypes import Mutation,BatchMutation
rowMutations = [BatchMutation(“rowkey1”,mutations),BatchMutation(“rowkey2”,mutations)]
client.mutateRows(“our_table1”,rowMutations)

1
2
3

mutateRowsTs(tableName,rowBatches,timestamp)：在表中执行一系列批次(单个行上的一系列突变)。如果抛出异常，则事务被中止。使用指定的时间戳，所有条目将具有相同的时间戳。如果是更新操作时，如果指定时间戳小于原来数据的时间戳，将被忽略。无返回值
    tableName：表名
    rowBatches：一系列批次，list
    timestamp：时间戳

mutation = Mutation(column=‘cf:a’,value=‘2’)
batchMutation = BatchMutation(‘row1’,[mutation])
client.mutateRowsTs(‘cx’,[batchMutation],timestamp=1513135651874)

1
2
3

atomicIncrement(tableName,row,column,value)：原子递增的列。返回当前列的值
    tableName：表名
    row：行
    column：列
    value：原子递增的值

result = client.atomicIncrement(‘cx’,‘row1’,‘cf:b’,1)
print result # 如果之前的值为2，此时值为3

1
2

deleteAll(tableName,row,column)：删除指定表指定行与指定列的所有数据，无返回值
    tableName：表名
    row：行
    column：列

client.deleteAll(‘cx’,‘row1’,‘cf:a’)

1

deleteAllTs(tableName,row,column,timestamp)：删除指定表指定行与指定列中，小于等于指定时间戳的所有数据，无返回值
    tableName：表名
    row：行
    column：列
    timestamp：时间戳

client.deleteAllTs(‘cx’,‘row1’,‘cf:a’,timestamp=1513569725685)

1

deleteAllRow(tableName,row)：删除整行数据，无返回值
    tableName：表名
    row：行

client.deleteAllRow(‘cx’,‘row1’)

1

deleteAllRowTs(tableName,row,timestamp)：删除指定表指定行中，小于等于此时间戳的所有数据，无返回值
    tableName：表名
    row：行
    timestamp：时间戳

client.deleteAllRowTs(‘cx’,‘row1’,timestamp=1513568619326)

1

scannerOpen(tableName,startRow,columns)：在指定表中，从指定行开始扫描，到表中最后一行结束，扫描指定列的数据。每行只去最新的一次数据，返回一个ScannerID，int类型
    tableName：表名
    startRow：起始行
    columns：列名列表,list类型

scannerId = client.scannerOpen(‘cx’,‘row2’,[“cf:b”,“cf:c”])

1

scannerOpenTs(tableName,startRow,columns,timestamp)：在指定表中，从指定行开始扫描，每行获取所有小于指定时间戳的最新的一条数据，扫描指定列的数据。返回一个ScannerID，int类型
    tableName：表名
    startRow：起始行
    columns：列名列表,list类型
    timestamp：时间戳

scannerId = client.scannerOpenTs(‘cx’,‘row1’,[“cf:a”,“cf:b”,“cf:c”],timestamp=1513579065365)

1

scannerOpenWithStop(tableName,startRow,stopRow,columns)：在指定表中，从指定行开始扫描，扫描到结束行结束(并不获取指定行的数据)，扫描指定列的数据。返回一个ScannerID，int类型
    tableName：表名
    startRow：起始行
    stopRow：结束行
    columns：列名列表,list类型

scannerId = client.scannerOpenWithStop(‘cx’,‘row1’,‘row2’,[“cf:b”,“cf:c”])

1

scannerOpenWithStopTs(tableName,startRow,stopRow,columns,timestamp)：在指定表中，从指定行开始扫描，扫描到结束行结束(并不获取指定行的数据)，每行获取所有小于指定时间戳的最新的一条数据，扫描指定列的数据。返回一个ScannerID，int类型
    tableName：表名
    startRow：起始行
    stopRow：结束行
    columns：列名列表,list类型
    timestamp：时间戳

scannerId = client.scannerOpenWithStopTs(‘cx’,‘row1’,‘row2’,[“cf:a”,“cf:b”,“cf:c”],timestamp=1513579065365)

1

scannerOpenWithPrefix(tableName,startAndPrefix,columns)：在指定表中，扫描具有指定前缀的行，扫描指定列的数据。每行获取最新的一条数据，返回一个ScannerID，int类型
    tableName：表名
    startAndPrefix：行前缀
    columns：列名列表,list类型

scannerId = client.scannerOpenWithPrefix(‘cx’,‘row’,[“cf:b”,“cf:c”])

1

scannerGet(id)：根据ScannerID来获取结果，返回一个hbase.ttypes.TRowResult对象列表
    id：ScannerID

scannerId = client.scannerOpen(‘cx’,‘row1’,[“cf:b”,“cf:c”])
while True:
result = client.scannerGet(scannerId)
if not result:
break
print result

1
2
3
4
5
6

scannerGetList(id,nbRows)：根据ScannerID来获取指定数量的结果，返回一个hbase.ttypes.TRowResult对象列表
    id：ScannerID
    nbRows：指定行数

scannerId = client.scannerOpen(‘cx’,‘row1’,[“cf:b”,“cf:c”])
result = client.scannerGetList(scannerId,2)

1
2

scannerClose(id)：关闭扫描器，无返回值
    id：ScannerID

client.scannerClose(scannerId)

整套demo

使用前需要启动hbase和thrift服务器

from thrift.transport import TSocket,TTransport
from thrift.protocol import TBinaryProtocol
from hbase import Hbase

thrift默认端口是9090

socket = TSocket.TSocket(‘127.0.0.1’,9090)
socket.setTimeout(5000)

transport = TTransport.TBufferedTransport(socket)
protocol = TBinaryProtocol.TBinaryProtocol(transport)

client = Hbase.Client(protocol)
socket.open()

print(client.get(‘table1’,‘row1’,‘cf:a’))

from hbase.ttypes import ColumnDescriptor

alltable = client.getTableNames() # 获取所有表名
print(‘所有表格’,alltable)
if(‘test’ in alltable):
allcf = client.getColumnDescriptors(‘test’) # 获取表的所有列族
print(‘test表的列族’,allcf)
allregions = client.getTableRegions(‘test’) # 获取所有与表关联的regions
print(‘test表的所有regions’,allregions)
else:
column1 = ColumnDescriptor(name=‘cf1’) # 定义列族
column3 = ColumnDescriptor(name=‘cf2’) # 定义列族
client.createTable(‘test’, [column1,column3]) # 创建表
print(‘创建表test’)

验证表是否被启用

if(not client.isTableEnabled(‘test’)):
client.enableTable(‘test’) # 启用表
print(‘启用表test’)

＝＝＝＝＝＝＝插入/修改数据＝＝＝＝＝＝＝

from hbase.ttypes import Mutation

mutation = Mutation(column=‘cf1:a’, value=‘1’)

插入数据。如果在test表中row行cf1:a列存在，将覆盖

client.mutateRow(‘test’, ‘row1’, [mutation]) # 在表中指定行执行一系列的变化操作。
client.mutateRowTs(‘test’,‘row2’,[mutation],1513070735669) # 可以自己添加时间戳
print(‘插入数据’)

from hbase.ttypes import Mutation,BatchMutation
mutation1 = Mutation(column=‘cf1:b’,value=‘2’)
mutation2 = Mutation(column=‘cf2:a’,value=‘3’)
mutation3 = Mutation(column=‘cf2:b’,value=‘4’)
batchMutation = BatchMutation(‘row3’,[mutation])
client.mutateRows(‘test’,[batchMutation]) # 在表中执行一系列批次(单个行上的一系列突变)
client.mutateRowsTs(‘test’,[batchMutation],timestamp=1513135651874) # 也可以自己添加时间戳
print(‘插入数据’)

result = client.atomicIncrement(‘test’,‘row1’,‘cf1:c’,1) # 原子递增的列进行一次递增。返回当前列的值
print(result)

====获取数据=

result = client.get(‘test’, ‘row1’, ‘cf1:a’) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
result = client.getVer(‘test’, ‘row1’, ‘cf1:a’, numVersions = 2) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
result = client.getVerTs(‘test’, ‘row1’, ‘cf1:a’, timestamp=0,numVersions = 2) # 为一个列表，其中只有一个hbase.ttypes.TCell对象的数据
print(result)

行

row = ‘row1’

列

column = ‘cf1:a’

查询结果

result = client.getRow(‘test’,row) # result为一个列表，获取表中指定行在最新时间戳上的数据
for item in result: # item为hbase.ttypes.TRowResult对象
print(‘行索引:’,item.row)
print(‘列值:’,item.columns.get(column).value) # 获取值。item.columns.get(‘cf:a’)为一个hbase.ttypes.TCell对象
print(‘时间戳:’,item.columns.get(column).timestamp) # 获取时间戳。item.columns.get(‘cf:a’)为一个hbase.ttypes.TCell对象

获取指定行指定列上的数据

result = client.getRowWithColumns(‘test’,‘row1’,[‘cf1:a’,‘cf2:a’]) #获取表中指定行与指定列在最新时间戳上的数据
for item in result:
print(‘行索引:’,item.row)
cf1_a = item.columns.get(‘cf1:a’)
if (cf1_a != None):
print(‘cf1:a列值:’,cf1_a.value)
print(‘时间戳:’,cf1_a.timestamp)
cf2_a = item.columns.get(‘cf2:a’)
if(cf2_a!=None):
print(‘cf2:a列值:’,cf2_a.value)
print(‘时间戳:’,cf2_a.timestamp)

result = client.getRowTs(‘test’,‘row1’,1513069831512) # 获取表中指定行并且小于这个时间戳的所有数据
print(result)

result = client.getRowWithColumnsTs(‘test’,‘row1’,[‘cf1:a’,‘cf1:b’,‘cf2:a’],1513069831512) # 获取指定行与指定列，并且小于这个时间戳的所有数据
print(result)

扫描数据====

scannerId = client.scannerOpen(‘test’,‘row1’,[“cf1:b”,“cf2:a”]) # 在指定表中，从指定行开始扫描，到表中最后一行结束，扫描指定列的数据。
scannerId = client.scannerOpenTs(‘test’,‘row1’,[“cf1:b”,“cf2:a”],timestamp=1513579065365) # 在指定表中，从指定行开始扫描，获取所有小于指定时间戳的所有数据，扫描指定列的数据
scannerId = client.scannerOpenWithStop(‘test’,‘row1’,‘row2’,[“cf1:b”,“cf2:a”]) # 在指定表中，从指定行开始扫描，扫描到结束行结束(并不获取指定行的数据)，扫描指定列的数据
scannerId = client.scannerOpenWithStopTs(‘test’,‘row1’,‘row2’,[“cf1:b”,“cf2:a”],timestamp=1513579065365) # 获取所有小于指定时间戳的所有数据
scannerId = client.scannerOpenWithPrefix(‘test’,‘row’,[“cf1:b”,“cf2:a”]) #在指定表中，扫描具有指定前缀的行，扫描指定列的数据

while True:
result = client.scannerGet(scannerId) # 根据ScannerID来获取结果
if not result:
break
print(result)

result = client.scannerGetList(scannerId,2) # 根据ScannerID来获取指定数量的结果

client.scannerClose(scannerId) # 关闭扫描器

===============删除数据＝＝＝＝＝＝＝＝＝＝＝＝＝＝

client.deleteAll(‘test’,‘row1’,‘cf1:a’) # 删除指定表指定行与指定列的所有数据
client.deleteAllTs(‘test’,‘row1’,‘cf2:a’,timestamp=1513569725685) # 删除指定表指定行与指定列中，小于等于指定时间戳的所有数据
client.deleteAllRowTs(‘test’,‘row1’,timestamp=1513568619326) # 删除指定表指定行中，小于等于此时间戳的所有数据
client.deleteAllRow(‘test’,‘row1’) # 删除整行数据
if(not client.isTableEnabled(‘test’)):
client.disableTable(‘test’)
print(‘禁用表test’)
client.deleteTable(‘test’) # 删除表.必须确保表存在,且被禁用
print(‘删除表test’)