python数据导入hive_使用Python将多行插入到Hive表中

本文介绍了如何使用Hive的Streaming API通过Python脚本高效地插入大量数据到Hive表中,并提供了具体示例,包括如何设置和使用Python脚本进行数据转换。
部署运行你感兴趣的模型镜像

1586010002-jmsa.png

Hive is a data warehouse designed for querying and aggregating large datasets that reside on HDFS.

The standard INSERT INTO syntax performs poorly because:

Each statement required a Map/Reduce process to be executed.

Each statement will result in a new file being added to HDFS - over time this will lead to very poor performance when reading from the table.

With that said, there is now a Streaming API for Hive / HCatalog, as detailed here.

I am faced with the need to insert data at velocity into Hive, using Python. I am aware of the pyhive and pyhs2 libraries, but neither of them appears to make use of the Streaming API.

Has anyone successfully managed to get Python to insert many rows into Hive using the Streaming API, and how was this done?

I look forward to your insights!

解决方案

Hive user can stream table through script to transform that data:

ADD FILE replace-nan-with-zeros.py;

SELECT

TRANSFORM (...)

USING 'python replace-nan-with-zeros.py'

AS (...)

FROM some_table;

Here a simple Python script:

#!/usr/bin/env python

import sys

kFirstColumns= 7

def main(argv):

for line in sys.stdin:

line = line.strip();

inputs = line.split('\t')

# replace NaNs with zeros

outputs = [ ]

columnIndex = 1;

for value in inputs:

newValue = value

if columnIndex > kFirstColumns:

newValue = value.replace('NaN','0.0')

outputs.append(newValue)

columnIndex = columnIndex + 1

print '\t'.join(outputs)

if __name__ == "__main__":

main(sys.argv[1:])

Hive and Python

Python can be used as a UDF from Hive through the HiveQL TRANSFORM statement. For example, the following HiveQL invokes a Python script stored in the streaming.py file.

Linux-based HDInsight

add file wasb:///streaming.py;

SELECT TRANSFORM (clientid, devicemake, devicemodel)

USING 'streaming.py' AS

(clientid string, phoneLable string, phoneHash string)

FROM hivesampletable

ORDER BY clientid LIMIT 50;

Windows Based HDInsight

add file wasb:///streaming.py;

SELECT TRANSFORM (clientid, devicemake, devicemodel)

USING 'D:\Python27\python.exe streaming.py' AS

(clientid string, phoneLable string, phoneHash string)

FROM hivesampletable

ORDER BY clientid LIMIT 50;

Here's what this example does:

1.The add file statement at the beginning of the file adds the streaming.py file to the distributed cache, so it's accessible by all nodes in the cluster.

2.The SELECT TRANSFORM ... USING statement selects data from the hivesampletable, and passes clientid, devicemake, and devicemodel to the streaming.py script.

3.The AS clause describes the fields returned from streaming.py

Here's the streaming.py file used by the HiveQL example.

#!/usr/bin/env python

import sys

import string

import hashlib

while True:

line = sys.stdin.readline()

if not line:

break

line = string.strip(line, "\n ")

clientid, devicemake, devicemodel = string.split(line, "\t")

phone_label = devicemake + ' ' + devicemodel

print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])

Since we are using streaming, this script has to do the following:

1.Read data from STDIN. This is accomplished by using sys.stdin.readline() in this example.

2.The trailing newline character is removed using string.strip(line, "\n "), since we just want the text data and not the end of line indicator.

3.When doing stream processing, a single line contains all the values with a tab character between each value. So string.split(line, "\t") can be used to split the input at each tab, returning just the fields.

4.When processing is complete, the output must be written to STDOUT as a single line, with a tab between each field. This is accomplished by using print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()]).

5.This all occurs within a while loop, that will repeat until no line is read, at which point break exits the loop and the script terminates.

Beyond that, the script just concatenates the input values for devicemake and devicemodel, and calculates a hash of the concatenated value. Pretty simple, but it describes the basics of how any Python script invoked from Hive should function: Loop, read input until there is no more, break each line of input apart at the tabs, process, write a single line of tab delimited output.

您可能感兴趣的与本文相关的镜像

Anything-LLM

Anything-LLM

AI应用

AnythingLLM是一个全栈应用程序,可以使用商用或开源的LLM/嵌入器/语义向量数据库模型,帮助用户在本地或云端搭建个性化的聊天机器人系统,且无需复杂设置

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值