python数据导入hive_使用Python将多行插入到Hive表中

最新推荐文章于 2024-04-25 11:36:27 发布

最新推荐文章于 2024-04-25 11:36:27 发布 · 566 阅读

文章标签：

#python数据导入hive

本文介绍了如何使用Hive的Streaming API通过Python脚本高效地插入大量数据到Hive表中，并提供了具体示例，包括如何设置和使用Python脚本进行数据转换。

部署运行你感兴趣的模型镜像

Hive is a data warehouse designed for querying and aggregating large datasets that reside on HDFS.

The standard INSERT INTO syntax performs poorly because:

Each statement required a Map/Reduce process to be executed.

Each statement will result in a new file being added to HDFS - over time this will lead to very poor performance when reading from the table.

With that said, there is now a Streaming API for Hive / HCatalog, as detailed here.

I am faced with the need to insert data at velocity into Hive, using Python. I am aware of the pyhive and pyhs2 libraries, but neither of them appears to make use of the Streaming API.

Has anyone successfully managed to get Python to insert many rows into Hive using the Streaming API, and how was this done?

I look forward to your insights!

解决方案

Hive user can stream table through script to transform that data:

ADD FILE replace-nan-with-zeros.py;

SELECT

TRANSFORM (...)

USING 'python replace-nan-with-zeros.py'

AS (...)

FROM some_table;

Here a simple Python script:

#!/usr/bin/env python

import sys

kFirstColumns= 7

def main(argv):

for line in sys.stdin:

line = line.strip();

inputs = line.split('\t')

# replace NaNs with zeros

outputs = [ ]

columnIndex = 1;

for value in inputs:

newValue = value

if columnIndex > kFirstColumns:

newValue = value.replace('NaN','0.0')

outputs.append(newValue)

columnIndex = columnIndex + 1

print '\t'.join(outputs)

if __name__ == "__main__":

main(sys.argv[1:])

Hive and Python

Python can be used as a UDF from Hive through the HiveQL TRANSFORM statement. For example, the following HiveQL invokes a Python script stored in the streaming.py file.

Linux-based HDInsight

add file wasb:///streaming.py;

SELECT TRANSFORM (clientid, devicemake, devicemodel)

USING 'streaming.py' AS

(clientid string, phoneLable string, phoneHash string)

FROM hivesampletable

ORDER BY clientid LIMIT 50;

Windows Based HDInsight

add file wasb:///streaming.py;

SELECT TRANSFORM (clientid, devicemake, devicemodel)

USING 'D:\Python27\python.exe streaming.py' AS

(clientid string, phoneLable string, phoneHash string)

FROM hivesampletable

ORDER BY clientid LIMIT 50;

Here's what this example does:

1.The add file statement at the beginning of the file adds the streaming.py file to the distributed cache, so it's accessible by all nodes in the cluster.

2.The SELECT TRANSFORM ... USING statement selects data from the hivesampletable, and passes clientid, devicemake, and devicemodel to the streaming.py script.

3.The AS clause describes the fields returned from streaming.py

Here's the streaming.py file used by the HiveQL example.

#!/usr/bin/env python

import sys

import string

import hashlib

while True:

line = sys.stdin.readline()

if not line:

break

line = string.strip(line, "\n ")

clientid, devicemake, devicemodel = string.split(line, "\t")

phone_label = devicemake + ' ' + devicemodel

print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])

Since we are using streaming, this script has to do the following:

1.Read data from STDIN. This is accomplished by using sys.stdin.readline() in this example.

2.The trailing newline character is removed using string.strip(line, "\n "), since we just want the text data and not the end of line indicator.

3.When doing stream processing, a single line contains all the values with a tab character between each value. So string.split(line, "\t") can be used to split the input at each tab, returning just the fields.

4.When processing is complete, the output must be written to STDOUT as a single line, with a tab between each field. This is accomplished by using print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()]).

5.This all occurs within a while loop, that will repeat until no line is read, at which point break exits the loop and the script terminates.

Beyond that, the script just concatenates the input values for devicemake and devicemodel, and calculates a hash of the concatenated value. Pretty simple, but it describes the basics of how any Python script invoked from Hive should function: Loop, read input until there is no more, break each line of input apart at the tabs, process, write a single line of tab delimited output.

您可能感兴趣的与本文相关的镜像