
Hive is a data warehouse designed for querying and aggregating large datasets that reside on HDFS.
The standard INSERT INTO syntax performs poorly because:
Each statement required a Map/Reduce process to be executed.
Each statement will result in a new file being added to HDFS - over time this will lead to very poor performance when reading from the table.
With that said, there is now a Streaming API for Hive / HCatalog, as detailed here.
I am faced with the need to insert data at velocity into Hive, using Python. I am aware of the pyhive and pyhs2 libraries, but neither of them appears to make use of the Streaming API.
Has anyone successfully managed to get Python to insert many rows into Hive using the Streaming API, and how was this done?
I look forward to your insights!
解决方案
Hive user can stream table through script to transform that data:
ADD FILE replace-nan-with-zeros.py;
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py'
AS (...)
FROM some_table;
Here a simple Python script:
#!/usr/bin/env python
import sys
kFirstColumns= 7
def main(argv):
for line in sys.stdin:
line = line.strip();
inputs = line.split('\t')
# replace NaNs with zeros
outputs = [ ]
columnIndex = 1;
for value in inputs:
newValue = value
if columnIndex > kFirstColumns:
newValue = value.replace('NaN','0.0')
outputs.append(newValue)
columnIndex = columnIndex + 1
print '\t'.join(outputs)
if __name__ == "__main__":
main(sys.argv[1:])
Hive and Python
Python can be used as a UDF from Hive through the HiveQL TRANSFORM statement. For example, the following HiveQL invokes a Python script stored in the streaming.py file.
Linux-based HDInsight
add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel)
USING 'streaming.py' AS
(clientid string, phoneLable string, phoneHash string)
FROM hivesampletable
ORDER BY clientid LIMIT 50;
Windows Based HDInsight
add file wasb:///streaming.py;
SELECT TRANSFORM (clientid, devicemake, devicemodel)
USING 'D:\Python27\python.exe streaming.py' AS
(clientid string, phoneLable string, phoneHash string)
FROM hivesampletable
ORDER BY clientid LIMIT 50;
Here's what this example does:
1.The add file statement at the beginning of the file adds the streaming.py file to the distributed cache, so it's accessible by all nodes in the cluster.
2.The SELECT TRANSFORM ... USING statement selects data from the hivesampletable, and passes clientid, devicemake, and devicemodel to the streaming.py script.
3.The AS clause describes the fields returned from streaming.py
Here's the streaming.py file used by the HiveQL example.
#!/usr/bin/env python
import sys
import string
import hashlib
while True:
line = sys.stdin.readline()
if not line:
break
line = string.strip(line, "\n ")
clientid, devicemake, devicemodel = string.split(line, "\t")
phone_label = devicemake + ' ' + devicemodel
print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()])
Since we are using streaming, this script has to do the following:
1.Read data from STDIN. This is accomplished by using sys.stdin.readline() in this example.
2.The trailing newline character is removed using string.strip(line, "\n "), since we just want the text data and not the end of line indicator.
3.When doing stream processing, a single line contains all the values with a tab character between each value. So string.split(line, "\t") can be used to split the input at each tab, returning just the fields.
4.When processing is complete, the output must be written to STDOUT as a single line, with a tab between each field. This is accomplished by using print "\t".join([clientid, phone_label, hashlib.md5(phone_label).hexdigest()]).
5.This all occurs within a while loop, that will repeat until no line is read, at which point break exits the loop and the script terminates.
Beyond that, the script just concatenates the input values for devicemake and devicemodel, and calculates a hash of the concatenated value. Pretty simple, but it describes the basics of how any Python script invoked from Hive should function: Loop, read input until there is no more, break each line of input apart at the tabs, process, write a single line of tab delimited output.
本文介绍了如何使用Hive的Streaming API通过Python脚本高效地插入大量数据到Hive表中,并提供了具体示例,包括如何设置和使用Python脚本进行数据转换。
980

被折叠的 条评论
为什么被折叠?



