用python实现hive函数过程

最新推荐文章于 2023-08-18 13:02:33 发布

原创最新推荐文章于 2023-08-18 13:02:33 发布 · 1.1k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#hive #python #hadoop

本文介绍如何使用Python处理Hive中的复杂数据问题，包括通过Pyhive和Pandas库实现机构递归查询，以及数据导出至HDFS的过程。此外，还提供了一个简单的条件判断并执行Hive SQL语句的例子。

部署运行你感兴趣的模型镜像

由于hive_sql无法处理一些复杂的语句比如递归以及缺乏对函数、过程、块的支持，因此需要使用其他的程序来辅助改造，推荐使用python，python有很多开源的库，使用起来很方便而且比较容易上手；

以下是一个使用python实现机构递归的例子作为python应用hive开发的参考：Python操作hive主要用到两个库：pyhive、pandas，Pyhive用于连接到hive，获取数据，操作数据；Pandas用于数据分析与处理。

1、python处理

#加载包

from pyhive import hive

import pandas as pd

#定义连接，连接到hive

con = hive.connect(host='*.*.*.*',port=10000,auth='KERBEROS',kerberos_service_name="***")

#定义游标

cursor=con.cursor()

cursor.execute('use ***')

cursor.execute('select org_id from organization')

# fetchall获取所有数据，还有一个fetchone每次获取一条数据

vdit=cursor.fetchall()

cursor.execute("select org_id,org_name,org_subtype,parent_org_id,org_level,index_code,status_cd from organization")

result=cursor.fetchall()

#获取表头

index = cursor.description

title = list()

for i in range(len(index)):

#select * from的时候

#title.append(index[i][0].split('.')[1])

##指定字段

title.append(index[i][0])

vdit_org=pd.DataFrame(result,columns=title)

#DataFrame

#创建空DataFrame（初始化DataFrame）

rows=pd.DataFrame(columns=['org_id','org_name','org_subtype','parent_org_id','org_level','index_code','status_cd','own_org_id'])

#递归函数

def recursive_org(v_org_id):

#检索

vidx_tmp=vdit_org[vdit_org.org_id == v_org_id]

#重置索引,索引是连接关联字段

vidx=vidx_tmp.set_index('status_cd',drop=False)

global rows

if len(vidx)>0:

row=pd.concat([vidx,vtail],axis=1)

rows=pd.concat([rows,row],axis=0)

p_org_id=vidx.loc['1000'][3]

recursive_org(p_org_id)

else:

return

for i in vdit:

v_id=i[0]

vtail=pd.DataFrame([[i[0]]],columns=['own_org_id'],index=['1000'])

recursive_org(v_id)

#导出数据到本地路径

rows.to_csv('d_organization.csv',index=None,header=False,sep='|')

#关闭游标，关闭数据库连接

cursor.close()

con.close()

2、将python生成的数据上传到hdfs

hadoop fs -put d_organization.csv hdfs://test/test/test/hive/test/python

3、hive中建表，并载入数据

CREATE EXTERNAL TABLE if not exists `tmp_1`(

`org_id` decimal(21,0),

`org_name` varchar(1000),

`org_subtype` varchar(80),

`parent_org_id` decimal(21,0),

`org_level` decimal(21,0),

`index_code` varchar(1000),

`status_cd` varchar(80),

`own_org_id` decimal(21,0))

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

WITH SERDEPROPERTIES (

'field.delim'='|',

'serialization.format'='|',

'serialization.null.format'='')

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'hdfs://test/test/test/hive/test/python/tmp_1';

LOAD DATA INPATH 'hdfs://test/test/test/hive/test/python/d_organization.csv' OVERWRITE INTO TABLE tmp_1;

注意：如果有大量数据需要处理可以使用pandas来处理，如果只是简单的判定然后执行一个语句可以直接使用pyhive来执行语句就行：

比如：判定表记录数是否大于0，大于0就执行一个语句

cursor.execute(‘select count(1) x from rb_tmp’)

result=cursor.fetchall()

cnt= result[0]

if cnt>0:

cursor.execute(‘create table yb_tmp as select * from rb_tmp’)

您可能感兴趣的与本文相关的镜像

Python3.8

Conda

Python

Python 是一种高级、解释型、通用的编程语言，以其简洁易读的语法而闻名，适用于广泛的应用，包括Web开发、数据分析、人工智能和自动化脚本