在工作学习中,hive提供的现成的函数往往不是很够用,我们需要手动去编写函数,hive提供了用户扩展的权利 。用户继承相关类或者实现相关接口,编写代码,即可实现自己需要的功能
UDF(User-Defined Function):用户定义普通函数,只对单行的数值产生作用
UDAF(User-Defined Aggregation Function):用户自定义聚合函数,可以对多行数据产生作用,类似于sum,avg等函数
UDTF(User-Defined Table-Generating Function):用户定义表生成函数,用来解决输入一行,输出多行的场景。
下面将一一展示三种自定义函数的编写及应用
准备工作
准备数据user.txt
zs play,sing
ls sleep,eat
mwf study,sleep
hadoop fs -mkdir /external/user/
hadoop fs -put user.txt /external/user/
在hive中创建user表
create external table user(name string,hobby string) row format delimited fields terminated by '\t' location '/external/user';
user表包括姓名和爱好两个字段,通过编写代码实现以下:
(1) 通过UDF实现将名字变为大写
(2) 通过UDTF将爱好拆分为两列
(3) 通过UDAF计算所有人的爱好总数
创建maven项目/pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.mwf</groupId>
<artifactId>test_udf01</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.hive/hive-exec -->
<dependency>
<groupId>org.apache.hive</groupId>
<artifactId>hive-exec</artifactId>
<version>1.2.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>2.7.3</version>
</dependency>
</dependencies>
</project>
UDF
UDF的写法是继承UDF类,然后编写evaluate()方法实现具体的逻辑
public class SimpleUDF extends UDF {
public String evaluate(String str) {
return str.toUpperCase();
}
}
打包测试
打jar包,上传到服务器,在hive中添加jar包,注册临时函数,然后调用
add jar test_udf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDF'
select udf(name) from user;
UDTF
编写UDTF需要继承GenericUDTF类,然后重写initialize方法和process方法和close方法
initialize方法主要是初始化返回的列和返回的列类型
process方法对输入的每一行进行操作,他通过调用forward()返回一行或者多行数据
close方法在process方法结束后调用,用于进行一些其他的操作,只执行一次
public class SimpleUDTF extends GenericUDTF {
@Override
public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
ArrayList<String> colNames = new ArrayList<String>();
colNames.add("hobby1");
colNames.add("hobby2");
ArrayList<ObjectInspector> fieldIOs = new ArrayList<ObjectInspector>();
fieldIOs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
fieldIOs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(colNames,fieldIOs);
}
public void process(Object[] objects) throws HiveException {
forward(objects[0].toString().split(","));
}
public void close() throws HiveException {
}
}
打包测试
add jar test_udtf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDTF'
select udtf(hoby) from user;
UDAF
参考下面这两篇文章,讲的很好
https://blog.youkuaiyun.com/zyz_home/article/details/79889519(我就是看这个看明白的,大家可以移步去学)
https://cloud.tencent.com/info/05fa14293c68fe91f7b4670389b8f7e5.html
需要继承以下两个类
AbstractGenericUDAFResolver根据传入的数据指定调用哪一个Evaluator进行处理
GenericUDAFEvaluator实现具体的方法,编写逻辑
编写之前要了解GenericUDAFEvaluator的内部类Model和ObjectInspector接口,可以通过看上面两篇文章了解
GenericUDAFEvaluator类需要实现的几个方法
// 确定各个阶段输入输出参数的数据格式ObjectInspectors
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException;
// 保存数据聚集结果的类
abstract AggregationBuffer getNewAggregationBuffer() throws HiveException;
// 重置聚集结果
public void reset(AggregationBuffer agg) throws HiveException;
// map阶段,迭代处理输入sql传过来的列数据
public void iterate(AggregationBuffer agg, Object[] parameters) throws HiveException;
// map与combiner结束返回结果,得到部分数据聚集结果
public Object terminatePartial(AggregationBuffer agg) throws HiveException;
// combiner合并map返回的结果,还有reducer合并mapper或combiner返回的结果。
public void merge(AggregationBuffer agg, Object partial) throws HiveException;
// reducer阶段,输出最终结果
public Object terminate(AggregationBuffer agg) throws HiveException;
代码实现
public class SimpleUDAF extends AbstractGenericUDAFResolver {
@Override
public GenericUDAFEvaluator getEvaluator(TypeInfo[] info) throws SemanticException {
return new TestEvaluator();
}
public static class TestEvaluator extends GenericUDAFEvaluator{
//数据的类型
PrimitiveObjectInspector inputOI;
ObjectInspector ouputOI;
PrimitiveObjectInspector integerOI;
//总数
int total = 0 ;
//指定输入,过程和输出的类型(有不同的阶段PARTIAL1,PARTIAL2,FINAL,COMPLETE)
@Override
public ObjectInspector init(Mode m, ObjectInspector[] parameters) throws HiveException {
super.init(m, parameters);
if(m ==Mode.PARTIAL1 || m == Mode.COMPLETE ){
inputOI = (PrimitiveObjectInspector) parameters[0];
}else{
integerOI = (PrimitiveObjectInspector) parameters[0];
}
ouputOI = ObjectInspectorFactory.getReflectionObjectInspector(Integer.class,
ObjectInspectorFactory.ObjectInspectorOptions.JAVA);
return ouputOI;
}
//聚集结果存储的中间类
static class HobbyAggregationBuffer implements AggregationBuffer {
int tempTotal = 0 ;
void add(int num){tempTotal += num; }
}
//获取聚集结果存储的中间类
public AggregationBuffer getNewAggregationBuffer() throws HiveException {
return new HobbyAggregationBuffer();
}
//重置聚集结果存储的中间类
public void reset(AggregationBuffer aggregationBuffer) throws HiveException {
aggregationBuffer = new HobbyAggregationBuffer();
}
//对每一行进行逻辑计算
public void iterate(AggregationBuffer aggregationBuffer, Object[] objects) throws HiveException {
if(objects[0] != null ){
HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
Object p = ((PrimitiveObjectInspector)inputOI).getPrimitiveJavaObject(objects[0]);
hobbyAgg.add(String.valueOf(p).split(",").length);
}
}
//对中间值(mapper和combiner的结果)进行计算
public Object terminatePartial(AggregationBuffer aggregationBuffer) throws HiveException {
HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
total += hobbyAgg.tempTotal;
return total;
}
//合并mapper阶段和combiner阶段的结果
public void merge(AggregationBuffer aggregationBuffer, Object o) throws HiveException {
if(o != null){
HobbyAggregationBuffer hobbyAgg1 = (HobbyAggregationBuffer) aggregationBuffer;
Integer partialSum = (Integer) integerOI.getPrimitiveJavaObject(o);
HobbyAggregationBuffer hobbyAgg2 = new HobbyAggregationBuffer();
hobbyAgg2.add(partialSum);
hobbyAgg1.add(hobbyAgg2.tempTotal);
}
}
//对结果值(reducer的记过)进行计算
public Object terminate(AggregationBuffer aggregationBuffer) throws HiveException {
HobbyAggregationBuffer hobbyAgg = (HobbyAggregationBuffer) aggregationBuffer;
total = hobbyAgg.tempTotal;
return total;
}
}
}
打包测试
add jar test_udaf.jar
create temporary function udf as 'com.mwf.demo.SimpleUDAF'
select udaf(hoby) from user;
完整的代码,可以移步https://github.com/upupfeng/test_udf获取
知识的搬运工,多搬搬总是好的