----本文笔记整理自 《Hive编程指南》13.10 用户自定义表生成函数
《Hadoop海量数据处理:技术详解与项目实战》范东来 第6章 6.7.3 UDTF
一、自定义表生成函数(MyExplode实现)
1.表生成函数:指0个或多个输入,产生多列或多行输出,如:explode(Array a),例如:
> select explode(array('a','b'));
a
b
<此即输出结果>
2.实现自定义的myexplode函数,功能与explode相同。如下代码:
package com.hive.udtf;
import java.util.ArrayList;
import java.util.List;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.hive.ql.exec.Description;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ListObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector.PrimitiveCategory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils;
import org.apache.hadoop.util.StringUtils;
/*
* .行转列explode的实现
*/
@Description(name = "myexplode",
value = "_FUNC_(list) - Returns the splits of a list into row")
public class MyExplode extends GenericUDTF{
static final Log LOG = LogFactory.getLog(MyExplode.class.getName());
/*
* 输入的参数OI实例为List型
* 注:此处定义到ListObjectInspector就够了,无需再向下定义
* 因为Hive中输入的类型可能不同
* 如:select myexplode(array('1','2')); 这就是Constant常量类型的List
* select myexplode(names) from data;(其中names是array类型) 这就是Lazy变量类型的List
*/
ListObjectInspector listOI;
/*
* List元素类型的OI实例为基本类型String
* 注:List元素类型也一样,存在Lazy类型的String和Writable类型和Jave型的String
*/
PrimitiveObjectInspector listElemOI;
@Override
public StructObjectInspector initialize(ObjectInspector[] argOIs)
throws UDFArgumentException {
/*
* 1.参数List<String> 校验,及输入类型的OI设定
*/
//参数个数需为1个
if (argOIs.length != 1) {
throw new UDFArgumentException("MyExplode takes only one argument");
}
//参数需为List
if (argOIs[0].getCategory() != ObjectInspector.Category.LIST) {
throw new UDFArgumentException("MyExplode takes list as parameter");
}
//设定参数类型的OI实例
this.listOI = (ListObjectInspector) argOIs[0];
ObjectInspector listElemOI = listOI.getListElementObjectInspector();
//且List元素为String。先判断是否为基本类型,再进一步判断是否为String,注意:强转前的类型判断
if (!(listElemOI.getCategory() == ObjectInspector.Category.PRIMITIVE
&& ((PrimitiveObjectInspector)listElemOI)
.getPrimitiveCategory() == PrimitiveCategory.STRING)) {
throw new UDFArgumentException("MyExplode takes string as list element");
}
//设定参数类型的List元素的OI实例
this.listElemOI = (PrimitiveObjectInspector) listElemOI;
/*
* 2.输出数据的OI实例设定
*/
//设定输出类型为 Struct类型(1个元素)的OI实例
List<String> fnames = new ArrayList<String>();
fnames.add("col1");
List<ObjectInspector> fois = new ArrayList<ObjectInspector>();
//注:此处是Java String是因为process()中的String[] result 是输出
fois.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
return ObjectInspectorFactory.getStandardStructObjectInspector(
fnames, fois);
}
boolean warned = false;
@Override
public void process(Object[] args)
throws HiveException {
assert (args.length == 1);
//解析List元素
String[] result = new String[1];//Struct型OI对应Java的数据类型为数组
Object p = null;
int len = listOI.getListLength(args[0]);
for (int i = 0; i < len; i++) {
//通过OI取args[0]中的集合元素
p = listOI.getListElement(args[0], i);
if (p != null) {
try {
//数据转换错误时会抛异常,在getString()方法中查看注释
result[0] = PrimitiveObjectInspectorUtils
.getString(p, listElemOI);
//System.out.println("value : " + result[0]);//可在控制台输出
//输出
forward(result);
} catch (RuntimeException e) {
if (!warned) {
//捕捉异常后,只打印一次异常信息,避免重复异常打印
warned = true;
LOG.warn(getClass().getSimpleName() + " "
+ StringUtils.stringifyException(e));
LOG.warn(getClass().getSimpleName()
+ " ignoring similar exceptions.");
}
continue;
}
}
}
}
@Override
public void close() throws HiveException {
}
}
注1:初始化的 initialize() 方法的返回参数必须是 Struct 类型的OI实例,也就是说,表生成函数支持单列(Struct中一个Field)到多列(Struct中多个Field)的输出。
注2:使用 org.apache.commons.logging.Log 对程序运行中的可能异常处输出日志,便于排错。如下:
static final Log LOG = LogFactory.getLog(MyExplode.class.getName());
//如process()中对Object数据转换时的可能异常处理方式
Object p = listOI.getListElement(args[0], i);
if (p != null) {
try {
//数据转换错误时会抛异常,在getString()方法中查看注释
result[0] = PrimitiveObjectInspectorUtils
.getString(p, listElemOI);
forward(result);
} catch (RuntimeException e) {
if (!warned) {
//捕捉异常后,只打印一次异常信息,避免重复异常打印
warned = true;
LOG.warn(getClass().getSimpleName() + " "
+ StringUtils.stringifyException(e));
LOG.warn(getClass().getSimpleName()
+ " ignoring similar exceptions.");
}
continue;
}
}
--Hive的日志位置(本地):
$ /tmp/root/hive.log (表示今天的)
$ /tmp/root/hive.log.2019-05-03 (表示昨天的)
注3:利用多列输出的特性,还可以实现对表字段URL的解析,如:
> select parse_url(url) as (host, path) from test;
google.com /index.html
hotmail.com /a/links.html
<此即多列多行输出>
> from (
> select parse_url(url) as (host, path) from test ) a
> select a.path
> where a.host = 'google.com';
/index.html
<此即解析成临时表后的查询输出>