hive:创建自定义函数 UDF

-柚子皮-

已于 2023-09-21 17:55:02 修改

阅读量1.6k

点赞数

CC 4.0 BY-SA版权

分类专栏： Database 文章标签： hive

于 2023-04-10 22:24:22 首次发布

本文链接：https://blog.youkuaiyun.com/pipisorry/article/details/130071422

Database 专栏收录该内容

8 篇文章

订阅专栏

文章介绍了ApacheHive中的用户自定义函数(UDF)的两种实现方式，包括简单API和复杂API（GenericUDF）。简单API适用于逻辑简单、参数和返回值为基本类型的情况，而GenericUDF则支持更复杂的类型和函数逻辑，包括对内嵌数据结构如Map和List的操作。文章还提供了示例代码以及如何选择使用这两种API的指导，并提到了UDF读取外部文件的示例。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

编写Apache Hive用户自定义函数（UDF）有两个不同的接口，一个简单，另一个相对复杂：
简单API： org.apache.hadoop.hive.ql.exec.UDF--使用反射推测（参数及返回值的）类型，开发简单，不易于控制。
复杂API： org.apache.hadoop.hive.ql.udf.generic.GenericUDF --使用代码指定类型和隐式类型转换逻辑。

如果你的函数读和返回都是基础数据类型（Hadoop&Hive 基本writable类型，如Text,IntWritable,LongWriable,DoubleWritable等等），那么UDF可以胜任。

优点：
- 实现简单
- 支持Hive的基本类型、数组和Map
- 支持函数重载
缺点：
- 逻辑较为简单，只适合用于实现简单的函数

这种方式编码少，代码逻辑清晰，可以快速实现简单的UDF

但是，如果你想写一个UDF用来操作内嵌数据结构，如Map，List和Set，那么你要去熟悉GenericUDF这个API

优点：
- 支持任意长度、任意类型的参数
- 可以根据参数个数和类型实现不同的逻辑
- 可以实现初始化和关闭资源的逻辑(initialize、close)
缺点：
- 实现比继承UDF要复杂一些

与继承 UDF 相比，GenericUDF 更加灵活，可以实现更为复杂的函数

关于两者的选择

如果函数具有以下特点，优先继承 UDF 类：

逻辑简单，比如英文转小写函数
参数和返回值类型简单，都是Hive的基本类型、数组或Map
没有初始化或关闭资源的需求

否则考虑继承 GenericUDF 类

简单UDF API

构建一个UDF只涉及到编写一个类继承实现一个方法（evaluate），以下是示例：
class SimpleUDFExample extends UDF {

public Text evaluate(Text input) {
if(input == null) return null;
return new Text("Hello " + input.toString());
}
}

示例2：
import org.apache.hadoop.hive.ql.exec.UDF;

/**
* 继承 org.apache.hadoop.hive.ql.exec.UDF
*/
public class SimpleUDF extends UDF {

/**
* 编写一个函数，要求如下：
* 1. 函数名必须为 evaluate
* 2. 参数和返回值类型可以为：Java基本类型、Java包装类、org.apache.hadoop.io.Writable等类型、List、Map
* 3. 函数一定要有返回值，不能为 void
*/
public int evaluate(int a, int b) {
return a + b;
}

/**
* 支持函数重载
*/
public Integer evaluate(Integer a, Integer b, Integer c) {
if (a == null || b == null || c == null)
return 0;

return a + b + c;
}
}

支持的参数和返回值类型

支持 hive基本类型、数组和Map

Hive基本类型

Java可以使用Java原始类型、Java包装类或对应的Writable类

PS：对于基本类型，最好不要使用 Java原始类型，当 null 传给 Java原始类型参数时，UDF 会报错。Java包装类还可以用于null值判断

Hive类型	Java原始类型	Java包装类	hadoop.io.Writable
tinyint	byte	Byte	ByteWritable
smallint	short	Short	ShortWritable
int	int	Integer	IntWritable
bigint	long	Long	LongWritable
string	String	-	Text
boolean	boolean	Boolean	BooleanWritable
float	float	Float	FloatWritable
double	double	Double	DoubleWritable

数组和Map

Hive类型	Java类型
array	List
Map<K, V>	Map<K, V>

复杂GenericUDF API

GenericUDF API提供了一种方法去处理那些不是可写类型的对象，例如：struct，map和array类型。这个API需要你亲自去为函数的参数去管理对象存储格式（ object inspectors），验证接收的参数的数量与类型。一个object inspector为内在的数据类型提供一个一致性接口，以至不同实现的对象可以在hive中以一致的方式去访问（例如，只要你能提供一个对应的object inspector，你可以实现一个如Map的复合对象）。

GenericUDF API要求你去实现以下方法：
// 这个类似于简单API的evaluat方法，它可以读取输入数据和返回结果
abstract Object evaluate(GenericUDF.DeferredObject[] arguments);

// 该方法无关紧要，我们可以返回任何东西，但应当是描述该方法的字符串
abstract String getDisplayString(String[] children);

// 只调用一次，在任何evaluate()调用之前，你可以接收到一个可以表示函数输入参数类型的object inspectors数组
// 这是你用来验证该函数是否接收正确的参数类型和参数个数的地方
abstract ObjectInspector initialize(ObjectInspector[] arguments);

示例

class ComplexUDFExample extends GenericUDF {
 
  ListObjectInspector listOI;
  StringObjectInspector elementOI;
 
  @Override
  public String getDisplayString(String[] arg0) {
        return this.getClass().getName();
  }
 
  @Override
  public ObjectInspector initialize(ObjectInspector[] arguments) throws UDFArgumentException {
    // 返回类型是boolean，所以我们提供了正确的object inspector
    return PrimitiveObjectInspectorFactory.javaBooleanObjectInspector;
  }
  
  @Override
  public Object evaluate(DeferredObject[] arguments) throws HiveException {
    
    // 利用object inspectors从传递的对象中得到list与string
    List<String> list = (List<String>) this.listOI.getList(arguments[0].get());
    String arg = elementOI.getPrimitiveJavaObject(arguments[1].get());
    
    // 检查空值
    if (list == null || arg == null) {
      return null;
    }
    
    // 判断是否list中包含目标值
    for(String s: list) {
      if (arg.equals(s)) return new Boolean(true);
    }
    return new Boolean(false);
  }
  
}

initialize()

需要 return 一个 ObjectInspector 实例，用于表示自定义UDF返回值类型。initialize() 的返回值决定了 evaluate() 的返回值类型。
在 initialize 指定的返回值类型为 Writable类型时，在 evaluate() 中 return 的就应该是对应的 Writable实例（比如return new LongWritable(items_id_long);）
在 initialize 指定的返回值类型为 Java包装类型时，在 evaluate() 中 return 的就应该是对应的 Java包装类实例（比如直接return items_id_long;）

否则可能会出类似这样的错误：java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.hadoop.io.LongWritable

对于基本类型(byte、short、int、long、float、double、boolean、string)，可以通过 PrimitiveObjectInspectorFactory 的静态字段直接获取

Hive类型	writable类型	Java包装类型
tinyint	writableByteObjectInspector	javaByteObjectInspector
smallint	writableShortObjectInspector	javaShortObjectInspector
int	writableIntObjectInspector	javaIntObjectInspector
bigint	writableLongObjectInspector	javaLongObjectInspector
string	writableStringObjectInspector	javaStringObjectInspector
boolean	writableBooleanObjectInspector	javaBooleanObjectInspector
float	writableFloatObjectInspector	javaFloatObjectInspector
double	writableDoubleObjectInspector	javaDoubleObjectInspector

Array、Map<K, V>等复杂类型，则可以通过 ObjectInspectorFactory 的静态方法获取

Hive类型	ObjectInspectorFactory的静态方法	evaluate()返回值类型
Array	getStandardListObjectInspector(T t)	List
Map<K, V>	getStandardMapObjectInspector(K k, V v);	Map<K, V>

示例：返回值类型为 Map<String, int> 的示例：
return ObjectInspectorFactory.getStandardMapObjectInspector(
PrimitiveObjectInspectorFactory.javaStringObjectInspector, // Key 是 String
PrimitiveObjectInspectorFactory.javaIntObjectInspector // Value 是 int
);

initialize中可以加这些正确性检查的语句
if (arguments.length != 2) {
throw new UDFArgumentLengthException("arrayContainsExample only takes 2 arguments: List<T>, T");
}
// 1. 检查是否接收到正确的参数类型
ObjectInspector a = arguments[0];
ObjectInspector b = arguments[1];
if (!(a instanceof ListObjectInspector) || !(b instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list / array, second argument must be a string");
}
this.listOI = (ListObjectInspector) a;
this.elementOI = (StringObjectInspector) b;

// 2. 检查list是否包含的元素都是string
if(!(listOI.getListElementObjectInspector() instanceof StringObjectInspector)) {
throw new UDFArgumentException("first argument must be a list of strings");
}

evaluate()

核心方法，自定义UDF的实现逻辑
代码实现步骤可以分为三部分：
参数接收
自定义UDF核心逻辑
返回处理结果

对于Hive基本类型，传入的都是 Writable类型

Hive类型	Java类型
tinyint	ByteWritable
smallint	ShortWritable
int	IntWritable
bigint	LongWritable
string	Text
boolean	BooleanWritable
float	FloatWritable
double	DoubleWritable
Array	ArrayList
Map<K, V>	HashMap<K, V>

udf读取外部文件并使用

范本示例

import com.google.common.collect.Sets;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hive.ql.io.orc.OrcFile;
import org.apache.hadoop.hive.ql.io.orc.Reader;
import org.apache.hadoop.hive.ql.io.orc.RecordReader;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.io.LongWritable;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.net.URL;


public class ABCD extends GenericUDTF {
    private static Logger logger = LoggerFactory.getLogger(ABCD.class);

    private String filePath = "000000_0";
    private static Map<Long, Set<Long>> c2cTable = new HashMap<>();
    private volatile Boolean initialized = false;

    @Override
    public void process(Object[] args) throws HiveException {
        try {
            if (!initialized) {
                c2cTable = readOrcFileAsMap(this.filePath);
                logger.info("totalCateCnt of c2cTable => " + c2cTable.size());
                initialized = true;
            }
        } catch (IOException e) {
            logger.error("----- c2cTable initialized error -----");
        }
    }

    public Map<Long, Set<Long>> readOrcFileAsMap(String fileDir) throws IOException {
        Map<Long, Set<Long>> dataMap = new HashMap<>();

        Configuration conf = new Configuration();
        URL url = Thread.currentThread().getContextClassLoader().getResource(fileDir);
        Path filePath = new Path(String.valueOf(url));
        Reader reader = OrcFile.createReader(FileSystem.getLocal(conf), filePath);
        StructObjectInspector inspector = (StructObjectInspector) reader.getObjectInspector();

        RecordReader records = reader.rows();
        Object row = null;
        while (records.hasNext()) {
            row = records.next(row);
            List valueList = inspector.getStructFieldsDataAsList(row);
            long id = ((LongWritable) valueList.get(0)).get();
            List<LongWritable> id2s = (List<LongWritable>) valueList.get(1);
            Set<Long> id2Set = Sets.newHashSet();
            for (LongWritable val : id2s) {
                id2Set.add(val.get());
            }
            dataMap.put(id, id2Set);
        }
        return dataMap;
    }
}

Note:
add file时，如果没有生成mapreduce任务，会导致报错“打开文件失败：./***”。
比如在测试的时候直接使用udf('helloworld') ,这样会直接在sql编译的时候就执行这个udf，这样本地就没有文件，mapreduce的map或者reduce已经不会执行了。
解决：
1 如果是简单查询，不生成task的SQL，请加上group by变成聚合查询，或者加上 set hive.fetch.task.conversion=none; 如：
set hive.fetch.task.conversion=none;
set mapred.cache.archives=hdfs://path/offline-dict.tar#data;
2 udf(txt) from (大于1个txt union all)，这样就可以正常生成mr任务。

引入和使用

写好udf后，可以自行打包成jar包，上传到有权限的 hdfs目录。
Note: 不要打包 hive 代码及对应的依赖（ json 库、kryo 等），这将导致版本冲突，查询失败。
或者push到udf仓库，再通过部署平台调用仓库相关的分支自动打包。
然后在hive sql代码中add相关包，创建函数进行调用。

add file hdfs://path/表名map_text/pt=${YYYYMMDD}/000000_0;
ADD JAR target/hive-extensions-1.0-SNAPSHOT-jar-with-dependencies.jar;
CREATE TEMPORARY FUNCTION helloworld as 'com.example.SimpleUDFExample';
select helloworld(name) from people limit 1000;

from: hive:创建自定义函数 UDF_-柚子皮-的博客-优快云博客

ref:UDF开发手册 - UDF - 掘金

Hive UDF开发指南_kent7306的博客-优快云博客

Hadoop Hive UDF Tutorial - Extending Hive with Custom Functions

HivePlugins - Apache Hive - Apache Software Foundation