hive自定义UDAF函数 O(n)_最大回撤率计算公式hive-优快云博客

本文链接：https://blog.youkuaiyun.com/liupinyang/article/details/120371107

hive自定义UDAF函数

文章目录

hive自定义UDAF函数

1.1需求

根据产品净值得到某个时间区间内的净值最大回撤,即max((Di-Dj)/Di), Di>=Dj; i,j为日期, Di、Dj 为净值且 j>=i。

以9月1号-9月5号为例，得到max[（9.1-9.1）/9.1,(9.1-9.2)/9.1,(9.2-9.2）/9.1 …(9.4-9.5）/9.5 ，(9.45-9.5）/9.5 ].

1.2 分析

需求较为明晰，我只需要把传入的参数放到list，然后求得最大值即可。

根据Di>Dj,若Di<Dj,得到的结果肯定为负值，这个地方需要处理，即Di-Dj<0时，赋值为0.

1.3测试数据

由于该函数没有内部排序，需要把数据按日期排序后使用

prd_id	v_date	v_value
1001	2021-08-12	1.43200
1001	2021-08-22	1.23200
1001	2021-08-23	1.33200
1001	2021-08-25	1.12300
1001	2021-08-27	1.53200
1002	2021-08-17	1.50200
1002	2021-08-21	1.23200
1002	2021-08-24	1.33200
1002	2021-08-25	1.13200
1003	2021-08-12	1.03200
1003	2021-08-17	1.21200
1004	2021-08-12	1.212
1004	2021-08-13	1.415
1004	2021-08-15	1.321
1005	2021-08-22	1.332
1005	2021-08-23	1.332
Time taken: 0.099 seconds, Fetched: 16 row(s)

1.4代码实现

需要继承UDAF类实现UDAFEvaluator接口。

package com.lpy.udf;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

import java.util.ArrayList;
import java.util.Collections;


/**
 * @author liuser
 * @date 2021/09/10
 */
public class UDAF_BACK extends UDAF {
    public static class AvgState {
		//用于存储回撤值
        private ArrayList<Double> doubles;
        //用于接收最大返回值
        private double max_back ;

    }


    public static class AvgEvaluator implements UDAFEvaluator {
        AvgState state;

        public AvgEvaluator() {
            super();
            state = new AvgState();
            init();
        }


        /**
         * init函数类似于构造函数，用于UDAF的初始化
         */
        public void init() {
            state.doubles = new ArrayList<Double>();
            state.max_back=0.0;
        }

        /**
         * iterate接收传入的参数，并进行内部的轮转。其返回类型为boolean * * @param o * @return
         */

        public boolean iterate(Double o) {
            if (o != null) {
                //初始化map存储 id,date,value
                state.doubles.add(o);
            }
            return true;
        }

        /**
         * terminatePartial无参数，其为iterate函数遍历结束后，返回轮转数据， * terminatePartial类似于hadoop的Combiner * * @return
         */

        public AvgState terminatePartial() {
            // combiner


            return state.doubles == null ? null : state;
        }


        /**
         * merge接收terminatePartial的返回结果，进行数据merge操作，其返回类型为boolean * * @param o * @return
         */

        public boolean merge(AvgState avgState) {
            ArrayList<Double> arrList = new ArrayList<Double>();
            if (avgState != null) {
                Object[] split = avgState.doubles.toArray();
                if (split.length <= 1) {
                    state.max_back = 0.00;
                }else {
                    for (int i = 0; i < split.length; i++) {
                        for (int j = i + 1; j < split.length; j++) {
                            arrList.add((Double.parseDouble(split[i].toString()) - Double.parseDouble(split[j].toString())) / Double.parseDouble(split[i].toString()));
                        }
                    }
                    state.max_back = Collections.max(arrList)<=0 ? 0.0:Collections.max(arrList);
                }
            }
            return true;
        }

        /**
         * terminate返回最终的聚集函数结果 * * @return
         */
        public Double terminate() {
            return state.max_back;
        }
    }

}

1.5 打jar包

根据自己的习惯，使用maven等等

1.6 上传服务器

上传huice-1.0.jar 至 /home/liuser/udf_jar/

1.8 使用udaf函数

hive执行 
hive (default)>add jar /home/liuser/udf_jar/huice-1.0.jar ;

hive (default)>create temporary function huice01 as 'com.lpy.udf.UDAF_BACK';


hive (default)> select prd_id,huice01(v_value) from test group by prd_id ;

1.9 结果展示

prd_id	_c1
1001	0.2157821229050279
1002	0.24633821571238357
1003	0.0
1004	0.06643109540636048
1005	0.0

2.0 相关注解

prd_id=1003时只有一条数据，没有回撤值，此时赋予 0.0

prd_id=1005时有多条数据，但是 Di<Dj, 同样没有回撤值，应赋予0.0。

0924 时间复杂度优化(merge 方法）

由于之前时间复杂度对O(n^2)，后续在思考如何进行复杂度优化，经过不断改进，将呈现复杂度降到O(n)。
代码如下。

        public boolean merge(AvgState avgState) {
            double drawdown=0.0;
            ArrayList<Double> drawdowns = new ArrayList<Double>();
            Object[] split = avgState.doubles.toArray();
            Double max_so_far = (Double)split[0];
            for (int i = 0; i < split.length; i++) {
                if (Double.parseDouble(split[i].toString()) > max_so_far) {
                    drawdown=0.0;
                    drawdowns.add(drawdown);
                    max_so_far = Double.parseDouble(split[i].toString());
                }else {
                    drawdown = (max_so_far - Double.parseDouble(split[i].toString())) / max_so_far;
                    drawdowns.add(drawdown);
                }
            }
            state.max_back = Collections.max(drawdowns)<=0 ? 0.0:Collections.max(drawdowns);
            
            return true;
        }