hive的常用函数使用

最新推荐文章于 2025-03-13 08:50:19 发布

原创最新推荐文章于 2025-03-13 08:50:19 发布

· 1.1k 阅读

4 ·

版权

文章标签：

#hive #数据仓库

hive 专栏收录该内容

1 篇文章

订阅专栏

时间类型

date：yyyy-MM-dd

timestamp: yyyy-MM-dd HH:mm:ss

日期函数

时间戳转日期格式

from_unixtime(bigint unixtime,格式)

格式可自己指定，默认是yyyy-MM-dd HH:mm:ss格式，时间戳是bigint类型

返回值：string

获取当前时间的时间戳

select unix_timestamp()

timestamp日期时间转时间戳

默认情况下：

unix_timestamp(‘yyyy-MM-dd HH:mm:ss’)

select unix_timestamp(‘2021-10-09 21:03:31’);

指定格式日期转UNIX 时间戳

select unix_timestamp(‘20211009 21:03:31’,‘yyyyMMdd HH:mm:ss’)

select unix_timestamp(‘2021-10-09’,‘yyyy-MM-dd’);

返回值：bigint

日期时间转日期函数

to_date

select to_date(‘2021-10-09 21:03:31’);

返回值：string

日期比较函数: datediff

datediff(string enddate, string startdate)

select datediff(‘2021-10-09 21:03:31’,‘2021-10-04 21:03:31’);

select datediff(‘2021-10-09’,‘2021-10-04’);

返回值: int,相减结果只能精确到天

日期增加函数: date_add

date_add(string startdate, int days)

select date_add(‘2021-10-09 21:03:31’,1);

select date_add(‘2021-10-09’,1);

返回值: string，相加结果只能精确到天

日期减少函数: date_sub

date_sub (string startdate, int days)

select date_sub(‘2021-10-09 21:03:31’,1);

select date_sub(‘2021-10-09’,1);

返回值: string

将字符串类型数据转换为日期类型date_format

date_format()

// '2021年01月14日' -> '2021-01-14'
select from_unixtime(unix_timestamp('2021年01月14日','yyyy年MM月dd日'),'yyyy-MM-dd');
// "04牛2021数加16逼" -> "2021/04/16"
select from_unixtime(unix_timestamp("04牛2021数加16逼","MM牛yyyy数加dd逼"),"yyyy/MM/dd");

数值计算

取整函数(四舍五入)：round
向上取整：ceil
向下取整：floor

字符串函数

concat('123','456'); // 123456
concat('123','456',null); // NULL

select concat_ws('#','a','b','c'); // a#b#c
select concat_ws('#','a','b','c',NULL); // a#b#c 可以指定分隔符，并且会自动忽略NULL
select concat_ws("|",cast(id as string),name,cast(age as string),gender,clazz) from students limit 10;

select substring("abcdefg",1); // abcdefg HQL中涉及到位置的时候 是从1开始计数
// '2021/01/14' -> '2021-01-14'
select concat_ws("-",substring('2021/01/14',1,4),substring('2021/01/14',6,2),substring('2021/01/14',9,2));

select split("abcde,fgh",","); // ["abcde","fgh"]
select split("a,b,c,d,e,f",",")[2]; // c

select explode(split("abcde,fgh",",")); // abcde
										//  fgh

// 解析json格式的数据
select get_json_object('{"name":"zhangsan","age":18,"score":[{"course_name":"math","score":100},{"course_name":"english","score":60}]}',"$.score[0].score"); // 100

复杂数据类型

array

建表：

create table arraytest(name string,score array<string>)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS terminated by ',';

查询：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wihEW8ES-1640176811190)(C:\Users\34625\AppData\Roaming\Typora\typora-user-images\image-20210929203431937.png)]

map

建表：

create table maptest(name string,score map<string,int>)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';

查询：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-P4xYndyx-1640176811192)(C:\Users\34625\AppData\Roaming\Typora\typora-user-images\image-20210929204151384.png)]

struct

建表：

create table structtest(name string,stu struct<height:int,weight:int,age:int,gender:string,clazz:string>)

ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 COLLECTION ITEMS TERMINATED BY ',';

查询

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ix85D6S9-1640176811193)(C:\Users\34625\AppData\Roaming\Typora\typora-user-images\image-20210929205624291.png)]

1、count(*)、count(1) 、count(‘字段名’) 区别

2、HQL 执行优先级：

from、where、 group by 、having、order by、join、select 、limit

3、where 条件里不支持不等式子查询，实际上是支持 in、not in、exists、not exists

4、hive中大小写不敏感

5、在hive中，数据中如果有null字符串，加载到表中的时候会变成 null （不是字符串）

如果需要判断 null，使用某个字段名 is null 这样的方式来判断

或者使用 nvl() 函数，不能直接某个字段名 == null\

常用函数

nvl函数

nvl(y,0)

if函数

if(表达式,如果表达式成立的返回值,如果表达式不成立的返回值)

if(y is null,0,y)

case when

case when y is null  then 0 else y

select  score
        ,case when score>120 then '优秀'
              when score>100 then '良好'
              when score>90 then '及格'
        else '不及格'
        end as pingfen
from default.score limit 20;

开窗函数的使用

count(*),max()

select * ,count(*) over(partition by clazz) from new_score;

select *,max(score) over(partition by clazz) from new_score;

select *,score.s-score.score from (select *,max(score) over(partition by clazz)as s from new_score)as
score;

row_number()

select *,row_number() over(partition by clazz order by score desc) from new_score;

row_number()实际上是给每一条数据打上一个标签

dense_rank() rank()

dense_rank：有并列排名，并且依次递增

rank：有并列排名，不依次递增

select *,dense_rank() over(partition by clazz order by score desc) as dense_rank, rank() over(partition by clazz order by score desc) from new_score;

percent_rank：

(rank的结果-1)/(分区内数据的个数-1)

cume_dist

计算某个窗口或分区中某个值的累积分布。

假定升序排序，则使用以下公式确定累积分布：小于等于当前值x的行数 / 窗口或partition分区内的总行数。其中，x 等于 order by 子句中指定的列的当前行中的值。

NTILE(n)

对分区内数据再分成n组，然后打上组号

max、min、avg、count、sum：基于每个partition分区内的数据做对应的计算

分组求topn

select * from (select *,row_number() over(partition by clazz order by score desc)as num from new_score
)as t where t.num<=3;

窗口帧

rows格式1:前2行+当前行+后两行
	sum(score) over (partition by clazz order by score desc rows between 2 PRECEDING and 2 FOLLOWING)
rows格式2:前记录到最末尾的总和
	sum(score) over (partition by clazz order by score desc rows between CURRENT ROW and UNBOUNDED FOLLOWING)

range格式1: 如果当前值在80,取值就会落在范围在80-2=78和80+2=82组件之内的行
	max(score) over (partition by clazz order by score desc range between 2 PRECEDING and 2 FOLLOWING)

只能运用在max、min、avg、count、sum、FIRST_VALUE、LAST_VALUE这几个窗口函数上

LAG(col,n)：往前第n行数据

LEAD(col,n)：往后第n行数据

FIRST_VALUE：取分组内排序后，截止到当前行，第一个值

LAST_VALUE：取分组内排序后，截止到当前行，最后一个值，对于并列的排名，取最后一个

select  id
        ,score
        ,clazz
        ,department
        ,lag(id,2) over (partition by clazz order by score desc) as lag_num
        ,LEAD(id,2) over (partition by clazz order by score desc) as lead_num
        ,FIRST_VALUE(id) over (partition by clazz order by score desc) as first_v_num
        ,LAST_VALUE(id) over (partition by clazz order by score desc) as last_v_num
from new_score;

Hive 行转列

lateral view explode

create table testArray2(
    name string,
    weight array<string>
)row format delimited 
fields terminated by '\t'
COLLECTION ITEMS terminated by ',';

志凯	"150","170","180"
上单	"150","180","190"



select name,col1  from testarray2 lateral view explode(weight) t1 as col1;

志凯	150
志凯	170
志凯	180
上单	150
上单	180
上单	190

select key from (select explode(map('key1',1,'key2',2,'key3',3)) as (key,value)) t;

key1
key2
key3

select name,col1,col2  from testarray2 lateral view explode(map('key1',1,'key2',2,'key3',3)) t1 as col1,col2;
志凯	key1	1
志凯	key2	2
志凯	key3	3
上单	key1	1
上单	key2	2
上单	key3	3


select name,pos,col1  from testarray2 lateral view posexplode(weight) t1 as pos,col1;

志凯	0	150
志凯	1	170
志凯	2	180
上单	0	150
上单	1	180
上单	2	190

Hive 列转行

collect_list()

// testLieToLine
name col1
志凯	150
志凯	170
志凯	180
上单	150
上单	180
上单	190

create table testLieToLine(
    name string,
    col1 int
)row format delimited 
fields terminated by '\t';


select name,collect_list(col1) from testLieToLine group by name;

// 结果
上单	["150","180","190"]
志凯	["150","170","180"]

select  t1.name
        ,collect_list(t1.col1) 
from (
    select  name
            ,col1 
    from testarray2 
    lateral view explode(weight) t1 as col1
) t1 group by t1.name;

hive自定义函数

UDF：一进一出

添加依赖：

<dependency>
            <groupId>org.apache.hive</groupId>
            <artifactId>hive-exec</artifactId>
            <version>1.2.1</version>
  </dependency>

编写代码，继承org.apache.hadoop.hive.ql.exec.UDF，实现evaluate方法，在evaluate方法中实现自己的逻辑

import org.apache.hadoop.hive.ql.exec.UDF;

public class HiveUDF extends UDF {
    // hadoop => #hadoop$
    public String evaluate(String col1) {
    // 给传进来的数据 左边加上 # 号 右边加上 $
        String result = "#" + col1 + "$";
        return result;
    }
}

打成jar包并上传至Linux虚拟机
在hive shell中，使用 add jar 路径将jar包作为资源添加到hive环境中

add jar /usr/local/soft/jars/HiveUDF2-1.0.jar;

使用jar包资源注册一个临时函数，fxxx1是你的函数名，'MyUDF’是主类名（copy reference)

create temporary function fxxx1 as 'MyUDF';

使用函数名处理数据

select fxx1(name) as fxx_name from students limit 10;

#施笑槐$
#吕金鹏$
#单乐蕊$
#葛德曜$
#宣谷芹$
#边昂雄$
#尚孤风$
#符半双$
#沈德昌$
#羿彦昌$

UDTF：一进多出

“key1:value1,key2:value2,key3:value3”

key1 value1

key2 value2

key3 value3

方法一：使用 explode+split

select split(t.col1,":")[0],split(t.col1,":")[1] 
from (select explode(split("key1:value1,key2:value2,key3:value3",",")) as col1) t;

方法二：自定UDTF

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;

public class HiveUDTF extends GenericUDTF {
    // 指定输出的列名 及 类型
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        ArrayList<String> filedNames = new ArrayList<String>();
        ArrayList<ObjectInspector> filedObj = new ArrayList<ObjectInspector>();
        filedNames.add("col1");
        filedObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        filedNames.add("col2");
        filedObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(filedNames, filedObj);
    }

    // 处理逻辑 my_udtf(col1,col2,col3)
    // "key1:value1,key2:value2,key3:value3"
    // my_udtf("key1:value1,key2:value2,key3:value3")
    public void process(Object[] objects) throws HiveException {
        // objects 表示传入的N列
        String col = objects[0].toString();
        // key1:value1  key2:value2  key3:value3
        String[] splits = col.split(",");
        for (String str : splits) {
            String[] cols = str.split(":");
            // 将数据输出
            forward(cols);
        }
    
    }
    
    // 在UDTF结束时调用
    public void close() throws HiveException {
    
    }
}

select my_udtf("key1:value1,key2:value2,key3:value3");

字段：id,col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12 共13列

数据：

a,1,2,3,4,5,6,7,8,9,10,11,12

b,11,12,13,14,15,16,17,18,19,20,21,22

c,21,22,23,24,25,26,27,28,29,30,31,32

转成3列：id,hours,value

例如：

a,1,2,3,4,5,6,7,8,9,10,11,12

a,0时,1

a,2时,2

a,4时,3

a,6时,4

…

create table udtfData(
    id string
    ,col1 string
    ,col2 string
    ,col3 string
    ,col4 string
    ,col5 string
    ,col6 string
    ,col7 string
    ,col8 string
    ,col9 string
    ,col10 string
    ,col11 string
    ,col12 string
)row format delimited fields terminated by ',';

import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

import java.util.ArrayList;

public class HiveUDTF2 extends GenericUDTF {
    @Override
    public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException {
        ArrayList<String> filedNames = new ArrayList<String>();
        ArrayList<ObjectInspector> fieldObj = new ArrayList<ObjectInspector>();
        filedNames.add("col1");
        fieldObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        filedNames.add("col2");
        fieldObj.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);
        return ObjectInspectorFactory.getStandardStructObjectInspector(filedNames, fieldObj);
    }

    public void process(Object[] objects) throws HiveException {
        int hours = 0;
        for (Object obj : objects) {
            hours = hours + 2;
            String col = obj.toString();
            ArrayList<String> cols = new ArrayList<String>();
            cols.add(hours + "时");
            cols.add(col);
            forward(cols);
        }
    }
    
    public void close() throws HiveException {
    
    }

}

添加jar资源:

add jar /usr/local/soft/HiveUDF2-1.0.jar;

注册udtf函数：

create temporary function my_udtf as 'MyUDTF';

销毁临时函数：

 drop temporary function f_up;

SQL:

select id,hours,value from udtfData lateral view my_udtf(col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11,col12) t as hours,value ;