Hive总结（二）

最新推荐文章于 2024-06-28 09:57:52 发布

weixin_39953756

最新推荐文章于 2024-06-28 09:57:52 发布

阅读量1.7k

点赞数 1

本文链接：https://blog.youkuaiyun.com/weixin_39953756/article/details/80979240

版权

一、hive表的创建方式

1.【普通的创建】

create table stu_info(
num int,
name string
)
row format delimited fields terminated by "\t";

加载数据到本地：将本地的数据复制到表对应的位置
load data local inpath '/opt/datas/stu_info' into table stu_info;

加载hdfs数据：将hdfs上的数据移动到表对应的位置
load data inpath '/table_stu.txt' into table stu_info;

2.【子查询的方式：as select】

create table stu_as as select name from stu_info;
-》将查询的数据和表的结构赋予一张新的表
-》类似于保存一个中间结果集

3.【Like方式】

LIKE existing_table_or_view_name
create table stu_like like stu_info;
-》复制表的结构赋予一张新的表

【databases】

create database db_emp;

【员工表】
create table emp(
empno int comment '员工编号',
ename string comment '员工姓名',
job string comment '员工职位',
mgr int comment '领导编号',
hiredate string comment '入职时间',
sal double comment '薪资',
comm double comment '奖金',
deptno int comment '部门编号'
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/emp.txt' into table emp;

【部门表】
create table dept(
deptno int comment '部门编号',
dname string comment '部门名称',
loc string comment '地址'
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/dept.txt' into table dept;

【覆盖表的数据overwrite】内部机制有删除的操作，删除原来的数据加载表中
load data local inpath '/opt/datas/dept.txt' overwrite into table dept;

二、外部表

关键字【External】

[LOCATION hdfs_path]共享数据：去加载hdfs上的路径数据

create table emp1(
empno int comment '员工编号',
ename string comment '员工姓名',
job string comment '员工职位',
mgr int comment '领导编号',
hiredate string comment '入职时间',
sal double comment '薪资',
comm double comment '奖金',
deptno int comment '部门编号'
)
row format delimited fields terminated by '\t'
LOCATION '/user/hive/warehouse/db_emp.db/emp';

如果你改变emp这张表的数据，那么emp1也会发生改变
如果你改变emp1这张表的数据，那么emp也会发生改变

删除表

drop table emp1;

-》数据共用一份数据，结果就把共享的数据删除了
-》删除表的时候会删除表对应的元数据信息（emp1）
-》清除表对应的hdfs上的文件夹

创建外部表

create table dept_ext2(
deptno int comment '部门编号',
dname string comment '部门名称',
loc string comment '地址'
)
row format delimited fields terminated by '\t'
LOCATION '/user/hive/warehouse/db_emp.db/dept';

表的类型Table Type: EXTERNAL_TABLE

删除表：drop table dept_ext;
-》外部表只是删除元数据（dept_ext）
-》不会删除对应的文件夹

-》一般先创建内部表，然后根据需求创建多张外部表
-》外部表主要是数据安全性的作用

hive内部表和外部表的区别：
1）创建表的时候：创建内部表时，会将数据移动到数据仓库指向的路径；若创建外部表，仅数据所在的路径，不对数据的位置做任何的改变
2）删除表时：在删除表的时候，内部表的元数据和数据一起删除，而外部表只删除元数据，不删除数据。这样外部表相对于内部表来说更加安全一些，数据组织也更加灵活，方便数据共享

三、临时表[TEMPORARY]

create temporary table dept_tmp(
deptno int,
dname string,
loc string
)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/dept.txt' into table dept;

数据存放路径location：
Location: hdfs://hadoop01.com:8020/tmp/hive/hadoop/23a93177-f22f-4035-a2e3-c51cf315625f/_tmp_space.db/962463c2-6563-47a8-9e62-2a1e8eb6ed19

关闭hive：
自动删除临时表
也可以手动删除drop

create TEMPORARY table dept_tmp(
deptno int ,
dname string ,
loc string
)
row format delimited fields terminated by '\t'
LOCATION '/user/hive/warehouse/db_emp.db/dept';

关闭hive：
自动删除临时表的数据
也可以手动删除drop，删除临时表的数据及数据文件

所以如果临时表与内部表共享数据的时候要小心，关闭hive的时候，内部表的数据也会被删除

四、分区表
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
普通表：select * from logs where `date`='2018120'
执行流程：对全表的数据进行查询，然后才会进行过滤

分区表：select * from logs where `date`='2018120' and hour='00'
执行流程：直接加载对应文件夹路径下的数据

分区表的字段是逻辑性的，体现在hdfs上形成一个文件夹存在，并不在数据中，
必须不能是数据中包含的字段

【一级分区】
create table emp_part(
empno int ,
ename string ,
job string ,
mgr int ,
hiredate string,
sal double ,
comm double ,
deptno int
)PARTITIONED BY(`date` string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/emp.txt' into table emp_part partition(`date`='2018120');
load data local inpath '/opt/datas/emp.txt' into table emp_part partition(`date`='2018121');

select * from emp_part where `date`='2018120';

【二级分区】

create table emp_part2(
empno int ,
ename string ,
job string ,
mgr int ,
hiredate string,
sal double ,
comm double ,
deptno int
)PARTITIONED BY(`date` string,hour string)
row format delimited fields terminated by '\t';

load data local inpath '/opt/datas/emp.txt' into table emp_part2 partition(`date`='2018120',hour='01');
load data local inpath '/opt/datas/emp.txt' into table emp_part2 partition(`date`='2018120',hour='02');

select * from emp_part2 where `date`='2018120';
select * from emp_part2 where `date`='2018120' and hour='01';

-》分区表的作用主要是提高了查询的效率

桶表：获取更高的处理效率、join、抽样数据（可以在分区之后进行桶表的创建，是更细的粒度划分）
[CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

create table emp_bu2(
empno int ,
ename string ,
job string ,
mgr int ,
hiredate string,
sal double ,
comm double ,
deptno int
)Partitioned by (`date` string,hour string) CLUSTERED BY (empno) INTO 4 BUCKETS
row format delimited fields terminated by '\t';
先建表，然后设置
set hive.enforce.bucketing = true;

insert overwrite table emp_bu select * from emp;

insert overwrite table emp_bu2 partition(`date`='20123213') select * from emp;

五、hive的分析函数
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
分析函数over：分析函数用于计算基于组的某种聚合值，它和聚合函数不同之处是对于
每个组返回多行，而聚合函数对于每个组返回一行数据。
-》主要作用：对于分组后的数据进行处理，然后输出处理后的结果

-》需求1: 查询部门编号为10的所有员工，按照薪资进行降序排序desc（默认升序）
   select * from emp where deptno=10 order by sal desc;
结果:
7839    KING    PRESIDENT       NULL    1981-11-17      5000.0 NULL    10
7782    CLARK   MANAGER 7839    1981-6-9        2450.0 NULL    10
7934    MILLER CLERK   7782    1982-1-23       1300.0 NULL    10

-》需求2：将每个部门薪资最高的那个人的薪资显示在最后一列
select empno,ename, deptno,sal, max(sal) over(partition by deptno order by sal desc ) as sal_111 from emp ;
结果：
7839    KING    10      5000.0 5000.0 1
7782    CLARK   10      2450.0 5000.0 2
7934    MILLER 10      1300.0 5000.0 3

7788    SCOTT   20      3000.0 3000.0
7902    FORD    20      3000.0 3000.0
7566    JONES   20      2975.0 3000.0
7876    ADAMS   20      1100.0 3000.0
7369    SMITH   20      800.0   3000.0

7698    BLAKE   30      2850.0 2850.0
7499    ALLEN   30      1600.0 2850.0
7844    TURNER 30      1500.0 2850.0
7654    MARTIN 30      1250.0 2850.0
7521    WARD    30      1250.0 2850.0
7900    JAMES   30      950.0   2850.0

-》需求3：将每个部门薪资最高的那个人的薪资显示在最后一列并且显示唯一的编号
select empno,ename, deptno,sal, row_number() over(partition by deptno order by sal desc ) as sal_111 from emp ;
结果：
7839    KING    10      5000.0 1
7782    CLARK   10      2450.0 2
7934    MILLER 10      1300.0 3

7788    SCOTT   20      3000.0 1
7902    FORD    20      3000.0 2
7566    JONES   20      2975.0 3
7876    ADAMS   20      1100.0 4
7369    SMITH   20      800.0   5

7698    BLAKE   30      2850.0 1
7499    ALLEN   30      1600.0 2
7844    TURNER 30      1500.0 3
7654    MARTIN 30      1250.0 4
7521    WARD    30      1250.0 5
7900    JAMES   30      950.0   6

-》需求4：获取每个部门薪资最高的前两位（嵌套子查询的方式）

select empno,ename, deptno,sal from(select empno,ename, deptno,sal,row_number() over (partition by deptno order by sal desc) as rn from emp) as tmp where rn < 3;
结果：
7839    KING    10      5000.0
7782    CLARK   10      2450.0
7788    SCOTT   20      3000.0
7902    FORD    20      3000.0
7698    BLAKE   30      2850.0
7499    ALLEN   30      1600.0

一定要注意区分分析函数与聚合函数的区别！！！

六、hive的数据导入和导出
【导入】
1、load data [loacl]
-》本地，将数据文件copy到hdfs对应的目录，适合大部分的场景使用
load data local inpath 'loac_path' into table tablename;
-》HDFS，将数据文件move到hdfs对应的目录上，适合大数据集的存储
load data inpath 'hdfs_path' into table tablename;

2、load data + overwrite 【覆盖数据】
load data [local] inpath 'path' overwrite into table tablename;
-》适合重复写入数据的表，一般指的临时表，作为过渡使用

3、子查询 as select
create table tb2(先建表) as select * from tb1;
-》适合数据查询结果的保存

4、insert方式
insert into table select sql; -》追加
insert overwrite table select sql; -》覆盖

测试：create table tb like dept;
insert into table tb select * from dept;

在关系型数据库插入一条数据：insert into table tablename values();
在hive中：insert into table tb values(50,'AA','bb');

注意：这种方式适合数据量非常小的情况下去使用，如果说数据量大，避免这种操作
show tables出现values__tmp__table__1这个临时表，关闭会话就消失

5、location方式

【导出】
1、insert
insert overwrite [local] directory 'path' select sql;

-》本地
insert overwrite local directory '/opt/datas/emp_in01' select * from emp;

insert overwrite local directory '/opt/datas/emp_in01' row format delimited fields terminated by "\t" select * from emp;

注意导出数据时行数据的间隔符指定

-》HDFS
insert overwrite directory '/emp_insert' select * from emp;
insert overwrite directory '/emp_insert' row format delimited fields terminated by "\t" select * from emp;

2、bin/hdfs dfs -get xxx 下载数据文件
   hive> dfs -get xxx   （hive的客户端）

3、bin/hive -e 或者 -f    +   >> 或者 >    (追加和覆盖符号)

4、sqoop 方式：import导入和export 导出

七、常见的hql语句
【过滤】
where 、limit、distinct、between and 、null、 is not null

select * from emp where sal >3000;
select * from emp limit 5;
select distinct deptno from emp;
select * from emp where sal between 1000 and 3000;
select empno,ename from emp where comm is null;
select empno,ename from emp where comm is not null;

【聚合函数】
count、sum、max、min、avg、group by 、having

select avg(sal) avg_sal from emp;

按部门分组求出每个部门的平均工资
select 中出现的字段，需要用聚合函数包裹或者放入group by 中
select deptno,avg(sal) from emp group by deptno;
10      2916.6666666666665
20      2175.0
30      1566.6666666666667
select deptno,max(job),avg(sal) from emp group by deptno;
10      PRESIDENT       2916.6666666666665
20      MANAGER   2175.0
30      SALESMAN        1566.6666666666667
select deptno,avg(sal) from emp group by deptno,job;

having和where 是差不多的用法，都是筛选语句
可以一起使用，先执行where后执行having
select deptno,avg(sal) avg_sal from emp group by deptno having avg_sal >2000 ;

【join】
left join, right join,inner join（等值）,全join
创建两张表A,B
A表：      B表:
ID Name     ID phone
1 张三    1 111
2 李四    2 112
3 王五    3 113
5 赵六    4 114

load data local inpath '/opt/datas/A' into table A;
load data local inpath '/opt/datas/B' into table B;

等值join ：id值都会出现
select aaa.ID,aaa.Name ,bbb.ID ,bbb.phone from A aaa join B bbb on aaa.ID=bbb.ID;

左join：以左表为基准，没有匹配到的字段就是null
select aaa.ID,aaa.Name ,bbb.ID ,bbb.phone from A aaa left join B bbb on aaa.ID=bbb.ID;

右join：以右表为基准，没有匹配到的字段就是null
select aaa.ID,aaa.Name ,bbb.ID ,bbb.phone from A aaa right join B bbb on aaa.ID=bbb.ID;

全join：所有的字段都会出现，匹配或者匹配不上都会出现，没有匹配上的字段就是null
select aaa.ID,aaa.Name ,bbb.ID ,bbb.phone from A aaa full join B bbb on aaa.ID=bbb.ID;

八、hive和mapreduce的相关的排序方式及运行参数
-》设置每个reduce处理的数据量：（单位是字节）
set hive.exec.reducers.bytes.per.reducer=<number>
<property>
    <name>hive.exec.reducers.bytes.per.reducer</name>
    <value>256000000</value>
    <description>size per reducer.The default is 256Mb, i.e if the input size is 1G, it will use 4 reducers.</description>
</property>
<property>

-》设置最大运行的reduce的个数：默认1009
set hive.exec.reducers.max=<number>
<property>
    <name>hive.exec.reducers.max</name>
    <value>1009</value>
    <description>
      max number of reducers will be used. If the one specified in the configuration parameter mapred.reduce.tasks is
      negative, Hive will use this one as the max number of reducers when automatically determine number of reducers.
    </description>
</property>

-》设置实际运行的reduce的个数，默认是1个
set mapreduce.job.reduces=<number>

1、order by ：全局排序，设置多个reduce没有太大的作用
select * from emp order by sal;

2、sort by ：局部排序，对于每个reduce的结果进行排序，设置reduce=4,导出到本地

insert overwrite local directory '/opt/datas/emp_sort' row format delimited fields terminated by "\t" select * from emp sort by sal;

3、distribute by（底层是mr分区）
-》可以按照指定的值进行分区
-》先分区后排序，一般和sort by连用
insert overwrite local directory '/opt/datas/emp_dist' row format delimited fields terminated by "\t" select * from emp distribute by deptno sort by sal;

4、cluster by ：相当于 distribute by +sort by 组合使用（没有搞懂）
-》排序只能倒序排序，不能指定排序的规则为desc或者asc;
-》相同的工资放在一起，因为数据的原因，相同工资有点少;
insert overwrite local directory '/opt/datas/emp_cluster' row format delimited fields terminated by "\t" select * from emp cluster by sal;

九、自定义UDF函数

UDF ：一进一出
=》转换大小写

hive的maven依赖：
<dependency>
   <groupId>org.apache.hadoop</groupId>
   <artifactId>hadoop-client</artifactId>
   <version>2.7.3</version>
</dependency>
<dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-exec</artifactId>
   <version>1.2.1</version>
</dependency>
<dependency>
   <groupId>org.apache.hive</groupId>
   <artifactId>hive-jdbc</artifactId>
   <version>1.2.1</version>
</dependency>

阿里的镜像资源下载：
<repositories>
<repository>
    <id>nexus-aliyun</id>
    <name>Nexus aliyun</name>
    <url>http://maven.aliyun.com/nexus/content/groups/public</url>
</repository>
</repositories>

继承UDF类,实现evaluate(Text str)

public class Lower extends UDF {

public Text evaluate(Text str){

  if (str == null) {
   return null;
  }
  if (StringUtils.isBlank(str.toString())) {
   return null;
  }

  return new Text(str.toString().toLowerCase());
}

//测试
public static void main(String[] args) {
  // TODO Auto-generated method stub
  System.out.println(new Lower().evaluate(new Text("TAYLOR swift")));
}
}

包名.类名
cmz.Hive20.Lower

-》将写好的代码打成jar包，上传linux
-》将jar包添加到hive里面去
add jar /opt/datas/lower_hive.jar;

-》创建一个函数
create temporary function my_lower as 'cmz.hivetest.lower';
查看函数：show functions;

-》使用函数
select empno,ename, my_lower(ename) lower_name from emp;

十、hiveserver2与jdbc客户端
-》将hive变成一个服务对外开放，通过客户端去连接
-》启动服务端：bin/hiveserver2
   后台启动：bin/hiveserver2 &

-》开启客户端：bin/beeline
-》查看客户端帮助信息： bin/beeline --help
    beeline> !help （不加分号）
官网：beeline -u jdbc:hive2://localhost:10000/default -n scott -w password_file
-》连接hiveserver2 ：
-n 指定用户名：linux上的用户
-p：指定密码： linux用户的密码
bin/beeline -u jdbc:hive2://hadoop01.com:10000/ -n hadoop -p 123456
或者
bin/beeline
!connect jdbc:hive2://hadoop01.com:10000

hive JDBC 远程连接
官网：https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-JDBC

十一、Hive的运行模式与虚拟列

hive的配置模式分为三种：
--依据hive的安装和metastore的设置机器而言
嵌入模式：使用自带的derby数据库
本地模式：将metastore放在mysql上，且mysql与hive安装在同一台机器
远程模式：将metastore放在mysql上，mysql和hive不在同一台机器上

对于远程调控模式，则需要让mysql向集群对外提供服务，则需要配置metastore

配置:   指明存放metastore的mysql所在的机器号
   <property>
    <name>hive.metastore.uris</name>
    <value/>
    <description>Thrift URI for the remote metastore. Used by metastore client to connect to remote metastore.</description>
</property>
启动服务：bin/hive --service metastore &

-》fetch模式

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
</property>

-》none：不管你写什么sql都会跑mapreduce，不开启fetch模式
-》minimal：当select * 、针对分区字段进行过滤、limit不跑mapreduce
-》more：当select 、过滤、limit不跑mapreduce

-》虚拟列：Virtual Colums
INPUT__FILE__NAME 输入文件名称，显示这行数据所在的绝对路径
BLOCK__OFFSET__INSIDE__FILE 记录数据在块中的偏移量

select *, BLOCK__OFFSET__INSIDE__FILE from dept;
10      ACCOUNTING      NEW YORK        0
20      RESEARCH        DALLAS 23
30      SALES   CHICAGO 42
40      OPERATIONS      BOSTON 59