hive中比关系型数据库中多了一些特殊的类型:
1)array:数组是相同类型的,下标从0开始
array01('英语','化学','物理')
array01[1] --->'化学'
传统关系型数据库中:
name like1 like2 like3 like4 like5
张三 足球 汽车 读书 喝茶 NUll
李四 篮球 电影 读书 音乐 美食
hive数据库中:
name likes
张三 足球 ,汽车,读书,喝茶
李四 篮球 ,电影,读书,音乐,美食
create table s1(
id int,
name string,
likes array<string>
)
row format delimited
fields terminated by ','
collection items terminated by '-'
lines terminated by '\n';
s1.txt
1,张三,足球-汽车-读书-喝茶
2,李四,篮球-电影-读书-音乐-美食
load data local inpath '/home/hyxy/Desktop/s1.txt' overwrite into table s1;
select * from s1;
hive> select likes[0] from s1;
足球
篮球
hive> select * from s1 where likes[0]='足球';
1 张三 ["足球","汽车","读书","喝茶"]
EXPLODE关键字:
explode就是将hive一行中复杂的array或者map结构拆分成多行:
hive>select id,name,hobby from s1 lateral view explode(likes) likes as hobby;
OK
1 张三 足球
1 张三 汽车
1 张三 读书
1 张三 喝茶
2 李四 篮球
2 李四 电影
2 李四 读书
2 李四 音乐
2 李四 美食
hive>select hobby,count(*) as c1 from s1 lateral view explode(likes) likes as hobby group by hobby order by c1 desc;
读书 2
音乐 1
足球 1
美食 1
篮球 1
电影 1
汽车 1
喝茶 1
2) map 是一组键值对元组集合,使用数组表示,可以访问元素
map01('first','max','second','demo')
map01['first'] --'max'
array01[90,80,95] -->语 数 英
map01[key:value]-->['语':90,'数':80,'英':95]
类型 ----> string:int
create table s2(
id int,
name string,
deductons map<string,string>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';
s2.txt
1,leo,英语:四级-日语:三级
2,marry,英语:六组-俄语:一级
load data local inpath '/home/hyxy/Desktop/s2.txt' overwrite into table s2;
select * from s2;
1 leo {"英语":"四级","日语":"三级"}
2 marry {"英语":"六组","俄语":"一级"}
查看英语是四级的学员信息
select * from s2 where deductons["英语"]="四级";
1 leo {"英语":"四级","日语":"三级"}
查看日语 级别
select deductons["日语"] from s2;
OK
三级
NULL
explode 拆分显示map集合(key 一列, value一列)
select explode(deductons) as (note_name,note_level) from s2;
OK
英语 四级
日语 三级
英语 六组
俄语 一级
lateral view explode 聚合函数显示:
select id,name,note_name,note_level from s2 lateral view explode(deductons) deductons as note_name,note_level;
OK
1 leo 英语 四级
1 leo 日语 三级
2 marry 英语 六组
2 marry 俄语 一级
3)STRUCT,和C语言中的struct类似,都可以通过"点" 符号访问元素内容
如:某列的数据类型 struct{班级 string,学号 int}
这列存放如下一行数据:{'10班',11001}
列名.班级 --> '10班'
struct是针对一组数据中有不同的数据类型,升级版MAP
city:城市 province:省 area:区域 code:邮编
create table s3(
id int,
address struct<city:string,province:string,area:string,code:int>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';
s3.txt
1,大连-辽宁-东北部-116000
2,青岛-山东-东部-211000
3,大连-辽宁-东北部-116000
load data local inpath '/home/hyxy/Desktop/s3.txt' overwrite into table s3;
select * from s3;
1 {"city":"大连","province":"辽宁","area":"东北部","code":116000}
2 {"city":"青岛","province":"山东","area":"东部","code":211000}
3 {"city":"大连","province":"辽宁","area":"东北部","code":116000}
查询key是city的value值
select address.city from s3;
OK
大连
青岛
大连
select * from s3 where address.city='大连';
1 {"city":"大连","province":"辽宁","area":"东北部","code":116000}
3 {"city":"大连","province":"辽宁","area":"东北部","code":116000}
select address.city,count(1) from s3 group by address.city;
大连 2
青岛 1
综合案例:
drop table students;
create table students(
sno int,
sname string,
sage int,
slikes array<string>,
snote map<string,string>,
sscore struct<chinese:int,english:int,math:int>
)
row format delimited fields terminated by ','
collection items terminated by '-'
map keys terminated by ':'
lines terminated by '\n';
student.txt
1,leo,25,足球-篮球-吉他,英语:四级-日语:三级,90-80-100
2,mili,25,足球-篮球-吉他,英语:六级-日语:一级,80-80-90
3,lin,25,足球-美食-吉他,英语:四级-日语:一级,90-90-90
load data local inpath '/home/hyxy/Desktop/student.txt' overwrite into table students;
select * from students;
1 leo 25 ["足球","篮球","吉他"] {"英语":"四级","日语":"三级"}{"chinese":90,"english":80,"math":100}
2 mili 25 ["足球","篮球","吉他"] {"英语":"六级","日语":"一级"}{"chinese":80,"english":80,"math":90}
4 lin 25 ["足球","美食","吉他"] {"英语":"四级","日语":"一级"}{"chinese":90,"english":90,"math":90}
insert into .... 不允许,有复杂类型不允许用insert操作
select sum(sscore) from students; 不允许
select sum(sscore.Math) from students;允许
案例:
--------------------------------------------------------------------------------
只需要在创建表的时候告诉 Hive ,数据中的列分隔符和行分隔符,Hive 就可以解析数据
Hive 的默认列分隔符:控制符 Ctrl + A,( \x01 Hive的八进制表示)
Hive 的默认行分隔符:换行符 \n
beeline创建表:
$>beeline -u jdbc:hive2://master:10000/default -n hyxy -p 123456
create table employees(name string,salary float,subordinates array<string>,
deductions map<string,float>,
address struct<street: string,city:string,state:string,zip:int>);
加载数据到集群:
$>cp /mnt/hgfs/工具/employees.txt .
$>LOAD DATA LOCAL INPATH '/home/hyxy/Desktop/employees.txt' OVERWRITE INTO TABLE employees;
hive>use default;
hive>select * from employees;
1.对array查询
hive>select name, subordinates[1] from employees where name = 'John Doe';
2.对map的查询
hive>select name,deductions['State Taxes'] from employees where name = 'John Doe';
3.对struct查询
hive>select name,address.state from employees where name = 'John Doe';
Hive order by 和 sort by :(编程指南6.5)
---------------------------------------------------------------------------- t1.txt
1
3
4
57
23
45
6 -----------------------------------------------------------------------------
0.创建测试数据
hive>create table test(id int);
hive>load data local inpath '/home/hyxy/Desktop/t1.txt' into table test;
1.order by 全局排序:
hive>select * from test order by id;
2.sort by 局部排序(reduce)
查看8088,1个mr
select * from test sort by id;
查看8088 按照id % 3 ,查看8088,3个mr
hive>set mapreduce.job.reduces=3;
hive>select * from test sort by id;
order by 全局排序 :在设置了set mapreduce.job.reduces=3后 reduce个数也是1个
索引(index):
1.创建索引:
hive>create index test_id_index
>on table test(id)
>as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
>with deferred rebuild;
2.默认查询索引表:在hive数据库的INDEX_TABLE下自动生成索引表;
hive>select * from hive.hive__test_test_id_index__;
说明:默认生成的索引表无数据,空白状态;
索引表有三个字段:
id:表示index字段;
_bucketname:表示数据所在的location(位置)
_offsets:表示当前数据所处的偏移量;
3.重新构建index表,目的生成index数据 on 【table】 去掉
hive>alter index test_id_index on hive.test rebuild;
hive> select * from hive.hive__test_test_id_index__;
OK
3 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [0,9]
6 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [13]
9 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [17]
12 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [6]
1 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [19]
4 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [2,21]
7 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [15]
2 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [23]
5 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [4]
8 hdfs://mycluster/user/hive/warehouse/hive.db/test/test_order.txt [11]
插入数据,查看索引表:
hive> insert into test values(555);
hive> select * from hive.hive__test_test_id_index__;
重建索引,收集表数据生成mr,重新按索引将数据加载到索引表中:
alter index test_id_index on test rebuild;
hive> select * from hive.hive__test_test_id_index__;
分别测试 无索引stocks, 索引stocks_index, 位图索引stocks_bitmap 查询条件 symbol
4. 准备数据 stocks表[2558KB] P56【4.3.2】
上传数据:[2558KB]
$>hadoop fs -put /mnt/hgfs/2.安装环境/data/stocks/stocks.csv /data/stocks
(1)无索引stocks:
hive>create external table if not exists stocks (exchange1 string,symbol string,ymd
string,price_open float,price_high float,price_low float,price_close float,
volume int,price_adj_close float)
row format delimited fields terminated by ','
location '/data/stocks';
(2)索引表stocks_index:
hive>create table stocks_index as select * from stocks;
创建索引:
hive>create index stocks_id_index
on table stocks_index(symbol)
as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'
with deferred rebuild;
hive> select * from default__stocks_index_stocks_id_index__;
重建索引:
hive> alter index stocks_id_index on stocks_index rebuild;
(3)位图索引表stocks_bitmap:
###创建位图索引 bitmap索引普遍应用于排重后值较少的列:
hive>create table stocks_bitmap as select * from stocks;
创建索引:
hive> create index stocks_id_index
on table stocks_bitmap(symbol)
as 'BITMAP'
with deferred rebuild;
重建索引:
hive> alter index stocks_id_index on stocks_bitmap rebuild;
###默认索引是不生效的
设置索引生效并使用索引(以下属性通过默认配置文件参照):
【【【
set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set hive.optimize.index.filter=true;
set hive.optimize.index.filter.compact.minsize=0;
】】】
参数说明:默认为false
hive.optimize.index.filter
参数说明:参数为输入一个紧凑的索引将被自动采用最小尺寸
hive.optimize.index.filter.compact.minsize
-----------------------------------------------------------------
1)无索引执行时间: 58637
hive> select year(ymd), avg(price_close)
from stocks
where exchange1 = 'NASDAQ' and symbol = 'AAPL'
group by year(ymd);
2)索引执行时间:56630
hive> select year(ymd), avg(price_close)
from stocks_index
where exchange1 = 'NASDAQ' and symbol = 'AAPL'
group by year(ymd);
3)位图索引执行时间:55344
hive> select year(ymd), avg(price_close)
from stocks_bitmap
where exchange1 = 'NASDAQ' and symbol = 'AAPL'
group by year(ymd);
5.显示索引
hive>show formatted index on stocks ;
show formatted index on stocks_index ;
show formatted index on stocks_bitmap ;
6.删除索引 on 【table】 去掉:
hive>drop index stocks_id_index on stocks ;
存储格式:
1.默认存储格式为:纯文本
stored as textfile;
TextFile是Hive的默认文件格式,数据不做压缩,磁盘开销比较大,数据解析时开销也比较大,
从本地文件向Hive load数据只能用textfile文件格式.
2.二进制存储的格式:
顺序文件(Sequence)、avro文件、parquet文件、rcfile和orcfile;
Sequence file 存储格式可以将一个文件划分成多个块(block),然后采用一种可分割的方式对块进行压缩。
如若想在hive中使用Sequence file 存储格式,name需要在create table 语句中通过 STORED AS SEQUENCEFILE 语句进行指定。
Sequence file 提供了3 种压缩方式:NONE、RECORD、BLOCK,默认是RECORD级别(记录级别)。
不过通常来说,BLOCK级别压缩性能最好而且是可以分割的。和其他很多的压缩属性一样,这个属性也并非是hive特有的。用户可以在 Hadoop 的 mapred-site.xml 文件中进行定义,或者在 hive 的 hive-site.xml 文件中进行定义,需要的时候,还可以在脚本中或查询语句前进行指定。
set mapred.output.compression.type=BLOCK
二进制格式化可划分为两类:面向行的格式和面向列的格式
Hive本身支持两种面向行的格式:Avro和Sequence,它们都是通用的可分割,可压缩的格式
Hive也可支持面向列的格式:Parquet、RCFile和ORCFile
3.转存parquet格式:
hive> create table stocks_parquet stored as parquet as select * from stocks
说明:原始数据大小为stocks表[40万条],21M,转存parquet格式后,hdfs上数据文件大小为6M,压缩比在3倍左右;
4.转存rcfile格式:
hive> create table stocks_rcfile stored as rcfile as select * from stocks ;
说明:原始数据大小为stocks表[40万条],21M,转存rcfile格式后,hdfs上数据文件大小为16M,压缩比在0.7倍左右;
5.转存orcfile格式:
hive> create table stocks_orcfile stored as orcfile as select * from stocks ;
说明:原始数据大小为stocks表[40万条],21M,转存orcfile格式后,hdfs上数据文件大小为5M,压缩比在4倍左右;
通地webUI查看转存后生成的文件大小
6. 性能测试对比:
hive>select symbol,avg(price_close) from stocks group by symbol;
执行时间:exec/fetch time: 0.227/1.580 sec
hive>select symbol,avg(price_close) from stocks_parquet group by symbol;
执行时间:exec/fetch time: 0.144/2.846 sec
hive>select symbol,avg(price_close) from stocks_rcfile group by symbol;
执行时间:exec/fetch time: 0.114/1.238 sec
hive>select symbol,avg(price_close) from stocks_orcfile group by symbol;
执行时间:exec/fetch time: 0.129/2.027 sec
用户自定义函数(UDF)是一个允许用户扩展HiveSQL的强大功能:
1)查看hive内置UDF
show functions;
2)concat函数可以将输入的多个字符串拼接成一个字符串输出
$> select concat(id,name) from s1;
OK
1张三
2李四
UDF自定义函数:
0.引入jar包,解压hive包 D:\jar包\apache-hive-1.2.1-bin\lib (108个包拷贝项目中 (注意不是jar包的删除掉 .pom))
1.首先创建JAVA类,继承UDF.class
2.重写evaluate()方法;
package com.hyxy;
import org.apache.hadoop.hive.ql.exec.UDF;
public class MyFunction extends UDF {
public int evaluate(int a,int b) {
return a+b;
}
}
3.打jar包(不需要选main直接确认即可!) 并将jar包上传到linux下;
$> cp /mnt/hgfs/工具/max.jar /home/hyxy/Desktop
4.加载自定义函数的jar包 ;
hive>add jar /home/hyxy/Desktop/hiveJar.jar ;
Added [/home/hyxy/Desktop/hiveJar.jar] to class path
Added resources: [/home/hyxy/Desktop/hiveJar.jar]
hive> create temporary function {函数名称} as '包名.类名'
hive> create temporary function sumfunction as 'com.hyxy.MyFunction';
测试:
1)创建表两个int类型,新增记录
hive> create table t3(i int,j int);
hive> insert into t3 values(3,5);
hive> select * from t3;
OK
3 5
2)通过自定义函数查询
hive> select sumfunction(i,j) from t3;
OK
8
删除临时函数
$>drop temporary function sumfunction;
5.自定义函数类型:
a.UDF: 单行进-->单行出
b.UDAF:多行进-->单行出
c.UDTF:单行进-->多行出(表)