Hive 是基于Hadoop构建的一套数据仓库分析系统,他以SQL查询的方式来分析存储在Hadoop分布式文件系统中的数据,将结构化的数据文件映射成一张表,并且提供了完整的SQL查询功能,可以将SQL语句转化成为MapReduce任务运行。这一套SQL就是HiveSQL。对于不熟悉MapReduce的人来说,使用类似SQL的语句就可以很方便地通过查询语句来分析数据。而MapReduce的开发人员,也可以把自己编写的mapper和reducer作为插件支持hive进行更加复杂的数据分析工作。
HiveSQL与关系数据库中的SQL略有不同:HiveSQL支持绝大多数的DDL,DML以及连接查询、条件查询和聚合函数。但是HIVE不支持OLTP,也不支持实时查询。
一、 DDL操作(Data Definition Language)
1. 创建表
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]
[CLUSTERED BY (col_name, col_name, ...)
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]
[ROW FORMAT row_format]
[ STORED AS file_format ]
[LOCATION hdfs_path]
- STORED AS
1). External: 外部表和内部表
CREATE TABLE database.table1
(
column1 STRING COMMENT 'comment1',
column2 INT COMMENT 'comment2'
);
CREATE EXTERNAL TABLE IF NOT EXISTS database.table1
(
column1 STRING COMMENT 'comment1',
column2 STRING COMMENT 'comment2'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
LOCATION 'hdfs:///data/dw/asf/20120201';
下面是,当hdfs中的文件用LZO压缩后,如何创建外部表,当然你需要hadoop-gpl的支持才能以文本形式读取lzo文件。
CREATE EXTERNAL TABLE IF NOT EXISTS database.table1
(
column1 STRING COMMENT 'comment1',
column2 STRING COMMENT 'comment2'
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
LOCATION 'hdfs:///data/dw/asf/20120201';
2)创建分区表:
create table day_table (id int, content string) partitioned by (dt string);
多分区表:
create table day_hour_table (id int, content string) partitioned by (dt string, hour string);
双分区表,按天和小时分区,在表结构中新增加了dt和hour两列。表文件夹目录示意图(多分区表):
ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ] partition_spec [ LOCATION 'location2' ] ... partition_spec: : PARTITION (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE day_table ADD
PARTITION (dt='2008-08-08', hour='08')
location '/path/pv1.txt'
PARTITION (dt='2008-08-08', hour='09')
location '/path/pv2.txt';
f、删除分区:
ALTER TABLE table_name DROP partition_spec, partition_spec,...
用户可以用 ALTER TABLE DROP PARTITION 来删除分区。分区的元数据和数据将被一并删除。例:
g、 数据加载进分区表中语法:ALTER TABLE day_hour_table DROP PARTITION (dt='2008-08-08', hour='09');
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]LOAD DATA INPATH '/user/pv.txt' INTO TABLE day_hour_table PARTITION(dt='2008-08- 08', hour='08'); LOAD DATA local INPATH '/user/hua/*' INTO TABLE day_hour partition(dt='2010-07- 07');
当数据被加载至表中时,不会对数据进行任何转换。Load操作只是将数据复制至Hive表对应的位置。数据加载时在表下自动创建一个目录,文件存放在该分区下。h、 基于分区的查询的语句SELECT day_table.* FROM day_table WHERE day_table.dt>= '2008-08-08';
查看分区语句:
总结:1、在 Hive 中,表中的一个 Partition 对应于表下的一个目录,所有的 Partition 的数据都存储在最子集的目录中。hive> show partitions day_hour_table; OK dt=2008-08-08/hour=08 dt=2008-08-08/hour=09 dt=2008-08-09/hour=09
2、 总的说来partition就是辅助查询,缩小查询范围,加快数据的检索速度和对数据按照一定的规格和条件进行管理。
3、 外部表也是一种表,普通表有分区,外部表也是有分区的。所以如果是基于分区表创建的外部表一定要对外部表执行ALTER TABLE table_name ADD PARTITION。否则是根本访问不到数据的。
3). 创建Bucket表
CREATE TABLE par_table(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(date STRING, pos STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED ‘\t’
FIELDS TERMINATED BY '\n'
STORED AS SEQUENCEFILE;
对于每一个表(table)或者分区, Hive可以进一步组织成桶,也就是说桶是更为细粒度的数据范围划分。Hive也是 针对某一列进行桶的组织。Hive采用对列值哈希,然后除以桶的个数求余的方式决定该条记录存放在哪个桶当中。
把表(或者分区)组织成桶(Bucket)有两个理由:
(1)获得更高的查询处理效率。桶为表加上了额外的结构,Hive 在处理有些查询时能利用这个结构。具体而言,连接两个在(包含连接列的)相同列上划分了桶的表,可以使用 Map 端连接 (Map-side join)高效的实现。比如JOIN操作。对于JOIN操作两个表有一个相同的列,如果对这两个表都进行了桶操作。那么将保存相同列值的桶进行JOIN操作就可以,可以大大较少JOIN的数据量。
(2)使取样(sampling)更高效。在处理大规模数据集时,在开发和修改查询的阶段,如果能在数据集的一小部分数据上试运行查询,会带来很多方便。
http://blog.youkuaiyun.com/wisgood/article/details/17186107create table bucketed_user(id int,name string) clustered by (id) sorted by(name) into 4 buckets row format delimited fields terminated by '\t' stored as textfile;
4)复制空表
CREATE TABLE empty_key_value_store
LIKE key_value_store;
显示所有表:
SHOW TABLES;
2. 删除表和截断表
删除表会移除表的元数据和数据,而HDFS上的数据,如果配置了Trash,会移到.Trash/Current目录下。
删除外部表时,表中的数据不会被删除。
DROP TABLE table_name;
DROP TABLE IF EXISTS table_name;
从表或者表分区删除所有行,不指定分区,将截断表中的所有分区,也可以一次指定多个分区,截断多个分区。
TRUNCATE TABLE table_name;
TRUNCATE TABLE table_name PARTITION (dt='20080808');
3. 修改表
修改表结构
表添加一列 :
hive> ALTER TABLE pokes ADD COLUMNS (new_col INT);
添加一列并增加列字段注释
hive> ALTER TABLE invites ADD COLUMNS (new_col2 INT COMMENT 'a comment');
重命名表:
hive> ALTER TABLE events RENAME TO 3koobecaf;
删除列:
hive> DROP TABLE pokes;
增加、删除分区
重命名表
增加/更新列
增加表的元数据信息
改变表文件格式与组织
创建/删除视图
创建数据库
二、DML 操作:元数据存储
hive不支持 用insert语句一条一条的进 行插入 操作,也 不支持 update操作。数据是以load的方式加载到建立好的表中。数据一旦导入就不可以修改。向数据表内加载文件
hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;
加载本地数据,同时给定分区信息
例如:加载本地数据,同时给定分区信息:
加载DFS数据 ,同时给定分区信息:
hive> LOAD DATA INPATH '/user/myname/kv2.txt' OVERWRITE INTO TABLE invites PARTITION (ds='2008-08-15');
The above command will load data from an HDFS file/directory to the table. Note that loading data from HDFS will result in moving the file/directory. As a result, the operation is almost instantaneous.
OVERWRITE
将查询结果插入Hive表
INSERT INTO
三、DQL 操作:数据查询SQL
SELECT * FROM test SORT BY amount DESC LIMIT 5
例如
按先件查询
hive> SELECT a.foo FROM invites a WHERE a.ds='<DATE>';
将查询数据输出至目录:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a WHERE a.ds='<DATE>';
将查询结果输出至本地目录:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out' SELECT a.* FROM pokes a;
选择所有列到本地目录 :
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a;
hive> INSERT OVERWRITE TABLE events SELECT a.* FROM profiles a WHERE a.key < 100;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3' SELECT a.* FROM events a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4' select a.invites, a.pokes FROM profiles a;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT COUNT(1) FROM invites a WHERE a.ds='<DATE>';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5' SELECT a.foo, a.bar FROM invites a;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum' SELECT SUM(a.pc) FROM pc1 a;
将一个表的统计结果插入另一个表中:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;
JOIN
hive> FROM pokes t1 JOIN invites t2 ON (t1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT t1.bar, t1.foo, t2.foo;
FROM src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE src.key < 100
INSERT OVERWRITE TABLE dest2 SELECT src.key, src.value WHERE src.key >= 100 and src.key < 200
INSERT OVERWRITE TABLE dest3 PARTITION(ds='2008-04-08', hr='12') SELECT src.key WHERE src.key >= 200 and src.key < 300
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out' SELECT src.value WHERE src.key >= 300;
将文件流直接插入文件:
hive> FROM invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM(a.foo, a.bar) AS (oof, rab) USING '/bin/cat' WHERE a.ds > '2008-08-09';
This streams the data in the map phase through the script /bin/cat (like hadoop streaming). Similarly - streaming can be used on the reduce side (please see the Hive Tutorial or examples)
3.2 基于Partition的查询
3.3 Join
table_reference JOIN table_factor [join_condition]
| table_reference {LEFT|RIGHT|FULL} [OUTER] JOIN table_reference join_condition
| table_reference LEFT SEMI JOIN table_reference join_condition
table_reference:
table_factor
| join_table
table_factor:
tbl_name [alias]
| table_subquery alias
| ( table_references )
join_condition:
ON equality_expression ( AND equality_expression )*
equality_expression:
expression = expression
ON (a.id = b.id AND a.department = b.department)
ON (a.key = b.key1) JOIN c ON (c.key = b.key2)
WHERE a.ds='2010-07-07' AND b.ds='2010-07-07‘
ON (c.key=d.key AND d.ds='2009-07-07' AND c.ds='2009-07-07')
FROM a
WHERE a.key in
(SELECT b.key
FROM B);
FROM a LEFT SEMI JOIN b on (a.key = b.key)
4. 从SQL到HiveQL应转变的习惯
4、Hive不支持将数据插入现有的表或分区中,
仅支持覆盖重写整个表,示例如下:
- INSERT OVERWRITE TABLE t1
- SELECT * FROM t2;
4、hive不支持INSERT INTO xxx VALUES(XXXX), UPDATE, DELETE操作
这样的话,就不要很复杂的锁机制来读写数据。
INSERT INTO syntax is only available starting in version 0.8。INSERT INTO就是在表或分区中追加数据。
5、hive支持嵌入mapreduce程序,来处理复杂的逻辑
如:
- FROM (
- MAP doctext USING 'python wc_mapper.py' AS (word, cnt)
- FROM docs
- CLUSTER BY word
- ) a
- REDUCE word, cnt USING 'python wc_reduce.py';
--doctext: 是输入
--word, cnt: 是map程序的输出
--CLUSTER BY: 将wordhash后,又作为reduce程序的输入
并且map程序、reduce程序可以单独使用,如:
- FROM (
- FROM session_table
- SELECT sessionid, tstamp, data
- DISTRIBUTE BY sessionid SORT BY tstamp
- ) a
- REDUCE sessionid, tstamp, data USING 'session_reducer.sh';
--DISTRIBUTE BY: 用于给reduce程序分配行数据
6、hive支持将转换后的数据直接写入不同的表,还能写入分区、hdfs和本地目录。
这样能免除多次扫描输入表的开销。
- FROM t1
- INSERT OVERWRITE TABLE t2
- SELECT t3.c2, count(1)
- FROM t3
- WHERE t3.c1 <= 20
- GROUP BY t3.c2
- INSERT OVERWRITE DIRECTORY '/output_dir'
- SELECT t3.c2, avg(t3.c1)
- FROM t3
- WHERE t3.c1 > 20 AND t3.c1 <= 30
- GROUP BY t3.c2
- INSERT OVERWRITE LOCAL DIRECTORY '/home/dir'
- SELECT t3.c2, sum(t3.c1)
- FROM t3
- WHERE t3.c1 > 30
- GROUP BY t3.c2;
5. 实际示例
创建一个表
CREATE TABLE u_data (
userid INT,
movieid INT,
rating INT,
unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t'
STORED AS TEXTFILE;
下载示例数据文件,并解压缩
wget http://www.grouplens.org/system/files/ml-data.tar__0.gz
tar xvzf ml-data.tar__0.gz
加载数据到表中:
LOAD DATA LOCAL INPATH 'ml-data/u.data'
OVERWRITE INTO TABLE u_data;
统计数据总量:
SELECT COUNT(1) FROM u_data;
现在做一些复杂的数据分析:
创建一个 weekday_mapper.py: 文件,作为数据按周进行分割
import sys
import datetime
for line in sys.stdin:
line = line.strip()
userid, movieid, rating, unixtime = line.split('/t')
生成数据的周信息
weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()
print '/t'.join([userid, movieid, rating, str(weekday)])
使用映射脚本
//创建表,按分割符分割行中的字段值
CREATE TABLE u_data_new (
userid INT,
movieid INT,
rating INT,
weekday INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '/t';
//将python文件加载到系统
add FILE weekday_mapper.py;
将数据按周进行分割
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
SELECT weekday, COUNT(1)
FROM u_data_new
GROUP BY weekday;
处理Apache Weblog 数据
将WEB日志先用正则表达式进行组合,再按需要的条件进行组合输入到表中
add jar ../build/contrib/hive_contrib.jar;
CREATE TABLE apachelog (
host STRING,
identity STRING,
user STRING,
time STRING,
request STRING,
status STRING,
size STRING,
referer STRING,
agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|//[[^//]]*//]) ([^ /"]*|/"[^/"]*/") (-|[0-9]*) (-|[0-9]*)(?: ([^ /"]*|/"[^/"]*/") ([^ /"]*|/"[^/"]*/"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
)
STORED AS TEXTFILE;