hive语法尝试及结论

最新推荐文章于 2024-10-09 19:24:15 发布

jsjw18

最新推荐文章于 2024-10-09 19:24:15 发布

阅读量7.4k

点赞数

CC 4.0 BY-SA版权

分类专栏： hadoop 文章标签： hadoop hive

本文链接：https://blog.youkuaiyun.com/victor_ww/article/details/39345733

hadoop 专栏收录该内容

7 篇文章

订阅专栏

这篇博客总结了Hive的使用经验，包括多插入模式的限制、查询不显示列头、数据覆盖、表结构复制、不同版本的查询支持如in、having操作，以及exists和嵌套查询等。还提到了数据分隔符、日期格式、数据加载时的注意事项，以及内外部表的管理。最后指出在处理字符串和分区时应注意的细节。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

特别注意：要谨慎使用overwrite关键字，特别是它与目录结合的时候，路径不要搞错了，否则目录下的文件直接被覆盖了
hive> insert into area_t values('1','1','1',now(),'1','1',2,2);
NoViableAltException(26@[])
结论：不支持此种用法

hive> insert into table area_t select areacode,areaname,'1',gxrq,parentcode,bz,flags,flags1 from area limit 15;
结论：追加方式

hive> insert overwrite table area_t select areacode,areaname,'1',gxrq,parentcode,bz,flags,flags1 from area limit 15;
结论：覆盖的方式

hive> insert overwrite directory '/user/lifeng' select * from area;
结论：不能用into、目录要用引号包含

hive> from area
> insert into table area_t select areacode,areaname,'1',gxrq,parentcode,bz,flags,flags1 limit 10;
结论：基本模式的用法

hive> from area
> insert into table area_t select areacode,areaname,'1',gxrq,parentcode,bz,flags,flags1 limit 10
> insert into table area_t select areacode,areaname,'1',gxrq,parentcode,bz,flags,flags1 order by areacode desc limit 15;
FAILED: SemanticException [Error 10087]: The same output cannot be present multiple times: area_t

结论：使用多插入模式时，不能插入相同的表

所有查询都不会显示列头(即字段名)

hive> select [all] parentcode from area limit 20;

结论：查询所有记录

hive> select all parentcode from area order by parentcode limit 20;
结论：排序后再选取前面的20条记录，order by 全局排序，只有一个Reduce任务

hive> select all parentcode from area sort by parentcode limit 20;
结论：sort by会起两个job进行处理，花费的时间更久，只在本机做排序

hive> select distinct parentcode from area order by parentcode limit 20;
结论：排序后、去重后选择前面的20条记录

hive> select yxaccno from area a,area_t b where a.areacode = b.yxaccno;
结论：无结果产生，不能用此等值连接方式

hive> select b.yxaccno from area a right join area_t b on a.areacode = b.yxaccno;
结论：列出area_t的所有数据

hive> select a.areacode,b.yxaccno from area_t b left join area a on a.areacode = b.yxaccno;
结论：列出area_t的所有数据

hive> select a.areacode,b.yxaccno from area_t b inner join area a on a.areacode = b.yxaccno;
结论：无结果产生，不能用此等值连接方式

hive> select a.areacode,b.yxaccno from area_t b full join area a on a.areacode = b.yxaccno;
结论：列出area和area_t中的所有数据

hive> select a.areacode,b.yxaccno from area_t b join area a on a.areacode = b.yxaccno;
结论：无结果产生，不能用此等值连接方式

hive> select parentcode,count(1),sum(sons) from area group by parentcode;
结论：产生统计信息

hive> show functions;
结论：产生所有系统函数

hive> describe function substr;
结论：显示系统函数的具体用法

hive> show databases;
结论：显示所有的数据库

hive> use dw_testing;
结论：使用dw_testing库

hive> show tables;
结论：显示该库下的所有的表

hive> show tables '*t';
结论：显示以't'结尾的表。'_'这个不能任意匹配单个字符，只能代表它本身

hive> desc area_t;
结论：查看表结构

hive> alter table area_t add columns(create_time date comment '创建时间');
结论：添加字段并注释

hive> alter table area_t rename to area_new;
结论：表重命名

hive> select areacode from area where gxrq > '0' limit 2;
hive> select areacode from area where gxrq is not null limit 2;
hive> select areacode from area where gxrq = '2008/9/23 14:10:09' limit 2;
结论：对于时间的比较上不能使用上面的两种方式

hive> select areacode from area where areacode = '7777580' limit 2;
结论：也查不出数据，要去空格才可以查询到结果：select areacode from area where trim(areacode) = '7777580' limit 2;

hive> insert into table area_t select '1','1','1','2014-09-17','2','2',3,3,'2014-09-16' from area limit 1;
hive> select * from area_t;
结论：日期类型的数据以字符串的格式插入是可以的，自动调用cast进行转换

hive> alter table area_t replace columns(create_time date);
结论：
1.删除表中的除了create_time的字段。一定需要字段名+字段类型，否则会报错
2.hdfs中文件内容并没有删除，只是删除了元数据而已

hive> dfs -cat /hive/warehouse/dw_testing.db/area_t/*;
结论：查看文件内容

hive> alter table area_t change create_time update_time date;
结论：修改字段名

hive> alter table area_t add columns(name varchar(30),age int);
结论：添加多个字段

hive> alter table area_t change id cid int first;
结论：将id改名为cid并放在首列

hive> alter table area_t change username name varchar(30) after cid;
结论：将username改名为name后并紧随cid列排放。一定要更改字段的相关信息才能搭配迁移位置

hive> DESCRIBE EXTENDED area;
结论：查看外部表字段名及元数据信息，内部表值显示字段名信息

hive> desc area_t;
结论：不管内部还是外部表都显示其表结构信息(字段名、类型、长度、注释)

hive> drop table area_t;
结论：内部表会删除元数据和数据文件；外部表只会删除元数据但不删除数据文件

create table AREA_T
(
yxaccno VARCHAR(20),
yxaccname VARCHAR(100),
dbckm CHAR(1),
gxrq DATE,
yhid VARCHAR(10),
bz VARCHAR(200),
areatb int,
levels int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' STORED AS TEXTFILE;
结论：建表时一定要加上分隔符

hive> load data local inpath '/usr/local/wonhigh/hivearea_t_file.txt' into table area_t;
结论：将本地数据追加进内部表

hive> load data local inpath '/usr/local/wonhigh/hivearea_t_file.txt' overwrite into table area_t;

结论：将本地数据覆盖所有内部表的数据

hive> create table area_t_cp like area_t;
结论：内部表的表结构复制
hive> create table area_cp like area;
结论：外部表的表结构复制

hive> select colthno,count(*) from mid where trim(colthno) in('BYW34U09DJ1BM4','BYW35G03DU1BM4','BYW35N32DP1BL4','BYW35N32DU1BL4') group by colthno having trim(colthno) = 'BYW34U09DJ1BM4';

结论：0.13版本支持in、支持having操作

hive> select * from mid b where b.colthno in(select a.colthno from mid_cp a where a.price > 0 limit 10) limit 20;

结论：in中支持子查询

hive> select b.* from mid b where exists (select 1 from mid_cp a where a.colthno = b.colthno) limit 20;

结论：支持exists

hive> select a.colthno,a.price from (select colthno,sum(round(price,2)) price from mid where trim(colthno) in('BYW34U09DJ1BM4','BYW35G03DU1BM4','BYW35N32DP1BL4','BYW35N32DU1BL4') group by colthno having colthno <> 'BYW34U09DJ1BM4') a order by a.colthno;

结论：支持嵌套查询

hive> drop database dc_retail_mdm cascade;

结论：当dc_retail_mdm中拥有表时，需要加上cascade才能删除

小结：

1.在创建表时一定要加上数据分隔符

2.平面文件中的表头不能保留表头(字段列)

3.平面文件以UTF8格式保存，防止乱码

4.对于日期的使用，date只能表示日期且在平面文件中的格式必须是yyyy-MM-dd；timestamp表示日期+时间，格式可以有两种形式：YYYY-MM-DD HH:MM:SS或YYYY-MM-DD HH:MM:SS.fffffffff

5.在使用overwrite时一定要谨慎路径问题

6.平面文件分隔符尽量不要用'\t'，可能字段值本身就有空格

7.load数据的时候，如果字符超过字段容忍的长度，会自动从第一位开始截取，如果类型不匹配且转换不了的就直接赋予null

8.hive中所有的查询语句在展示列表是都不会显示列头(字段名)