hive2---之数据定义及数据操作-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_16953611/article/details/82317083

本文介绍Hive SQL的基本操作，包括数据定义语言（DDL）中的表创建、修改、视图创建等，数据操作语言（DML）中的插入、更新、删除等操作，以及数据查询语言（DQL）中的选择、聚合、限制等功能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

数据定义及数据操作

一.DDL

1.1 创建表

创建内部表（默认是内部表，一般如果数据的写入和表的建立都需要hive操作的话就用内部表）

create table movie_table(
    movieId STRING,
    title STRING,
    genres STRING
)row format delimited fieds  terminated by ','
 stored as textfile
 location '/movie_table'

创建外部表

create external table rating_table(
    userId STRING,
    movieId STRING,
    rating STRING,
    ts     STRING
)row format delimited fields terminated by ','
 stored as textfile
 location '/rating_table';

创建分桶表(其实也就是穿件索引字段)

create external table rating_table_p(
    userId STRING,
    movieId STRING,
    rating STRING
)partitioned by(dt STRING)
 row format delimited 
 fields terminated by '\t'
 lines terminnated by '\n'
--查看当前表的分区
show partitions rating_table_p;

创建分区表

create external table rating_table_b(
    userId  STRING,
    movieId STRING,
    rating  STRING
)clustered by (userId) into 32 buckets;

复制空表

create table empty_test_table
like rating_table;

1.2 修改表结构

添加一列表增加字段注释

alter table rating_taable add columns(new_col INT commment 'new com');

更改表名

alter table rating_table rename to rename_table

创建视图

create view shorter_join as
   select * from people join cart
   on (cart.people_id=people.id) where firstname='john';

显示命令

show tables;
show databases;
show partitions;
show functions;
describe extended table_name dot col_name;

二.常用DML

包括：insert ; update ;delete;insert into

1.file->table

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

2.table->file

--导出本地文件
insert overwrite local directory '/root/hive_test/1.txt' select * from movie_table;
--导出到HDFS
insert overwrite dicectory '/root/hive_test/1.txt' select * from movie_table;

三.常用DQL

1.基本的select操作

--1.select ... from ...
select name,salary,from employees;

2.聚合操作

--2.使用聚合函数，常用的如count(),sum(),avg()通常会和group by 搭配使用
select word,count(word) from wc group by  word;
  --这里有一个注意的点。通过设置  hive.map.aggr=true，可以提高聚合的性能
  --在使用count（distingct col） 如果col是分区列的时候，会产生bug ，查询结果为0

3. 表生成操作

--常见函数  explode()
select word,count(*) from (
   select explode(splite(sentence,' ')) as word from article
) t 
 gruop by word;

4.limit:典型的查询会返回多行数据，限制返回的行数。

select * from employess limit 10;
--这种情况下，如果查询结果多余10条的话，只会输出10条结果

5.case....when...then :用于处理单个列的查询结果。

select name ,salart,
       case
           when salary <5000.0  then 'low'
           when salary >=5000.0 and salary <7000.0  then 'middle'
           when salary >7000.0  then 'high'
           else 'very high'
       end as bracket from employees;

6.where :用于过滤条件。注意点，不能再where中使用别名，与传统SQL用法类似

7 .order by 和 sort by ,distribute by

1.order by 做全局排序，所有的排序都是通过一个reduce 进行的，过程缓慢

2.sort by 是在每个reduce中进行排序，所以全局会有重复。

因此可以在sort by 之前用distribute by (控制map在输出时候的划分，将相同的key划分到通过一个reduce中)

可以用cluster by 替代。

8.join :和通常的SQL语句一样，但是仅仅支持等值连接。

包括：inner join ;left join ;right join ;outer join ;

8.1这里有个优化点：当对三个或者更多的表进行join 的时候，如果每个on 句子都使用相同的连接键的话，那么只会产生一个mr,可以提升速度。

8.2hive同时假定查询的时最后出现的那个表，设定为大表，所以我们通常从做到用，按照从大到小的顺序排列表。因为他会尝试将其他的表放入缓存中。但是我们可以强制使用/*streamtable(table_name)*/或者/*mapjoin(table_name)*/来手动指定小表