HiveSQL_hive create table like会启mr任务吗-优快云博客

HiveSQL DDL操作数据库

准备操作

操作数据库

HiveSQL DDL--hive表映射HDFS文件

HQL DML语句--数据导入-load data方式

HQL DML语句--数据导入-insert方式

HiveSQL DDL操作数据库

准备操作

在正式使用Hive之前，务必确保你的环境是ok的，即：打开虚拟机之后，需要做的事情是：

1.在node1中启动hadoop环境，历史任务服务，metastore服务，hiveserver2服务。

start-all.sh

mapred --daemon start historyserver

nohup hive --service metastore &

nohup hive --service hiveserver2 &

lsof -i:10000

操作数据库

1. 查看所有的数据库.

show databases ;

2.切换数据库.

use day05;

3.创建数据库,默认是存储到: HDFS的 /user/hive/warehouse/数据库名.db 这里.

create database day06;      -- 如果重复执行会报错.
create database if not exists day06;      -- 如果day06数据库不存在, 我们再创建. 存在就什么都不操作.

4. 创建数据库的时候, 指定数据库在HDFS上的存储位置.

create database day07 location '/aa';          -- 引号中写的是 HDFS的路径

5. 查看具体的数据库.

show create database day07;     -- 查看数据库的简单信息.
desc database day07;            -- 查看数据库的详细详细.

6. 切换到day07数据库, 创建1个表, 我们看看.

use day07;
create table stu(id int, name string);

7. 删除数据库.

drop database day06;            -- 因为是空库, 所以会直接删除.
drop database day07 cascade;   -- 如果要删除的数据库不是空库, 则后续加上cascade关键字, 会连同数据库, 表一起删除.

HiveSQL DDL--hive表映射HDFS文件

目的: 给大家演示下Hive的本质, 就是我们手动创建Hive表, 然后把源文件上传到该Hive表所在的HDFS路径下, 文件内容会被直接映射到Hive表中.

然后就能写HiveSQL分析表数据(即: HDFS文件数据了)

1.基于HDFS文件格式, 创建Hive表.

create table if not exists stu(
    id int,         -- 学生id
    name string,    -- 学生姓名
    gender string,  -- 性别
    age int,        -- 年龄
    major string    -- 专业
) row format delimited fields terminated by ',';    -- 指定行格式分隔符为逗号, 即: 按照逗号切割文件内容, 获取每列数据.

2.把源文件上传到该Hive表的HDFS路径下.

3. 查看表数据.

select * from stu;

4.可以写HiveSQL对HDFS文件内容做分析了.

select count(1) from stu where age >= 20;

HQL DDL语句--完整建表格式

/*
HQL DDL语句-之 (完整的)建表语法:
    create [external] table 表名(
        列名1 数据类型 comment '字段的描述信息',
        列名2 数据类型 comment '字段的描述信息',
        列名3 数据类型 comment '字段的描述信息',
        ......
    ) comment '表的描述信息'
    分区          partitioned by(分区字段1 数据类型 comment '字段的描述信息', 分区字段2...)
    分桶          clustered by(分桶字段1, 分桶字段2...) sorted by (排序字段1 asc | desc, 排序字段2...) into 桶的个数 buckets
    行格式切割符   row format delimited | SerDe '指定其它的SerDe类, 即: 不同的切割方式'
    存储方式      stored as TextFile | Orc      行存储或者列存储
    存储位置      location hdfs的文件路径
    表属性信息    tblproperties('属性名'='属性值') 例如: 内外部表, 创建者信息, 压缩协议...
    ;

HQL 常用的数据类型:
    原生类型:
        int         整数
        double      小数
        string      字符串
        timestamp   时间戳, 单位: 毫秒
        date        日期
    复杂类型;
        array   列表(序列)
        map     映射
        struct  结构体
        union   联合体
*/

内部表和外部表区别

/*
 内部表和外部表的区别:
    1. 建表格式不同.
        内部表: 直接创建即可, 默认就是内部表.
        外部表: 建表是需要加 external关键字.
    2. 权限不同, 是否会删除源文件.
        内部表: 也叫受管理表, 删除内部表时, 不仅会删除元数据(Hive中查不到了), 还会删除源文件(HDFS也查不到了)
        外部表: 只会删除元数据(Hive中查不到了), 不会删除源文件(HDFS中还在)

1. 创建数据库, 切库.

create database day06;
use day06;
show tables;

2. 创建内部表, 射手表.

create table t_archer_inner(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
row format delimited fields terminated by '\t';

3. 上传源文件到上述Hive表的HDFS路径下, 然后就可以查看表数据了.

select * from t_archer_inner;

4. 查看表的字段(发现中文注释乱码)

desc t_archer_inner;        -- 查看表结构(简单信息, 列名, 数据类型, 描述信息)
desc formatted t_archer_inner;  -- 查看表的详细信息, 其中: Table type表示表的类型: MANAGEN_TABLE(受管理表, 内部表)  EXTERNAL_TABLE(外部表)

5. 删除内部表

drop table t_archer_inner;      -- 不仅会删除元数据, 还会删除源文件.

6. 创建外部表

create external table t_archer_outer(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
row format delimited fields terminated by '\t';

7. 上传源文件到上述Hive表的HDFS路径下, 然后就可以查看表数据了.

select * from t_archer_outer;

8. 查看外部表的详细信息

desc formatted t_archer_outer;

9. 删除外部表

drop table t_archer_outer;      -- 只会删除元数据, 不会删除源文件.

解决hive中文注释乱码问题

具体步骤如下:
1. 去MySQL的hive3数据库中, 修改Hive的码表信息. 因为Hive的元数据(表名, 列名, 数据类型, 描述信息等)都是在MySQL中存储的.
即: 执行如下的代码

        -- （1）修改表字段注解和表注解
        use hive3;
        alter table COLUMNS_V2 modify column COMMENT varchar(256) character set utf8;
        alter table TABLE_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;

        -- （2）修改分区字段注解
        alter table PARTITION_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8 ;
        alter table PARTITION_KEYS modify column PKEY_COMMENT varchar(4000) character set utf8;

        -- （3）修改索引注解
        alter table INDEX_PARAMS modify column PARAM_VALUE varchar(4000) character set utf8;

2. 去Linux中(即: node1机器), 修改hive软件的配置信息, 配置文件是: /export/server/hive/conf/hive-site.xml 文件.

把如下的内容, 添加到 <configuration> </configuration>标签中即可.

<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://node1:3306/hive3?createDatabaseIfNotExist=true&useSSL=false&useUnicode=true&characterEncoding=UTF-8</value>
</property>

3. 在node1机器中, 关闭metastore服务, hiveserver2服务, 然后重启这两个服务.

kill -9 pid值 pid值
nohup hive --service metastore &
nohup hive --service hiveserver2 &

4. 删除刚才创建的hive表, 然后重新创建即可, 此时: 注释乱码问题已经解决了.

HQL DDL--备份表和删除表数据

备份表

-- 1. 备份表, 方式1: 只复制表结构.
drop table t1;
create table t1 like t_archer_inner;        -- 底层不会转成MR任务, 而是直接执行.
-- 1.1 查看表结构
desc t1;                  -- 查看表结构信息, 简单信息
desc formatted t1;        -- 查看表结构信息, 详细信息
show create table t1;     -- 查看表结构信息, 建表信息

-- 1.2 查看备份表数据
select * from t1;       -- 没有数据

-- 2. 备份表, 方式2: 复制表结构及表数据
drop table t2;
create table t2 as select * from t_archer_inner limit 3;        -- 会转MR

-- 2.1 查看表结构
desc t2;                  -- 查看表结构信息, 简单信息
desc formatted t2;        -- 查看表结构信息, 详细信息
show create table t2;     -- 查看表结构信息, 建表信息

-- 2.2 查看表数据.
select * from t2;

删除表

-- 1. truncate table 删除表数据, 只针对内部表有效, 针对于外部表无效.
desc formatted t1; -- MANAGED_TABLE, 内部表.
-- 1.1 删除内部表数据.
select * from t1;

-- 1.2 尝试删除外部表数据, 发现: 报错.
select * from t_archer_outer;
desc formatted t_archer_outer;  -- EXTERNAL_TABLE, 外部表.
truncate table t_archer_outer;  -- 报错, truncate table只针对于内部表有效, 针对于外部表无效.

HQL DDL--查看及修改表信息

-- 1. 查看表信息.
desc t_archer_inner;                -- 查看简单信息, 列名, 数据类型, 描述信息
desc formatted  t_archer_inner;     -- 查看详细信息.
show create table t_archer_inner;   -- 查看建表信息
-- 2. 修改表名, 格式: alter table 旧表名 rename to 新表名
alter table t_archer_inner rename to t_archer_i;
-- alter table t_archer_i rename t_archer_inner;   -- 不能省略to, 报错.
-- 3. 修改表的存储路径, 看看就行了, 不建议做, 建议: 统一存储, 统一管理.
select * from t_archer_i;       -- 如果是修改hive表在HDFS的存储路径了, 则hive表数据可能会丢失.
show create table t_archer_i;
-- 格式: alter table 表名 set location 'hdfs文件的路径'
alter table t_archer_i set location '/aa';
-- 4. 修改表属性信息, 内外部表切换.
desc formatted t_archer_i;
alter table t_archer_i set tblproperties('EXTERNAL'='true');    -- 设置为: 外部表, EXTERNAL必须大写, 后边的true或者false大小写均可(建议大写)
alter table t_archer_i set tblproperties('EXTERNAL'='FALSE');   -- 设置为: 内部表, EXTERNAL必须大写, 后边的true或者false大小写均可(建议大写)
-- 5. 修改列的相关操作.
-- 5.0 查看表结构(列名, 数据类型)
desc t_archer_i;
select * from t_archer_i;
-- 5.1 给表新增列.
alter table t_archer_i add columns (kongfu string comment '功夫');
-- 5.2 单独修改指定的列.  注意: string不能直接转成int类型, 反之可以.
-- 格式: alter table 表名 change 旧列名 新列名 新数据类型;
alter table t_archer_i change attack_max a_m string;
alter table t_archer_i change kongfu kf string;     -- 修改kongfu列名为 kf
alter table t_archer_i change kongfu kf int;        -- 报错, string不能直接转int
-- 5.3 修改表中给所有的列.
-- 格式: alter table 表名 replace columns(列1 数据类型, 列2 数据类型)
alter table t_archer_i replace columns (new_id int, new_name string);   -- 用新的2列, 替换之前所有的列(10列)

HQL DQL语句--建表-默认切割符

在创建Hive表的时候, 我们可以直接指定行格式切割符, 即: 按照什么规则来切割HDFS源文件的每行数据.
如果我们没有指定行格式切割符, 则hive表会用默认的行格式切割符, 即: '\001', 它是1个特殊的字符.
在Linu文件x系统中, 显示为: ^A 快捷键: ctrl + v, ctrl + a
在windows文件系统中, 显示为: SOH
在HDFS文件系统中, 显示为: 口

-- 场景1: 建表, 手动插入数据.
-- 1. 建表.
create table test(
    id int comment '编号',
    name string comment '姓名'
) ;     -- 如果没有指定行格式分隔符, 其实相当于是: row format delimited fields terminated by '\001'

-- 2. 手动往表中添加数据.
insert into test values(1, '乔峰');

-- 3.查看表数据.
-- show create table test;
select * from test;

-- 场景2: 建表, 用于映射指定的文本数据.
drop table team_ace_player;
-- 1. 建表.
create table team_ace_player(
    id int comment '战队编号',
    team_name string comment '战队名',
    ace_name string comment '明星玩家名'
);   -- 默认的切割符是'\001', 不写, 其实相当于写了 row format delimited fields terminated by '\001';

-- 2. 上传源文件.
-- 3. 查看表结果.
select * from team_ace_player;

HQL DDL语句--快速映射表

Hive的本质就是: 把HDFS文件映射成Hive表, 然后就可以写HQL来操作它了, 底层会被解析成MR任务, 交由Yarn调度执行, 所需的数据源及执行结果会保存到HDFS上...

-- 1. 创建表.
create table products(
    id int comment '商品id',
    name string comment '商品名',
    price int comment '商品价格',
    cid string comment '分类id'
) comment '商品表'
row format delimited fields terminated by ',';
-- 2. 上传源文件到Hive表的HDFS路径下.
-- 3. 查看表结果.
select * from products;

HQL DML语句--数据导入-load data方式

HQL DML语句详解:
1. 你要分清楚数据导入和数据导出分别指的是什么.
数据导入:
Linux, Windows => Hive表中
数据导出:
Hive表 => windows, Linux
2. 数据导入相关语法如下:
方式1: load data 方式
方式2: insert into方式
3. 数据导出相关语法如下:
insert overwrite

数据导入之 load data语法详解:
格式:
load data [local] inpath '源文件路径' [overwrite] into table 表名 [partition by(分区字段1, 分区字段2...)];
格式详解:
load data 固定格式, 表示: 数据导入
local 如果不写, 表示从HDFS路径导入到Hive表, 如果写了, 代表从Linux路径导入到Hive表.
Linux系统是本地文件系统, 文件路径前缀为: file:/// 而HDFS文件路径的前缀为: hdfs://node1:8020/
inpath 后边跟的是具体的(源)文件路径
overwrite 如果写了, 就是覆盖写入, 不过不写, 就是追加写入.
into table 表示具体导入数据到哪个表中.
partition by 表示具体的分区, 即: 把数据导入到哪个文件夹中. 回顾: 分区 = 分文件夹
如果是从Linux文件系统, 导入数据到Hive表, 是

从Linux系统中拷贝一份上传到HDFS中的.
如果是从HDFS文件系统, 导入数据到Hive表, 是从HDFS系统中剪切该文件到该Hive表的HDFS路径下.
load data导入的时候, Linux是拷贝, HDFS是剪切.

-- 0. 切库
use day06;
-- 1. 创建王者英雄表, 存储所有英雄的信息.
create table t_all_hero(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '王者荣耀英雄表'
row format delimited fields terminated by '\t';

-- 2. load方式加载数据到上述的表中.
-- 方式1: 从Linux路径 -> 导入到Hive表中.
-- 加载 射手数据, Linux路径, 文件前缀是: file:/// 还可以省略不写.
load data local inpath 'file:///export/hivedata/archer.txt' into table t_all_hero;      -- 底层不转MR, 相当于直接把源文件上传到HDFS中.
-- 加载 刺客数据
load data local inpath '/export/hivedata/assassin.txt' into table t_all_hero;
-- 加载 战士数据, 写了overwrite, 表示: 覆盖写入.
load data local inpath '/export/hivedata/warrior.txt' overwrite into table t_all_hero;

-- 方式2: 从HDFS路径 -> 导入到Hive表中.
-- 加载 法师数据到表中.   overwrite: 不写就是追加, 写了就是覆盖.   local不写就是HDFS路径, 写了就是Linux路径.
-- 细节: hdfs://node1:8020 前缀可以省略不写
load data inpath 'hdfs://node1:8020/hivedata/mage.txt' into table t_all_hero;
-- 细节: hdfs://node1:8020 前缀可以省略不写,
load data inpath '/hivedata/tank.txt' overwrite into table t_all_hero;

-- 3. 查看表数据.
select * from t_all_hero;

HQL DML语句--数据导入-insert方式

格式:
insert into | overwrite table 表名 [partition by(分区字段1, 分区字段2...)] select 语句;
格式解释:
1. into是追加, overwrite是覆盖.
2. into的时候, table关键字可以省略不写, overwrite的时候, table关键字必须写.
3. 上述的语句, 底层会转成MR来执行.


-- 1. 创建t_all_hero_tmp, 临时的王者荣耀英雄表.
create table t_all_hero_tmp(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '王者荣耀英雄表'
row format delimited fields terminated by '\t';

-- 2. 往上述的表中添加数据.
insert into table t_all_hero_tmp select * from t_all_hero limit 5;
insert into t_all_hero_tmp select * from t_all_hero limit 5; -- 细节: into的时候, table可以省略不写.

-- overwrite: 表示 覆盖写入
insert overwrite table t_all_hero_tmp select * from t_all_hero limit 3;     -- table不能省略

-- 3. 查看结果.
select * from t_all_hero_tmp;

HQL DML语句--导出数据

格式:
insert overwrite [local] directory '存储该文件的路径' [row format delimited fields terminated by '行格式分隔符']
select 语句;
1. 不写local就是HDFS路径, 写了就是Linux路径.
2. 行格式分隔符, 表示导出文件后, 内容之间的分隔符.
3. 导出的时候, 是覆盖导出的, 建议要导出到的目录, 内容为空, 否则啥都没有

-- 1. 查看数据.
select * from t_all_hero;       -- 10条数据

-- 2. 导出 t_all_hero表的数据到 hdfs的 /hivedata 目录下, 该目录下是有3个文件的.
insert overwrite directory '/hivedata' select * from t_all_hero;        -- 默认分隔符.

-- 3. 路径同上, 指定 分隔符
insert overwrite directory '/hivedata' row format delimited fields terminated by ':'
select * from t_all_hero;

-- 4. 导出 t_all_hero表的数据到 linux的 /hivedata 目录下, 该目录下是有3个文件的.
insert overwrite local directory '/export/hivedata' row format delimited fields terminated by '$'
select * from t_all_hero;

Hive的第1代客户端的作用

Hive的第1代客户端指的是: Linux路径下 /export/server/hive/bin/hive 这个脚本.
它的作用有 4 个:
1. 可以启动元数据服务和 hiveserver2服务.
nohup hive --service metastore &
nohup hive --service hiveserver2 &
2. 可以充当客户端使用, 编写并执行HQL代码.
3. 可以临时执行一次 1个 HQL语句.
hive -e 'Hive SQL语句' -- execute: 执行的意思.
4. 可以临时执行一次 1个 HQL脚本.
hive -f hive脚本 -- 建议后缀名为.hql file: 文件

1. hive -e的方式导出数据, 注意: 如下的命令要在Linux中执行.
hive -e 'select * from day06.t_all_hero;' > 1.txt 把HQL执行结果覆盖写入到 1.txt文件中, 注意: 在Linux中执行.

2. hive -f的方式导出数据. 注意: 脚本后缀名最好是 hql
hive -f my.xyz >> 1.txt 把my.xyz脚本的执行结果, 追加到1.txt文件中.
hive -f my.hql >> 1.txt hive sql脚本文件的后缀名最好为: hql

HQL DDL语句--分区表

-- 1. 建表.
create table t_all_hero(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
row format delimited fields terminated by '\t';
-- 2. 上传源文件到该hive表的HDFS路径下.
-- 3. 查看表数据.
select * from t_all_hero;

分区表-静态分区

-- 1. 创建分区表, 指定分区字段.
create table t_all_hero_part(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
partitioned by (role string comment '角色字段-充当分区字段')  -- 分区字段必须是表中没有的字段.
row format delimited fields terminated by '\t';

-- 2. 如何往分区表中添加数据呢?
-- 方式1: 静态分区,手动指定分区字段和字段值.  
load data local inpath '/export/hivedata/archer.txt' into table t_all_hero_part partition(role='sheshou');
load data local inpath '/export/hivedata/assassin.txt' into table t_all_hero_part partition(role='cike');
load data local inpath '/export/hivedata/mage.txt' into table t_all_hero_part partition(role='fashi');
load data local inpath '/export/hivedata/support.txt' into table t_all_hero_part partition(role='fuzhu');
load data local inpath '/export/hivedata/tank.txt' into table t_all_hero_part partition(role='tanke');
load data local inpath '/export/hivedata/warrior.txt' into table t_all_hero_part partition(role='zhanshi');

-- 3. 查询分区表的数据.
select * from t_all_hero_part;

-- 4. 查询出所有的射手数据，如果你设置了分区表, 查询时, 建议带上分区字段.
select * from t_all_hero_part where role_main='archer';     -- 依旧进行了全表扫描.
select * from t_all_hero_part where role='sheshou';         -- 精准扫描某个分区(目录), 避免全表扫描.

分区表-动态分区

1. 在进行动态分区的时候, 建议: 手动关闭严格模式.
2. 分区的严格模式要求: 在进行动态分区的时候, 至少要有1个静态分区. 如果都是动态分区, 则: 报错.
静态分区: partition(role='sheshou') partition(分区字段='值')
动态分区: partition(role) partition(分区字段)
3. 动态分区不支持load方式加载数据, 采用 insert into | overwrite方式导入数据.
4. set 参数名=值; 是在设置参数值.
set 参数名; 是在是在获取(查看)该参数的值.

-- 1. 创建分区表.
create table t_all_hero_part_dynamic(
    id            int comment 'ID',
    name          string comment '英雄',
    hp_max        int comment '最大生命',
    mp_max        int comment '最大法力',
    attack_max    int comment '最高物攻',
    defense_max   int comment '最大物防',
    attack_range  string comment '攻击范围',
    role_main     string comment '主要定位',
    role_assist   string comment '次要定位'
) comment '射手表'
partitioned by (role string comment '角色字段-充当分区字段')  -- 核心细节: 分区字段必须是表中没有的字段.
row format delimited fields terminated by '\t';

-- 2. 关闭严格模式.
set hive.exec.dynamic.partition.mode=nonstrict;     -- nonstrict 非严格模式, strict: 严格模式(默认)

-- 3. 动态分区的方式, 添加数据.
insert into table t_all_hero_part_dynamic partition(role)
select *, role_main from t_all_hero;

-- 4. 查询分区表的数据.
select * from t_all_hero_part_dynamic;

-- 5.  查询出所有的射手数据，如果你设置了分区表, 查询时, 建议带上分区字段.
select * from t_all_hero_part_dynamic where role_main='archer';     -- 依旧进行了全表扫描.
select * from t_all_hero_part_dynamic where role='sheshou';         -- 精准扫描某个分区(目录), 避免全表扫描.

分区表--多级分区

多级分区一般用时间来分区, 可以是: 年, 月, 日...

多级分区的时候, 分区层级不建议超过 3级, 一般是: 年, 月2级就够了.

-- 1. 创建商品表, 按照: 年, 月分区.
create table products(
    pid int,
    pname string,
    price int,
    cid string
) comment '商品表'
partitioned by (year int, month int)        -- 按照年, 月分区, 2级分区
row format delimited fields terminated by ',';

-- 2. 查看所有的分区信息.
show partitions products;

-- 3. 手动添加分区.
-- 3.1 添加1个分区.
alter table products add partition(year=2023, month=1);
-- 3.2 添加多个分区.
alter table products add partition(year=2023, month=4) partition(year=2023, month=5) partition(year=2023, month=11);
alter table products add partition(year=2024, month=1) partition(year=2024, month=5) partition(year=2024, month=10);

-- 4. 修改分区.  
alter table products partition(year=2024, month=10) rename to partition(year=2024, month=08);

-- 5. 删除分区.
alter table products drop partition(year=2023, month=4);    -- 删除2023年4月 这个分区
alter table products drop partition(month=1);               -- 删除所有的1月 这个分区
alter table products drop partition(year=2023);             -- 删除2023年 及其所有的子分区.

-- 6. 查看商品表数据.
select * from products;

-- 7. 精准查看某个分区的数据.
select * from products where year=2023;             -- 查找2023年 分区内, 所有的数据.
select * from products where year=2023 and month=1; -- 查找2023年, 1月 分区内, 所有的数据.

分桶表

概述:
分桶 = 分文件, 相当于把数据根据分桶字段, 拆分成N个文件.
作用:
1. 方便进行数据采样.
2. 减少join的次数, 提高查询效率.

1. 分桶字段必须是表中已有的字段.
2. 分桶数量 = HDFS文件系统中, 最终的文件数量.
3. 分桶规则用的是: 哈希取模分桶法, 简单来说, 就是根据分桶字段计算它的哈希值, 然后和桶的个数取余, 余数为几, 就进哪个桶.
哈希值: 程序根据值的内容, 内存地址值等信息, 计算出来的1个数字.
例如:
select hash('乔峰'); 哈希值为: -870432061
select hash('乔峰') % 3; 取余结果为: -1 3是假设一共有3个桶.
select abs(-10); 计算绝对值的: 10
select abs(hash('乔峰')) % 3; 取余结果为: 1
4. set mapreduce.job.reduces=n; 可以设置ReduceTask任务的数量, 这个设置只在分桶查询中会用到.
分桶建表的时候, 不用该参数.
5. 分桶表的数据不建议load data方式或者手动上传, 而是: insert into | overwrite的方式添加.

-- 1. 创建普通的学生表(充当数据源表), 上传源文件, 然后查看数据.
create table student(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
row format delimited fields terminated by ',';

select * from student;

-- 2. 创建分桶表-学生表, 按照学生id进行分桶, 分成 3 个桶.
create table student_buckets(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
clustered by (sid) into 3 buckets       -- 按照学生id进行分桶, 分成 3 个桶.  即: 根据sid的哈希值 和 3取余, 余数为几, 就进哪个桶.
row format delimited fields terminated by ',';

-- 3. 往 分桶表中插入数据.
insert into table student_buckets select * from student;

-- 4. 查看分桶表结果.
select * from student_buckets;

分桶表--分桶+排序

-- 1. 创建分桶表-学生表, 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 age 降序排列.
drop table student_buckets_sort;
create table student_buckets_sort(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
-- clustered by (sid) sorted by (sid) into 3 buckets      -- 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 sid 升序排列.
clustered by (sid) sorted by (age desc) into 3 buckets    -- 按照学生id进行分桶, 分成 3 个桶, 桶内部按照 age 降序排列.
row format delimited fields terminated by ',';

-- 2.往上述的分桶排序表中, 添加数据.
insert into student_buckets_sort select * from student;

-- 3. 查询结果.
select * from student_buckets_sort;

分桶表--分桶规则

分桶规则: 分桶字段的哈希值(绝对值形式) % 桶的个数, 余数是几, 就进哪个分桶.

select md5('pwd111');       -- 130353326a7bfab601f57757033b5b4a
select sha1('pwd111');      -- 5434831585d4109e2fba9b12a2ca26f8e1734c2e
select sha2('pwd111', 256);      -- 07c9f005ea0fb70fc48531e25a3a13e434e721dbc858c0b799568c3cafb2534c
select sha2('pwd111', 512);      -- 56c6ac5aec911eda1ff78fde625d26a49b1a43a5770dd5059e84e051ffda3e0a66e8b05dd724bd7e57631e395ddccb9a2dd7eb6ff8bfd8e17c765c2316df8365
select crc32('pwd111');          -- 4152129734

-- 查看函数的说明文档.
describe function extended sha2;

-- 获取哈希值.
select hash(123);       -- 整数的哈希值, 是本身.
select hash('乔峰');     -- 哈希值: -870432061
select hash('乔峰') % 3; -- -1
select abs(-10);        -- 计算绝对值的, 10

-- 分桶规则如下
select abs(hash('乔峰')) % 3;     -- 1

复杂类型--array

-- 数据格式为: "zhangsan	beijing,shanghai,tianjin,hangzhou"
-- 1. 建表.
create table t_array(
    name string comment '姓名',
    city array<string> comment '出差城市'   -- array类似于Python的列表, 就是容器类型.  <string>意思是: 泛型, array中只能存储字符串类型数据.
)
row format delimited fields terminated by '\t'  -- 切割后, 数据格式为: "zhangsan",  "beijing,shanghai,tianjin,hangzhou"
collection items terminated by ',';             -- 切割后, 数据格式为: "zhangsan",  ["beijing", "shanghai" ,"tianjin", "hangzhou"]

-- 2. 上传源文件, Hive表自动解析.

-- 3. 查看表结果.
select * from t_array;

-- 4. 完成如下的需求.
-- 查询所有数据
select * from t_array;

-- 查询city数组中第一个元素
select name, city[0] from t_array;     -- 字段名[索引], 索引从刚开始计数.

-- 查询city数组中元素的个数
select name, city, size(city) from t_array;     -- size(复杂类型) 可以查看该复杂类型的元素个数.

-- 因为 array_contains()函数返回结果就是True 或者False, 所以可以直接写.
select name, city from t_array where array_contains(city, 'tianjin');       -- 要city列 包含tianjin的
select name, city from t_array where not array_contains(city, 'tianjin');   -- 要city列 不包含tianjin的

复杂类型--struct

结构体类型较之于数组类型, 区别是: 结构体类型可以设置子列的类型和名称.
对比为: array<string>, struct<name:string, age:int>
相同点是: 都可以存储多个元素.

-- 1. 建表.       数据源格式为: "1#周杰轮:11"
create table t_struct(
    id int,
    info struct<name:string, age:int>
)
row format delimited fields terminated by '#'  -- 切割后: "1", "周杰轮:11"
collection items terminated by ':';     -- collection items 是负责切割: 数组, 结构体的

-- 2. 上传源文件.

-- 3. 查看表数据.
select * from t_struct;

-- 4. 细节: 列名.子列名 可以获取子列的信息.
select id, info.name, info.age from t_struct;

复杂类型--map

map映射类型: 类似于Python的字典, Java中的Map集合, 存储的是键值对数据, 左边的叫: 键(key), 右边的叫: 值(value)

-- 数据格式为:  "1,林杰均,father:林大明#mother:小甜甜#brother:小甜,28"
-- 1. 建表.
create table t_map(
    id int comment '编号',
    name string comment '姓名',
    members map<string, string> comment '家庭成员',     -- 左边的string: 键的类型,  右边的string: 值的类型.
    age int comment '年龄'
)
row format delimited fields terminated by ','  -- 切完后, 数据为: 1, "林杰均", "father:林大明#mother:小甜甜#brother:小甜", 28
collection items terminated by '#'  -- 切完后, 数据为: 1, "林杰均", ["father:林大明", "mother:小甜甜", "brother:小甜"], 28
map keys terminated by  ':';  -- 切完后, 数据为: 1, "林杰均", {"father" : "林大明", "mother" : "小甜甜", "brother" : "小甜"}, 28
-- 2. 上传源文件.

-- 3. 查看表数据.
select * from t_map;


-- 4. 完成如下的需求.
-- 4.1 查询全部
select * from t_map;
-- 4.2 查询father、mother这两个map的key
select id, name, age, members['father'], members['mother'] from t_map;
-- 4.3 查询全部map的key，使用map_keys函数，结果是array类型
select *, map_keys(members) from t_map;
-- 4.4 查询全部map的value，使用map_values函数，结果是array类型
select *, map_values(members) member_value from t_map;      -- 给列起别名, as可以省略不写.
select *, map_values(members) `值` from t_map;              -- 如果列名和关键字重名了, 或者有中文, 加上: 反引号``即可.
-- 4.5 查询map类型的KV对数量
select *, size(members) from t_map;

-- 4.6 查询map的key中有brother的数据
-- step1: 获取所有的键.
select *, map_keys(members) from t_map;
-- step2: 判断是否包含 brother 这个"键"
select *, array_contains(map_keys(members), 'brother') from t_map;
-- step3: 筛选出指定的数据即可.
select * from t_map where array_contains(map_keys(members), 'brother');

HQL DQL语句--基本查询

MySQL和Hive单表查询语法对比

MySQL中, 完整的单表查询, 语法结构如下:
select
[distinct] 列1, 列2...
from 表名
where 组前筛选
group by 分组字段
having 组后筛选
order by 排序字段 asc | desc
limit 起始索引, 数据条数;

Hive中, 完整的单表查询, 语法结构如下:
[CTE表达式]
select
[distinct | all] 列1, 列2...
from 表名
where 组前筛选
group by 分组字段
having 组后筛选
order by 排序字段 asc | desc
cluster by 分桶排序字段 | distribute by 分桶字段 sort by 排序字段 asc | desc
limit 起始索引, 数据条数;

对比上述的 MySQL 和 Hive的单表语法, 会发现, 有如下3点不同:
1. Hive支持CTE表达式的写法, 可以临时的存储某些语句的执行结果, 方便查询.
2. hive中除了distinct关键字之外, 还支持all关键字写法.
3. hive支持分桶查询, 即: cluster by 分桶排序字段
4. 如果只分桶就写 distribute by 分桶字段, 如果还要排序就写 sort by 排序字段 asc | desc
如果分桶和排序字段是同一个字段, 则可以直接写 cluster by, 即: cluster by = distribute by + sort by

HQL DQL--基本查询

-- 2. 建表.
CREATE TABLE orders (
    orderId bigint COMMENT '订单id',
    orderNo string COMMENT '订单编号',
    shopId bigint COMMENT '门店id',
    userId bigint COMMENT '用户id',
    orderStatus tinyint COMMENT '订单状态 -3:用户拒收 -2:未付款的订单 -1：用户取消 0:待发 1:配送中 2:用户确认收货',
    goodsMoney double COMMENT '商品金额',
    deliverMoney double COMMENT '运费',
    totalMoney double COMMENT '订单金额（包括运费）',
    realTotalMoney double COMMENT '实际订单金额（折扣后金额）',
    payType tinyint COMMENT '支付方式,0:未知;1:支付宝，2：微信;3、现金；4、其他',
    isPay tinyint COMMENT '是否支付 0:未支付 1:已支付',
    userName string COMMENT '收件人姓名',
    userAddress string COMMENT '收件人地址',
    userPhone string COMMENT '收件人电话',
    createTime timestamp COMMENT '下单时间',
    payTime timestamp COMMENT '支付时间',
    totalPayFee int COMMENT '总支付金额'
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

-- 3. 上传源文件, 或者 load data方式加载数据.
-- 4. 查看表数据.
select * from orders;   -- 1000条

-- 5. 完成如下的需求.
-- 5.1 查询所有
select * from orders;   -- 1000条
select * from orders limit 100;   -- 10条, 实际开发中写法
-- 5.2 查询单列
select userName, orderId, totalMoney from orders;
-- 5.3 查询数据量
select count(orderId) from orders;        -- 1000条
-- 5.4 过滤广东省订单
select * from orders where userAddress like '广东省%';
-- 5.5 找出广东省单笔营业额最大的订单
-- 思路1: 排序.
select * from orders where userAddress like '广东省%' order by realTotalMoney desc limit 1;
-- 思路2: 子查询.
select *
from
    orders
where
      userAddress like '广东省%'
  and realTotalMoney = (select max(realTotalMoney) from orders where userAddress like '广东省%');

-- 5.6 统计未支付、已支付各自的人数
-- 方式1: 分组统计.
select
    case
        when isPay = 0 then '未支付'
        when isPay = 1 then '已支付'
    end as isPay,
    case isPay
        when 0 then '未支付'
        when 1 then '已支付'
    end as isPay2,
    count(orderId)
from
    orders
group by
    isPay;

-- 5.7 在已付款订单中，统计每个用户最高的一笔消费金额
select userId, max(realTotalMoney) from orders where isPay = 1 group by userId;
-- 5.8 统计每个用户的平均订单消费额
select userId, userName, avg(realTotalMoney) from orders  group by userId, userName;
-- 升级需求: 保留两位小数.
select userId, userName, round(avg(realTotalMoney), 2) as avg_money from orders  group by userId, userName;
-- 5.9 统计每个用户的平均订单消费额，过滤大于10000的数据
select userId, userName, avg(realTotalMoney) as avg_money from orders  group by userId, userName having avg_money > 10000;
-- 四舍五入, 保留两位小数, 组后筛选, 写法如下.
select
    userId,
    userName,
    round(avg(realTotalMoney), 2) as avg_money
from
    orders
group by
    userId, userName
having
    round(avg(realTotalMoney), 2) > 10000;

HQL DQL语句--join连接查询

join连接查询是多表查询的方式, 它具体有 6种查询方式, 分别是: 交叉连接, 内连接, 左外连接, 右外连接, 满外连接(全外连接), 左半连接, 具体如下:

-- 1. 建表, 上传源文件.
--table1: 员工表
CREATE TABLE employee(
   id int,      -- 员工id
   name string,
   deg string,
   salary int,
   dept string
 ) row format delimited fields terminated by ',';

--table2:员工家庭住址信息表
CREATE TABLE employee_address (
    id int,     -- 员工id
    hno string,
    street string,
    city string
) row format delimited fields terminated by ',';

-- 2. 查看表数据.
select * from employee;
select * from employee_address;

-- 3. 演示各种连接查询.
-- 3.1 交叉连接查询 cross join, 查询结果是: 两张表的笛卡尔积, 即: 表A总条数 * 表B总条数, 会有大量的脏数据, 一般不用.
-- 写法.
select * from employee cross join employee_address;     -- 25条

-- 3.2 内连接查询, inner join, 其中inner可以省略不写, 查询结果为: 表的交集.
-- 显式内连接
select * from employee e1 inner join employee_address e2 on e1.id = e2.id;  -- 4条
select * from employee e1 join employee_address e2 on e1.id = e2.id;  -- 4条


-- 3.3 左外连接查询, left outer join, 其中outer可以省略不写, 查询结果为: 左表的全集 + 表的交集.
select * from employee e1 left outer join employee_address e2 on e1.id = e2.id; -- 5条
select * from employee e1 left join employee_address e2 on e1.id = e2.id; -- 5条

-- 3.4 右外连接查询, right outer join, 其中outer可以省略不写, 查询结果为: 右表的全集 + 表的交集.
select * from employee e1 right outer join employee_address e2 on e1.id = e2.id; -- 5条
select * from employee e1 right join employee_address e2 on e1.id = e2.id; -- 5条

-- 3.5 满外连接(全外连接)查询, full outer join, 其中outer可以省略不写, 查询结果为: 左表全集 + 右表全集 + 表的交集，满外连接 = 左外连接 + 右外连接 查询结果.
select * from employee e1 full outer join employee_address e2 on e1.id = e2.id; -- 6条
select * from employee e1 full join employee_address e2 on e1.id = e2.id; -- 6条

-- 3.6 左半连接, left semi join, , 查询结果为: 表的交集，左半连接 相当于 内连接的查询结果, 只要左表部分.
select * from employee e1 left semi join employee_address e2 on e1.id = e2.id; -- 4条, 只有左表的数据.

HQL DQL语句--分桶查询

分桶查询介绍:
概述:
分桶查询就是根据分桶字段, 把表数据分成n份, 但是: 是逻辑划分, 不是物理划分.
逻辑划分: 类似于分组, 就是根据分桶字段值进行分组, HDFS上文件还是1个.
物理划分: 就是昨天的分桶建表, HDFS上存储数据的时候, 已经变成了N个文件.
格式:
cluster by 分桶排序字段 | distribute by 分桶字段 sort by 排序字段 asc | desc
注意:
1. 分桶查询是逻辑分桶, 把数据分成n组进行查询, 但是HDFS上还是1个文件.
分桶建表是物理分桶, 数据在HDFS上会被存储到N个文件几种.
2. 分桶查询需要用到 set mapreduce.job.reduces = n; 这个n就是ReduceTask任务的数量, 即: 分几个桶.
3. mapreduce.job.reduces 参数的值默认是 -1, 即: 程序会按照数据量, 任务量自动分配ReduceTask的个数, 一般是1个, 即: 1个桶.
4. distribute by 表示分桶, sort by表示对桶内的数据进行排序(局部排序), 如果分桶字段和排序字段是同1个, 则可以简写为: cluster by
即: cluster by = distribute by + sort by

-- 1. 建表, 上传源文件, 查看表数据.
create table student(
    sid int,
    name string,
    gender string,
    age int,
    major string
) comment '学生信息表'
row format delimited fields terminated by ',';
select * from student;      -- 22条

-- 2. 按照性别进行分桶, 分成2个桶...
-- 分桶查询是, 需要设置ReduceTask的任务数, 有几个ReduceTask, 就有几个桶.
set mapreduce.job.reduces = 2;  -- 默认是-1, 即: 程序会根据数据量, 任务量, 自动给出 ReduceTask任务数, 一般是: 1
select * from student distribute by gender;        -- 根据gender分桶, 且根据gender升序排列.

-- 3. 按照id分成3个桶, 注意: 不排序.
set mapreduce.job.reduces = 3;
select * from student distribute by sid;

-- 4. 上述的数据, 虽然分成了3个桶, 但是不方便查看数据, 因此, 加入: 排序.
-- 按照id分成3个桶, 然后按照年龄降序排列.
select * from student distribute by sid sort by age;    -- 默认是升序.
select * from student distribute by sid sort by age desc;    -- 降序

-- 5. 如果分桶字段和排序字段是同1个, 则可以简写为: cluster by
-- 根据sid分成3个桶, 且根据sid升序排列.
select * from student distribute by sid sort by sid;    -- 默认是升序.
select * from student cluster by sid;       -- 效果同上.
-- select * from student cluster by sid desc;       -- 报错, cluster by默认升序.

HQL DQL--随机抽样

随机抽样解释:
概述:
随机采样指的是 tablesample()函数, 通过它, 我们可以用类似于分桶的思路, 对数据进行采样.
格式:
select .. from 表名 tablesample(bucket x out of y on 列名 | rand());
格式解释:
1. y表示: 把数据分成y个桶. 逻辑分桶.
2. x表示: 从y个分桶中, 获取第x份数据.
3. 如果是列名方式, 列名采样, 在列名等其它条件不发生改变的情况下, 每次采样获取的数据都是一样的.
4. 如果是rand()函数, 随机采样, 则: 每次获取的数据都是不一样的.
5. x 不能大于 y

-- 1. 查看原表数据.
select * from student;      -- 22条, 12条男, 10条女

-- 2. 按照 性别分成2个桶, 取第1份数据.
select * from student tablesample ( bucket 1 out of 2 on gender);   -- 10条, 女
-- 3. 按照 性别分成2个桶, 取第2份数据.
select * from student tablesample ( bucket 2 out of 2 on gender);   -- 12条, 男

-- 4. 演示rand()函数
select rand();      -- 随机数, 生成0.0 ~ 1.0之间的数字, 包左不包右, 也叫: 前闭后开.  [0.0, 1.0)

-- 5. 随机采样, 分成3个桶, 取第1份数据.
select * from student tablesample ( bucket 1 out of 3 on rand());   -- 每次获取的数据条数都不一样.
select * from student tablesample ( bucket 4 out of 3 on rand());   -- 报错, x不能大于y

-- 也可以换成其它的表进行采样.
select * from orders tablesample ( bucket 1 out of 3 on rand());

HQL DQL--正则查询

正则查询介绍:
概述:
正则表达式不独属于任意的一种语言, 市场上绝大多数的语言都支持它, 例如: Java, Python, JavaScript, HiveSQL...
正则的规则是通用的, 但是正则表达式的校验格式(语法)稍有不同.
正则表达式: 正确的, 符合特定规则的字符串. (RegExp, 全称: regular expression)
常用的正则规则(通用):
. 代表任意的1个字符
\. 代表1个普通的. 没有任何的特殊含义.
[abc] 代表a,b,c中任意的1个字符.
[^abc] 代表除了a,b,c以外, 任意的1个字符
\d 代表任意的1个数字, 等价于 [0-9]
\w 代表1个单词字符, 即: 数字, 字母, 下划线, 等价于: [a-zA-Z0-9_]
\S 代表任意的1个非空字符
\\ 代表1个\
^ 代表开头
$ 代表结尾
? 数量词, 代表前边的内容出现0次或者1次.
* 数量词, 代表前边的内容出现0次或多次.
+ 数量词, 代表前边的内容出现1次或者多次.
a{n} 数量词, 代表a恰好出现n次, 多一次少一次都不行.
a{n,} 数量词, 代表a至少出现n次, 至多出现无数次.
a{n,m} 数量词, 代表a至少出现n次, 至多出现m次, 包括n和m
HiveSQL中, 正则校验格式如下:
字符串 rlike '正则表达式'

-- 1. 校验字符串是否以a开头.
select '1abc' rlike '^a.*';   -- false
select 'a' rlike '^a.*';   -- true

-- 2. 校验字符串以x结尾.
select 'abcx1' rlike '.*x$';    -- false
select 'abcx' rlike '.*x$';     -- true

-- 3. 校验字符串中是否包含连续的3个a
select '11aaaabc' rlike '^.*a{3}.*$';   -- true
select '11aa' rlike '^.*a{3}.*$';       -- false

-- 4. 校验字符串是否不以a,b,c任意的1个字母开头.
select 'a123' rlike '^[^abc].*';     -- false
select 'bcd' rlike '^[^abc].*';     -- false
select 'c123cd' rlike '^[^abc].*';     -- false
select 'd123' rlike '^[^abc].*';     -- true

-- 5. 校验手机号是否合法.
-- 规则: 1. 纯数字组成.  2.长度必须是11位.  3.第1位数字必须是1.  4.第2位数字可以是: 3-9
select '13112345678' rlike '^1[3-9][0-9]{9}$';
select '13112345678' rlike '^1[3-9]\\d{9}$';

-- 6. 完成如下的需求
-- 6.1 查找广东省的数据
select * from orders where userAddress like '广东省%';        -- 模糊查询
select * from orders where userAddress rlike '^广东省.*';     -- 正则查询

-- 6.2 查找用户地址是：xx省 xx市 xx区的数据
select * from orders where userAddress rlike '.*省.*市.*区';     -- 正则查询

-- 6.3 查找用户姓为张、王、邓
select * from orders where userName rlike '^[张王邓]\\S+';     -- \S: 代表任意的1个非空白字符
select * from orders where userName rlike '^[张王邓].+';

-- 6.4 查找手机号符合：188****0*** 规则
select * from orders where userPhone rlike '^188\\*{4}0\\d{3}$';

HQL DQL--union联合查询

union 联合查询介绍:
概述:
就是对表数据做纵向拼接的, 类似于书本的上, 下册, 有两种拼接方式.
分类:
union distinct -- distinct可以省略不写, 合并并去重.
union all -- 合并不去重.
细节:
1. union all 和 distinct 作用类似, 都是纵向合并, 区别就是: 去不去重的问题.
2. 直接写union 默认是 union distinct, 即: 去重合并.
3. 要进行union联合查询的数据, 列的个数, 对应的数据类型都要一致.
4. 如果条件写到上边的查询语句后, 则: 单独作用于该SQL.
如果条件写到下边的查询语句后, 则: 作用于全局.

-- 1. 查看数据源表.
use day07;      -- 为了方便演示.
show tables;

select * from day07.student;              -- 22条;
select * from day07.student_buckets;      -- 22条;

-- 2. union all演示, 合并 不去重.
select * from student
union all
select * from student_buckets;      -- 44条;

-- 3. union distinct演示, 合并,去重.
select * from student
union distinct
select * from student_buckets;      -- 22条;

-- 4. union distinct演示, 合并,去重.
select * from student
union   -- distinct 可以省略不写.
select * from student_buckets;      -- 22条;

-- 5. 如果条件写到上边的 查询语句后, 则: 单独作用于该SQL.
(select * from student limit 3)     -- 3条
union all
select * from student_buckets;      -- 25条;

-- 6. 如果条件写到下边的 查询语句后, 则: 作用于全局.
select * from student                       -- 22条
union all
select * from student_buckets limit 3;      -- 3条;

-- 6. 如果仅仅是单独作用于某1个SQL, 则可以用括号括起来.
select * from student                         -- 22条
union all
(select * from student_buckets limit 3);      -- 25条;

HQL DQL语句--虚拟列查询

虚拟列查询介绍:
概述:
虚拟列是Hive内置的可以在查询语句中使用的特殊标记，可以查询数据本身的详细参数。
大白话:
虚拟列是Hive内置的, 可以直接用的参数, 辅助我们进行查询的.
Hive目前可用3个虚拟列：
INPUT__FILE__NAME 显示数据行所在的具体文件类似于: 你在哪个班.
BLOCK__OFFSET__INSIDE__FILE 显示数据行所在文件的偏移量, 偏移量从0开始计数. 类似于: 你是班级的第几个人.
文件内容如下: 对应的行偏移量:
95001,李勇,男,20,CS 0
95002,刘晨,女,19,IS 23
95003,王敏,女,22,MA 46
ROW__OFFSET__INSIDE__BLOCK 显示数据所在HDFS块的偏移量, 即: 该行数据在HDFS文件的那个Block块中. 类似于: 你在班级的哪个组.
此虚拟列需要设置：SET hive.exec.rowoffset=true 才可使用
注意:
1. 中文在utf-8码表中占 3 个字节, 在gbk码表中占2个字节.
2. 数字, 字母, 特殊符号无论在什么码表, 都只占 1个字节.
3. HDFS文件和Linux文件类似, 默认的换行符都是: \n

-- 1. 查看数据源表.
use day08;
select * from student;

-- 2. 查看 day08#student 数据表映射的是哪个文件.
select *, INPUT__FILE__NAME from student;       -- hdfs://node1.itcast.cn:8020/user/hive/warehouse/day08.db/student/students.txt

-- 3. 查看 day08#student 数据表每行数据 在映射文件中的 起始索引(行偏移量)
select *, BLOCK__OFFSET__INSIDE__FILE from student;     -- 行偏移量从: 0开始.
-- 上述方式可以帮助我们筛选出中间的某些数据.
select *, BLOCK__OFFSET__INSIDE__FILE from student where BLOCK__OFFSET__INSIDE__FILE > 100; -- 获取行偏移量大于100的数据.

-- 4. 查看 day08#student 数据表每行数据 在HDFS的文件的哪个Block块中.  因为HDFS文件是分块存储的, 每个块的大小是128MB.
-- 核心细节: 设置HDFS的块偏移量, block块从 0 开始计数.
SET hive.exec.rowoffset=true;
select *, ROW__OFFSET__INSIDE__BLOCK from student;      -- 因为映射的students.txt文件较小, 达不到128MB, 所以该文件只有1个Block块.

-- 5. 汇总查看
select *,
       INPUT__FILE__NAME,               -- 该行数据在 那个HDFS文件中.             类似于: 你在哪个班.
       BLOCK__OFFSET__INSIDE__FILE,     -- 该行数据在 HDFS文件中的 行偏移量        类似于: 你是咱们班的第几个学生
       ROW__OFFSET__INSIDE__BLOCK       -- 该行数据在 HDFS文件的 哪个Block块.      类似于: 你在班级的哪个组.
from
    student;

HQL DQL语句--CTE表达式

CTE表达式介绍:
概述:
全称叫 common table expression, 公共表表达式, 就是用于临时存储某些结果数据的.
格式:
with cte表达式的名字 as (
select 语句, 即: 要被临时存储的内容
)
select ... from cte表达式的名字;
写法:
1. 标准写法.
2. from风格.
3. 链式编程.
4. 结合union all 或者 join查询.
5. 用数据表永久存储结果.
6. 用视图临时存储结果.

-- 1. cte表达式入门写法.
with t1 as (
    select * from student
)
select * from t1;

-- 2. from风格.
with t1 as (
    select * from student
)
from t1 select sid, name;

-- 3. 链式编程(链式风格)
with t1 as ( select * from student),            -- 22条, 全列
     t2 as ( select * from t1 limit 10),        -- 10条, 全列
     t3 as ( select sid, name, age from t2)     -- 10条, 3列
select * from t3;

-- 4. cte结合union语句一起使用, 联合查询.
-- 3. 链式编程(链式风格)
with t1 as ( select * from student),            -- 22条, 全列
     t2 as ( select * from t1 limit 10)         -- 10条, 全列
select * from t1
union all           -- 如果不写, 默认是 distinct
select * from t2;       -- 32条, 5列

-- 扩展, 演示下 full outer join 全外连接.
with t1 as ( select * from student),            -- 22条, 全列
     t2 as ( select * from t1 limit 10)         -- 10条, 全列
select * from t1
full join t2
on t1.sid = t2.sid;     -- full join: 满外连接, 横向拼接.  22条, 10列,  10行全列, 12行(5列值, 5列空)

-- 5. 我们可以用 数据表 把cte结果 永久存储.
show tables;

create table hg1 as
with t1 as (
    select * from student
)
select sid, name, age from t1 limit 10;

-- 查看备份结果.
select * from hg1;

-- 6. 我们可以用 视图 把cte结果 永久存储.
show tables;

create view hg2 as      -- 视图是存储在内存中的, 临时存储.
with t1 as (
    select * from student
)
select sid, name, age from t1 limit 5;

select * from hg2;  -- 从视图中查询数据.