hive里面的连接操作其实跟SQL还是差不多的...
数据准备:创建数据-->创建表-->导入数据
首先创建两个原始数据的文件,这两个文件分别有三列,第一列是id、第二列是名称、第三列是另外一个表的id。通过第二列可以明显的看到两个表做连接查询的结果:
[xingoo@localhost tmp]$ cat aa.txt
1 a 3
2 b 4
3 c 1
[xingoo@localhost tmp]$ cat bb.txt
1 xxx 2
2 yyy 3
3 zzz 5
接下来创建两个表,需要注意的是表的字段分隔符为空格,另一个表可以直接基于当前的表创建。
hive> create table aa
> (a string,b string,c string)
> row format delimited
> fields terminated by ' ';
OK
Time taken: 0.19 seconds
hive> create table bb like aa;
OK
Time taken: 0.188 seconds
查看两个表的结构:
hive> describe aa;
OK
a string
b string
c string
Time taken: 0.068 seconds, Fetched: 3 row(s)
hive> describe bb;
OK
a string
b string
c string
Time taken: 0.045 seconds, Fetched: 3 row(s)
下面可以基于本地的文件,导入数据
hive> load data local inpath '/usr/tmp/aa.txt' overwrite into table aa;
Loading data to table test.aa
OK
Time taken: 0.519 seconds
hive> load data local inpath '/usr/tmp/bb.txt' overwrite into table bb;
Loading data to table test.bb
OK
Time taken: 0.321 seconds
内连接
内连接即基于on语句,仅列出表1和表2符合连接条件的数据。
hive> select * from aa a join bb b on a.c=b.a;
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1264 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
1 a 3 3 zzz 5
Time taken: 17.083 seconds, Fetched: 2 row(s)
左连接
左连接是显示左边的表的所有数据,如果有右边表与之对应,则显示;否则显示null
ive> select * from aa a left outer join bb b on a.c=b.a;
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1282 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3 3 zzz 5
2 b 4 NULL NULL NULL
3 c 1 1 xxx 2
Time taken: 16.048 seconds, Fetched: 3 row(s)
右连接
类似左连接,同理。
hive> select * from aa a right outer join bb b on a.c=b.a;
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1306 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
NULL NULL NULL 2 yyy 3
1 a 3 3 zzz 5
Time taken: 15.483 seconds, Fetched: 3 row(s)
全连接
相当于表1和表2的数据都显示,如果没有对应的数据,则显示Null.
hive> select * from aa a full outer join bb b on a.c=b.a;
MapReduce Jobs Launched:
Stage-Stage-1: HDFS Read: 4026 HDFS Write: 270 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
3 c 1 1 xxx 2
NULL NULL NULL 2 yyy 3
1 a 3 3 zzz 5
2 b 4 NULL NULL NULL
Time taken: 1.689 seconds, Fetched: 4 row(s)
左半开连接
这个比较特殊,SEMI-JOIN仅仅会显示表1的数据,即左边表的数据。但是效率会比左连接快,因为他会先拿到表1的数据,然后在表2中查找,只要查找到结果立马就返回数据。
hive> select * from aa a left semi join bb b on a.c=b.a;
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1366 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3
3 c 1
Time taken: 16.811 seconds, Fetched: 2 row(s)
笛卡尔积
笛卡尔积会针对表1和表2的每条数据做连接...
hive> select * from aa join bb;
MapReduce Jobs Launched:
Stage-Stage-3: HDFS Read: 1390 HDFS Write: 90 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
OK
1 a 3 1 xxx 2
2 b 4 1 xxx 2
3 c 1 1 xxx 2
1 a 3 2 yyy 3
2 b 4 2 yyy 3
3 c 1 2 yyy 3
1 a 3 3 zzz 5
2 b 4 3 zzz 5
3 c 1 3 zzz 5