DataSet的Join操作

最新推荐文章于 2024-08-05 09:54:01 发布

原创最新推荐文章于 2024-08-05 09:54:01 发布 · 3k 阅读

5 ·

CC 4.0 BY-SA版权

文章标签：

#spark #join #left_outer #inner #dataset

Spark知识大杂烩专栏收录该内容

68 篇文章

订阅专栏

本文深入解析Spark中六种Join算子的使用方法，包括不同Join类型的实现与对比，通过实例演示如何进行内连接、左外连接等操作，适用于Spark初学者及数据处理工程师。

来自官网

Spark2.4版本Join算子的重载方法有6种，分别如下：

第一种：

def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame

Join with another DataFrame, using the given join expression. 
The following performs a full outer join between df1 and df2.

// Scala:
import org.apache.spark.sql.functions._
df1.join(df2, $"df1Key" === $"df2Key", "outer")

// Java:
import static org.apache.spark.sql.functions.*;
df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");
right
Right side of the join.

joinExprs
Join expression.

joinType
Type of join to perform. Default inner（默认为inner）. Must be one of: 
inner, cross, 
outer, full, full_outer, 
left, left_outer, 
right, right_outer, 
left_semi, left_anti.

Since
2.0.0

第二种：

def join(right: Dataset[_], joinExprs: Column): DataFrame
Inner join with another DataFrame, using the given join expression
（由于默认为inner，此处省略了joinType）.

// The following two are equivalent:
df1.join(df2, $"df1Key" === $"df2Key")
df1.join(df2).where($"df1Key" === $"df2Key")
Since
2.0.0

第三种：


def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame

Equi-join with another DataFrame using the given columns. 
A cross join with a predicate is specified as an inner join. 
If you would explicitly like to perform a cross join use the crossJoin method.

Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

right
Right side of the join operation.

usingColumns
Names of the columns to join on. This columns must exist on both sides.
（两个DataFrame关联的字段名称必须一致）

joinType
Type of join to perform. Default inner. Must be one of: inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

Since
2.0.0

Note
If you perform a self-join using this function without aliasing the input DataFrames, 
you will NOT be able to reference any columns after the join, 
since there is no way to disambiguate which side of the join you would like to reference.

第四种：


def join(right: Dataset[_], usingColumns: Seq[String]): DataFrame
Inner equi-join with another DataFrame using the given columns.

Different from other join functions, the join columns will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

// Joining df1 and df2 using the columns "user_id" and "user_name"
df1.join(df2, Seq("user_id", "user_name"))
right
Right side of the join operation.

usingColumns
Names of the columns to join on. This columns must exist on both sides.
（两个DataFrame关联的字段名称必须一致）

Since
2.0.0

Note
If you perform a self-join using this function without aliasing the input DataFrames, 
you will NOT be able to reference any columns after the join, 
since there is no way to disambiguate which side of the join you would like to reference.

第五种：

def join(right: Dataset[_], usingColumn: String): DataFrame

Inner equi-join with another DataFrame using the given column.

Different from other join functions, the join column will only appear once in the output, i.e. similar to SQL's JOIN USING syntax.

// Joining df1 and df2 using the column "user_id"
df1.join(df2, "user_id")
right
Right side of the join operation.

usingColumn
Name of the column to join on. This column must exist on both sides.
（两个DataFrame关联的字段名称必须一致）

Since
2.0.0

Note
If you perform a self-join using this function without aliasing the input DataFrames, 
you will NOT be able to reference any columns after the join, 
since there is no way to disambiguate which side of the join you would like to reference.

第六种：

def join(right: Dataset[_]): DataFrame
Join with another DataFrame.

Behaves as an INNER JOIN and requires a subsequent join predicate.

right
Right side of the join operation.

Since
2.0.0

简化理解

一般统计用的最多的为inner、left_outer，偶尔用full，剩下的几乎不太用
看两个关联字段名称是否完全一致，如果一致，直接用含有usingColumn（是个字符串）或usingColumns（是个Seq()，一般用来写两个或以上的关联字段，当然也可以写只有一个关联的字段，此时，类似于usingColumn）参数的join方法，否则需要写表达式（joinExprs）
如何写joinExprs: Column个表达式呢？其实官网案例已经很明确了，但是对于新手来说，刚开始一脸懵逼，下面用简单的案例描述

案例描述

测试数据
文件name.txt内容：

1,Jack
2,Rose
3,Lily
4,Lucy
7,Rivers

文件age.txt内容:

1,18
2,19
3,20
4,21
5,22
6,23

代码测试
程序入口SparkSession编写都是固定的这里不再描述：

 val spark = SparkSession
      .builder()
      .appName(this.getClass.getSimpleName)
      .master("local[*]")
      .getOrCreate()

先以两个关联字段都起名为id，作为测试

import spark.implicits._
val df_name = spark.read.textFile("./data/name.txt")
      .map(_.split(","))
      .map(x => (x(0), x(1)))
      .toDF("id", "name")
      .cache()

val df_age = spark.read.textFile("./data/age.txt")
      .map(_.split(","))
      .map(x => (x(0), x(1)))
      .toDF("id", "age")
      .cache()

(1) 内连接代码

df_name.join(df_age, usingColumn = "id").show()
df_name.join(df_age, usingColumns = Seq("id")).show()
df_name.join(df_age, usingColumns = Seq("id"), joinType = "inner").show()

结果：

+---+----+---+
| id|name|age|
+---+----+---+
|  1|Jack| 18|
|  2|Rose| 19|
|  3|Lily| 20|
|  4|Lucy| 21|
+---+----+---+

(2)左外连接代码

df_name.join(df_age,usingColumns = Seq("id"),joinType = "left_outer").show()

结果：

+---+------+----+
| id|  name| age|
+---+------+----+
|  1|  Jack|  18|
|  2|  Rose|  19|
|  3|  Lily|  20|
|  4|  Lucy|  21|
|  7|Rivers|null|
+---+------+----+

关联字段名称为不一样作为测试（关联字段分别为nid,aid）

import spark.implicits._
 val df_name = spark.read.textFile("./data/name")
      .map(_.split(","))
      .map(x => (x(0), x(1)))
      .toDF("nid", "name")
      .cache()

 val df_age = spark.read.textFile("./data/age")
      .map(_.split(","))
      .map(x => (x(0), x(1)))
      .toDF("aid", "age")
      .cache()

(1)内连接代码

df_name.join(df_age, joinExprs = $"nid" === $"aid").show()
df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()
df_name.join(df_age).where(condition = $"nid" === $"aid").show()

结果：

+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

(2)左外连接代码

df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()

结果：

+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+

还有一种情况(直接用DataFrame引用字段)，就是不管关联字段名称是否相同都可以使用
（1）关联字段都用id

df_name.join(df_age,df_name("id") === df_age("id")).show()

结果：

+---+----+---+---+
| id|name| id|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

这种情况几乎不会用，因为如果数据入库后，要选择id怎么办呢？两个同名的id很尴尬咯

（2）关联字段分别用nid,aid

df_name.join(df_age,df_name("nid") === df_age("aid")).show()

结果：

+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

我相信到此，你已经明白了join的各种用法了

11种JoinType的区别

其实这个完全没有必要写了，因为一般很少使用到所有的JoinType，因此这里简单的进行了一下对比

println("--------inner-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "inner").show()
    println("--------cross-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "cross").show()
    println("--------outer-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "outer").show()
    println("--------full-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full").show()
    println("--------full_outer-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "full_outer").show()
    println("--------left-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left").show()
    println("--------left_outer-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_outer").show()
    println("--------right-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right").show()
    println("--------right_outer-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "right_outer").show()
    println("--------left_semi-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_semi").show()
    println("--------left_anti-------")
    df_name.join(df_age, joinExprs = $"nid" === $"aid", joinType = "left_anti").show()

结果：

--------inner-------
+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

--------cross-------
+---+----+---+---+
|nid|name|aid|age|
+---+----+---+---+
|  1|Jack|  1| 18|
|  2|Rose|  2| 19|
|  3|Lily|  3| 20|
|  4|Lucy|  4| 21|
+---+----+---+---+

--------outer-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+

--------full-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+

--------full_outer-------
+----+------+----+----+
| nid|  name| aid| age|
+----+------+----+----+
|   7|Rivers|null|null|
|   3|  Lily|   3|  20|
|null|  null|   5|  22|
|null|  null|   6|  23|
|   1|  Jack|   1|  18|
|   4|  Lucy|   4|  21|
|   2|  Rose|   2|  19|
+----+------+----+----+

--------left-------
+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+

--------left_outer-------
+---+------+----+----+
|nid|  name| aid| age|
+---+------+----+----+
|  1|  Jack|   1|  18|
|  2|  Rose|   2|  19|
|  3|  Lily|   3|  20|
|  4|  Lucy|   4|  21|
|  7|Rivers|null|null|
+---+------+----+----+

--------right-------
+----+----+---+---+
| nid|name|aid|age|
+----+----+---+---+
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
+----+----+---+---+

--------right_outer-------
+----+----+---+---+
| nid|name|aid|age|
+----+----+---+---+
|   1|Jack|  1| 18|
|   2|Rose|  2| 19|
|   3|Lily|  3| 20|
|   4|Lucy|  4| 21|
|null|null|  5| 22|
|null|null|  6| 23|
+----+----+---+---+

--------left_semi-------
+---+----+
|nid|name|
+---+----+
|  1|Jack|
|  2|Rose|
|  3|Lily|
|  4|Lucy|
+---+----+

--------left_anti-------
+---+------+
|nid|  name|
+---+------+
|  7|Rivers|
+---+------+