Spark SQL----NULL语义

原创已于 2025-09-18 15:36:43 修改 · 1.1k 阅读

18 ·

CC 4.0 BY-SA版权

文章标签：

#spark #sql #大数据 #分布式 #apache

于 2024-07-11 07:29:50 首次发布

Spark 专栏收录该内容

70 篇文章

订阅专栏

Spark SQL----NULL语义

一、比较运算符中的空处理
二、逻辑运算符中的空处理
三、表达式中的空处理
四、WHERE、HAVING和JOIN子句中的条件表达式的空处理
五、在GROUP BY和DISTINCT中空处理
六、在ORDER BY中的空处理
七、UNION, INTERSECT, EXCEPT中的空处理
八、EXISTS 和NOT EXISTS 子查询中的空处理
九、IN 和 NOT IN 子查询中的空处理

表由一组行组成，每行包含一组列。列与数据类型相关联，表示实体的特定属性（例如，age 是一个名为person的实体的列）。有时，特定于行的列的值在该行出现时是未知的。在SQL中，这些值表示为NULL。本节详细介绍了在各种运算符、表达式和其他SQL构造中处理NULL值的语义。
下面说明了名为person的表的schema layout和数据。数据在年龄列中包含NULL值，该表将用于以下各节中的各种示例。
TABLE: person

Id	Name	Age
100	Joe	30
200	Marry	NULL
300	Mike	18
400	Fred	50
500	Albert	NULL
600	Michelle	30
700	Dan	50

一、比较运算符中的空处理

Apache spark支持标准的比较运算符，如“>”、“>=”、“=”、”<“和”<=“。当其中一个操作数或两个操作数都未知或为NULL时，这些运算符的结果为未知或NULL。为了比较NULL值的相等性，Spark提供了一个NULL安全的相等运算符（“<=>”），当其中一个操作数为NULL时，该运算符返回False，当两个操作数均为NULL时返回True。下表说明了当一个或两个操作数都为NULL时比较运算符的行为`：

Left Operand	Right Operand	>	>=	=	<	<=	<=>
NULL	Any value	NULL	NULL	NULL	NULL	NULL	False
Any value	NULL	NULL	NULL	NULL	NULL	NULL	False
NULL	NULL	NULL	NULL	NULL	NULL	NULL	True
例子：

-- Normal comparison operators return `NULL` when one of the operand is `NULL`.
SELECT 5 > null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT null = null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator return `False` when one of the operand is `NULL`
SELECT 5 <=> null AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

-- Null-safe equal operator return `True` when one of the operand is `NULL`
SELECT NULL <=> NULL;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

二、逻辑运算符中的空处理

Spark支持标准逻辑运算符，如AND、OR和NOT。这些运算符将布尔表达式作为参数，并返回布尔值。
下表说明了当一个或两个操作数都为NULL时逻辑运算符的行为。

Left Operand	Right Operand	OR	AND
True	NULL	True	NULL
False	NULL	NULL	False
NULL	True	True	NULL
NULL	False	NULL	False
NULL	NULL	NULL	NULL

operand	NOT
NULL	NULL

例子：

-- Normal comparison operators return `NULL` when one of the operands is `NULL`.
SELECT (true OR null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Normal comparison operators return `NULL` when both the operands are `NULL`.
SELECT (null OR false) AS expression_output
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

-- Null-safe equal operator returns `False` when one of the operands is `NULL`
SELECT NOT(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

三、表达式中的空处理

比较运算符和逻辑运算符在Spark中被视为表达式。除了这两种表达式之外，Spark还支持其他形式的表达式，如函数表达式、强制转换表达式等。Spark中的表达式大致可分为：

Null intolerant表达式
可以处理NULL值操作数的表达式
- 这些表达式的结果取决于表达式本身。

3.1 null-intolerant表达式中的空处理

当表达式的一个或多个参数为Null时，Null intolerant表达式返回Null，大多数表达式属于这一类。
例子：

SELECT concat('John', null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT positive(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT to_date(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

3.2 可以处理空值操作数的空处理表达式

这类表达式被设计用来处理NULL值。表达式的结果取决于表达式本身。例如，函数表达式isnull在输入为空时返回true，在输入为非空时返回false，而函数coalesce返回其操作数列表中的第一个非null值。但是，coalesce在其所有操作数为NULL时返回NULL。下面是这类表达的不完整列表。

COALESCE
NULLIF
IFNULL
NVL
NVL2
ISNAN
NANVL
ISNULL
ISNOTNULL
ATLEASTNNONNULLS
IN

例子：

SELECT isnull(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             true|
+-----------------+

-- Returns the first occurrence of non `NULL` value.
SELECT coalesce(null, null, 3, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|                3|
+-----------------+

-- Returns `NULL` as all its operands are `NULL`. 
SELECT coalesce(null, null, null, null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|             null|
+-----------------+

SELECT isnan(null) AS expression_output;
+-----------------+
|expression_output|
+-----------------+
|            false|
+-----------------+

3.3 内置聚合表达式中的空处理

聚合函数通过处理一组输入行来计算单个结果。下面是聚合函数如何处理NULL值的规则。

NULL值在所有聚合函数的处理过程中被忽略。
- 此规则的唯一例外是COUNT(*)函数。
当所有输入值为NULL或输入数据集为空时，一些聚合函数返回NULL。这些函数的列表如下:
- MAX
- MIN
- SUM
- AVG
- EVERY
- ANY
- SOME
  例子：

-- `count(*)` does not skip `NULL` values.
SELECT count(*) FROM person;
+--------+
|count(1)|
+--------+
|       7|
+--------+

-- `NULL` values in column `age` are skipped from processing.
SELECT count(age) FROM person;
+----------+
|count(age)|
+----------+
|         5|
+----------+

-- `count(*)` on an empty input set returns 0. This is unlike the other
-- aggregate functions, such as `max`, which return `NULL`.
SELECT count(*) FROM person where 1 = 0;
+--------+
|count(1)|
+--------+
|       0|
+--------+

-- `NULL` values are excluded from computation of maximum value.
SELECT max(age) FROM person;
+--------+
|max(age)|
+--------+
|      50|
+--------+

-- `max` returns `NULL` on an empty input set.
SELECT max(age) FROM person where 1 = 0;
+--------+
|max(age)|
+--------+
|    null|
+--------+

四、WHERE、HAVING和JOIN子句中的条件表达式的空处理

WHERE、HAVING操作符根据用户指定的条件过滤行。JOIN操作符用于根据连接条件组合来自两个表的行。对于所有这三种操作符，条件表达式都是布尔表达式，可以返回True、False或Unknown (NULL)。如果条件的结果为True，则表示“满足”。
例子：

-- Persons whose age is unknown (`NULL`) are filtered out from the result set.
SELECT * FROM person WHERE age > 0;
+--------+---+
|    name|age|
+--------+---+
|Michelle| 30|
|    Fred| 50|
|    Mike| 18|
|     Dan| 50|
|     Joe| 30|
+--------+---+

-- `IS NULL` expression is used in disjunction to select the persons
-- with unknown (`NULL`) records.
SELECT * FROM person WHERE age > 0 OR age IS NULL;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- Person with unknown(`NULL`) ages are skipped from processing.
SELECT age, count(*) FROM person GROUP BY age HAVING max(age) > 18;
+---+--------+
|age|count(1)|
+---+--------+
| 50|       2|
| 30|       2|
+---+--------+

-- A self join case with a join condition `p1.age = p2.age AND p1.name = p2.name`.
-- The persons with unknown age (`NULL`) are filtered out by the join operator.
SELECT * FROM person p1, person p2
    WHERE p1.age = p2.age
    AND p1.name = p2.name;
+--------+---+--------+---+
|    name|age|    name|age|
+--------+---+--------+---+
|Michelle| 30|Michelle| 30|
|    Fred| 50|    Fred| 50|
|    Mike| 18|    Mike| 18|
|     Dan| 50|     Dan| 50|
|     Joe| 30|     Joe| 30|
+--------+---+--------+---+

-- The age column from both legs of join are compared using null-safe equal which
-- is why the persons with unknown age (`NULL`) are qualified by the join.
SELECT * FROM person p1, person p2
    WHERE p1.age <=> p2.age
    AND p1.name = p2.name;
+--------+----+--------+----+
|    name| age|    name| age|
+--------+----+--------+----+
|  Albert|null|  Albert|null|
|Michelle|  30|Michelle|  30|
|    Fred|  50|    Fred|  50|
|    Mike|  18|    Mike|  18|
|     Dan|  50|     Dan|  50|
|   Marry|null|   Marry|null|
|     Joe|  30|     Joe|  30|
+--------+----+--------+----+

五、在GROUP BY和DISTINCT中空处理

如章节一比较运算符中的空处理中所讨论的，两个NULL值不相等。但是，出于分组和不同处理的目的，将具有NULL数据的两个或多个值分组到同一个bucket中。这种行为符合SQL标准和其他企业数据库管理系统。
例子：

-- `NULL` values are put in one bucket in `GROUP BY` processing.
SELECT age, count(*) FROM person GROUP BY age;
+----+--------+
| age|count(1)|
+----+--------+
|null|       2|
|  50|       2|
|  30|       2|
|  18|       1|
+----+--------+

-- All `NULL` ages are considered one distinct value in `DISTINCT` processing.
SELECT DISTINCT age FROM person;
+----+
| age|
+----+
|null|
|  50|
|  30|
|  18|
+----+

六、在ORDER BY中的空处理

Spark SQL在ORDER BY子句中支持空排序规范。Spark处理ORDER BY子句时，首先或最后放置所有NULL值，这取决于空排序规范。默认情况下，所有NULL值放在首位。
例子：

-- `NULL` values are shown at first and other values
-- are sorted in ascending way.
SELECT age, name FROM person ORDER BY age;
+----+--------+
| age|    name|
+----+--------+
|null|   Marry|
|null|  Albert|
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|    Fred|
|  50|     Dan|
+----+--------+

-- Column values other than `NULL` are sorted in ascending
-- way and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  18|    Mike|
|  30|Michelle|
|  30|     Joe|
|  50|     Dan|
|  50|    Fred|
|null|   Marry|
|null|  Albert|
+----+--------+

-- Columns other than `NULL` values are sorted in descending
-- and `NULL` values are shown at the last.
SELECT age, name FROM person ORDER BY age DESC NULLS LAST;
+----+--------+
| age|    name|
+----+--------+
|  50|    Fred|
|  50|     Dan|
|  30|Michelle|
|  30|     Joe|
|  18|    Mike|
|null|   Marry|
|null|  Albert|
+----+--------+

七、UNION, INTERSECT, EXCEPT中的空处理

在集合操作的上下文中，以null-safe的方式比较NULL值是否相等。这意味着在比较行时，两个NULL值被认为是相等的，这与常规的EqualTo(=)操作符不同。
例子：

CREATE VIEW unknown_age SELECT * FROM person WHERE age IS NULL;

-- Only common rows between two legs of `INTERSECT` are in the 
-- result set. The comparison between columns of the row are done
-- in a null-safe manner.
SELECT name, age FROM person
    INTERSECT
    SELECT name, age from unknown_age;
+------+----+
|  name| age|
+------+----+
|Albert|null|
| Marry|null|
+------+----+

-- `NULL` values from two legs of the `EXCEPT` are not in output. 
-- This basically shows that the comparison happens in a null-safe manner.
SELECT age, name FROM person
    EXCEPT
    SELECT age FROM unknown_age;
+---+--------+
|age|    name|
+---+--------+
| 30|     Joe|
| 50|    Fred|
| 30|Michelle|
| 18|    Mike|
| 50|     Dan|
+---+--------+

-- Performs `UNION` operation between two sets of data. 
-- The comparison between columns of the row ae done in
-- null-safe manner.
SELECT name, age FROM person
    UNION 
    SELECT name, age FROM unknown_age;
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|     Joe|  30|
|Michelle|  30|
|   Marry|null|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
+--------+----+

八、EXISTS 和NOT EXISTS 子查询中的空处理

在Spark中，允许在WHERE子句中使用EXISTS和NOT EXISTS表达式。这些是返回TRUE或FALSE的布尔表达式。换句话说，EXISTS是一个成员条件，当它引用的子查询返回一行或多行时返回TRUE。类似地，NOT EXISTS是一个非成员条件，当从子查询返回no rows或zero rows时返回TRUE。这两个表达式不受子查询结果中存在NULL的影响。它们通常更快，因为它们可以转换为semijoins / anti-semijoins，而无需为null感知提供特殊规定。
例子：

-- Even if subquery produces rows with `NULL` values, the `EXISTS` expression
-- evaluates to `TRUE` as the subquery produces 1 row.
SELECT * FROM person WHERE EXISTS (SELECT null);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

-- `NOT EXISTS` expression returns `FALSE`. It returns `TRUE` only when
-- subquery produces no rows. In this case, it returns 1 row.
SELECT * FROM person WHERE NOT EXISTS (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- `NOT EXISTS` expression returns `TRUE`.
SELECT * FROM person WHERE NOT EXISTS (SELECT 1 WHERE 1 = 0);
+--------+----+
|    name| age|
+--------+----+
|  Albert|null|
|Michelle|  30|
|    Fred|  50|
|    Mike|  18|
|     Dan|  50|
|   Marry|null|
|     Joe|  30|
+--------+----+

九、IN 和 NOT IN 子查询中的空处理

在Spark中，允许在查询的WHERE子句中使用IN和NOT IN表达式。与EXISTS表达式不同，IN表达式可以返回TRUE、FALSE或UNKNOWN（NULL）值。从概念上讲，IN表达式在语义上等价于由disjunctive运算符（OR）分隔的一组相等条件。例如，c1 IN (1, 2, 3)在语义上等价于 (C1 = 1 OR c1 = 2 OR c1 = 3)。
就处理NULL值而言，语义可以从比较运算符（=）和逻辑运算符（OR）中的NULL值处理中推导出来。总之，以下是计算IN表达式结果的规则。

当在列表中找到有问题的非NULL值时，返回TRUE
当在列表中找不到非NULL值并且列表中不包含NULL值时，返回FALSE
当值为NULL，或者在列表中找不到非NULL值并且列表至少包含一个NULL值时，返回UNKNOWN

当列表包含NULL时，NOT IN总是返回UNKNOWN，与输入值无关。这是因为如果值不在包含NULL的列表中，IN将返回UNKNOWN，并且因为not UNKNOWN再次为UNKNOW。
例子：

-- The subquery has only `NULL` value in its result set. Therefore,
-- the result of `IN` predicate is UNKNOWN.
SELECT * FROM person WHERE age IN (SELECT null);
+----+---+
|name|age|
+----+---+
+----+---+

-- The subquery has `NULL` value in the result set as well as a valid 
-- value `50`. Rows with age = 50 are returned. 
SELECT * FROM person
    WHERE age IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
|Fred| 50|
| Dan| 50|
+----+---+

-- Since subquery has `NULL` value in the result set, the `NOT IN`
-- predicate would return UNKNOWN. Hence, no rows are
-- qualified for this query.
SELECT * FROM person
    WHERE age NOT IN (SELECT age FROM VALUES (50), (null) sub(age));
+----+---+
|name|age|
+----+---+
+----+---+