spark scala 删除所有列全为空值的行

最新推荐文章于 2025-12-05 20:11:02 发布

原创最新推荐文章于 2025-12-05 20:11:02 发布 · 105 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#spark #scala #大数据 #分布式 #开发语言

本文详细介绍了如何在Apache Spark DataFrame中使用na.drop()函数删除表中的全空行（all）和部分空行（any），并提供了示例代码。此外，还展示了针对特定列和给定列名的行删除方法，以及不同函数原型的用法和相关示例链接。

删除表中全部为NaN的行

df.na.drop("all")

删除表任一列中有NaN的行

df.na.drop("any")

示例:

scala> df.show
+----+-------+--------+-------------------+-----+----------+
|  id|zipcode|    type|               city|state|population|
+----+-------+--------+-------------------+-----+----------+
|   1|    704|STANDARD|               null|   PR|     30100|
|   2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|   3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|   4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|   5|  76177|STANDARD|               null|   TX|      null|
|null|   null|    null|               null| null|      null|
|   7|  76179|STANDARD|               null|   TX|      null|
+----+-------+--------+-------------------+-----+----------+


scala> df.na.drop("all").show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode|    type|               city|state|population|
+---+-------+--------+-------------------+-----+----------+
|  1|    704|STANDARD|               null|   PR|     30100|
|  2|    704|    null|PASEO COSTA DEL SUR|   PR|      null|
|  3|    709|    null|       BDA SAN LUIS|   PR|      3700|
|  4|  76166|  UNIQUE|  CINGULAR WIRELESS|   TX|     84000|
|  5|  76177|STANDARD|               null|   TX|      null|
|  7|  76179|STANDARD|               null|   TX|      null|
+---+-------+--------+-------------------+-----+----------+


scala> df.na.drop().show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+


scala> df.na.drop("any").show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode|  type|             city|state|population|
+---+-------+------+-----------------+-----+----------+
|  4|  76166|UNIQUE|CINGULAR WIRELESS|   TX|     84000|
+---+-------+------+-----------------+-----+----------+

删除给定列为Null的行:

val nameArray = sparkEnv.sc.textFile("/master/abc.txt").collect()
val df = df.na.drop("all", nameArray.toList.toArray)

df.na.drop(Seq("population","type"))

删除指定列为Na的行(删除列create_time为Na的行)

.na.drop("all", Seq("create_time"))

函数原型:

def drop(): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values.

def drop(how: String): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values.
If how is "any", then drop rows containing any null or NaN values. If how is "all", then drop rows only if every column is null or NaN for that row.

def drop(how: String, cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(how: String, cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing null or NaN values in the specified columns.
If how is "any", then drop rows containing any null or NaN values in the specified columns. If how is "all", then drop rows only if every specified column is null or NaN for that row.

def drop(cols: Seq[String]): DataFrame
(Scala-specific) Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

def drop(cols: Array[String]): DataFrame
Returns a new DataFrame that drops rows containing any null or NaN values in the specified columns.

更多函数原型:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameNaFunctions

参考:
N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
示例代码及数据集:https://github.com/spark-examples/spark-scala-examples csv路径:src/main/resources/small_zipcode.csv
https://www.jianshu.com/p/39852729736a