将以下内容保存为small_zipcode.csv
id,zipcode,type,city,state,population
1,704,STANDARD,,PR,30100
2,704,,PASEO COSTA DEL SUR,PR,
3,709,,BDA SAN LUIS,PR,3700
4,76166,UNIQUE,CINGULAR WIRELESS,TX,84000
5,76177,STANDARD,,TX,
,,,,,
7,76179,STANDARD,,TX,
打开spark-shell交互式命令行
val filePath="small_zipcode.csv"
val df=spark.read.options(
Map("inferSchema"->"true","delimiter"->",","header"->"true")).csv(filePath)
scala> df.show
+----+-------+--------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+----+-------+--------+-------------------+-----+----------+
| 1| 704|STANDARD| null| PR| 30100|
| 2| 704| null|PASEO COSTA DEL SUR| PR| null|
| 3| 709| null| BDA SAN LUIS| PR| 3700|
| 4| 76166| UNIQUE| CINGULAR WIRELESS| TX| 84000|
| 5| 76177|STANDARD| null| TX| null|
|null| null| null| null| null| null|
| 7| 76179|STANDARD| null| TX| null|
+----+-------+--------+-------------------+-----+----------+
scala> df.na.drop("all").show()
+---+-------+--------+-------------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+--------+-------------------+-----+----------+
| 1| 704|STANDARD| null| PR| 30100|
| 2| 704| null|PASEO COSTA DEL SUR| PR| null|
| 3| 709| null| BDA SAN LUIS| PR| 3700|
| 4| 76166| UNIQUE| CINGULAR WIRELESS| TX| 84000|
| 5| 76177|STANDARD| null| TX| null|
| 7| 76179|STANDARD| null| TX| null|
+---+-------+--------+-------------------+-----+----------+
scala> df.na.drop().show()
+---+-------+------+-----------------+-----+----------+
| id|zipcode| type| city|state|population|
+---+-------+------+-----------------+-----+----------+
| 4| 76166|UNIQUE|CINGULAR WIRELESS| TX| 84000|
+---+-------+------+-----------------+-----+----------+
参考:
N多spark使用示例:https://sparkbyexamples.com/spark/spark-dataframe-drop-rows-with-null-values/
本文介绍了如何使用Spark Shell读取并操作CSV文件,包括数据清洗,如删除含有所有空值的行和仅保留非空数据。重点展示了na.drop()方法的应用和结果筛选。
1097

被折叠的 条评论
为什么被折叠?



