spark中dataframe的部分操作与作用

最新推荐文章于 2025-07-20 16:47:51 发布

原创最新推荐文章于 2025-07-20 16:47:51 发布 · 631 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#spark

本文介绍了Spark DataFrame的生成、Action操作，如show、collect、foreach、describe等，并详细讲解了条件查询与join操作，包括where、filter、select、groupBy、join、agg等方法的使用。

spark中dataframe的部分操作与作用

Spark SQL中的DataFrame类似于一张关系型数据表。在关系型数据库中对单表或进行的查询操作，在DataFrame中都可以通过调用其API接口来实现，可参考官方文档：链接

一.生成dataframe

sparksql可以从其他RDD对象，parquet文件，json文件，hive表，以及通过jdbc连接到其他关系型数据库作为数据源来生成dataframe对象，此处以mysql为例：

	连接代码：
	
	import org.apache.spark.sql.SparkSession
	object sparksqlDataframe {
	  def main(args: Array[String]): Unit = {
	    val spark = SparkSession.builder().appName("sparkSql").master("local[2]").getOrCreate()
	    val jdbcDF = spark.read
	      .format("jdbc")
	      .option("url","jdbc:mysql://localhost:3306/app_food?useUnicode=true&characterEncoding=utf-8")
	      .option("dbtable","tb_food")
	      .option("user","root")
	      .option("password","123456")
	      .load()

二.dataframe的Action操作

show(numRows:Int,truncate:Boolean/Int>): Unit
show()可直接使用，也可添加int与boolean值，默认展示前20行，一行中最多显示前20个字符，超出以省略号代替，所有表格向右对齐

添加单个int值，表示展示int值行的数据，所有表格向右对齐
- 添加boolean，是否截断超过20个字符的字符串，默认为true，所有表格向右对齐
- 添加int与boolean,展示int值行的数据，boolean值为true时，截断超过20个字符的字符串，所有表格向右对齐.
- 添加两个int时，第一个int为展示多少行数据，第二个int大于0时，截断int值个字符的字符串，所有表格向右对齐。
  使用：
```
   jdbcDF.show(3, false)         
```

collect

collect() : Array[T] 返回包含该数据集中所有行的数组（Array类型）。

使用：
```
 	val array = jdbcDF.collect()
```
collectAsList() : List[T]

返回包含此数据集中的所有行的一个Java列表（List）。大数据集下执行此操作，会导致OutOfMemoryError,进而使程序崩溃。
使用：

		val list = bcDF.collectAsList()

describe(cols: String): DataFrame*

计算一个或多个数字和字符串列的统计信息，包括计数，平均值，stddev（标准差），最小值和最大值。如果没有给出列，则此函数将计算所有数字或字符串列的统计信息。
使用：
```
    jdbcDF.describe("FOOD_ID","FOOD_NAME","PREPARE").show()
```
foreach
foreach(func: ForeachFunction[T]): Unit （特定于Java）在此数据集的每个元素上运行func。
foreach(f: (T) ⇒ Unit): Unit 将函数f应用于所有行
其他
head() / first(): T 返回第一行
head(n: Int): Array[T] / take(n: Int): Array[T] 返回前n行
takeAsList(n: Int): List[T] 作为一个List返回前n行
take操作会将数据移动到应用程序的driver中，n过大会导致OutOfMemoryError，使进程崩溃。
count(): Long 返回数据集有多少行

三.dataframe的条件查询与join操作

where 条件相关
where(conditionExpr: String): Dataset[T] / filter(conditionExpr: String): Dataset[T]
使用给定的sql表达式过滤行
用法：
```
 	jdbcDF.where("FOOD_ID > 20 AND FOOD_NAME LIKE '%菜%'").show()
```
where(condition: Column): Dataset[T] / filter(condition: Column): Dataset[T] 过滤器别名，效果同上。
用法：
```
peopleDs.where($"age" > 15)
peopleDs.filter($"age" > 15)
```
查询指定字段
select(col: String, cols: String*): DataFrame 选择一组列，只能在使用列名选择现有的列
用法：
```
	ds.select("colA", "colB")
	ds.select($"colA", $"colB")
```

select(cols: Column*): DataFrame 选择一组基于列的表达式。
用法：

   ds.select($"colA", $"colB" + 1)
   jdbcDF.select(jdbcDF( "id" ), jdbcDF( "id") + 1 )

selectExpr(exprs: String*): DataFrame 可对指定字段调用UDF函数，指定别名等，可接受sql表达式
用法：
```
	jdbcDF .selectExpr("id" , "c3 as time" , "round(c4)" )
	ds.selectExpr("colA", "colB as newName", "abs(colC)")
	ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
```
- apply(colName: String): Column / col(colName: String): Column 根据列名称选择列，并将其作为列返回。
  用法：
```
val x = jdbcDF.col("FOOD_ID")
val y = jdbcDF.apply("FOOD_NAME")   
```
- drop(col: Column): DataFrame / drop(colNames: String*): DataFrame 返回一个新的DataFrame对象，其中不包含去除的字段，一次只能去除一个字段。如果不包含列名，则为空。
  用法：
```
jdbcDF.drop("id")
jdbcDF.drop(jdbcDF("id"))
```

limit ：limit(n: Int): Dataset[T]

limit方法获取指定DataFrame的前n行记录，得到一个新的DataFrame对象。和take与head不同的是，limit方法不是Action操作，返回一个dataset，而其他返回的array
排序
orderBy(sortExprs: Column*): Dataset[T] / sort(sortExprs: Column*): Dataset[T] 返回一个按照某列升序或者降序的dataset
用法：
```
   	ds.sort($"col1", $"col2".desc)
   	jdbcDF.orderBy(jdbcDF("c4").desc)
```

*    *sort(sortCol: String, sortCols: String\*): Dataset[T] /  orderBy(sortCol: String, sortCols: String\*): Dataset[T] *    效果相同，此处可直接输入列名
用法：
				
			ds.sort("sortcol")
			ds.sort($"sortcol")
			ds.sort($"sortcol".asc)

groupby：
- groupBy(col1: String, cols: String*): RelationalGroupedDataset 输入string类型字段名
- defgroupBy(cols: Column*): RelationalGroupedDataset 输入column类型的对象，根据指定的列对其groupby,方便后续处理
  用法：
```
 jdbcDF.groupBy("c1" )
 jdbcDF.groupBy( jdbcDF( "c1"))
 ds.groupBy($"department").avg()
```
distinct
- distinct：distinct(): Dataset[T]：返回一个仅包含该数据集中唯一行的新数据集。这是dropDuplicates的别名。
  用法：
```
 jdbcDF.distinct()
```
- dropDuplicates(colNames: Array[String]): Dataset[T] / dropDuplicates(col1: String, cols: String): Dataset[T]* 输入string类型字段名or输入column类型的对象，返回一个不重复的新数据集
  用法：
```
jdbcDF.dropDuplicates(Seq("c1"))
```

聚合

agg(expr: Column, exprs: Column*): DataFrame 聚集在没有组的整个数据集上。一般与groupby方法配合使用
用法：

  	jdbcDF.agg("id" -> "max", "c4" -> "sum") //求id中的最大值，c4的和
  	ds.agg(max($"age"), avg($"salary"))
  	ds.groupBy().agg(max($"age"), avg($"salary"))

UNION
- union(other: Dataset[T]): Dataset[T] 返回一个新的数据集，该数据集包含此数据集和另一个数据集中的行并集。这等效于SQL中的UNION ALL。要执行SQL样式的集合并集（可对元素进行重复数据删除），请使用此函数，后跟一个distinct。同样作为SQL的标准，此函数按位置（而不是名称）解析列。
  用法：
```
 jdbcDF.unionALL(jdbcDF.limit(1))
```
join
- join(right: Dataset[_]): DataFrame 加入另一个DataFrame。表现为INNER JOIN，并且需要后续的join谓词。
  用法：
```
 joinDF1.join(joinDF2)
```
- join(right: Dataset[_], usingColumn: String): DataFrame 使用给定的列与另一个DataFrame内部相等联接。与其他联接函数不同，联接列将仅在输出中出现一次，即类似于SQL的JOIN USING语法。
  用法：
```
  df1.join(df2, "user_id")
  joinDF1.join(joinDF2, "id")  
```
- join(right: Dataset[_], usingColumns: Seq[String]): DataFrame 也可以给定多个列来join，效果同上
  用法：
```
  df1.join(df2, Seq("user_id", "user_name"))
  joinDF1.join(joinDF2, Seq("id", "name")）
```

join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame 指定join类型并用给定的join表达式与另一个DataFrame连接。joinType包括：inner, cross, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.
用法：

   //scala
   import org.apache.spark.sql.functions._
   df1.join(df2, $"df1Key" === $"df2Key", "outer")
   joinDF1.join(joinDF2, Seq("id", "name"), "inner"）
   // Java:
   import static org.apache.spark.sql.functions.*;
   df1.join(df2, col("df1Key").equalTo(col("df2Key")), "outer");

join(right: Dataset[_], joinExprs: Column): DataFrame 使用传入的column类型进行join
用法：

  // The following two are equivalent:
  df1.join(df2, $"df1Key" === $"df2Key")
  df1.join(df2).where($"df1Key" === $"df2Key")

获取指定字段统计信息
- stat
  stat: DataFrameStatFunctions，stat方法可以用于计算指定字段或指定字段之间的统计信息，比如方差，协方差等。这个方法返回一个DataFramesStatFunctions类型对象。
  用法：
```
 jdbcDF.stat.freqItems(Seq ("c1") , 0.3) //统计该字段值出现频率在30%以上的内容
 ds.stat.freqItems(Seq("a"))//在名称为“a”的列中查找出现最频繁的项。
```
交并集
- 交集获取两个DataFrame中共有的记录，等同于SQL中的INTERSECT。
  intersect(other: Dataset[T]): Dataset[T]
  用法：
```
  jdbcDF.intersect(jdbcDF.limit(1))
```
- 获取一个DataFrame中有另一个DataFrame中没有的记录
  except(other: Dataset[T]): Dataset[T]
  用法：
```
  jdbcDF.except(jdbcDF.limit(1))
```
操作字段名
- 重命名dataframe的对应字段名
  withColumnRenamed(existingName: String, newName: String): DataFrame如果指定的字段名不存在，不进行任何操作。
  用法：
```
 jdbcDF.withColumnRenamed( "id" , "idx" )
```
- withColumn(colName: String, col: Column): DataFrame 根据指定colName往DataFrame中新增一列，如果colName已存在，则会覆盖当前列。
  用法：
```
  jdbcDF.withColumn("id2", jdbcDF("id")) 
```