Spark复习五:DataFrame API操作

本文详细介绍如何使用Apache Spark进行数据处理,包括直接读取CSV文件、选择DataFrame中的特定字段、显示数据、处理无表头文件及从RDD转换为DataFrame的方法。同时,文章还展示了如何将数据转换为表格形式并进行展示。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1:直接读取文件:

scala> val userDF=spark.read.format("csv").option("header","true").option("delimiter",",").load("file:///home/data/users.csv")
userDF: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 5 more fields]

scala> userDF.printSchema
root
 |-- user_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- joinedAt: string (nullable = true)
 |-- location: string (nullable = true)
 |-- timezone: string (nullable = true)

2根据已有的DF选择部分字段生成新的DF:

 |-- user_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: string (nullable = true)
 
scala> val userDF2=userDF.select("user_id","locale","birthyear","gender")
userDF2: org.apache.spark.sql.DataFrame = [user_id: string, locale: string ... 2 more fields]

scala> userDF2.printSchema
root
 |-- user_id: string (nullable = true)
 |-- locale: string (nullable = true)
 |-- birthyear: string (nullable = true)
 |-- gender: string (nullable = true)

3:显示 数据(默认显示 20条):

scala> userDF2.show

4:读取没有表头的文件作处理:

scala> import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.col

scala> val data2=data1.select(col("_c0").as("interested"),col("_c1").as("user_id"),col("_c2").as("event_id"))
data2: org.apache.spark.sql.DataFrame = [interested: string, user_id: string ... 1 more field]

scala> data2.printSchema
root
 |-- interested: string (nullable = true)
 |-- user_id: string (nullable = true)
 |-- event_id: string (nullable = true)


scala> data2.show(5,false)

5.间接读取其他RDD,生成新的DF:

scala> val wf=sc.textFile("file:///home/data/test.txt").flatMap(line =>line.split(" ").map(x =>(x,1))).reduceByKey(_+_)
wf: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[35] at reduceByKey at <console>:28

scala> wf.foreach(println)
(d,2)
(up,1)
(a,4)
(b,1)
(day,2)
(hello,8)
(good,2)
(study,1)
(c,1)


1:由RDD-》DF:

scala> val wcDF=wf.toDF
wcDF: org.apache.spark.sql.DataFrame = [_1: string, _2: int]

scala> wcDF.printSchema
root
 |-- _1: string (nullable = true)
 |-- _2: integer (nullable = false)

2.DF -->表格化数据处理:
scala> val wc=wcDF.withColumnRenamed("_1","word").withColumnRenamed("_2","count")
wc: org.apache.spark.sql.DataFrame = [word: string, count: int]

scala> wc.printSchema
root
 |-- word: string (nullable = true)
 |-- count: integer (nullable = false)

scala> wc.show

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值