大数据-Spark SQL

最新推荐文章于 2022-05-26 11:27:48 发布

原创最新推荐文章于 2022-05-26 11:27:48 发布 · 336 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#Spark SQL #创建与操作DataFrame #Spark SQL的视图 #创建与操作DataSet #Spark SQL的函数

大数据专栏收录该内容

62 篇文章

订阅专栏

本文介绍 SparkSQL 的核心功能，包括创建 DataFrame 和 DataSet 的多种方式，以及如何使用 SQL 语句和 DSL 进行数据操作。此外，还展示了如何创建视图、进行多表查询，并使用不同的数据源。

Spark SQL

Spark SQL是Spark的一个模块，处理结构化数据，不能处理非结构化数据

特点

容易集成（不需要单独安装）
统一的数据访问方式（结构化数据的类型：JDBC、Json、Hive、Parquer文件都可以做为Spark SQL的数据源）
完全兼容Hive（把Hive中的数据，读取到Spark SQL中运行）
支持标准的数据连接

创建DataFrame

一、通过case class方式创建

grade.txt文件

06140411	Mr.Wu	102	110	106	318
06140407	Mr.Zhi	60	98	80	238
06140404	Mr.Zhang	98	31	63	192
06140403	Mr.Zhang	105	109	107	321
06140406	Mr.Xie	57	87	92	236
06140408	Mr.Guo	102	102	50	254
06140402	Mr.Li	54	61	64	179
06140401	Mr.Deng	83	76	111	270
06140409	Mr.Zhang	70	56	91	217
06140412	Mr.Yao	22	119	112	253
06140410	Mr.Su	45	65	80	190
06140405	Mr.Zheng	79	20	26	125

scala代码

# 定义schema
case class info(studentID: String,studentName: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => info(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = rdd1.toDF
df.show

结果

+---------+-----------+-------+----+-------+----------+                         
|studentID|studentName|chinese|math|english|totalGrade|
+---------+-----------+-------+----+-------+----------+
| 06140411|      Mr.Wu|    102| 110|    106|       318|
| 06140407|     Mr.Zhi|     60|  98|     80|       238|
| 06140404|   Mr.Zhang|     98|  31|     63|       192|
| 06140403|   Mr.Zhang|    105| 109|    107|       321|
| 06140406|     Mr.Xie|     57|  87|     92|       236|
| 06140408|     Mr.Guo|    102| 102|     50|       254|
| 06140402|      Mr.Li|     54|  61|     64|       179|
| 06140401|    Mr.Deng|     83|  76|    111|       270|
| 06140409|   Mr.Zhang|     70|  56|     91|       217|
| 06140412|     Mr.Yao|     22| 119|    112|       253|
| 06140410|      Mr.Su|     45|  65|     80|       190|
| 06140405|   Mr.Zheng|     79|  20|     26|       125|
+---------+-----------+-------+----+-------+----------+

在这里插入图片描述

二、通过spark session方式创建

grade.txt文件

06140411	Mr.Wu	102	110	106	318
06140407	Mr.Zhi	60	98	80	238
06140404	Mr.Zhang	98	31	63	192
06140403	Mr.Zhang	105	109	107	321
06140406	Mr.Xie	57	87	92	236
06140408	Mr.Guo	102	102	50	254
06140402	Mr.Li	54	61	64	179
06140401	Mr.Deng	83	76	111	270
06140409	Mr.Zhang	70	56	91	217
06140412	Mr.Yao	22	119	112	253
06140410	Mr.Su	45	65	80	190
06140405	Mr.Zheng	79	20	26	125

scala代码

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
# 定义schema
val mySchema = StructType(List(StructField("studentID",DataTypes.StringType),StructField("studentName",DataTypes.StringType),StructField("chinese",DataTypes.IntegerType),StructField("math",DataTypes.IntegerType),StructField("english",DataTypes.IntegerType),StructField("totalGrade",DataTypes.IntegerType)))
val rdd = sc.textFile("/root/grade.txt").map(_.split("\t"))
val rdd1 = rdd.map(x => Row(x(0),x(1),x(2).toInt,x(3).toInt,x(4).toInt,x(5).toInt))
val df = spark.createDataFrame(rdd1,mySchema)
df.show

结果
在这里插入图片描述

三、读取带格式的文件

student.json文件

{"studentID":"06140401", "studentName":"Mr.Deng"}
{"studentID":"06140402", "studentName":"Mr.Li"}
{"studentID":"06140403", "studentName":"Mr.Zhang"}
{"studentID":"06140404", "studentName":"Mr.Zhang"}
{"studentID":"06140405", "studentName":"Mr.Zheng"}
{"studentID":"06140406", "studentName":"Mr.Xie"}
{"studentID":"06140407", "studentName":"Mr.Zhi"}
{"studentID":"06140408", "studentName":"Mr.Guo"}
{"studentID":"06140409", "studentName":"Mr.Zhang"}
{"studentID":"06140410", "studentName":"Mr.Su"}
{"studentID":"06140411", "studentName":"Mr.Wu"}
{"studentID":"06140412", "studentName":"Mr.Yao"}

scala代码

val df = spark.read.json("/root/temp/student.json")
val df = spark.read.format("json").load("/root/temp/student.json")
df.show

结果
在这里插入图片描述

操作DataFrame

DSL语句

df1.select($"studentName",$"chinese",$"math",$"english",$"totalGrade").show

在这里插入图片描述

df1.filter($"totalGrade">300).show

在这里插入图片描述

df1.groupBy($"roomID").count.show

SQL语句

grade表

在这里插入图片描述
student表

在这里插入图片描述
创建视图

df1.createOrReplaceTempView("grade")
df2.createOrReplaceTempView("student")

spark.sql("select studentID,totalGrade from grade").show

在这里插入图片描述

spark.sql("select count(*) as studentNum from student").show

在这里插入图片描述
多表查询（内连接）

spark.sql("select studentName,totalGrade from grade,student where grade.studentID = student.studentID order by grade.studentID").show

在这里插入图片描述

Spark SQL的视图

createGlobalTempView、createOrReplaceGlobalTempView、createOrReplaceTempView、createTempView

（1）普通视图（本地视图）：只在当前Session中有效（createOrReplaceTempView、createTempView）

（2）全局视图：在不同的Session中都有用。原理：把全局视图创建在命名空间中：global_temp中（类似于一个库）（createOrReplaceTempView、createTempView）

创建DataSet

一、使用序列

scala代码

case class Person(name: String,age: Int)
val rdd = Seq(Person("destiny",18),Person("freedom",20)).toDF
rdd.show

结果
在这里插入图片描述

二、使用JSON数据

case class Student(studentID: String,studentName: String)
val df = spark.read.format("json").load("/root/temp/student.json")
df.as[Student].show

结果
在这里插入图片描述

三、使用其它格式数据

val ds = spark.read.text("/root/temp/spark_workCount.txt").as[String]
val word = ds.flatMap(_.split(" ")).filter(_.length > 3)
word.show

结果
在这里插入图片描述

val word = ds.flatMap(_.split(" ")).map((_,1)).groupByKey(_._1).count

结果
在这里插入图片描述

操作DataSet

ds.where($"totalGrade" >= 250).show

在这里插入图片描述
多表查询

case class Grade(studentID: String,chinese: Int,math: Int,english: Int,totalGrade: Int)
case class Student(studentID: String,studentName: String)
val rdd = sc.textFile("/root/temp/gradeSheet.txt").map(_.split("\t"))
val ds1 = rdd.map(x => Grade(x(0),x(1).toInt,x(2).toInt,x(3).toInt,x(4).toInt)).toDS
val rdd = sc.textFile("/root/temp/studentSheet.txt").map(_.split("\t"))
val ds2 = rdd.map(x => Student(x(0),x(1))).toDS

在这里插入图片描述

ds1.join(ds2,"studentID").show

在这里插入图片描述

ds1.join(ds2,"studentID").where("totalGrade >= 250").show

在这里插入图片描述

使用数据源

load与save函数

scala代码

val ds = spark.read.load("/root/temp/users.parquet")
# save结果的文件为parquet类型
ds.select($"name",$"favorite_color").write.save("/root/temp/parquet")
val ds1 = spark.read.load("/root/temp/parquet")
ds1.show

结果
在这里插入图片描述

mode函数

df.write.mode("overwrite").save("/root/temp/parquet")

在这里插入图片描述

saveAsTable函数

df.select($"name").write.saveAsTable("table1")
spark.sql("select * from table1").show

在这里插入图片描述

option函数

支持schema合并

val df = sc.makeRDD(1 to 6).map(x => (x,x*2)).toDF("singel","double")
df.write.mode("overwrite").save("/root/temp/table/key=1")
val df1 = sc.makeRDD(7 to 10).map(x => (x,x*3)).toDF("single","triple")
df1.write.mode("overwrite").save("/root/temp/table/key=2")
val df2 = spark.read.option("mergeSchema",true).parquet("/root/temp/table")

结果
在这里插入图片描述