Apache Spark SQL
Spark SQL是构建在RDD之上的ETL(Extract Transform Load)工具。SparkSQL在RDD之上抽象出来Dataset/Dataframe
这两个类提供了类似RDD的功能,也就意味用户可以使用map、faltMap、filter等高阶算子,同时也通过了基于列的命名查询,也就是说Dataset/DataFrame提供了两套操作数据的API,这些API可以给Saprk引擎要提供更多信息,系统可可以根据这些信息对计算实现一定的优化。目前Spark SQL提供了两种交互方式1)SQL 脚本
2)Dataset API
(strong-typed类型、untyped类型操作)
Datasets & DataFrames
Dataset是一个分布式数据集,Dataset是在spark-1.6提出新的API,该API构建在RDD(strong type,使用lambda表达式)之上同时可以借助于Spark SQL对执行引擎的优点,使得使用Dateset执行一些数据的转换比直接使用RDD算子功能和性能都有所提升。因此我们可以认为Dateset就是一个加强版本的RDD。Dataset除了可以使用JVM中数组|集合对象创建之外,也可以将任意的一个RDD转换为Dataset.Python does not have the support for the Dataset API.
DataFrames 是Dataset的一种特殊情况。比如 Dataset中可以存储任意 对象类型的数据作为Dataset的元素。但是Dataframe的元素只有一种类型Row类型,这种基于Row查询和传统数据库中ResultSet操作极其相似。因为Row类型的数据表示Dataframe的一个元素,类似数据库中的一行,这些行中的元素可以通过下标或者column name访问。由于 Dateset是API的兼容或者支持度上不是多么的好,但是Dataframe在API层面支持的Scala、Java、R、Python支持比较全面。
快速入门
- 引入依赖
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
- 创建字符统计(untyped)
//1.创建SparkSeesion
val spark=SparkSession.builder()
.appName("wordcount")
.master("local[6]")
.getOrCreate()
//2.导入spark定义隐式增强|转换
import spark.implicits._
//3.创建dataset
val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_,1))
val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupBy($"_1") //无类型操作
.sum("_2").as("total")
.withColumnRenamed("_1","word")
.withColumnRenamed("sum(_2)","total")
.show()
//5.关闭spark
spark.stop()
- 创建字符统计(strong typed)
//1.创建SparkSeesion
val spark=SparkSession.builder()
.appName("wordcount")
.master("local[6]")
.getOrCreate()
//2.导入spark定义隐式增强|转换
import spark.implicits._
//3.创建dataset
val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_,1))
val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupByKey(t=>t._1)
.agg(typed.sum[(String,Int)](tuple=>tuple._2).name("total"))
.show()
//5.关闭spark
spark.stop()
Dataset & Dataframe
Dataset create
Dataset类似于RDD,不同的是Spark SQL有一套自己的序列化规范独立于Spark RDD(Java/Kryo序列化)之上称为Encoders。不同于SparkRDD序列化,由于Dataset支持无类型操作,用户无需获取操作的类型,操作仅仅是列名,因为Spark SQL在执行算子操作的时候可以省略反序列化的步骤,继而提升程序执行效率。
- case-class
case class Person(id:Int,name:String,age:Int,sex:Boolean)
val person: Dataset[Person] =List(Person(1,"zhangsan",18,true),Person(2,"wangwu",28,true)).toDS()
person.select($"id",$"name")
.show()
+---+--------+
| id| name|
+---+--------+
| 1|zhangsan|
| 2| wangwu|
+---+--------+
- Tuple元组
val person: Dataset[(Int,String,Int,Boolean)] =List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
person.select($"_1",$"_2")
.show()
+---+--------+
| _1| _2|
+---+--------+
| 1|zhangsan|
| 2| wangwu|
+---+--------+
- json数据
{"name":"张三","age":18}
{"name":"lisi","age":28}
{"name":"wangwu","age":38}
case class User(name:String,age:Long)//数值默认是long类型
spark.read.json("file:///D:/demo/json/").as[User]
.show()
+---+------+
|age| name|
+---+------+
| 18| 张三|
| 28| lisi|
| 38|wangwu|
+---+------+
-
rdd
- 元组
val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0))) userRDD.toDS().show()
+---+----+----+---+-------+ | _1| _2| _3| _4| _5| +---+----+----+---+-------+ | 1|张三|true| 18|15000.0| +---+----+----+---+-------+
- case-class
val userRDD = spark.sparkContext.makeRDD(List(Person(1,"张三",18,true))) userRDD.toDS().show()
+---+----+---+----+ | id|name|age| sex| +---+----+---+----+ | 1|张三| 18|true| +---+----+---+----+
Dataframe create
DataFrame是一个命名列的数据集,用户可以直接操作column
因此几乎所有Dataframe推荐操作都是无类型操作
。用户也可以把一个Dataframe看做是Dataset[Row]
类型的数据集。
- json文件
val dataFrame:DataFrame = spark.read.json("file:///D:/demo/json")
dataFrame.printSchema()
dataFrame.show()
- case-class
case class User(id:Int,name:String,sex:Boolean)
var userDF=List(User(1,"zs",true)).toDF()
userDF.show()
+---+----+----+
| id|name| sex|
+---+----+----+
| 1| zs|true|
+---+----+----+
- Tuple元组
var userDF=List((1,"zs",true)).toDF("id","name","sex")
userDF.show()
+---+----+----+
| id|name| sex|
+---+----+----+
| 1| zs|true|
+---+----+----+
-
RDD转换
- 元组
var userDF= spark.sparkContext.parallelize(List((1,"zs",true))) .toDF("id","name","sex") //可以指定列 userDF.show()
+---+----+----+ | id|name| sex| +---+----+----+ | 1| zs|true| +---+----+----+
- case-class
var userDF= spark.sparkContext.parallelize(List(User(1,"zs",true))) .toDF("id","uname","sex") userDF.show()
+---+-----+----+ | id|uname| sex| +---+-----+----+ | 1| zs|true| +---+-----+----+
- RDD[Row]类型转为DataFrame
var userRDD:RDD[Row]= spark.sparkContext.parallelize(List(User(1,"zs",true))) .map(u=>Row(u.id,u.name,u.sex)) var schema=new StructType() .add("id",IntegerType) .add("name",StringType) .add("sex",BooleanType) var userDF=spark.createDataFrame(userRDD,schema) userDF.show()
+---+----+----+ | id|name| sex| +---+----+----+ | 1| zs|true| +---+----+----+
- Javabean
public class JavaUser implements Serializable { private Integer id; private String name; private Boolean sex; // get/set }
var userRDD:RDD[JavaUser]=spark.sparkContext.makeRDD(List(new JavaUser(1,"zs",true))) var userDF=spark.createDataFrame(userRDD,classOf[JavaUser]) userDF.show()
- case-class
case class ScalaUser(id:Int,name:String,sex:Boolean)
var userRDD:RDD[ScalaUser]=spark.sparkContext.makeRDD(List(ScalaUser(1,"zs",true))) var userDF=spark.createDataFrame(userRDD) userDF.show()
+---+----+----+ | id|name| sex| +---+----+----+ | 1| zs|true| +---+----+----+
- 元组
var userRDD:RDD[(Int,String,Boolean)]=spark.sparkContext.makeRDD(List((1,"zs",true))) var userDF=spark.createDataFrame(userRDD) userDF.show()
+---+---+----+ | _1| _2| _3| +---+---+----+ | 1| zs|true| +---+---+----+
##DataFrame Operations(Untyped)
printSchema()
打印Dataframe的表结构(表头)
var df=List((1,"zs",true)).toDF("id","name","sex")
df.printSchema()
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- sex: boolean (nullable = false)
show
var df=List((1,"zs",true)).toDF("id","name","sex")
df.show()//打印输出前20行
| id|name| sex|
+---+----+----+
| 1| zs|true|
+---+----+----+
select
var df=List((1,"zs",true,1,15000),(2,"ls",false,1,15000))
.toDF("id","name","sex","dept","salary")
df.select($"id",$"name",$"salary")
.show()
+---+----+------+
| id|name|salary|
+---+----+------+
| 1| zs| 15000|
| 2| ls| 15000|
+---+----+------+
selectExpr
var df=List((1,"zs",true,1,15000),(2,"ls",false,1,18000))
.toDF("id","name","sex","dept","salary")
//等价 df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
df.selectExpr("id","name","sex","dept","salary * 12 as annual_salary")
.show()
+---+----+-----+----+-------------+
| id|name| sex|dept|annual_salary|
+---+----+-----+----+-------------+
| 1| zs| true| 1| 180000|
| 2| ls|false| 1| 216000|
+---+----+-----+----+-------------+
withColumn
df.select($"id",$"name",$"sex",$"dept",$"salary")
.withColumn("annula_salary",$"salary" * 12)
.show()
+---+----+-----+----+------+-------------+
| id|name| sex|dept|salary|annula_salary|
+---+----+-----+----+------+-------------+
| 1| zs| true| 1| 15000| 180000|
| 2| ls|false| 1| 18000| 216000|
+---+----+-----+----+------+-------------+
withColumnRenamed
df.select($"id",$"name",$"sex",$"dept",$"salary")
.withColumn("annula_salary",$"salary" * 12)
.withColumnRenamed("dept","department")
.withColumnRenamed("name","username")
.show()
+---+--------+-----+----------+------+-------------+
| id|username| sex|department|salary|annula_salary|
+---+--------+-----+----------+------+-------------+
| 1| zs| true| 1| 15000| 180000|
| 2| ls|false| 1| 18000| 216000|
+---+--------+-----+----------+------+-------------+
drop
df.select($"id",$"name",$"sex",$"dept",$"salary")
.withColumn("annula_salary",$"salary" * 12)
.withColumnRenamed("dept","department")
.withColumnRenamed("name","username")
.drop("sex")
.show()
+---+--------+----------+------+-------------+
| id|username|department|salary|annula_salary|
+---+--------+----------+------+-------------+
| 1| zs| 1| 15000| 180000|
| 2| ls| 1| 18000| 216000|
+---+--------+----------+------+-------------+
dropDuplicates
删除重复数据
var df=List((1,"zs",true,1,15000),
(2,"ls",false,1,18000),
(3,"ww",false,1,19000),
(4,"zl",false,1,18000)
)
.toDF("id","name","sex","dept","salary")
df.select($"id",$"name",$"sex",$"dept",$"salary")
.dropDuplicates("sex","salary")
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 3| ww|false| 1| 19000|
| 1| zs| true| 1| 15000|
| 2| ls|false| 1| 18000|
+---+----+-----+----+------+
orderBy|sort
var df=List((1,"zs",true,1,15000),
(2,"ls",false,2,18000),
(3,"ww",false,2,14000),
(4,"zl",false,1,18000),
(4,"zl",false,1,16000)
)
.toDF("id","name","sex","dept","salary")
df.select($"id",$"name",$"sex",$"dept",$"salary")
.orderBy($"salary" desc,$"id" asc)
//.sort($"salary" desc,$"id" asc)
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 2| ls|false| 2| 18000|
| 4| zl|false| 1| 18000|
| 4| zl|false| 1| 16000|
| 1| zs| true| 1| 15000|
| 3| ww|false| 2| 14000|
+---+----+-----+----+------+
groupBy
df.select($"id",$"name",$"sex",$"dept",$"salary")
.groupBy($"dept")
.max("salary")
.show()
+----+-----------+
|dept|max(salary)|
+----+-----------+
| 1| 18000|
| 2| 18000|
+----+-----------+
类似的算子 max、min、avg|mean、sum、count
agg
var df=List((1,"zs",true,1,15000),
(2,"ls",false,2,18000),
(3,"ww",false,2,14000),
(4,"zl",false,1,18000),
(4,"zl",false,1,16000)
)
.toDF("id","name","sex","dept","salary")
import org.apache.spark.sql.functions._
df.select($"id",$"name",$"sex",$"dept",$"salary")
.groupBy($"dept")
.agg(max("salary") as "max_salary",avg("salary") as "avg_salary")
.show()
+----+----------+------------------+
|dept|max_salary| avg_salary|
+----+----------+------------------+
| 1| 18000|16333.333333333334|
| 2| 18000| 16000.0|
+----+----------+------------------+
还可以传递一方式
var df=List((1,"zs",true,1,15000),
(2,"ls",false,2,18000),
(3,"ww",false,2,14000),
(4,"zl",false,1,18000),
(4,"zl",false,1,16000)
)
.toDF("id","name","sex","dept","salary")
df.select($"id",$"name",$"sex",$"dept",$"salary")
.groupBy($"dept")
.agg(Map("salary"->"max","id"->"count"))
.show()
+----+-----------+---------+
|dept|max(salary)|count(id)|
+----+-----------+---------+
| 1| 18000| 3|
| 2| 18000| 2|
+----+-----------+---------+
limit
df.select($"id",$"name",$"sex",$"dept",$"salary")
.orderBy($"id" desc).limit(4)
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 5| zl|false| 1| 16000|
| 4| zl|false| 1| 18000|
| 3| ww|false| 2| 14000|
| 2| ls|false| 2| 18000|
+---+----+-----+----+------+
where
var df=List((1,"zs",true,1,15000),
(2,"ls",false,2,18000),
(3,"ww",false,2,14000),
(4,"zl",false,1,18000),
(5,"win7",false,1,16000)
)
.toDF("id","name","sex","dept","salary")
df.select($"id",$"name",$"sex",$"dept",$"salary")
//where("(name like '%s%' and salary > 15000) or name = 'win7'")
.where(($"name" like "%s%" and $"salary" > 15000) or $"name" ==="win7" )
.show()
+---+----+-----+----+------+
| id|name| sex|dept|salary|
+---+----+-----+----+------+
| 2| ls|false| 2| 18000|
| 5|win7|false| 1| 16000|
+---+----+-----+----+------+
pivot(行转列)
var scoreDF=List((1,"math",85),
(1,"chinese",80),
(1,"english",90),
(2,"math",90),
(2,"chinese",80)
).toDF("id","course","score")
import org.apache.spark.sql.functions._
//select id,max(case course when 'math' then score else 0 end )as math ,max(case course when 'chinese' then score else 0 end) as chinese from t_course group by id;
scoreDF.selectExpr("id",
"case course when 'math' then score else 0 end as math",
"case course when 'chinese' then score else 0 end as chinese",
"case course when 'english' then score else 0 end as english")
.groupBy("id")
.agg(max($"math"),max($"chinese"),max($"english"))
.show()
+---+---------+------------+------------+
| id|max(math)|max(chinese)|max(english)|
+---+---------+------------+------------+
| 1| 85| 80| 90|
| 2| 90| 80| 0|
+---+---------+------------+------------+
var scoreRDD=List((1,"math",85),
(1,"chinese",80),
(1,"english",90),
(2,"math",90),
(2,"chinese",80)
)
scoreRDD.toDF("id","course","score")
.groupBy("id")
// 行转列 可选值
.pivot("course",scoreRDD.map(t=>t._2).distinct)
.max("score")
.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
| 1| 85| 80| 90|
| 2| 90| 80| null|
+---+----+-------+-------+
na
var scoreRDD=List((1,"math",85),
(1,"chinese",80),
(1,"english",90),
(2,"math",90),
(2,"chinese",80),
(3,"math",100)
)
scoreRDD.toDF("id","course","score")
.groupBy("id")
// 行转列 可选值
.pivot("course",scoreRDD.map(t=>t._2).distinct)
.max("score")
.na.fill(Map("english" -> -1,"chinese" -> 0))
.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
| 1| 85| 80| 90|
| 3| 100| 0| -1|
| 2| 90| 80| -1|
+---+----+-------+-------+
over(开窗 )
计算当前员工在部门的薪资排序
select id,name,salary ,dept,(select count(*)+1 from t_user u2 where u1.dept=u2.dept and u1.salary < u2.salary) from t_user u1
select id,name,salary, sum(1) over(partition by dept order by salary desc ) from t_user
var df=List((1,"zs",true,1,15000),
(2,"ls",false,2,18000),
(3,"ww",false,2,14000),
(4,"zl",false,1,18000),
(5,"win7",false,1,16000)
)
.toDF("id","name","sex","dept","salary")
import org.apache.spark.sql.functions._
val w1 = Window.partitionBy("dept")
.orderBy($"salary" desc)
.rangeBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("dept")
.orderBy($"salary" desc)
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val w3= Window
.orderBy($"salary" desc)
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df.select("id","name","dept","salary")
.withColumn("rank",count("id") over(w1))
.withColumn("avg",avg("salary") over(w2))
.withColumn("diff",max("salary") over(w2))//计算所在部门 salary-本部门最高
.withColumn("all_max",max("salary") over(w3))//计算全公司最高薪资
.select($"id",$"name",$"dept",$"salary",$"rank",$"avg",$"diff"-$"salary" as "chaju",$"all_max")
.show()
+---+----+----+------+----+------------------+-----+-------+
| id|name|dept|salary|rank| avg|chaju|all_max|
+---+----+----+------+----+------------------+-----+-------+
| 4| zl| 1| 18000| 1|16333.333333333334| 0| 18000|
| 2| ls| 2| 18000| 1| 16000.0| 0| 18000|
| 5|win7| 1| 16000| 2|16333.333333333334| 2000| 18000|
| 1| zs| 1| 15000| 3|16333.333333333334| 3000| 18000|
| 3| ww| 2| 14000| 2| 16000.0| 4000| 18000|
+---+----+----+------+----+------------------+-----+-------+
join
var userCostDF=spark.sparkContext.parallelize(List(
UserCost(1,"电脑配件",100),
UserCost(1,"母婴用品",100),
UserCost(1,"生活用品",100),
UserCost(2,"居家美食",79),
UserCost(2,"消费电子",80),
UserCost(2,"生活用品",100)
)).toDF().withColumnRenamed("id","uid")
val categories = userCostDF.select("category")
.as[(String)]
.rdd.distinct.collect()
var userDF=spark.sparkContext.parallelize(List(
User(1,"张晓三",true,18,15000),
User(2,"李晓四",true,18,18000),
User(3,"王晓五",false,18,10000)
)).toDF()
userDF.join(userCostDF,$"id"===$"uid","left_outer")
.drop("uid")
.groupBy("id","name")
.pivot($"category",categories)
.sum("totalCost")
.na.fill(0.0)
.show()
+---+------+--------+--------+--------+--------+--------+
| id| name|电脑配件|生活用品|母婴用品|居家美食|消费电子|
+---+------+--------+--------+--------+--------+--------+
| 1|张晓三| 100.0| 100.0| 100.0| 0.0| 0.0|
| 3|王晓五| 0.0| 0.0| 0.0| 0.0| 0.0|
| 2|李晓四| 0.0| 100.0| 0.0| 79.0| 80.0|
+---+------+--------+--------+--------+--------+--------+
cube(多维度)
import org.apache.spark.sql.functions._
List((110,50,80,80),(120,60,95,75),(120,50,96,70))
.toDF("height","weight","IQ","EQ")
.cube($"height",$"weight")
.agg(avg("IQ"),avg("EQ"))
.show()
Dataset Oprations (Strong typed)
由于强类型操作都是基于类型操作,Spark SQL的操作都是推荐使用Dataframe基于列操作,因此一般情况下不在推荐使用。
val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_,1))
import org.apache.spark.sql.expressions.scalalang.typed
val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupByKey(t=>t._1)
.agg(typed.sum[(String,Int)](tuple=>tuple._2).name("total"))
.filter(tuple=>tuple._1.contains("o"))
.show()
+-----+-----+
|value|total|
+-----+-----+
|hello| 1.0|
| demo| 1.0|
+-----+-----+
SQL查询
单行查询
var userDF= List((1,"zs",true,18,15000,1))
.toDF("id","name","sex","age","salary","dept")
userDF.createTempView("t_employee")
val sql="select * from t_employee where name = 'zs'"
spark.sql(sql).show()
+---+----+----+---+------+----+
| id|name| sex|age|salary|dept|
+---+----+----+---+------+----+
| 1| zs|true| 18| 15000| 1|
+---+----+----+---+------+----+
模糊查询
var userDF= List((1,"zs",true,18,15000,1))
.toDF("id","name","sex","age","salary","dept")
userDF.createTempView("t_employee")
val sql="select * from t_employee where name like '%s%'"
spark.sql(sql).show()
排序查询
var userDF= List(
(1,"zs",true,18,15000,1),
(2,"ls",false,18,12000,1))
.toDF("id","name","sex","age","salary","dept")
//构建视图
userDF.createTempView("t_employee")
val sql=
"""
select * from t_employee
where salary > 10000
order by salary desc
"""
spark.sql(sql).show()
limit查询
var userDF= List(
(1,"zs",true,18,15000,1),
(2,"ls",false,18,12000,1),
(3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")
//构建视图
userDF.createTempView("t_employee")
val sql=
"""
select * from t_employee
where salary > 10000
order by salary desc
limit 2
"""
spark.sql(sql).show()
分组查询
var userDF= List(
(1,"zs",true,18,15000,1),
(2,"ls",false,18,12000,1),
(3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")
userDF.createTempView("t_employee")
val sql=
"""
select dept ,avg(salary) as avg_slalary from t_employee
group by dept
order by avg_slalary desc
"""
spark.sql(sql).show()
+----+-----------+
|dept|avg_slalary|
+----+-----------+
| 2| 16000.0|
| 1| 13500.0|
+----+-----------+
having过滤
var userDF= List(
(1,"zs",true,18,15000,1),
(2,"ls",false,18,12000,1),
(3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")
//构建视图
userDF.createTempView("t_employee")
val sql=
"""
select dept ,avg(salary) as avg_slalary from t_employee
group by dept
having avg_slalary > 13500
order by avg_slalary desc
"""
spark.sql(sql).show()
+----+-----------+
|dept|avg_slalary|
+----+-----------+
| 2| 16000.0|
+----+-----------+
case-when
var userDF= List(
(1,"zs",true,18,15000,1),
(2,"ls",false,18,12000,1),
(3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")
//构建视图
userDF.createTempView("t_employee")
val sql=
"""
select id,name,case sex when true then '男' else '女' end as sex_alias
from t_employee
"""
spark.sql(sql).show()
+---+----+---------+
| id|name|sex_alias|
+---+----+---------+
| 1| zs| 男|
| 2| ls| 女|
| 3| ww| 女|
+---+----+---------+
行转列
var scoreDF=List((1, "语文", 100),
(1, "数学", 100),
(1, "英语", 100),
(2, "数学", 79),
(2, "语文", 80),
(2, "英语", 100)).toDF("id","course","score")
scoreDF.createOrReplaceTempView("t_course")
val sql=
"""
select id,
max(case course when '数学' then score else 0 end) as math,
max(case course when '英语' then score else 0 end) as english,
max(case course when '语文' then score else 0 end) as chinese
from t_course
group by id
"""
spark.sql(sql).show()
+---+----+-------+-------+
| id|math|english|chinese|
+---+----+-------+-------+
| 1| 100| 100| 100|
| 2| 79| 100| 80|
+---+----+-------+-------+
pivot
var scoreDF=List((1, "语文", 100),
(1, "数学", 100),
(1, "英语", 100),
(2, "数学", 79),
(2, "语文", 80),
(2, "英语", 100)).toDF("id","course","score")
scoreDF.createOrReplaceTempView("t_course")
val sql=
"""
select * from t_course
pivot(max(score) for course in ('数学','语文','英语'))
"""
spark.sql(sql).show()
这里需要注意,在书写SQL的时候除去聚合字段和输出列明字段,其他字段作为groupby后的隐藏字段。
+---+----+----+----+
| id|数学|语文|英语|
+---+----+----+----+
| 1| 100| 100| 100|
| 2| 79| 80| 100|
+---+----+----+----+
表连接
spark.sql("select * from t_user left join t_dept on deptNo=id").show()
1
+-------+---+-------+-----+--------+------+---+----+
| name|age| salary| sex| job|deptNo| id|name|
+-------+---+-------+-----+--------+------+---+----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|研发|
| Andy| 30|15000.0| true|SALESMAN| 1| 1|研发|
| Justin| 19| 8000.0| true| CLERK| 1| 1|研发|
| Kaine| 20|20000.0| true| MANAGER| 2| 2|设计|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|设计|
+-------+---+-------+-----+--------+------+---+----+
123456789
spark支持inner join、left outer、right outer、full outer join连接
子查询
spark.sql("select * ,(select count(t1.salary) from t_user t1 where (t1.deptNo = t2.deptNo) group by t1.deptNo) as total from t_user t2 left join t_dept on t2.deptNo=id order by t2.deptNo asc,t2.salary desc").show()
1
+-------+---+-------+-----+--------+------+---+----+-----+
| name|age| salary| sex| job|deptNo| id|name|total|
+-------+---+-------+-----+--------+------+---+----+-----+
|Michael| 29|20000.0| true| MANAGER| 1| 1|研发| 3|
| Andy| 30|15000.0| true|SALESMAN| 1| 1|研发| 3|
| Justin| 19| 8000.0| true| CLERK| 1| 1|研发| 3|
| Kaine| 20|20000.0| true| MANAGER| 2| 2|设计| 2|
| Lisa| 19|18000.0|false|SALESMAN| 2| 2|设计| 2|
+-------+---+-------+-----+--------+------+---+----+-----+
123456789
注意在spark中仅仅支持内嵌子查询
=
查询
开窗函数
在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中 还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一 层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。
- 输出员工信息以及所在部门的平均薪资
spark.sql("select u.name,u.deptNo,u.salary,d.name,avg(salary) over(partition by deptNo ) avg from t_user u left join t_dept d on deptNo=id").show()
1
+-------+------+-------+----+------------------+
| name|deptNo| salary|name| avg|
+-------+------+-------+----+------------------+
|Michael| 1|20000.0|研发|14333.333333333334|
| Andy| 1|15000.0|研发|14333.333333333334|
| Justin| 1| 8000.0|研发|14333.333333333334|
| Kaine| 2|20000.0|设计| 19000.0|
| Lisa| 2|18000.0|设计| 19000.0|
+-------+------+-------+----+------------------+
123456789
等价写法
select u2.name,u2.deptNo,u2.salary,d.name,(select avg(salary) from t_user u1 where u1.deptNo=u2.deptNo group by u1.deptNo) avgSalary from t_user u2 left join t_dept d on deptNo=id
1
- 计算员工在自己部门的薪资排序
spark.sql("select u.name,u.deptNo,u.salary,d.name,sum(1) over(partition by deptNo order by salary desc) rank from t_user u left join t_dept d on deptNo=id").show()
1
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
|Michael| 1|20000.0|研发| 1|
| Andy| 1|15000.0|研发| 2|
| Justin| 1| 8000.0|研发| 3|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 2|
+-------+------+-------+----+----+
123456789
其中sum(1)可以替换成ROW_NUM()
- 计算员工在公司的薪资排名
spark.sql("select t_user.name,deptNo,salary,t_dept.name,ROW_NUMBER() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 5|
| Andy| 1|15000.0|研发| 4|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 2|
| Lisa| 2|18000.0|设计| 3|
+-------+------+-------+----+----+
123456789
可以看出,ROW_NUMBER()只表示数据出库的顺序,无法比较真正的顺序。因此一般在做排名的时候一般使用RANK()或者DENSE_RANK()函数。
- 使用RANK函数实现薪资排名(序号不连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 5|
| Andy| 1|15000.0|研发| 4|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 3|
+-------+------+-------+----+----+
123456789
- 使用DENSE_RANK函数实现薪资排名(序号连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,DENSE_RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
| name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin| 1| 8000.0|研发| 4|
| Andy| 1|15000.0|研发| 3|
|Michael| 1|20000.0|研发| 1|
| Kaine| 2|20000.0|设计| 1|
| Lisa| 2|18000.0|设计| 2|
+-------+------+-------+----+----+
123456789
- 计算每个部门和本部门最高薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- max(salary) over(partition by deptNo order by salary desc rows between unbounded preceding and current row)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+--------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+--------+
|Michael| 1|20000.0|研发| 0.0|
| Andy| 1|15000.0|研发| -5000.0|
| Justin| 1| 8000.0|研发|-12000.0|
| Kaine| 2|20000.0|设计| 0.0|
| Lisa| 2|18000.0|设计| -2000.0|
+-------+------+-------+----+--------+
123456789
- 计算本部门的员工工资和最小工资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- min(salary) over(partition by deptNo order by salary desc rows between current row and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+-------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+-------+
| Justin| 1| 8000.0|研发| 0.0|
| Andy| 1|15000.0|研发| 7000.0|
|Michael| 1|20000.0|研发|12000.0|
| Kaine| 2|20000.0|设计| 2000.0|
| Lisa| 2|18000.0|设计| 0.0|
+-------+------+-------+----+-------+
123456789
- 计算本部门员工和公司平均薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- avg(salary) over(order by salary rows between unbounded preceding and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+-------+
| name|deptNo| salary|name| diff|
+-------+------+-------+----+-------+
|Michael| 1|20000.0|研发| 3800.0|
| Andy| 1|15000.0|研发|-1200.0|
| Justin| 1| 8000.0|研发|-8200.0|
| Lisa| 2|18000.0|设计| 1800.0|
| Kaine| 2|20000.0|设计| 3800.0|
+-------+------+-------+----+-------+
123456789
总结
聚合函数(字段) over ([partition by 字段] order by 字段 asc [rows between 起始行偏移量 and 终止偏移量] )
1
其中:偏移量的取值
preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量。
following:与preceding相反,累加后N行(分区之内)。若是累加到该分区结束则为unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
123
说明:上边的前N,后M,以及current row均会累加该偏移量所在行
自定义函数
单行函数
更具员工的职位,计算年薪
spark.udf.register("annual_salary",(job:String,salary:Double)=>{
job match {
case "MANAGER" => salary*12 + 5000000
case "SALESMAN" => salary*12 + 100000
case "CLERK" => salary*12 + 20000
case _ => salary*12
}
})
12345678
spark.sql("select * ,annual_salary(job,salary) annual_salary from t_user").show()
1
+-------+---+-------+-----+--------+------+-------------+
| name|age| salary| sex| job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER| 1| 5240000.0|
| Andy| 30|15000.0| true|SALESMAN| 1| 280000.0|
| Justin| 19| 8000.0| true| CLERK| 1| 116000.0|
| Kaine| 20|20000.0| true| MANAGER| 2| 5240000.0|
| Lisa| 19|18000.0|false|SALESMAN| 2| 316000.0|
+-------+---+-------+-----+--------+------+-------------+
123456789
如果使用API形式调用可以使用:
userDS.selectExpr("name","age","job","salary","annual_salary(job,salary) as annual_salary").show()
1
聚合函数(了解)
有类型聚合(SQL)
自定义UserDefinedAggregateFunction实现类
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, StructType}
class DeptSalarySum extends UserDefinedAggregateFunction{
//说明输入的Schema
override def inputSchema: StructType = {
new StructType().add("salary","double")
}
//缓冲临时变量
override def bufferSchema: StructType = {
new StructType().add("total","double")
}
//返回值类型
override def dataType: DataType = {
DoubleType
}
//表示系统给定的一组输入,总有固定的输出类型
override def deterministic: Boolean = true
override def initialize(buffer: MutableAggregationBuffer): Unit = {
buffer.update(0,0.0D)
}
//局部计算
override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
val history = buffer.getDouble(0)
buffer.update(0,history + input.getAs[Double](0))
}
//中间结果聚合
override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
buffer1.update(0,buffer1.getDouble(0)+buffer2.getDouble(0))
}
//返回最终结果
override def evaluate(buffer: Row): Any = {
buffer.getDouble(0)
}
}
1234567891011121314151617181920212223242526272829303132333435363738
注册聚合类
spark.udf.register("deptSalarySum",new DeptSalarySum)
1
调用聚合函数
spark.sql("select deptNo,deptSalarySum(salary) as salary from t_user group by deptNo").show()
1
+------+---------------------+
|deptNo|deptsalarysum(salary)|
+------+---------------------+
| 1| 43000.0|
| 2| 38000.0|
+------+---------------------+
123456
强类型聚合(API)
强类型聚合用户需要实现Aggregate接口同时需要定制State变量用于存储聚合过程中的中间变量。用于实现针对Dataset数据的集合的聚合。
DeptSalarySate
class DeptSalarySate(var initValue:Double=0 ) {
def add(v:Double): Unit ={
initValue += v
}
def get():Double={
initValue
}
}
12345678
DeptSalaryAggregator
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator
class DeptSalaryAggregator extends Aggregator[Employee,DeptSalarySate,Double]{
override def zero: DeptSalarySate = new DeptSalarySate()
override def reduce(b: DeptSalarySate, a: Employee): DeptSalarySate ={
b.add(a.salary)
b
}
override def merge(b1: DeptSalarySate, b2: DeptSalarySate): DeptSalarySate = {
b1.add(b2.get())
b1
}
override def finish(reduction: DeptSalarySate): Double = {
reduction.get()
}
override def bufferEncoder: Encoder[DeptSalarySate] = {
//Encoders.product[DeptSalarySate]
Encoders.kryo(classOf[DeptSalarySate])
}
override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}
12345678910111213141516171819202122232425262728
测试程序
var myagg=new DeptSalaryAggregator().toColumn.name("avg")
userDS.select(myagg).show()
12
+-------+
| avg|
+-------+
|81000.0|
+-------+
12345
Load/Save
paquet文件
生成parquet文件
Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.save("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
spark.close()
123456789101112131415161718192021222324252627
读取Parquet文件
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
//spark.read.parquet 等价
spark.read.load("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
.show()
spark.close()
12345678910111213
+-------+-----+-------+-------+-------------+---------+
| name| sex| salary|dept_no|annual_salary|dept_name|
+-------+-----+-------+-------+-------------+---------+
|Michael| true|20000.0| 1| 240000.0| 研发|
| Andy| true|15000.0| 1| 180000.0| 研发|
| Justin| true| 8000.0| 1| 96000.0| 研发|
| Kaine| true|20000.0| 2| 240000.0| 设计|
| Lisa|false|18000.0| 2| 216000.0| 设计|
+-------+-----+-------+-------+-------------+---------+
123456789
Json格式
产生json格式
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("json")
.save("file:///Users/jiangzz/Desktop/results/json/")
spark.close()
1234567891011121314151617181920212223242526272829
读取Json格式
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._
spark.read.json("file:///Users/jiangzz/Desktop/results/json/")
.show()
spark.close()
1234567891011
csv格式
生成
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("csv")
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.save("file:///Users/jiangzz/Desktop/results/csv/")
spark.close()
1234567891011121314151617181920212223242526272829303132
读取
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
spark.read
.option("sep", ";")
.option("inferSchema", "true")
.option("header", "true")
.csv("file:///Users/jiangzz/Desktop/results/csv/")
.show()
spark.close()
1234567891011121314151617
ORC格式
ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。
生成
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode(SaveMode.Ignore)
.format("orc")
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.save("file:///Users/jiangzz/Desktop/results/orc/")
spark.close()
12345678910111213141516171819202122232425262728293031
读取
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
spark.read
.option("orc.bloom.filter.columns", "favorite_color")
.option("orc.dictionary.key.threshold", "1.0")
.orc("file:///Users/jiangzz/Desktop/results/orc/")
.show()
spark.close()
12345678910111213141516171819202122232425
SQL读取文件
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val parqeutDF = spark.sql("SELECT * FROM parquet.`file:///Users/jiangzz/Desktop/results/parquet/results.parquet`")
val jsonDF = spark.sql("SELECT * FROM json.`file:///Users/jiangzz/Desktop/results/json/`")
val orcDF = spark.sql("SELECT * FROM orc.`file:///Users/jiangzz/Desktop/results/orc/`")
//parqeutDF.show()
//jsonDF.show()
//csvDF.show()
orcDF.show()
spark.close()
1234567891011121314151617
JDBC数据库读取
读取MysQL
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.47</version>
</dependency>
12345
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
spark.read
.format("jdbc")
.option("url", "jdbc:mysql://CentOS:3306/test")
.option("dbtable", "t_user")
.option("user", "root")
.option("password", "root")
.load().createTempView("t_user")
spark.sql("select * from t_user").show()
spark.close()
1234567891011121314151617181920
写入MySQL
val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()
import spark.implicits._
spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
.map(line => line.split(","))
.map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
.toDS().as("u")
val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
.map(line => line.split(","))
.map(t => Dept(t(0).toInt, t(1)))
.toDS().as("d")
val props = new Properties()
props.put("user", "root")
props.put("password", "root")
userDS.select("name","sex","salary","deptNo")
.withColumn("annual_salary",$"salary"*12)
.join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
.drop("id")
.toDF("name","sex","salary","dept_no","annual_salary","dept_name")
.write
.mode("append")
.jdbc("jdbc:mysql://CentOS:3306/test","t_user",props)
spark.close()