sparkSql的使用

本文深入探讨Apache Spark SQL,介绍其核心组件Dataset和DataFrame,提供快速入门指导,涵盖创建、操作和查询数据的方法,包括数据加载、保存、SQL查询、自定义函数及聚合函数等高级特性,旨在帮助开发者提升数据处理效率。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Apache Spark SQL

Spark SQL是构建在RDD之上的ETL(Extract Transform Load)工具。SparkSQL在RDD之上抽象出来Dataset/Dataframe这两个类提供了类似RDD的功能,也就意味用户可以使用map、faltMap、filter等高阶算子,同时也通过了基于列的命名查询,也就是说Dataset/DataFrame提供了两套操作数据的API,这些API可以给Saprk引擎要提供更多信息,系统可可以根据这些信息对计算实现一定的优化。目前Spark SQL提供了两种交互方式1)SQL 脚本 2)Dataset API(strong-typed类型、untyped类型操作)

Datasets & DataFrames

Dataset是一个分布式数据集,Dataset是在spark-1.6提出新的API,该API构建在RDD(strong type,使用lambda表达式)之上同时可以借助于Spark SQL对执行引擎的优点,使得使用Dateset执行一些数据的转换比直接使用RDD算子功能和性能都有所提升。因此我们可以认为Dateset就是一个加强版本的RDD。Dataset除了可以使用JVM中数组|集合对象创建之外,也可以将任意的一个RDD转换为Dataset.Python does not have the support for the Dataset API.

DataFrames 是Dataset的一种特殊情况。比如 Dataset中可以存储任意 对象类型的数据作为Dataset的元素。但是Dataframe的元素只有一种类型Row类型,这种基于Row查询和传统数据库中ResultSet操作极其相似。因为Row类型的数据表示Dataframe的一个元素,类似数据库中的一行,这些行中的元素可以通过下标或者column name访问。由于 Dateset是API的兼容或者支持度上不是多么的好,但是Dataframe在API层面支持的Scala、Java、R、Python支持比较全面。

快速入门

  • 引入依赖
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-core_2.11</artifactId>
    <version>2.4.3</version>
</dependency>
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.4.3</version>
</dependency>
  • 创建字符统计(untyped)
//1.创建SparkSeesion
val spark=SparkSession.builder()
    .appName("wordcount")
    .master("local[6]")
    .getOrCreate()

//2.导入spark定义隐式增强|转换
import spark.implicits._

//3.创建dataset
val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_,1))

val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupBy($"_1") //无类型操作
    .sum("_2").as("total")
    .withColumnRenamed("_1","word")
    .withColumnRenamed("sum(_2)","total")
    .show()

//5.关闭spark
spark.stop()
  • 创建字符统计(strong typed)
//1.创建SparkSeesion
val spark=SparkSession.builder()
    .appName("wordcount")
    .master("local[6]")
    .getOrCreate()

//2.导入spark定义隐式增强|转换
import spark.implicits._

//3.创建dataset
val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
    .flatMap(_.split("\\s+"))
    .map((_,1))

val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupByKey(t=>t._1)
    .agg(typed.sum[(String,Int)](tuple=>tuple._2).name("total"))
    .show()

//5.关闭spark
spark.stop()

Dataset & Dataframe

Dataset create

Dataset类似于RDD,不同的是Spark SQL有一套自己的序列化规范独立于Spark RDD(Java/Kryo序列化)之上称为Encoders。不同于SparkRDD序列化,由于Dataset支持无类型操作,用户无需获取操作的类型,操作仅仅是列名,因为Spark SQL在执行算子操作的时候可以省略反序列化的步骤,继而提升程序执行效率。

  • case-class
case class Person(id:Int,name:String,age:Int,sex:Boolean)

val person: Dataset[Person] =List(Person(1,"zhangsan",18,true),Person(2,"wangwu",28,true)).toDS()
person.select($"id",$"name")
.show()
+---+--------+
| id|    name|
+---+--------+
|  1|zhangsan|
|  2|  wangwu|
+---+--------+
  • Tuple元组
val person: Dataset[(Int,String,Int,Boolean)] =List((1,"zhangsan",18,true),(2,"wangwu",28,true)).toDS()
person.select($"_1",$"_2")
.show()
+---+--------+
| _1|      _2|
+---+--------+
|  1|zhangsan|
|  2|  wangwu|
+---+--------+
  • json数据
{"name":"张三","age":18} 
{"name":"lisi","age":28} 
{"name":"wangwu","age":38}
case class User(name:String,age:Long)//数值默认是long类型
spark.read.json("file:///D:/demo/json/").as[User]
        .show()
+---+------+
|age|  name|
+---+------+
| 18|  张三|
| 28|  lisi|
| 38|wangwu|
+---+------+
  • rdd

    • 元组
    val userRDD = spark.sparkContext.makeRDD(List((1,"张三",true,18,15000.0)))
    userRDD.toDS().show()
    
    +---+----+----+---+-------+
    | _1|  _2|  _3| _4|     _5|
    +---+----+----+---+-------+
    |  1|张三|true| 18|15000.0|
    +---+----+----+---+-------+
    
    • case-class
    val userRDD = spark.sparkContext.makeRDD(List(Person(1,"张三",18,true)))
    userRDD.toDS().show()
    
    +---+----+---+----+
    | id|name|age| sex|
    +---+----+---+----+
    |  1|张三| 18|true|
    +---+----+---+----+
    

Dataframe create

DataFrame是一个命名列的数据集,用户可以直接操作column因此几乎所有Dataframe推荐操作都是无类型操作。用户也可以把一个Dataframe看做是Dataset[Row]类型的数据集。

  • json文件
val dataFrame:DataFrame = spark.read.json("file:///D:/demo/json")
dataFrame.printSchema()
dataFrame.show()
  • case-class
case class User(id:Int,name:String,sex:Boolean)

var userDF=List(User(1,"zs",true)).toDF()
userDF.show()
+---+----+----+
| id|name| sex|
+---+----+----+
|  1|  zs|true|
+---+----+----+
  • Tuple元组
var userDF=List((1,"zs",true)).toDF("id","name","sex")
userDF.show()
+---+----+----+
| id|name| sex|
+---+----+----+
|  1|  zs|true|
+---+----+----+
  • RDD转换

    • 元组
    var userDF= spark.sparkContext.parallelize(List((1,"zs",true)))
    .toDF("id","name","sex") //可以指定列
    userDF.show()
    
    +---+----+----+
    | id|name| sex|
    +---+----+----+
    |  1|  zs|true|
    +---+----+----+
    
    • case-class
    var userDF= spark.sparkContext.parallelize(List(User(1,"zs",true)))
    .toDF("id","uname","sex")
    userDF.show()
    
    +---+-----+----+
    | id|uname| sex|
    +---+-----+----+
    |  1|   zs|true|
    +---+-----+----+
    
    • RDD[Row]类型转为DataFrame
    var userRDD:RDD[Row]= spark.sparkContext.parallelize(List(User(1,"zs",true)))
    	.map(u=>Row(u.id,u.name,u.sex))
    
    var schema=new StructType()
        .add("id",IntegerType)
        .add("name",StringType)
        .add("sex",BooleanType)
    
    var userDF=spark.createDataFrame(userRDD,schema)
    userDF.show()
    
    +---+----+----+
    | id|name| sex|
    +---+----+----+
    |  1|  zs|true|
    +---+----+----+
    
    • Javabean
    public class JavaUser implements Serializable {
        private Integer id;
        private String name;
        private Boolean sex;
       // get/set
    }
    
    var userRDD:RDD[JavaUser]=spark.sparkContext.makeRDD(List(new JavaUser(1,"zs",true)))
    var userDF=spark.createDataFrame(userRDD,classOf[JavaUser])
    userDF.show()
    
    • case-class
    case class ScalaUser(id:Int,name:String,sex:Boolean)
    
    var userRDD:RDD[ScalaUser]=spark.sparkContext.makeRDD(List(ScalaUser(1,"zs",true)))
    var userDF=spark.createDataFrame(userRDD)
    userDF.show()
    
    +---+----+----+
    | id|name| sex|
    +---+----+----+
    |  1|  zs|true|
    +---+----+----+
    
    • 元组
    var userRDD:RDD[(Int,String,Boolean)]=spark.sparkContext.makeRDD(List((1,"zs",true)))
    var userDF=spark.createDataFrame(userRDD)
    userDF.show()
    
    +---+---+----+
    | _1| _2|  _3|
    +---+---+----+
    |  1| zs|true|
    +---+---+----+
    

##DataFrame Operations(Untyped)

printSchema()

打印Dataframe的表结构(表头)

var df=List((1,"zs",true)).toDF("id","name","sex")
df.printSchema()
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- sex: boolean (nullable = false)

show

var df=List((1,"zs",true)).toDF("id","name","sex")
df.show()//打印输出前20行
| id|name| sex|
+---+----+----+
|  1|  zs|true|
+---+----+----+

select

var df=List((1,"zs",true,1,15000),(2,"ls",false,1,15000))
.toDF("id","name","sex","dept","salary")

df.select($"id",$"name",$"salary")
.show()
+---+----+------+
| id|name|salary|
+---+----+------+
|  1|  zs| 15000|
|  2|  ls| 15000|
+---+----+------+

selectExpr

var df=List((1,"zs",true,1,15000),(2,"ls",false,1,18000))
.toDF("id","name","sex","dept","salary")
//等价 df.select($"id",$"name",$"sex",$"dept",$"salary" * 12 as "annual_salary")
df.selectExpr("id","name","sex","dept","salary * 12 as annual_salary") 
.show()
+---+----+-----+----+-------------+
| id|name|  sex|dept|annual_salary|
+---+----+-----+----+-------------+
|  1|  zs| true|   1|       180000|
|  2|  ls|false|   1|       216000|
+---+----+-----+----+-------------+

withColumn

df.select($"id",$"name",$"sex",$"dept",$"salary")
.withColumn("annula_salary",$"salary" * 12)
.show()
+---+----+-----+----+------+-------------+
| id|name|  sex|dept|salary|annula_salary|
+---+----+-----+----+------+-------------+
|  1|  zs| true|   1| 15000|       180000|
|  2|  ls|false|   1| 18000|       216000|
+---+----+-----+----+------+-------------+

withColumnRenamed

df.select($"id",$"name",$"sex",$"dept",$"salary")
    .withColumn("annula_salary",$"salary" * 12)
    .withColumnRenamed("dept","department")
    .withColumnRenamed("name","username")
    .show()
+---+--------+-----+----------+------+-------------+
| id|username|  sex|department|salary|annula_salary|
+---+--------+-----+----------+------+-------------+
|  1|      zs| true|         1| 15000|       180000|
|  2|      ls|false|         1| 18000|       216000|
+---+--------+-----+----------+------+-------------+

drop

df.select($"id",$"name",$"sex",$"dept",$"salary")
    .withColumn("annula_salary",$"salary" * 12)
    .withColumnRenamed("dept","department")
    .withColumnRenamed("name","username")
    .drop("sex")
    .show()
+---+--------+----------+------+-------------+
| id|username|department|salary|annula_salary|
+---+--------+----------+------+-------------+
|  1|      zs|         1| 15000|       180000|
|  2|      ls|         1| 18000|       216000|
+---+--------+----------+------+-------------+

dropDuplicates

删除重复数据
var df=List((1,"zs",true,1,15000),
            (2,"ls",false,1,18000),
            (3,"ww",false,1,19000),
            (4,"zl",false,1,18000)
           )
.toDF("id","name","sex","dept","salary")

df.select($"id",$"name",$"sex",$"dept",$"salary")
.dropDuplicates("sex","salary")
.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  3|  ww|false|   1| 19000|
|  1|  zs| true|   1| 15000|
|  2|  ls|false|   1| 18000|
+---+----+-----+----+------+

orderBy|sort

var df=List((1,"zs",true,1,15000),
            (2,"ls",false,2,18000),
            (3,"ww",false,2,14000),
            (4,"zl",false,1,18000),
            (4,"zl",false,1,16000)
           )
.toDF("id","name","sex","dept","salary")

df.select($"id",$"name",$"sex",$"dept",$"salary")
.orderBy($"salary" desc,$"id" asc)
//.sort($"salary" desc,$"id" asc)
.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  2|  ls|false|   2| 18000|
|  4|  zl|false|   1| 18000|
|  4|  zl|false|   1| 16000|
|  1|  zs| true|   1| 15000|
|  3|  ww|false|   2| 14000|
+---+----+-----+----+------+

groupBy

df.select($"id",$"name",$"sex",$"dept",$"salary")
    .groupBy($"dept")
    .max("salary")
    .show()
+----+-----------+
|dept|max(salary)|
+----+-----------+
|   1|      18000|
|   2|      18000|
+----+-----------+

类似的算子 max、min、avg|mean、sum、count

agg

var df=List((1,"zs",true,1,15000),
            (2,"ls",false,2,18000),
            (3,"ww",false,2,14000),
            (4,"zl",false,1,18000),
            (4,"zl",false,1,16000)
           )
.toDF("id","name","sex","dept","salary")

import org.apache.spark.sql.functions._
df.select($"id",$"name",$"sex",$"dept",$"salary")
.groupBy($"dept")
.agg(max("salary") as "max_salary",avg("salary") as "avg_salary")
.show()
+----+----------+------------------+
|dept|max_salary|        avg_salary|
+----+----------+------------------+
|   1|     18000|16333.333333333334|
|   2|     18000|           16000.0|
+----+----------+------------------+

还可以传递一方式

var df=List((1,"zs",true,1,15000),
            (2,"ls",false,2,18000),
            (3,"ww",false,2,14000),
            (4,"zl",false,1,18000),
            (4,"zl",false,1,16000)
           )
.toDF("id","name","sex","dept","salary")

df.select($"id",$"name",$"sex",$"dept",$"salary")
.groupBy($"dept")
.agg(Map("salary"->"max","id"->"count"))
.show()
+----+-----------+---------+
|dept|max(salary)|count(id)|
+----+-----------+---------+
|   1|      18000|        3|
|   2|      18000|        2|
+----+-----------+---------+

limit

df.select($"id",$"name",$"sex",$"dept",$"salary")
.orderBy($"id" desc).limit(4)
.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  5|  zl|false|   1| 16000|
|  4|  zl|false|   1| 18000|
|  3|  ww|false|   2| 14000|
|  2|  ls|false|   2| 18000|
+---+----+-----+----+------+

where

var df=List((1,"zs",true,1,15000),
            (2,"ls",false,2,18000),
            (3,"ww",false,2,14000),
            (4,"zl",false,1,18000),
            (5,"win7",false,1,16000)
           )
.toDF("id","name","sex","dept","salary")

df.select($"id",$"name",$"sex",$"dept",$"salary")
//where("(name like '%s%' and salary > 15000) or name = 'win7'")
.where(($"name" like "%s%" and $"salary" > 15000) or $"name" ==="win7" )
.show()
+---+----+-----+----+------+
| id|name|  sex|dept|salary|
+---+----+-----+----+------+
|  2|  ls|false|   2| 18000|
|  5|win7|false|   1| 16000|
+---+----+-----+----+------+

pivot(行转列)

var scoreDF=List((1,"math",85),
                 (1,"chinese",80),
                 (1,"english",90),
                 (2,"math",90),
                 (2,"chinese",80)
                ).toDF("id","course","score")
import org.apache.spark.sql.functions._
//select id,max(case course when 'math' then score else 0 end )as math ,max(case course when 'chinese' then score else 0 end) as chinese from t_course group by id;
scoreDF.selectExpr("id",
                   "case course when 'math' then score else 0 end as math",
                   "case course when 'chinese' then score else 0 end  as chinese",
                   "case course when 'english' then score else 0 end as english")
.groupBy("id")
.agg(max($"math"),max($"chinese"),max($"english"))
.show()
+---+---------+------------+------------+
| id|max(math)|max(chinese)|max(english)|
+---+---------+------------+------------+
|  1|       85|          80|          90|
|  2|       90|          80|           0|
+---+---------+------------+------------+
var scoreRDD=List((1,"math",85),
                  (1,"chinese",80),
                  (1,"english",90),
                  (2,"math",90),
                  (2,"chinese",80)
                 )
scoreRDD.toDF("id","course","score")
.groupBy("id")
//             行转列       可选值
.pivot("course",scoreRDD.map(t=>t._2).distinct)
.max("score")
.show()

+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
|  1|  85|     80|     90|
|  2|  90|     80|   null|
+---+----+-------+-------+

na

var scoreRDD=List((1,"math",85),
                  (1,"chinese",80),
                  (1,"english",90),
                  (2,"math",90),
                  (2,"chinese",80),
                  (3,"math",100)
                 )
scoreRDD.toDF("id","course","score")
.groupBy("id")
//             行转列       可选值
.pivot("course",scoreRDD.map(t=>t._2).distinct)
.max("score")
.na.fill(Map("english" -> -1,"chinese" -> 0))
.show()
+---+----+-------+-------+
| id|math|chinese|english|
+---+----+-------+-------+
|  1|  85|     80|     90|
|  3| 100|      0|     -1|
|  2|  90|     80|     -1|
+---+----+-------+-------+

over(开窗 )

计算当前员工在部门的薪资排序

select id,name,salary ,dept,(select count(*)+1 from t_user u2 where u1.dept=u2.dept and u1.salary < u2.salary) from t_user u1 
select id,name,salary, sum(1) over(partition by dept order by salary desc ) from t_user	
var df=List((1,"zs",true,1,15000),
            (2,"ls",false,2,18000),
            (3,"ww",false,2,14000),
            (4,"zl",false,1,18000),
            (5,"win7",false,1,16000)
           )
.toDF("id","name","sex","dept","salary")
import org.apache.spark.sql.functions._
val w1 = Window.partitionBy("dept")
    .orderBy($"salary" desc)
    .rangeBetween(Window.unboundedPreceding, Window.currentRow)
val w2 = Window.partitionBy("dept")
    .orderBy($"salary" desc)
    .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

val w3= Window
    .orderBy($"salary" desc)
    .rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)

df.select("id","name","dept","salary")
    .withColumn("rank",count("id") over(w1))
    .withColumn("avg",avg("salary") over(w2))
    .withColumn("diff",max("salary") over(w2))//计算所在部门 salary-本部门最高
    .withColumn("all_max",max("salary") over(w3))//计算全公司最高薪资
    .select($"id",$"name",$"dept",$"salary",$"rank",$"avg",$"diff"-$"salary" as "chaju",$"all_max")
    .show()
+---+----+----+------+----+------------------+-----+-------+
| id|name|dept|salary|rank|               avg|chaju|all_max|
+---+----+----+------+----+------------------+-----+-------+
|  4|  zl|   1| 18000|   1|16333.333333333334|    0|  18000|
|  2|  ls|   2| 18000|   1|           16000.0|    0|  18000|
|  5|win7|   1| 16000|   2|16333.333333333334| 2000|  18000|
|  1|  zs|   1| 15000|   3|16333.333333333334| 3000|  18000|
|  3|  ww|   2| 14000|   2|           16000.0| 4000|  18000|
+---+----+----+------+----+------------------+-----+-------+

join

var userCostDF=spark.sparkContext.parallelize(List(
    UserCost(1,"电脑配件",100),
    UserCost(1,"母婴用品",100),
    UserCost(1,"生活用品",100),
    UserCost(2,"居家美食",79),
    UserCost(2,"消费电子",80),
    UserCost(2,"生活用品",100)
)).toDF().withColumnRenamed("id","uid")

val categories = userCostDF.select("category")
.as[(String)]
.rdd.distinct.collect()

var userDF=spark.sparkContext.parallelize(List(
    User(1,"张晓三",true,18,15000),
    User(2,"李晓四",true,18,18000),
    User(3,"王晓五",false,18,10000)
)).toDF()

userDF.join(userCostDF,$"id"===$"uid","left_outer")
    .drop("uid")
    .groupBy("id","name")
    .pivot($"category",categories)
    .sum("totalCost")
    .na.fill(0.0)
    .show()

+---+------+--------+--------+--------+--------+--------+
| id|  name|电脑配件|生活用品|母婴用品|居家美食|消费电子|
+---+------+--------+--------+--------+--------+--------+
|  1|张晓三|   100.0|   100.0|   100.0|     0.0|     0.0|
|  3|王晓五|     0.0|     0.0|     0.0|     0.0|     0.0|
|  2|李晓四|     0.0|   100.0|     0.0|    79.0|    80.0|
+---+------+--------+--------+--------+--------+--------+

cube(多维度)

import org.apache.spark.sql.functions._
List((110,50,80,80),(120,60,95,75),(120,50,96,70))
    .toDF("height","weight","IQ","EQ")
    .cube($"height",$"weight")
    .agg(avg("IQ"),avg("EQ"))
    .show()

Dataset Oprations (Strong typed)

由于强类型操作都是基于类型操作,Spark SQL的操作都是推荐使用Dataframe基于列操作,因此一般情况下不在推荐使用。

val lines=Array("this is a demo","hello spark")
val wordRDD = spark.sparkContext.makeRDD(lines)
.flatMap(_.split("\\s+"))
.map((_,1))

import org.apache.spark.sql.expressions.scalalang.typed
val ds:Dataset[(String,Int)]=wordRDD.toDS()
//4.对Dataset执行sql算子操作
ds.groupByKey(t=>t._1)
.agg(typed.sum[(String,Int)](tuple=>tuple._2).name("total"))
.filter(tuple=>tuple._1.contains("o"))
.show()
+-----+-----+
|value|total|
+-----+-----+
|hello|  1.0|
| demo|  1.0|
+-----+-----+

SQL查询

单行查询

var userDF= List((1,"zs",true,18,15000,1))
      .toDF("id","name","sex","age","salary","dept")

userDF.createTempView("t_employee")
val sql="select * from t_employee where name = 'zs'"
spark.sql(sql).show()
+---+----+----+---+------+----+
| id|name| sex|age|salary|dept|
+---+----+----+---+------+----+
|  1|  zs|true| 18| 15000|   1|
+---+----+----+---+------+----+

模糊查询

var userDF= List((1,"zs",true,18,15000,1))
      .toDF("id","name","sex","age","salary","dept")

userDF.createTempView("t_employee")
val sql="select * from t_employee where name like '%s%'"
spark.sql(sql).show()

排序查询

 var userDF= List(
     (1,"zs",true,18,15000,1),
     (2,"ls",false,18,12000,1))
      .toDF("id","name","sex","age","salary","dept")

    //构建视图
    userDF.createTempView("t_employee")
    val sql=
      """
        select * from t_employee
        where salary > 10000
        order by salary desc
      """
    spark.sql(sql).show()

limit查询

var userDF= List(
     (1,"zs",true,18,15000,1),
     (2,"ls",false,18,12000,1),
     (3,"ww",false,18,16000,2))
      .toDF("id","name","sex","age","salary","dept")

    //构建视图
    userDF.createTempView("t_employee")
    val sql=
      """
        select * from t_employee
        where salary > 10000
        order by salary desc
        limit 2
      """
    spark.sql(sql).show()

分组查询

var userDF= List(
    (1,"zs",true,18,15000,1),
    (2,"ls",false,18,12000,1),
    (3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")

userDF.createTempView("t_employee")
    val sql=
      """
        select dept ,avg(salary) as avg_slalary from t_employee
        group by dept
        order by avg_slalary desc
      """
    spark.sql(sql).show()
+----+-----------+
|dept|avg_slalary|
+----+-----------+
|   2|    16000.0|
|   1|    13500.0|
+----+-----------+

having过滤

 var userDF= List(
     (1,"zs",true,18,15000,1),
     (2,"ls",false,18,12000,1),
     (3,"ww",false,18,16000,2))
      .toDF("id","name","sex","age","salary","dept")

    //构建视图
    userDF.createTempView("t_employee")
    val sql=
      """
        select dept ,avg(salary) as avg_slalary from t_employee
        group by dept
        having avg_slalary > 13500
        order by avg_slalary desc
      """
    spark.sql(sql).show()
+----+-----------+
|dept|avg_slalary|
+----+-----------+
|   2|    16000.0|
+----+-----------+

case-when

var userDF= List(
    (1,"zs",true,18,15000,1),
    (2,"ls",false,18,12000,1),
    (3,"ww",false,18,16000,2))
.toDF("id","name","sex","age","salary","dept")

//构建视图
userDF.createTempView("t_employee")
val sql=
	"""
        select id,name,case sex when true then '男' else '女' end as sex_alias
        from t_employee
      """
spark.sql(sql).show()
+---+----+---------+
| id|name|sex_alias|
+---+----+---------+
|  1|  zs|       男|
|  2|  ls|       女|
|  3|  ww|       女|
+---+----+---------+

行转列

var scoreDF=List((1, "语文", 100),
      (1, "数学", 100),
      (1, "英语", 100),
      (2, "数学", 79),
      (2, "语文", 80),
      (2, "英语", 100)).toDF("id","course","score")
    scoreDF.createOrReplaceTempView("t_course")

    val sql=
      """
        select id,
          max(case course when '数学' then score else 0 end) as math,
          max(case course when '英语' then score else 0 end) as english,
          max(case course when '语文' then score else 0 end) as chinese
        from t_course
        group by id
      """
    spark.sql(sql).show()
+---+----+-------+-------+
| id|math|english|chinese|
+---+----+-------+-------+
|  1| 100|    100|    100|
|  2|  79|    100|     80|
+---+----+-------+-------+

pivot

 var scoreDF=List((1, "语文", 100),
      (1, "数学", 100),
      (1, "英语", 100),
      (2, "数学", 79),
      (2, "语文", 80),
      (2, "英语", 100)).toDF("id","course","score")
    scoreDF.createOrReplaceTempView("t_course")

    val sql=
      """
        select * from t_course
        pivot(max(score) for course in ('数学','语文','英语'))
      """
    spark.sql(sql).show()

这里需要注意,在书写SQL的时候除去聚合字段和输出列明字段,其他字段作为groupby后的隐藏字段。

+---+----+----+----+
| id|数学|语文|英语|
+---+----+----+----+
|  1| 100| 100| 100|
|  2|  79|  80| 100|
+---+----+----+----+

表连接

spark.sql("select * from t_user left join t_dept on deptNo=id").show()
1
+-------+---+-------+-----+--------+------+---+----+
|   name|age| salary|  sex|     job|deptNo| id|name|
+-------+---+-------+-----+--------+------+---+----+
|Michael| 29|20000.0| true| MANAGER|     1|  1|研发|
|   Andy| 30|15000.0| true|SALESMAN|     1|  1|研发|
| Justin| 19| 8000.0| true|   CLERK|     1|  1|研发|
|  Kaine| 20|20000.0| true| MANAGER|     2|  2|设计|
|   Lisa| 19|18000.0|false|SALESMAN|     2|  2|设计|
+-------+---+-------+-----+--------+------+---+----+
123456789

spark支持inner join、left outer、right outer、full outer join连接

子查询

spark.sql("select * ,(select count(t1.salary) from t_user t1 where (t1.deptNo = t2.deptNo)  group by t1.deptNo) as total from t_user t2 left join t_dept on t2.deptNo=id order by t2.deptNo asc,t2.salary desc").show()
1
+-------+---+-------+-----+--------+------+---+----+-----+
|   name|age| salary|  sex|     job|deptNo| id|name|total|
+-------+---+-------+-----+--------+------+---+----+-----+
|Michael| 29|20000.0| true| MANAGER|     1|  1|研发|    3|
|   Andy| 30|15000.0| true|SALESMAN|     1|  1|研发|    3|
| Justin| 19| 8000.0| true|   CLERK|     1|  1|研发|    3|
|  Kaine| 20|20000.0| true| MANAGER|     2|  2|设计|    2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|  2|设计|    2|
+-------+---+-------+-----+--------+------+---+----+-----+
123456789

注意在spark中仅仅支持内嵌子查询=查询

开窗函数

在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中 还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一 层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。

  • 输出员工信息以及所在部门的平均薪资
spark.sql("select u.name,u.deptNo,u.salary,d.name,avg(salary) over(partition by deptNo ) avg from t_user u left join t_dept d on deptNo=id").show()
1
+-------+------+-------+----+------------------+
|   name|deptNo| salary|name|               avg|
+-------+------+-------+----+------------------+
|Michael|     1|20000.0|研发|14333.333333333334|
|   Andy|     1|15000.0|研发|14333.333333333334|
| Justin|     1| 8000.0|研发|14333.333333333334|
|  Kaine|     2|20000.0|设计|           19000.0|
|   Lisa|     2|18000.0|设计|           19000.0|
+-------+------+-------+----+------------------+
123456789

等价写法

select u2.name,u2.deptNo,u2.salary,d.name,(select avg(salary) from t_user u1 where u1.deptNo=u2.deptNo group by u1.deptNo) avgSalary from t_user u2 left join t_dept d on deptNo=id
1
  • 计算员工在自己部门的薪资排序
spark.sql("select u.name,u.deptNo,u.salary,d.name,sum(1) over(partition by deptNo order by salary desc) rank from t_user u left join t_dept d on deptNo=id").show()
1
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
|Michael|     1|20000.0|研发|   1|
|   Andy|     1|15000.0|研发|   2|
| Justin|     1| 8000.0|研发|   3|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   2|
+-------+------+-------+----+----+
123456789

其中sum(1)可以替换成ROW_NUM()

  • 计算员工在公司的薪资排名
spark.sql("select t_user.name,deptNo,salary,t_dept.name,ROW_NUMBER() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   5|
|   Andy|     1|15000.0|研发|   4|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   2|
|   Lisa|     2|18000.0|设计|   3|
+-------+------+-------+----+----+
123456789

可以看出,ROW_NUMBER()只表示数据出库的顺序,无法比较真正的顺序。因此一般在做排名的时候一般使用RANK()或者DENSE_RANK()函数。

  • 使用RANK函数实现薪资排名(序号不连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   5|
|   Andy|     1|15000.0|研发|   4|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   3|
+-------+------+-------+----+----+
123456789
  • 使用DENSE_RANK函数实现薪资排名(序号连续)
spark.sql("select t_user.name,deptNo,salary,t_dept.name,DENSE_RANK() over(order by salary desc) rank from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+----+
|   name|deptNo| salary|name|rank|
+-------+------+-------+----+----+
| Justin|     1| 8000.0|研发|   4|
|   Andy|     1|15000.0|研发|   3|
|Michael|     1|20000.0|研发|   1|
|  Kaine|     2|20000.0|设计|   1|
|   Lisa|     2|18000.0|设计|   2|
+-------+------+-------+----+----+
123456789
  • 计算每个部门和本部门最高薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- max(salary) over(partition by deptNo order by salary desc rows between  unbounded preceding and current row)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+--------+
|   name|deptNo| salary|name|    diff|
+-------+------+-------+----+--------+
|Michael|     1|20000.0|研发|     0.0|
|   Andy|     1|15000.0|研发| -5000.0|
| Justin|     1| 8000.0|研发|-12000.0|
|  Kaine|     2|20000.0|设计|     0.0|
|   Lisa|     2|18000.0|设计| -2000.0|
+-------+------+-------+----+--------+
123456789
  • 计算本部门的员工工资和最小工资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- min(salary) over(partition by deptNo order by salary desc rows between  current row and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+-------+
|   name|deptNo| salary|name|   diff|
+-------+------+-------+----+-------+
| Justin|     1| 8000.0|研发|    0.0|
|   Andy|     1|15000.0|研发| 7000.0|
|Michael|     1|20000.0|研发|12000.0|
|  Kaine|     2|20000.0|设计| 2000.0|
|   Lisa|     2|18000.0|设计|    0.0|
+-------+------+-------+----+-------+
123456789
  • 计算本部门员工和公司平均薪资的差值
spark.sql("select t_user.name,deptNo,salary,t_dept.name,(salary- avg(salary) over(order by salary  rows between  unbounded preceding and unbounded following)) diff from t_user left join t_dept on deptNo=id order by deptNo").show()
1
+-------+------+-------+----+-------+
|   name|deptNo| salary|name|   diff|
+-------+------+-------+----+-------+
|Michael|     1|20000.0|研发| 3800.0|
|   Andy|     1|15000.0|研发|-1200.0|
| Justin|     1| 8000.0|研发|-8200.0|
|   Lisa|     2|18000.0|设计| 1800.0|
|  Kaine|     2|20000.0|设计| 3800.0|
+-------+------+-------+----+-------+
123456789

总结

聚合函数(字段) over ([partition by 字段] order by 字段 asc [rows between 起始行偏移量 and 终止偏移量] )
1

其中:偏移量的取值

preceding:用于累加前N行(分区之内)。若是从分区第一行头开始,则为 unbounded。 N为:相对当前行向前的偏移量。
following:与preceding相反,累加后N行(分区之内)。若是累加到该分区结束则为unbounded。N为:相对当前行向后的偏移量
current row:顾名思义,当前行,偏移量为0
123

说明:上边的前N,后M,以及current row均会累加该偏移量所在行

自定义函数

单行函数

更具员工的职位,计算年薪

spark.udf.register("annual_salary",(job:String,salary:Double)=>{
  job match {
    case "MANAGER" => salary*12 + 5000000
    case "SALESMAN" => salary*12 + 100000
    case "CLERK" => salary*12 + 20000
    case _ => salary*12
  }
})
12345678
spark.sql("select * ,annual_salary(job,salary) annual_salary from t_user").show()
1
+-------+---+-------+-----+--------+------+-------------+
|   name|age| salary|  sex|     job|deptNo|annual_salary|
+-------+---+-------+-----+--------+------+-------------+
|Michael| 29|20000.0| true| MANAGER|     1|    5240000.0|
|   Andy| 30|15000.0| true|SALESMAN|     1|     280000.0|
| Justin| 19| 8000.0| true|   CLERK|     1|     116000.0|
|  Kaine| 20|20000.0| true| MANAGER|     2|    5240000.0|
|   Lisa| 19|18000.0|false|SALESMAN|     2|     316000.0|
+-------+---+-------+-----+--------+------+-------------+
123456789

如果使用API形式调用可以使用:

userDS.selectExpr("name","age","job","salary","annual_salary(job,salary) as annual_salary").show()
1

聚合函数(了解)

有类型聚合(SQL)

自定义UserDefinedAggregateFunction实现类

import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, StructType}

class DeptSalarySum  extends UserDefinedAggregateFunction{
  //说明输入的Schema
  override def inputSchema: StructType = {
    new StructType().add("salary","double")
  }
  //缓冲临时变量
  override def bufferSchema: StructType = {
    new StructType().add("total","double")
  }
  //返回值类型
  override def dataType: DataType = {
    DoubleType
  }
  //表示系统给定的一组输入,总有固定的输出类型
  override def deterministic: Boolean = true

  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    buffer.update(0,0.0D)
  }

  //局部计算
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val history = buffer.getDouble(0)
    buffer.update(0,history + input.getAs[Double](0))
  }
  //中间结果聚合
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    buffer1.update(0,buffer1.getDouble(0)+buffer2.getDouble(0))
  }
  //返回最终结果
  override def evaluate(buffer: Row): Any = {
    buffer.getDouble(0)
  }
}
1234567891011121314151617181920212223242526272829303132333435363738

注册聚合类

spark.udf.register("deptSalarySum",new DeptSalarySum)
1

调用聚合函数

spark.sql("select deptNo,deptSalarySum(salary) as salary from t_user group by deptNo").show()
1
+------+---------------------+
|deptNo|deptsalarysum(salary)|
+------+---------------------+
|     1|              43000.0|
|     2|              38000.0|
+------+---------------------+
123456
强类型聚合(API)

强类型聚合用户需要实现Aggregate接口同时需要定制State变量用于存储聚合过程中的中间变量。用于实现针对Dataset数据的集合的聚合。

DeptSalarySate

class DeptSalarySate(var initValue:Double=0 ) {
  def add(v:Double): Unit ={
    initValue += v
  }
  def get():Double={
    initValue
  }
}
12345678

DeptSalaryAggregator

import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.expressions.Aggregator

class DeptSalaryAggregator extends Aggregator[Employee,DeptSalarySate,Double]{
  override def zero: DeptSalarySate = new DeptSalarySate()

  override def reduce(b: DeptSalarySate, a: Employee): DeptSalarySate ={
    b.add(a.salary)
    b
  }

  override def merge(b1: DeptSalarySate, b2: DeptSalarySate): DeptSalarySate = {
    b1.add(b2.get())
    b1
  }

  override def finish(reduction: DeptSalarySate): Double = {
    reduction.get()
  }

  override def bufferEncoder: Encoder[DeptSalarySate] = {
    //Encoders.product[DeptSalarySate]
    Encoders.kryo(classOf[DeptSalarySate])
  }

  override def outputEncoder: Encoder[Double] = Encoders.scalaDouble
}

12345678910111213141516171819202122232425262728

测试程序

var myagg=new DeptSalaryAggregator().toColumn.name("avg")
userDS.select(myagg).show()
12
+-------+
|    avg|
+-------+
|81000.0|
+-------+
12345

Load/Save

paquet文件

生成parquet文件

Parquet仅仅是一种存储格式,它是语言、平台无关的,并且不需要和任何一种数据处理框架绑定.

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
      .map(line => line.split(","))
      .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
      .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
      .map(line => line.split(","))
      .map(t => Dept(t(0).toInt, t(1)))
      .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
      .withColumn("annual_salary",$"salary"*12)
      .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
      .drop("id")
      .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
      .write
      .save("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")

spark.close()
123456789101112131415161718192021222324252627

读取Parquet文件

val spark = SparkSession.builder()
                .master("local[6]")
                .appName("sql")
                .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
//spark.read.parquet 等价
spark.read.load("file:///Users/jiangzz/Desktop/results/parquet/results.parquet")
      .show()

spark.close()
12345678910111213
+-------+-----+-------+-------+-------------+---------+
|   name|  sex| salary|dept_no|annual_salary|dept_name|
+-------+-----+-------+-------+-------------+---------+
|Michael| true|20000.0|      1|     240000.0|     研发|
|   Andy| true|15000.0|      1|     180000.0|     研发|
| Justin| true| 8000.0|      1|      96000.0|     研发|
|  Kaine| true|20000.0|      2|     240000.0|     设计|
|   Lisa|false|18000.0|      2|     216000.0|     设计|
+-------+-----+-------+-------+-------------+---------+
123456789

Json格式

产生json格式

val spark = SparkSession.builder()
.master("local[6]")
.appName("sql")
.getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("json")
    .save("file:///Users/jiangzz/Desktop/results/json/")

spark.close()
1234567891011121314151617181920212223242526272829

读取Json格式

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()
spark.sparkContext.setLogLevel("FATAL")
import spark.implicits._

spark.read.json("file:///Users/jiangzz/Desktop/results/json/")
    .show()

spark.close()
1234567891011

csv格式

生成

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("csv")
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .save("file:///Users/jiangzz/Desktop/results/csv/")

spark.close()
1234567891011121314151617181920212223242526272829303132

读取

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")

spark.read
    .option("sep", ";")
    .option("inferSchema", "true")
    .option("header", "true")
    .csv("file:///Users/jiangzz/Desktop/results/csv/")
    .show()

spark.close()
1234567891011121314151617

ORC格式

ORC的全称是(Optimized Row Columnar),ORC文件格式是一种Hadoop生态圈中的列式存储格式,它的产生早在2013年初,最初产生自Apache Hive,用于降低Hadoop数据存储空间和加速Hive查询速度。

生成

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode(SaveMode.Ignore)
    .format("orc")
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .save("file:///Users/jiangzz/Desktop/results/orc/")

spark.close()
12345678910111213141516171819202122232425262728293031

读取

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

spark.read
    .option("orc.bloom.filter.columns", "favorite_color")
    .option("orc.dictionary.key.threshold", "1.0")
    .orc("file:///Users/jiangzz/Desktop/results/orc/")
    .show()

spark.close()
12345678910111213141516171819202122232425

SQL读取文件

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")

val parqeutDF = spark.sql("SELECT * FROM parquet.`file:///Users/jiangzz/Desktop/results/parquet/results.parquet`")
val jsonDF = spark.sql("SELECT * FROM json.`file:///Users/jiangzz/Desktop/results/json/`")
val orcDF = spark.sql("SELECT * FROM orc.`file:///Users/jiangzz/Desktop/results/orc/`")
//parqeutDF.show()
//jsonDF.show()
//csvDF.show()
orcDF.show()
spark.close()
1234567891011121314151617

JDBC数据库读取

读取MysQL

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.47</version>
</dependency>
12345
val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
spark.read
    .format("jdbc")
    .option("url", "jdbc:mysql://CentOS:3306/test")
    .option("dbtable", "t_user")
    .option("user", "root")
    .option("password", "root")
    .load().createTempView("t_user")


spark.sql("select * from t_user").show()

spark.close()
1234567891011121314151617181920

写入MySQL

val spark = SparkSession.builder()
    .master("local[6]")
    .appName("sql")
    .getOrCreate()

import spark.implicits._

spark.sparkContext.setLogLevel("FATAL")
val userDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_employee.txt")
    .map(line => line.split(","))
    .map(t => Employee(t(0), t(1).toInt, t(2).toDouble, t(3).toBoolean, t(4), t(5).toInt))
    .toDS().as("u")

val deptDS = spark.sparkContext.textFile("file:///Users/jiangzz/Desktop/words/t_dept.txt")
    .map(line => line.split(","))
    .map(t => Dept(t(0).toInt, t(1)))
    .toDS().as("d")

val props = new Properties()
    props.put("user", "root")
    props.put("password", "root")

userDS.select("name","sex","salary","deptNo")
    .withColumn("annual_salary",$"salary"*12)
    .join(deptDS,$"u.deptNo" ===$"d.id" ,"left_outer")
    .drop("id")
    .toDF("name","sex","salary","dept_no","annual_salary","dept_name")
    .write
    .mode("append")
		.jdbc("jdbc:mysql://CentOS:3306/test","t_user",props)

spark.close()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值