SparkSQL

SparkSQL概念

Spark SQL是用于结构化数据处理的一个模块。同Spark RDD 不同地方在于Spark SQL的API可以给Spark计算引擎提供更多地信息,例如:数据结构、计算算子等。在内部Spark可以通过这些信息有针对对任务做优化和调整。这里有几种方式和Spark SQL进行交互,例如Dataset API和SQL等,这两种API可以混合使用。Spark SQL的一个用途是执行SQL查询。 Spark SQL还可用于从现有Hive安装中读取数据。从其他编程语言中运行SQL时,结果将作为Dataset/DataFrame返回,使用命令行或JDBC / ODBC与SQL接口进行交互。

Dataset是一个分布式数据集合在Spark 1.6提供一个新的接口,Dataset提供RDD的优势(强类型,使用强大的lambda函数)以及具备了Spark SQL执行引擎的优点。Dataset可以通过JVM对象构建,然后可以使用转换函数等(例如:map、flatMap、filter等),目前Dataset API支持Scala和Java 目前Python对Dataset支持还不算完备。

Data Frame是命名列的数据集,他在概念是等价于关系型数据库。DataFrames可以从很多地方构建,比如说结构化数据文件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[Row].

SparkSession

  • 依赖
<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.4.3</version>
</dependency>

如果报错:Exception in thread “main” java.lang.NoClassDefFoundError: org/codehaus/janino/InternalCompilerException添加下面的依赖试试

<dependency>
      <groupId>org.codehaus.janino</groupId>
      <artifactId>janino</artifactId>
      <version>3.0.8</version>
    </dependency>

Spark中所有的功能的入口点是SparkSession类,使用SparkSession.builder()创建SparkSession

val spark = SparkSession.builder()
          .appName("hellosql")
          .master("local[10]")
           .getOrCreate()
//一般都需要引入改隐试转换 主要是 将 RDD 转换为 DataFrame/Dataset
import spark.implicits._

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()

Dataset

Dataset与RDD类似,但是他们不使用Java序列化或Kryo,而是使用专用的Encoder来序列化对象以便通过网络进行处理和传输。Encoder是动态生成的代码,并使用一种格式,允许Spark执行许多操作,如过滤、排序和散列,而无需将字节反序列化为对象。

  • 集合Case-Class
package demo1

import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(id:Int,name:String,age:Int,sex:Boolean)
  
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._
    val dataset:Dataset[Person] = List(Person(1,"enming",23,true),Person(2,"haonan",23,false)).toDS()
    dataset.select($"name",$"sex").show()

    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

  • 元组
package demo1

import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(id:Int,name:String,age:Int,sex:Boolean)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._

    val dataset:Dataset[(Int,String,Int,Boolean)] = List((1,"haonan",23,false),(2,"enming",23,true)).toDS()
    dataset.select($"_1",$"_2").show()

    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

  • 加载json数据
{"name":"张三","age":18}
{"name":"lisi","age":28}
{"name":"wangwu","age":38}
package demo1

import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(name:String,age:Long)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._

    val dataset = spark.read.json("E:///test.json").as[Person]
    dataset.show()

    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

DataFrame

DataFrame是命名列的数据集,他在概念上等价于关系型数据库。DataFrames可以从很多地方构建,比如结构化数据文件、hive中的表或者外部数据库,使用Dataset[row]的数据集,可以理解DataFrame就是一个Dataset[row]

  • 加载json文件
package demo1
import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(name:String,age:Long)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._

    val dataframe = spark.read.json("E:///test.json")
    dataframe.show()

    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}
  • case-class
package demo1
import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(name:String,age:Long)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._

    List(Person("haonan",18),Person("enming",18)).toDF("name","age").show()
    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

  • 元组
package demo1
import org.apache.spark.sql.{Dataset, SparkSession}

object Testsql {
  case class Person(name:String,age:Long)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[2]").appName("Testsql").getOrCreate()
   import spark.implicits._

   List(("zhanghaonan",18),("fanenming",20)).toDF("name","age").show()
    
    //关闭日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

  • 通过RDD转换(灵活)
case class Person(name:String,age:Long)
def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
    .appName("hellosql")
    .master("local[10]")
    .getOrCreate()
    import spark.implicits._

    val lines = spark.sparkContext.parallelize(List("zhangsan,20", "lisi,30"))
    .map(line => Row(line.split(",")(0), line.split(",")(1).toInt))

    val structType = new StructType(Array(StructField("name",StringType,true),StructField("age",IntegerType,true)))
    val frame = spark.createDataFrame(lines,structType)

    frame.show()

    //关闭Spark日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
}

DataFrame算子操作

测试数据格式如下

Michael,29,2000,true
Andy,30,5000,true
Justin,19,1000,true
Kaine,20,5000,true
Lisa,19,1000,false
select
case class Person(name:String,age:Long)
def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
    .appName("hellosql")
    .master("local[10]")
    .getOrCreate()
    import spark.implicits._

//Michael,29,2000,true
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
.show()

    //关闭Spark日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
}
filter
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
    .select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
    .filter($"name" === "Michael" or $"年薪" <  60000)
.show()
where
//Michael,29,2000,true
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "year_salary") //不允许别名中有 中文 bug
.where("(name = 'Michael') or ( year_salary <= 24000) ")
.show()
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
  .map(_.split(","))
  .map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
    .select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
    .where($"name" === "Michael" or $"年薪" <= 24000)
    .show()
withColumn
//Michael,29,2000,true
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
.select($"name",$"age",$"sex",$"salary",$"salary" * 12 as "年薪")
.where($"name" === "Michael" or $"年薪" <= 24000)
.withColumn("年终奖",$"年薪" * 0.8)
.show()
groupBy
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
spark.createDataFrame(rdd,StructType(fields))
    .select($"age",$"sex")
    .groupBy($"sex")
    .avg("age")
.show()
agg
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::Nil
import org.apache.spark.sql.functions._
spark.createDataFrame(rdd,StructType(fields))
.select($"age",$"sex",$"salary")
.groupBy($"sex")
.agg(sum($"salary") as "toatalSalary",avg("age") as "avgAge",max($"salary"))
.show()
join

准备以下数据dept.txt

1,销售部门
2,研发部门
3,媒体运营
4,后勤部门

people.txt

Michael,29,2000,true,1
Andy,30,5000,true,1
Justin,19,1000,true,2
Kaine,20,5000,true,2
Lisa,19,1000,false,3
//Michael,29,2000,true,1
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean,arr(4).trim().toInt))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::
new StructField("deptno",IntegerType,true)::Nil

val user = spark.createDataFrame(rdd,StructType(fields)).as("user")

var dept =  spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line =>(line.split(",")(0).toInt,line.split(",")(1)))
.toDF("deptno","deptname").as("dept")

user.select($"name",$"user.deptno")
.join(dept,$"dept.deptno" === $"user.deptno")
.show()
drop
userDF.select($"deptno",$"salary" )
.groupBy($"deptno")
.agg(sum($"salary") as "总薪资",avg($"salary") as "平均值",max($"salary") as "最大值")
.join(deptDF,$"dept.deptno" === $"user.deptno")
.drop($"dept.deptno")
.show()

orderBy

userDF.select($"deptno",$"salary" )
.groupBy($"deptno")
.agg(sum($"salary") as "总薪资",avg($"salary") as "平均值",max($"salary") as "最大值")
.join(deptDF,$"dept.deptno" === $"user.deptno")
.drop($"dept.deptno")
.orderBy($"总薪资" asc)
.show()
map
userDF.map(row => (row.getString(0),row.getInt(1))).show()

默认情况下SparkSQL会在执行SQL的时候将序列化里面的参数数值,一般情况下系统提供了常见类型的Encoder,如果出现了没有的Encoder,用户需要声明 隐式转换Encoder

implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
userDF.map(row => row.getValuesMap[Any](List("name","age","salary")))
.foreach(map=>{
  var name=map.getOrElse("name","")
  var age=map.getOrElse("age",0)
  var salary=map.getOrElse("salary",0.0)
  println(name+" "+age+" "+salary)
})
flatMap
implicit val mapEncoder = org.apache.spark.sql.Encoders.kryo[Map[String, Any]]
userDF.flatMap(row => row.getValuesMap(List("name","age")))
    .map(item => item._1 +" -> "+item._2)
    .show()
limit(take(n))
var rdd=  spark.sparkContext.textFile("file:///D:/people.txt")
.map(_.split(","))
.map(arr=>Row(arr(0),arr(1).trim().toInt,arr(2).trim().toDouble,arr(3).trim().toBoolean,arr(4).trim().toInt))

var fields=new StructField("name",StringType,true)::new StructField("age",IntegerType,true)::new StructField("salary",DoubleType,true):: new StructField("sex",BooleanType,true)::
new StructField("deptno",IntegerType,true)::Nil

val user = spark.createDataFrame(rdd,StructType(fields)).as("user")

var dept =  spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line =>(line.split(",")(0).toInt,line.split(",")(1)))
.toDF("deptno","deptname").as("dept")

user.select($"name",$"deptno" as "u_dept")
.join(dept,$"dept.deptno" === $"u_dept")
.drop("u_dept")
.orderBy($"deptno" desc)

.limit(3)
.show()

SQL获取DataFrame

Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2
sql查询
package demo2

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}

object TestSQL {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("TestSQL").master("local[4]").getOrCreate()
    import spark.implicits._
    val rdd = spark.sparkContext.textFile("file:///E:/people.txt")
      .map(line => {
        val tokens = line.split(",")
        Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
      })
    var fields = new StructField("name",StringType,true)::
    new StructField("age",IntegerType,true)::
    new StructField("salary",DoubleType,true)::
    new StructField("sex",BooleanType,true)::
    new StructField("job",StringType,true)::
    new StructField("deptno",IntegerType,true)::Nil

    val userDF = spark.createDataFrame(rdd,StructType(fields))

    //创建一个视图
    userDF.createTempView("t_user")

    spark.sql("select * from t_user where name like '%M%' or salary between 10000 and 20000").show()

    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

group by
package demo2

import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}

object TestSQL {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("TestSQL").master("local[4]").getOrCreate()
    import spark.implicits._
    val rdd = spark.sparkContext.textFile("file:///E:/people.txt")
      .map(line => {
        val tokens = line.split(",")
        Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
      })
    var fields = new StructField("name",StringType,true)::
      new StructField("age",IntegerType,true)::
      new StructField("salary",DoubleType,true)::
      new StructField("sex",BooleanType,true)::
      new StructField("job",StringType,true)::
      new StructField("deptno",IntegerType,true)::Nil

    val userDF = spark.createDataFrame(rdd,StructType(fields))

    //创建一个视图
    userDF.createTempView("t_user")

    spark.sql("select deptno,max(salary),avg(salary),sum(salary),count(1) from t_user group by deptno").show()

    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

having过滤
val rdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
var fields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil

val userDF = spark.createDataFrame(rdd,StructType(fields))

//创建一个视图
userDF.createTempView("t_user")

spark.sql("select deptno,max(salary),avg(salary),sum(salary),count(1) total from t_user group by deptno having total > 2 ")
.show()

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
表连接join
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil

var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil

spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")

spark.sql("select u.*,d.name from t_user u left join t_dept d on u.deptno=d.deptno")
.show()

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
limit
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil

var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil

spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")

spark.sql("select u.*,d.name from t_user u left join t_dept d on u.deptno=d.deptno order by u.age asc limit 8")
.show()

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()
子查询
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})
val deptRdd = spark.sparkContext.textFile("file:///D:/dept.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0).toInt, tokens(1))
})
var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil

var deptFields=new StructField("deptno",IntegerType,true)::
new StructField("name",StringType,true)::Nil

spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")
spark.createDataFrame(deptRdd,StructType(deptFields)).createTempView("t_dept")

spark.sql("select * from (select name,age,salary from t_user)")
.show()

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()

开窗函数

在正常的统计分析中 ,通常使用聚合函数作为分析,聚合分析函数的特点是将n行记录合并成一行,在数据库的统计当中还有一种统计称为开窗统计,开窗函数可以实现将一行变成多行。可以将数据库查询的每一条记录比作是一幢高楼的一层, 开窗函数就是在每一层开一扇窗, 让每一层能看到整装楼的全貌或一部分。

  • 查询每个部门员工信息,并返回本部门的平均工资
Michael,29,20000,true,MANAGER,1
Andy,30,15000,true,SALESMAN,1
Justin,19,8000,true,CLERK,1
Kaine,20,20000,true,MANAGER,2
Lisa,19,18000,false,SALESMAN,2

+-------+---+-------+-----+--------+------+
|   name|age| salary|  sex|     job|deptno|
+-------+---+-------+-----+--------+------+
|Michael| 29|20000.0| true| MANAGER|     1|
|   Jimi| 25|20000.0| true|SALESMAN|     1|
|   Andy| 30|15000.0| true|SALESMAN|     1|
| Justin| 19| 8000.0| true|   CLERK|     1|
|  Kaine| 20|20000.0| true| MANAGER|     2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|
+-------+---+-------+-----+--------+------+
val userRdd = spark.sparkContext.textFile("file:///D:/people.txt")
.map(line => {
    val tokens = line.split(",")
    Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
})

var userFields=new StructField("name",StringType,true)::
new StructField("age",IntegerType,true)::
new StructField("salary",DoubleType,true)::
new StructField("sex",BooleanType,true)::
new StructField("job",StringType,true)::
new StructField("deptno",IntegerType,true)::Nil


spark.createDataFrame(userRdd,StructType(userFields)).createTempView("t_user")

spark.sql("select *, avg(salary) over(partition by deptno) as avgSalary from t_user")
.show()

//关闭Spark日志
spark.sparkContext.setLogLevel("FATAL")
spark.stop()

结果:

+-------+---+-------+-----+--------+------+------------------+
|   name|age| salary|  sex|     job|deptno|         avgSalary|
+-------+---+-------+-----+--------+------+------------------+
|Michael| 29|20000.0| true| MANAGER|     1|14333.333333333334|
|   Andy| 30|15000.0| true|SALESMAN|     1|14333.333333333334|
| Justin| 19| 8000.0| true|   CLERK|     1|14333.333333333334|
|  Kaine| 20|20000.0| true| MANAGER|     2|           19000.0|
|   Lisa| 19|18000.0|false|SALESMAN|     2|           19000.0|
+-------+---+-------+-----+--------+------+------------------+
ROW_NUMBER()
  • 统计员工在部门内薪资排名
spark.sql("select * , ROW_NUMBER() over(partition by deptno order by salary DESC) as rank from t_user")
      .show()
+-------+---+-------+-----+--------+------+----+
|   name|age| salary|  sex|     job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER|     1|   1|
|   Andy| 30|15000.0| true|SALESMAN|     1|   2|
| Justin| 19| 8000.0| true|   CLERK|     1|   3|
|  Kaine| 20|20000.0| true| MANAGER|     2|   1|
|   Lisa| 19|18000.0|false|SALESMAN|     2|   2|
+-------+---+-------+-----+--------+------+----+
  • 统计员工在公司所有员工的薪资排名
spark.sql("select * , ROW_NUMBER() over(order by salary DESC) as rank from t_user")
      .show()
+-------+---+-------+-----+--------+------+----+
|   name|age| salary|  sex|     job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER|     1|   1|
|  Kaine| 20|20000.0| true| MANAGER|     2|   2|
|   Lisa| 19|18000.0|false|SALESMAN|     2|   3|
|   Andy| 30|15000.0| true|SALESMAN|     1|   4|
| Justin| 19| 8000.0| true|   CLERK|     1|   5|
+-------+---+-------+-----+--------+------+----+

ROW_NUMBER()函数只能计算结果在当前开窗函数中的顺序。并不能计算排名。

DENSE_RANK()
  • 计算员工在公司薪资排名
val sql="select * , DENSE_RANK() over(order by salary DESC)  rank  from t_emp"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
|   name|age| salary|  sex|     job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER|     1|   1|
|   Jimi| 25|20000.0| true|SALESMAN|     1|   1|
|  Kaine| 20|20000.0| true| MANAGER|     2|   1|
|   Lisa| 19|18000.0|false|SALESMAN|     2|   2|
|   Andy| 30|15000.0| true|SALESMAN|     1|   3|
| Justin| 19| 8000.0| true|   CLERK|     1|   4|
+-------+---+-------+-----+--------+------+----+
  • 计算员工在公司部门薪资排名
val sql="select * , DENSE_RANK() over(partition by deptno order by salary DESC)  rank  from t_emp"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
|   name|age| salary|  sex|     job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER|     1|   1|
|  Kaine| 20|20000.0| true| MANAGER|     2|   1|
|   Lisa| 19|18000.0|false|SALESMAN|     2|   2|
|   Andy| 30|15000.0| true|SALESMAN|     1|   3|
| Justin| 19| 8000.0| true|   CLERK|     1|   4|
+-------+---+-------+-----+--------+------+----+
RANK()

该函数和DENSE_RANK()类似,不同的是RANK计算的排名顺序不连续。

  • 计算员工在公司部门薪资排名
val sql="select * , RANK() over(partition by deptno order by salary DESC)  rank  from t_user"
spark.sql(sql).show()
+-------+---+-------+-----+--------+------+----+
|   name|age| salary|  sex|     job|deptno|rank|
+-------+---+-------+-----+--------+------+----+
|Michael| 29|20000.0| true| MANAGER|     1|   1|
|  Kaine| 20|20000.0| true| MANAGER|     2|   1|
|   Lisa| 19|18000.0|false|SALESMAN|     2|   3|
|   Andy| 30|15000.0| true|SALESMAN|     1|   4|
| Justin| 19| 8000.0| true|   CLERK|     1|   5|
+-------+---+-------+-----+--------+------+----+

自定义函数

单行函数

package demo3
import org.apache.spark.sql.types._
import org.apache.spark.sql.{Row, SparkSession}

object TestSQL {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("TestSQL").master("local[4]").getOrCreate()
    import spark.implicits._
    val rdd = spark.sparkContext.textFile("file:///E:/people.txt")
      .map(line => {
        val tokens = line.split(",")
        Row(tokens(0), tokens(1).toInt, tokens(2).toDouble, tokens(3).toBoolean, tokens(4), tokens(5).toInt)
      })
    var fields = new StructField("name",StringType,true)::
      new StructField("age",IntegerType,true)::
      new StructField("salary",DoubleType,true)::
      new StructField("sex",BooleanType,true)::
      new StructField("job",StringType,true)::
      new StructField("deptno",IntegerType,true)::Nil
    val frame = spark.createDataFrame(rdd,StructType(fields))
    //创建一个视图
    frame.createTempView("t_user")

    spark.udf.register("yearSalary",(job:String,salary:Double)=>{
      job match {
        case "MANAGER" =>salary*14
        case "SALESMAN" => salary*16
        case "CLERK" => salary*13
        case _ => salary*12
      }
    })
    spark.sql("select name,salary,yearSalary(job,salary) as yearSalary from t_user").show()

    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
}

聚合函数

无类型聚合(spark sql)
  • order.txt
1,苹果,4.5,2,001
2,橘子,2.5,5,001
3,机械键盘,800,1,002
  • MySumAggregateFunction
import org.apache.spark.sql.Row
import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction}
import org.apache.spark.sql.types.{DataType, DoubleType, IntegerType, StructType}

class MySumAggregateFunction extends UserDefinedAggregateFunction{
  //说明输出参数,name 参数无所谓
  override def inputSchema: StructType = {
    new StructType().add("price",DoubleType).add("count",IntegerType)
  }
  //最终输出结果的Schema
  override def bufferSchema: StructType = {
      new StructType().add("totalCost",DoubleType)
  }
 // 统计结果值类型
  override def dataType: DataType = DoubleType
  //一般不需要做额外实现,直接返回true
  override def deterministic: Boolean = true
  //设置统计初始值
  override def initialize(buffer: MutableAggregationBuffer): Unit = {
    //初始化第一个参数的值是0
    buffer.update(0,0.0)
  }
  //局部计算
  override def update(buffer: MutableAggregationBuffer, input: Row): Unit = {
    val price = input.getAs[Double](0)
    val count = input.getAs[Int](1)
    val historyCost = buffer.getDouble(0)
    buffer.update(0,historyCost+(price*count))
  }
  //计算在最终结果,要将结果更新到buffer1
  override def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {
    var totalCost=buffer1.getDouble(0)+buffer2.getDouble(0)
    buffer1.update(0,totalCost)
  }
  //执行最终的返回结果
  override def evaluate(buffer: Row): Any = {
    buffer.getDouble(0)
  }
}
  • 按照userid统计用户消费
 case class OrderLog(price: Double, count: Int,userid:String)
  def main(args: Array[String]): Unit = {
      val spark = SparkSession.builder()
          .appName("hellosql")
          .master("local[10]")
          .getOrCreate()
    import spark.implicits._
    

    var orderDF=  spark.sparkContext.textFile("file:///D:/order.txt")
      .map(_.split(","))
      .map(arr=> OrderLog(arr(2).toDouble,arr(3).toInt,arr(4)))
      .toDF().createTempView("t_order")

    spark.udf.register("customsum",new MySumAggregateFunction())
    spark.sql("select userid , customsum(price,count) as totalCost from t_order group by userid").show()

    //关闭Spark日志
    spark.sparkContext.setLogLevel("FATAL")
    spark.stop()
  }
有类型聚合|强类型聚合(datafram api)
  • AverageState
case class AverageState(var sum: Double, var total: Int)
  • MyAggregator
import org.apache.spark.sql.{Encoder, Encoders}
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.expressions.Aggregator

class MyAggregator extends Aggregator[GenericRowWithSchema,AverageState,Double] {
  //初始值
  override def zero: AverageState = AverageState(0.0,0)

  //局部合并
  override def reduce(b: AverageState, a: GenericRowWithSchema): AverageState ={
      var sum=b.sum + a.getAs[Int]("count") * a.getAs[Double]("price")
      var count=b.total+1
      b.copy(sum,count)
  }
  //最终合并
  override def merge(b1: AverageState, b2: AverageState): AverageState = {
    b1.copy(b1.sum+b2.sum,b1.total+b2.total)
  }
  //最终输出结果
  override def finish(reduction: AverageState): Double = {
    reduction.sum/reduction.total
  }
  //中间计算结果
  override def bufferEncoder: Encoder[AverageState] = {
    Encoders.product[AverageState]
  }
//最终输出结果
  override def outputEncoder: Encoder[Double] = {
    Encoders.scalaDouble
  }
}
  • 使用
 var orderDF=  spark.sparkContext.textFile("file:///D:/order.txt")
      .map(_.split(","))
      .map(arr=> OrderLog(arr(2).toDouble,arr(3).toInt,arr(4)))
      .toDF()
    val avg = new MyAggregator().toColumn.name("avgCost")

    orderDF.groupBy($"userid").agg(avg).show()
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值