一、DataFrame和RDD对比
1.相同点:二者都是分布式存储数据集,适用于大数据处理。
2.不同点:
(1)RDD不知道内部结构
(2)DataFrame有详细的内部结构,类似数据表
二、DataFrame基本数据源和API操作
1.DataFrame数据来源
参考官网:http://spark.apache.org/docs/2.1.0/sql-programming-guide.html#datasets-and-dataframes
加载dataframe数据
val spark=SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()
//将json文件加载为一个dataframe
val peopleDF= spark.read.format("json").load("datas/people.json")
2.基本API的操作
(1)printSchema:输出dataframe对应的schema信息
peopleDF.printSchema()
(2)show:输出dataframe的前20条记录
peopleDF.show() //默认20条
//可以输出30条
peopleDF.show(30)
(3)select:查询某列所有的数据
peopleDF.select("name").show()
(4)col:返回某一列的列名
//查询某几列的说有数据,并对列进行计算
peopleDF.select(peopleDF.col("name"),(peopleDF.col("age")+10).as("age2")).show()
(5)where/filter:根据某一列的值进行过滤
peopleDF.filter(peopleDF.col("age")>19).show()
或者
//name以M开头的人
studentDF.filter("SUBSTR(name,0,1)='M'").show
(6)groupBy:根据某一列进行分组,然后行进聚合操作
peopleDF.groupBy("age").count().show()
(7)sort/orderBy:全局数据排序功能,类似Hive中的order by语句,按照给定字段进行全部数据的排序
//先按照薪水进行降序,再按照年龄进行升序
PersonDF.sort($"salary".desc,$"empAge".asc).select("empName","salary","empAge").show
或者:在服务器上验证。
studentDF.sort("name","id").show
或者
studentDF.sort(studentDF("name").asc, studentDF("id").desc).show
(8)limit:获取前N条数据记录
(9)sortWithinPartitions:局部排序
PersonDF.sortWithinPartitions($"salary".desc,$"empAge".asc)
.select("empName","salary","empAge").show
3.代码
(1)实例一
package _0729DF
import _0728sql.Avg_UDAF
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.functions._
/**
*
*/
/**
* sparkDSL语法
* sparksql其实是为了受众面更广,sql维护成本比较低
*
* 1/为了更方便维护,我推荐sql
* 2/但是sql的可操作性相对比较小
* (当遇到了数据倾斜的时候)大表join普通的表,广播变量,filter
* 3/能写sql的,尽量用sql代替,如果有计算时间等的优化考虑,可以把部分变成dsl
* 4/spark虽然快,但是对内存太敏感了,可能使用spark的时候,更多要考虑数据倾斜
*
*/
case class Person1(empName:String,empAge:Int,empGender:String,salary:Float,deptNo:Int)
case class Dept1(deptNo:Int,deptName:String)
object DSLdemo extends App{
// 1. 创建上下文
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("DSLdemo")
val sc = SparkContext.getOrCreate(conf)
val sqlContext = new HiveContext(sc)
sqlContext.udf.register("gender2Num",(gender:String) => {
gender.toUpperCase() match {
case "M" => 0
case "F" => 1
case _ => -1
}
})
val rddp = sc.parallelize(Array(
Person1("张三",21,"M",3231.2f,1),
Person1("里斯",23,"M",3412.2f,1),
Person1("王五",52,"M",3562.2f,1),
Person1("赵六",21,"M",5655.2f,1),
Person1("小花",23,"F",6574.2f,2),
Person1("小红",21,"F",4547.2f,1),
Person1("小明",61,"M",3231.2f,2),
Person1("mary",43,"F",4354.2f,1),
Person1("leo",53,"M",3231.2f,1),
Person1("jack",12,"M",1111.2f,3),
Person1("jane",34,"F",2624.2f,1),
Person1("lili",41,"F",6754.2f,3)
))
val rddd = sc.parallelize(Array(
Dept1(1,"部门1"),
Dept1(2,"部门2"),
Dept1(4,"部门4")
))
import sqlContext.implicits._
val deptDF: DataFrame = rddd.toDF()
val PersonDF: DataFrame = rddp.toDF()
PersonDF.registerTempTable("p")
deptDF.cache()
PersonDF.cache()
//DSL
println("===========================DSL===============================")
/**
* select empName,XXX,XXX from PersonDF
*
* 一般select用在,从数据中挑出所想要使用的列
* 转换字段的顺序(当Dataset[Person],有可能字段顺序会发生改变),重新排下字段顺序
* 在select中使用的一些函数(map)casewhen,substring ....
*/
//第一种:(传入字符串)
PersonDF.select("empName","empAge","empGender").show()
//第二种:(传入column)
//当使用$ 要将stringtocolumn的时候,必须导入functions
//import org.apache.spark.sql.functions._
PersonDF.select($"empName" as("empName1"),$"empAge" as("empAge1"),$"empGender" as("empGender1")).show()
PersonDF.select(col("empName").as("empName2"),
col("empAge").as("empAge2"),
col("empGender").as("empGender2")).show()
//第三种:自定义一个函数:
PersonDF.selectExpr("empName","empAge","gender2Num(empGender) as sex").show()
println("===========================where/filter======================================")
PersonDF.where("empAge > 30 AND empGender = 'M' AND deptNo = 1").show
PersonDF.where("empAge > 30").where("empGender = 'M' AND deptNo = 1").show
PersonDF.filter($"empAge" > 30 && $"empGender" === "F").show
println("=============================sort===========================================")
//涉及全局排序,和局部排序
//全局排序
PersonDF.sort("salary").select("empName","salary").show
PersonDF.sort($"salary".desc).select("empName","salary").show
//先按照薪水进行降序,再按照年龄进行升序
PersonDF.sort($"salary".desc,$"empAge".asc).select("empName","salary","empAge").show
println("============================order by=========================================")
PersonDF.repartition(5).orderBy($"salary".desc,$"empAge".asc).select("empName","salary","empAge").show
//局部排序
print("============================sortWithinPartitions===============================")
PersonDF.sortWithinPartitions($"salary".desc,$"empAge".asc)
.select("empName","salary","empAge").show
//聚合操作
println("============================group by==================================")
//以下操作有BUG,对同一列的数据做两种操作,会失败
//在1.6会出现问题,但是新版本spark不会有
PersonDF.groupBy("empGender")
.agg(
"salary" -> "avg",
"salary" -> "sum"
).show
//解决方式:
PersonDF.groupBy("empGender")
.agg(
avg("salary").as("avg1"),
sum("salary").as("sum1"),
count(lit(1)).as("cnt")
).show
//直接使用sql
sqlContext.sql(
"""
|select empGender,sum(salary)as sum_sql,avg(salary) as avg_sql
|from p
|group by empGender
""".stripMargin).show
println("==============================自定义函数===================================")
//自定义方法调用
sqlContext.udf.register("self_avg",Avg_UDAF)
PersonDF.groupBy("empGender")
.agg(
"salary" -> "self_avg"
).withColumn("test",lit(0))
.show
println("==============================limit===================================")
PersonDF.limit(4).show()
println("==============================join==========================================")
/**
* 左关联,右关联,内联,外联,左半关联
*/
//这种方式会报错 deptno来自于哪张表?
//PersonDF.join(deptDF,$"deptno"===$"deptno").show
PersonDF.join(deptDF,"deptno").show
//如果出现字段名称一样,之后无法处理,那么可以事后取出具体的字段
//或者事后,修改字段名称(可能要写的比较长)
PersonDF.join(deptDF,PersonDF.col("deptno")===deptDF.col("deptno"))
//.toDF("列1新名称","列2新名称")
.select(PersonDF.col("deptno"),PersonDF.col("salary")).show
deptDF.toDF("d1","dname").join(PersonDF,$"deptno"===$"d1").show
println("--------------------------左右关联等-------------------------------")
PersonDF.join(deptDF,Seq("deptNo"),"left").show
PersonDF.join(deptDF,Seq("deptNo"),"right").show
PersonDF.join(deptDF,Seq("deptNo"),"full").show
//select * from deptDF where deptNo in (select deptNo from PersonDF)
deptDF.join(PersonDF,Seq("deptNo"),"leftsemi").show
println("=============================casewhen====================================")
/**
* select *,
* case when salary <= 2000
* then "底薪"
* when salary > 2000 and salary <=4000
* then "中等水平"
* else "高薪" as salarylevel
* end from PersonDF
*
* //基于某些已知的字段,为每一条记录打上更细的标签
*/
PersonDF.select(PersonDF.col("empName"),PersonDF.col("salary"),
when(PersonDF.col("salary") <= 2000,"底薪")
.when(PersonDF.col("salary") > 2000 && PersonDF.col("salary") <= 4000,"中等水平")
.otherwise("高薪").as("salarylevel")
).show()
println("========================窗口函数====================================")
/**
*1. 使用窗口函数,就必须使用hivecontext
*
*2.要使用HiveContext必须添加运行参数: -XX:PermSize=128M -XX:MaxPermSize=256M
*3.按照部门分组,组内按salary排序,求每个部门的前三(可能需要做row_number或者unionall)
* select * from
* (select *,
* row_number() over (partition by deptno order by salary desc) as rnk
* from PersonDF) a
* where rnk <=3
*
* select * from PersonDF where deptno = 1 order by salary desc limit 3
* union all
* select * from PersonDF where deptno = 2 order by salary desc limit 3
* union all
* select * from PersonDF where deptno = 3 order by salary desc limit 3
*/
val w = Window.partitionBy("deptNo").orderBy($"salary".desc,$"empAge".asc)
PersonDF.select($"empName",$"empAge",$"deptNo",$"salary",
row_number().over(w).as("rnk")
).where($"rnk".leq(3)).show
// println("=========================union ALL============================")
// PersonDF.select($"empName",$"empAge",$"deptNo",$"salary"
// ).where($"deptNo" === 1).sort($"salary".desc,$"empAge".asc).limit(3)
// .unionAll(PersonDF.select($"empName",$"empAge",$"deptNo",$"salary"
// ).where($"deptNo" === 2).sort($"salary".desc,$"empAge".asc).limit(3))
// .unionAll(PersonDF.select($"empName",$"empAge",$"deptNo",$"salary"
// ).where($"deptNo" === 3).sort($"salary".desc,$"empAge".asc).limit(3)).show
//
//
//
//
// //Thread.sleep(1232323l)
// deptDF.unpersist()
// PersonDF.unpersist()
}
(2)实例二
package MoocSparkSQL
import org.apache.spark.sql.SparkSession
/**
*
* DataFrame中的操作操作
*/
object DataFrameCase {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DataFrameRDDApp").master("local[2]").getOrCreate()
// RDD ==> DataFrame
val rdd = spark.sparkContext.textFile("datas/student.data") //这个自己没有!20181001
//注意:需要导入隐式转换
import spark.implicits._
val studentDF = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF()
//show默认只显示前20条
studentDF.show
studentDF.show(30)
studentDF.show(30, false)
studentDF.take(10)
studentDF.first()
studentDF.head(3)
studentDF.select("email").show(30,false)
studentDF.filter("name=''").show
studentDF.filter("name='' OR name='NULL'").show
//name以M开头的人
studentDF.filter("SUBSTR(name,0,1)='M'").show
studentDF.sort(studentDF("name")).show
studentDF.sort(studentDF("name").desc).show
studentDF.sort("name","id").show
studentDF.sort(studentDF("name").asc, studentDF("id").desc).show
studentDF.select(studentDF("name").as("student_name")).show
val studentDF2 = rdd.map(_.split("\\|")).map(line => Student(line(0).toInt, line(1), line(2), line(3))).toDF()
studentDF.join(studentDF2, studentDF.col("id") === studentDF2.col("id")).show
spark.stop()
}
case class Student(id: Int, name: String, phone: String, email: String)
}