Spark数据源
在Spark SQL中,可以使用各种各样的数据源来操作
注意:使用load 或 save时,默认是Parquet文件。列式存储文件。
举例:
读取 users.parquet 文件
val userDF = spark.read.load("/usr/local/tmp_files/users.parquet")
userDF.printSchema
userDF.show
val userDF = spark.read.load("/usr/local/tmp_files/emp.json")
保存parquet文件
userDF.select($"name",$"favorite_color").write.save("/usr/local/tmp_files/parquet")
读取刚刚写入的文件:
val userDF1 = spark.read.load("/usr/local/tmp_files/parquet/part-00000-1ab4e661-32c6-441a-b320-79d")---> 不推荐
生产:
val userDF2 = spark.read.load("/usr/local/tmp_files/parquet")
直接读完了parquet下的所有文件
读json文件 必须format
两种方式
val userDF = spark.read.format("json").load("/usr/local/tmp_files/emp.json")
val userDF3 = spark.read.json("/usr/local/tmp_files/emp.json")
关于save函数:
调用save函数的时候,可以指定存储模式,追加、覆盖等等
userDF2.write.save("/usr/local/tmp_files/parquet")
如果此时再次
userDF2.write.save("/usr/local/tmp_files/parquet")
报错
org.apache.spark.sql.AnalysisException: path file:/usr/local/tmp_files/parquet already exists.;
save的时候覆盖
userDF2.write.mode("overwrite").save("/usr/local/tmp_files/parquet")
将结果保存成表
userDF2.select($"name").write.saveAsTable("table1")
scala> userDF.select($"name").write.saveAsTable("table2")
scala> spark.sql("select * from table2").show
+------+
| name|
+------+
|Alyssa|
| Ben|
+------+
支持Schema的合并
项目开始 表结构简单 schema简单
项目越来越大 schema越来越复杂
举例:
通过RDD来创建DataFrame
val df1 = sc.makeRDD(1 to 5).map( i => (i,i*2)).toDF("single","double")
"single","double" 是表结构
df1.show
df1.write.mode("overwrite").save("/usr/local/tmp_files/test_table/key=1")
val df2 = sc.makeRDD(6 to 10).map( i => (i,i*3)).toDF("single","triple")
df2.show
df2.write.mode("overwrite").save("/usr/local/tmp_files/test_table/key=2")
合并两个部分
val df3 = spark.read.parquet("/usr/local/tmp_files/test_table")
val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
通过RDD来创建DataFrame
val df1 = sc.makeRDD(1 to 5).map( i => (i,i*2)).toDF("single","double")
"single","double" 是表结构
df1.show
df1.write.mode("overwrite").save("/usr/local/tmp_files/test_table/tzkt=1")
val df2 = sc.makeRDD(6 to 10).map( i => (i,i*3)).toDF("single","triple")
df2.show
df2.write.mode("overwrite").save("/usr/local/tmp_files/test_table/key=2")
合并两个部分
val df3 = spark.read.parquet("/usr/local/tmp_files/test_table")
val df3 = spark.read.option("mergeSchema",true).parquet("/usr/local/tmp_files/test_table")
json文件
读取Json文件,生成DataFrame
val peopleDF = spark.read.json("/usr/local/tmp_files/people.json")
peopleDF.printSchema
peopleDF.createOrReplaceTempView("peopleView")
spark.sql("select * from peopleView").show
Spark SQL 支持统一的访问接口。对于不同的数据源,读取进来,生成DataFrame后,操作完全一样。
JDBC
使用JDBC操作关系型数据库,加载到Spark中进行分析和处理。
方式一:
val mysqlDF = spark.read.format("jdbc")
.option("url","jdbc:mysql://192.168.109.1:3306/company?serverTimezone=UTC&characterEncoding=utf-8")
.option("user","root")
.option("password","123456")
.option("driver","com.mysql.jdbc.Driver")
.option("dbtable","emp").load
mysqlDF.show
方式二:
定义一个Properties类
import java.util.Properties
val mysqlProps = new Properties()
mysqlProps.setProperty("user","root")
mysqlProps.setProperty("password","123456")
val mysqlDF1 = spark.read.jdbc("jdbc:mysql://192.168.109.1:3306/company?serverTimezone=UTC&characterEncoding=utf-8","emp",mysqlProps)
mysqlDF1.show
以mysql作为数据源
object JdbcSource {
def main(args: Array[String]): Unit = {
//1.sparkSQL 创建sparkSession
val sparkSession: SparkSession = SparkSession.builder().appName("JdbcSource")
.master("local[2]").getOrCreate()
//2.加载数据源 或者odbc
val urlData: DataFrame = sparkSession.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost:3306/urlcount",
"driver" -> "com.mysql.jdbc.Driver",
"dbtable" -> "url_data",
"user" -> "root",
"password" -> "root"
)).load()
//测试
// urlData.printSchema()
// urlData.show()
//3.过滤数据
val fData: Dataset[Row] = urlData.filter(x => {
//uid>2 为何拿到uid?
x.getAs[Int](0) > 2
})
fData.show()
sparkSession.stop()
}
}
使用Hive
比较常见
(*)spark SQL 完全兼容hive
(*)需要进行配置
拷贝一下文件到spark/conf目录下:
Hive 配置文件: hive-site.xml
Hadoop 配置文件:core-site.xml hdfs-site.xml
配置好后,重启spark
启动Hadoop 与 hive
spark.sql("create table comany.emp_0410(empno Int,ename String,job String,mgr String,hiredate String,sal Int,comm String,deptno Int)row format delimited fields terminated by ','")
写出格式
object JdbcSource1 {
def main(args: Array[String]): Unit = {
//1.sparkSQL 创建sparkSession
val sparkSession: SparkSession = SparkSession.builder().appName("JdbcSource")
.master("local[2]").getOrCreate()
import sparkSession.implicits._
//2.加载数据源
val urlData: DataFrame = sparkSession.read.format("jdbc").options(Map(
"url" -> "jdbc:mysql://localhost:3306/urlcount",
"driver" -> "com.mysql.jdbc.Driver",
"dbtable" -> "url_data",
"user" -> "root",
"password" -> "root"
)).load()
//3.uid>2
val r: Dataset[Row] = urlData.filter($"uid" > 2)
val rs: DataFrame = r.select($"xueyuan",$"number_one")
//写入以text格式
//rs.write.text("d:/saveText")
//写入以json格式
//rs.write.json("d:/saveJson")
//写入以csv
//rs.write.csv("d:/saveCsv")
rs.write.parquet("d:/savePar")
rs.show()
sparkSession.stop()
}
}