目录
1. SQLContext的使用
Spark1.x中Spark SQL的入口:SQLContext
The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. To create a basic SQLContext, all you need is a SparkContext.
SQLContextApp.scala
package cn.ljj.spark
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
/**
* SQLContext的使用
*/
object SQLContextApp {
def main(args: Array[String]): Unit = {
val path=args(0)
// 1) 创建相应的Context
val sparkConf =new SparkConf()
val sc=new SparkContext(sparkConf)
val sqlContext=new SQLContext(sc)
// 2) 相关的处理:json
val people=sqlContext.read.format("json").load(path)
people.printSchema()
people.show()
// 3) 关闭资源
sc.stop()
}
}
提交运行
spark-submit \
--name SQLContextApp \
--class cn.ljj.spark.SQLContextApp \
--master local[2] \
/usr/local/mylib/sql-1.0.jar \
/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json
2.HiveContext的使用
Spark1.x中Spark SQL的入口:HiveContext
In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available.
HiveContextApp.scala
package cn.ljj.spark
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext
/**
* HiveContext的使用
* 使用时需要通过 --jars 将MySQL驱动包传递到classpath
*/
object HiveContextApp {
def main(args: Array[String]): Unit = {
// 1) 创建相应的Context
val sparkConf =new SparkConf()
val sc=new SparkContext(sparkConf)
val hiveContext=new HiveContext(sc)
// 2) 相关的处理:载入表
hiveContext.table("student").show
// 3) 关闭资源
sc.stop()
}
}
编译
mvn clean package -DskipTests
提交运行
spark-submit \
--name HiveContextApp \
--class cn,ljj.spark.HiveContextApp \
--master local[2] \
--jars /usr/local/jars/mysql-connector-java-5.1.27-bin.jar \
/usr/local/mylib/sql-1.0.jar
3.SparkSession的使用
Spark2.x中Spark SQL的入口:SparkSession
The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
SparkSessionApp.scala
package cn.ljj.spark
import org.apache.spark.sql.SparkSession
/**
* SparkSession的使用
*/
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("SparkSessionApp").master("local[2]").getOrCreate()
val people = spark.read.json("/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json")
people.show()
spark.stop()
}
}
4. spark-shell
1)将hive-site.xml文件复制到spark/conf/中,实现spark访问hive
2)启动spark-shell
spark-shell --master local[2]
or
spark-shell --jars jar包 --master local[2]
3) 使用
spark.sql("sql 语句").show
4)启动spark-sql
spark-sql --jars jar包 --master local[2]
使用:直接输入SQL语句
5)thriftserver/beeline的使用(默认端口为10000)
./start-thriftserver.sh \
--master local[2] \
--jars jar包 \
--hiveconf hive.server2.thrift.port=14000
# beeline连接到thriftserver
beeline -u jdbc:hive2://localhost:14000 -n bigdata
6)thriftserver和普通的spark-shell/spark-sql有什么区别
- spark-shell,spark-sql 都是一个spark application
- thriftserver, 不管启动多少个客户端(beeline/code),永远都是一个spark application,解决了一个数据共享的问题,多个客户端可以共享数据
5. thriftserver编程
1)pom.xml 添加文件
<!-- jdbc 操作依赖 使用thriftserver-->
<dependency>
<groupId>org.spark-project.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1.spark2</version>
</dependency>
2)SparkSQLThriftServerApp.scala
package com.ljj.spark
import java.sql.DriverManager
/**
* 通过jdbc方式访问
*/
object SparkSQLThriftServerApp {
def main(args: Array[String]): Unit = {
Class.forName("org.apache.hive.jdbc.HiveDriver")
val conn=DriverManager.getConnection("jdbc:hive2://localhost:14000","root","")
val pstmt=conn.prepareStatement("select name,age,score from student")
val rs=pstmt.executeQuery()
while (rs.next()){
println("name: "+rs.getString("name")+
", age: "+rs.getInt("age")+
", score: "+rs.getDouble("score"))
}
rs.close()
pstmt.close()
conn.close()
}
}
3)启动 thriftserver
start-thriftserver.sh \
--master local[2] \
--jars /usr/local/jars/mysql-connector-java-5.1.27-bin.jar \
--hiveconf hive.server2.thrift.port=14000
4)运行程序
5)结果
6.DataFrame&DataSet
1 概述
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
Dataset:分布式数据集
DataFrame:以列(列名,列的类型,列值)的形式构成的分布式数据集
1)DataFrame和RDD对比
RDD:(Resilient Distributed Datasets),弹性分布式数据集
- java/scala ⇒ jvm
- python ⇒ python runtime
DataFrame:
- java/scala/python ⇒ Logic Plan
2 DataFrame 基本API常用操作
1)DataFrameApp.scala
package cn.ljj.spark
import org.apache.spark.sql.SparkSession
/**
* DataFrame API基本操作
*/
object DataFrameApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()
// 将json文件加载成一个dataframe
val peopleDF = spark.read.format("json").
load("/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json")
// 输出dataframe对应的schema信息
peopleDF.printSchema()
// 输出数据集的前20条记录
peopleDF.show()
// 查询某列所有的数据:select name from table-name
peopleDF.select("name").show()
// 查询某几列所有的数据,并对列进行计算:select name,age+10 as age2 from table-name
peopleDF.select(peopleDF.col("name"),(peopleDF.col("age")+10).as("age2")).show()
// 根据某一列的值进行过滤:select * from table where age>19
peopleDF.filter(peopleDF.col("age")>19).show()
// 根据某一列进行分组,然后再进行聚合操作:select age,count(1) from table-name group by age
peopleDF.groupBy("age").count().show()
spark.stop()
}
}
3 DataFrame与RDD互操作
Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.
The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.
准备操作文件:infos.txt
1,zhangsan,20
2,lisi,30
3,wangwu,40
1)方式一:反射,前提:事先需要知道字段,字段类型
2)方式二:编程,如果第一种情况不能满足需求(事先不知道列)
DataFrameRDDApp.scala
package cn.ljj.spark
import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}
/**
* DataFrame与RDD互操作
*/
object DataFrameRDDApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()
// 反射方式
// inferReflection(spark)
// 编程方式
program(spark)
spark.stop()
}
def program(spark:SparkSession): Unit ={
// RDD ==> DataFrame
val rdd=spark.sparkContext.textFile("/usr/local/data/infos.txt")
val infoRDD=rdd.map(_.split(",")).map(line=>Row(line(0).toInt,line(1),line(2).toInt))
val structType=StructType(Array(StructField("id",IntegerType,true),
StructField("name",StringType,true),
StructField("age",IntegerType,true)))
val infoDF=spark.createDataFrame(infoRDD,structType)
infoDF.printSchema()
infoDF.show()
// 通过df的api操作
infoDF.filter(infoDF.col("age")>30).show()
// 通过sql方式操作
infoDF.createOrReplaceTempView("infos")
spark.sql("select * from infos where age>30").show()
}
def inferReflection(spark:SparkSession): Unit ={
// RDD ==> DataFrame
val rdd=spark.sparkContext.textFile("/usr/local/data/infos.txt")
// 需要导入隐式转换
import spark.implicits._
val infoDF=rdd.map(_.split(",")).map(line=>Info(line(0).toInt,line(1),line(2).toInt)).toDF()
infoDF.show()
infoDF.filter(infoDF.col("age")>30).show()
// sql方式
infoDF.createOrReplaceTempView("infos")
spark.sql("select * from infos where age>30").show()
}
case class Info(id: Int, name: String, age: Int)
}
4 DataSet
1)使用
DatasetApp.scala
package cn.ljj.spark
import org.apache.spark.sql.SparkSession
/**
* Dataset操作
*/
object DatasetApp {
def main(args: Array[String]) {
val spark = SparkSession.builder().appName("DatasetApp")
.master("local[2]").getOrCreate()
//注意:需要导入隐式转换
import spark.implicits._
val path = "file:///usr/local/data/sales.csv"
//spark如何解析csv文件?
val df = spark.read.option("header","true").option("inferSchema","true").csv(path)
df.show
val ds = df.as[Sales]
ds.map(line => line.itemId).show
spark.stop()
}
case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}
7.SparkSQL操作外部数据
1 概述
1)产生背景
- every spark application starts with loading data and ends with saving data.
- loading and saving data is not easy.
- datasets stored in various formats/system.
2)目标:easy loading/saving DataFrames
2 操作Parquet文件
package com.ljj.spark
import org.apache.spark.sql.SparkSession
object ParquetApp {
def main(args: Array[String]): Unit = {
val spark=SparkSession.builder().appName("ParquetApp").master("local[2]").getOrCreate()
// 加载数据
val userDF=spark.read.format("parquet")
.load("/usr/local/app/spark-2.2.0/examples/src/main/resources/users.parquet")
userDF.printSchema()
userDF.show()
userDF.select("name","favorite_color").show()
userDF.select("name","favorite_color").write.format("json").save("/usr/local/data/jsonout")
// sparksql默认读取的的format就是parquet
spark.read.load("/Users/Mac/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").show()
// 另外一种加载方式
spark.read.format("parquet")
.option("path","/usr/local/app/spark-2.2.0/examples/src/main/resources/users.parquet")
.load().show()
spark.stop()
}
}
3 操作Hive表数据
在spark-shell中操作
spark.sql("select name,age from student").write.saveAsTable("student_1")
4 操作MySQL表数据
// Loading data from a JDBC source
// 第一种方式
val jdbcDF = spark.read
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/hive")
.option("dbtable", "hive.TBLS")
.option("user", "root")
.option("password", "rootroot")
.load()
// 第二种方式
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")
val jdbcDF2 = spark.read
.jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)
// Saving data to a JDBC source
jdbcDF.write
.format("jdbc")
.option("url", "jdbc:mysql://localhost:3306/hive")
.option("dbtable", "hive.TBLS")
.option("user", "root")
.option("password", "rootroot")
.save()
jdbcDF2.write
.jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)
// Specifying create table column data types on write
jdbcDF.write
.option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
.jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)