SparkSql入门

目录

1. SQLContext的使用

2.HiveContext的使用 

3.SparkSession的使用

4. spark-shell

5. thriftserver编程 

6.DataFrame&DataSet 

1 概述

2 DataFrame 基本API常用操作

 3 DataFrame与RDD互操作

4 DataSet

7.SparkSQL操作外部数据

1 概述

2 操作Parquet文件

3 操作Hive表数据

4 操作MySQL表数据


1. SQLContext的使用

Spark1.x中Spark SQL的入口:SQLContext

The entry point into all functionality in Spark SQL is the SQLContext class, or one of its descendants. To create a basic SQLContext, all you need is a SparkContext.

SQLContextApp.scala

package cn.ljj.spark

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

/**
  * SQLContext的使用
  */

object SQLContextApp {

  def main(args: Array[String]): Unit = {
    val path=args(0)

    // 1) 创建相应的Context
    val sparkConf =new SparkConf()

    val sc=new SparkContext(sparkConf)

    val sqlContext=new SQLContext(sc)

    // 2) 相关的处理:json
    val people=sqlContext.read.format("json").load(path)
    people.printSchema()
    people.show()

    // 3) 关闭资源
    sc.stop()
  }
}

提交运行

spark-submit \
--name SQLContextApp \
--class cn.ljj.spark.SQLContextApp \
--master local[2] \
/usr/local/mylib/sql-1.0.jar \
/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json

2.HiveContext的使用 

Spark1.x中Spark SQL的入口:HiveContext

In addition to the basic SQLContext, you can also create a HiveContext, which provides a superset of the functionality provided by the basic SQLContext. Additional features include the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the ability to read data from Hive tables. To use a HiveContext, you do not need to have an existing Hive setup, and all of the data sources available to a SQLContext are still available.

HiveContextApp.scala

package cn.ljj.spark

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

/**
  * HiveContext的使用
  * 使用时需要通过 --jars 将MySQL驱动包传递到classpath
  */
object HiveContextApp {

  def main(args: Array[String]): Unit = {

    // 1) 创建相应的Context
    val sparkConf =new SparkConf()

    val sc=new SparkContext(sparkConf)

    val hiveContext=new HiveContext(sc)

    // 2) 相关的处理:载入表
    hiveContext.table("student").show

    // 3) 关闭资源
    sc.stop()
  }
}

编译

mvn clean package -DskipTests

提交运行 

spark-submit \
--name HiveContextApp \
--class cn,ljj.spark.HiveContextApp \
--master local[2] \
--jars /usr/local/jars/mysql-connector-java-5.1.27-bin.jar \
/usr/local/mylib/sql-1.0.jar

3.SparkSession的使用

Spark2.x中Spark SQL的入口:SparkSession

The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():

import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder()
  .appName("Spark SQL basic example")
  .config("spark.some.config.option", "some-value")
  .getOrCreate()

// For implicit conversions like converting RDDs to DataFrames
import spark.implicits._
 

SparkSessionApp.scala

package cn.ljj.spark

import org.apache.spark.sql.SparkSession

/**
  * SparkSession的使用
  */
object SparkSessionApp {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("SparkSessionApp").master("local[2]").getOrCreate()

    val people = spark.read.json("/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json")

    people.show()

    spark.stop()
  }
}

4. spark-shell
 

 1)将hive-site.xml文件复制到spark/conf/中,实现spark访问hive

2)启动spark-shell

spark-shell --master local[2]
or
spark-shell --jars jar包 --master local[2]

3) 使用

spark.sql("sql 语句").show

 4)启动spark-sql

spark-sql --jars jar包 --master local[2]

使用:直接输入SQL语句 

5)thriftserver/beeline的使用(默认端口为10000)

./start-thriftserver.sh \
--master local[2] \
--jars jar包 \
--hiveconf hive.server2.thrift.port=14000

# beeline连接到thriftserver
beeline -u jdbc:hive2://localhost:14000 -n bigdata

6)thriftserver和普通的spark-shell/spark-sql有什么区别 

  • spark-shell,spark-sql 都是一个spark application
  • thriftserver, 不管启动多少个客户端(beeline/code),永远都是一个spark application,解决了一个数据共享的问题,多个客户端可以共享数据

5. thriftserver编程 

1)pom.xml 添加文件

<!-- jdbc 操作依赖 使用thriftserver-->
    <dependency>
        <groupId>org.spark-project.hive</groupId>
        <artifactId>hive-jdbc</artifactId>
        <version>1.2.1.spark2</version>
    </dependency>

2)SparkSQLThriftServerApp.scala

package com.ljj.spark

import java.sql.DriverManager

/**
  * 通过jdbc方式访问
  */
object SparkSQLThriftServerApp {

  def main(args: Array[String]): Unit = {

    Class.forName("org.apache.hive.jdbc.HiveDriver")

    val conn=DriverManager.getConnection("jdbc:hive2://localhost:14000","root","")
    val pstmt=conn.prepareStatement("select name,age,score from student")
    val rs=pstmt.executeQuery()
    while (rs.next()){
      println("name: "+rs.getString("name")+
      ", age: "+rs.getInt("age")+
      ", score: "+rs.getDouble("score"))
    }

    rs.close()
    pstmt.close()
    conn.close()
  }
}

3)启动 thriftserver

start-thriftserver.sh \
--master local[2] \
--jars /usr/local/jars/mysql-connector-java-5.1.27-bin.jar \
--hiveconf hive.server2.thrift.port=14000

4)运行程序 
5)结果 
这里写图片描述 

6.DataFrame&DataSet 

1 概述

A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.

Dataset:分布式数据集
DataFrame:以列(列名,列的类型,列值)的形式构成的分布式数据集

1)DataFrame和RDD对比

RDD:(Resilient Distributed Datasets),弹性分布式数据集

  • java/scala ⇒ jvm
  • python ⇒ python runtime

DataFrame:

  • java/scala/python ⇒ Logic Plan

2 DataFrame 基本API常用操作

1)DataFrameApp.scala

package cn.ljj.spark

import org.apache.spark.sql.SparkSession

/**
  * DataFrame API基本操作
  */
object DataFrameApp {

  def main(args: Array[String]): Unit = {

    val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()

    // 将json文件加载成一个dataframe
    val peopleDF = spark.read.format("json").
      load("/usr/local/app/spark-2.2.0/examples/src/main/resources/people.json")

    // 输出dataframe对应的schema信息
    peopleDF.printSchema()

    // 输出数据集的前20条记录
    peopleDF.show()

    // 查询某列所有的数据:select name from table-name
    peopleDF.select("name").show()

    // 查询某几列所有的数据,并对列进行计算:select name,age+10 as age2 from table-name
    peopleDF.select(peopleDF.col("name"),(peopleDF.col("age")+10).as("age2")).show()

    // 根据某一列的值进行过滤:select * from table where age>19
    peopleDF.filter(peopleDF.col("age")>19).show()

    // 根据某一列进行分组,然后再进行聚合操作:select age,count(1) from table-name group by age
    peopleDF.groupBy("age").count().show()

    spark.stop()

  }
}

 3 DataFrame与RDD互操作

Spark SQL supports two different methods for converting existing RDDs into Datasets. The first method uses reflection to infer the schema of an RDD that contains specific types of objects. This reflection based approach leads to more concise code and works well when you already know the schema while writing your Spark application.

The second method for creating Datasets is through a programmatic interface that allows you to construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows you to construct Datasets when the columns and their types are not known until runtime.

 准备操作文件:infos.txt

1,zhangsan,20 

2,lisi,30 

3,wangwu,40

1)方式一:反射,前提:事先需要知道字段,字段类型
2)方式二:编程,如果第一种情况不能满足需求(事先不知道列)

DataFrameRDDApp.scala

package cn.ljj.spark

import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType}
import org.apache.spark.sql.{Row, SparkSession}

/**
  * DataFrame与RDD互操作
  */
object DataFrameRDDApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().appName("DataFrameApp").master("local[2]").getOrCreate()

    // 反射方式
//    inferReflection(spark)

    // 编程方式
    program(spark)

    spark.stop()
  }

  def program(spark:SparkSession): Unit ={

    // RDD ==> DataFrame
    val rdd=spark.sparkContext.textFile("/usr/local/data/infos.txt")

    val infoRDD=rdd.map(_.split(",")).map(line=>Row(line(0).toInt,line(1),line(2).toInt))

    val structType=StructType(Array(StructField("id",IntegerType,true),
      StructField("name",StringType,true),
      StructField("age",IntegerType,true)))

    val infoDF=spark.createDataFrame(infoRDD,structType)
    infoDF.printSchema()
    infoDF.show()

    // 通过df的api操作
    infoDF.filter(infoDF.col("age")>30).show()

    // 通过sql方式操作
    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age>30").show()
  }


  def inferReflection(spark:SparkSession): Unit ={

    // RDD ==> DataFrame
    val rdd=spark.sparkContext.textFile("/usr/local/data/infos.txt")

    // 需要导入隐式转换
    import spark.implicits._
    val infoDF=rdd.map(_.split(",")).map(line=>Info(line(0).toInt,line(1),line(2).toInt)).toDF()

    infoDF.show()

    infoDF.filter(infoDF.col("age")>30).show()

    // sql方式
    infoDF.createOrReplaceTempView("infos")
    spark.sql("select * from infos where age>30").show()
  }

  case class Info(id: Int, name: String, age: Int)

}

4 DataSet

1)使用
DatasetApp.scala

package cn.ljj.spark

import org.apache.spark.sql.SparkSession

/**
 * Dataset操作
 */
object DatasetApp {

  def main(args: Array[String]) {
    val spark = SparkSession.builder().appName("DatasetApp")
      .master("local[2]").getOrCreate()

    //注意:需要导入隐式转换
    import spark.implicits._

    val path = "file:///usr/local/data/sales.csv"

    //spark如何解析csv文件?
    val df = spark.read.option("header","true").option("inferSchema","true").csv(path)
    df.show

    val ds = df.as[Sales]
    ds.map(line => line.itemId).show
    
    spark.stop()
  }

  case class Sales(transactionId:Int,customerId:Int,itemId:Int,amountPaid:Double)
}

7.SparkSQL操作外部数据

1 概述

1)产生背景

  • every spark application starts with loading data and ends with saving data.
  • loading and saving data is not easy.
  • datasets stored in various formats/system.

2)目标:easy loading/saving DataFrames

2 操作Parquet文件

package com.ljj.spark

import org.apache.spark.sql.SparkSession

object ParquetApp {

  def main(args: Array[String]): Unit = {

    val spark=SparkSession.builder().appName("ParquetApp").master("local[2]").getOrCreate()

    // 加载数据
    val userDF=spark.read.format("parquet")
      .load("/usr/local/app/spark-2.2.0/examples/src/main/resources/users.parquet")

    userDF.printSchema()
    userDF.show()

    userDF.select("name","favorite_color").show()

    userDF.select("name","favorite_color").write.format("json").save("/usr/local/data/jsonout")

    // sparksql默认读取的的format就是parquet
    spark.read.load("/Users/Mac/app/spark-2.2.0-bin-2.6.0-cdh5.7.0/examples/src/main/resources/users.parquet").show()

    // 另外一种加载方式
    spark.read.format("parquet")
      .option("path","/usr/local/app/spark-2.2.0/examples/src/main/resources/users.parquet")
      .load().show()

    spark.stop()


  }

}

3 操作Hive表数据

在spark-shell中操作

spark.sql("select name,age from student").write.saveAsTable("student_1")

4 操作MySQL表数据

// Loading data from a JDBC source

// 第一种方式
val jdbcDF = spark.read
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/hive")
  .option("dbtable", "hive.TBLS")
  .option("user", "root")
  .option("password", "rootroot")
  .load()

// 第二种方式
val connectionProperties = new Properties()
connectionProperties.put("user", "root")
connectionProperties.put("password", "root")

val jdbcDF2 = spark.read
  .jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)

// Saving data to a JDBC source

jdbcDF.write
  .format("jdbc")
  .option("url", "jdbc:mysql://localhost:3306/hive")
  .option("dbtable", "hive.TBLS")
  .option("user", "root")
  .option("password", "rootroot")
  .save()

jdbcDF2.write
  .jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)

// Specifying create table column data types on write
jdbcDF.write
  .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
  .jdbc("jdbc:mysql://localhost:3306/hive", "hive.TBLS", connectionProperties)

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值