SparkSQL(一)

http://spark.apache.org/docs/latest/sql-getting-started.html

scala>     val df = spark.read.json("file:///home/wzj/app/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]                

scala> df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


scala> df.show()
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

scala> df.createTempView("people")
scala> spark.sql("select * from people")


=====报错======

Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
  at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
  at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
  at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
  ... 184 more
[wzj@hadoop001 bin]$ ./spark-shell --jars ~/lib/mysql-connector-java-5.1.27-bin.jar 
Spark session available as 'spark'.

scala> 
scala>     val df = spark.read.json("file:///home/wzj/app/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]

scala>     df.printSchema()
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)


scala>     df.createTempView("people")
scala>     spark.sql("select * from people").show
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

API:

import org.apache.spark.sql.SparkSession

object SparkSessionApp {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder
      .master("local")
      .appName("test")
      .config("spark.some.config.option","some-value")
      .getOrCreate()
    val df = spark.read.json("data/test.json")
    df.printSchema()
    df.createTempView("people")
    spark.sql("select * from people").show
    spark.close()
  }
}

如果使用spark-sql,必须同时指定–driver-class-path,否则会报错

[wzj@hadoop001 bin]$ ./spark-sql --jars ~/lib/mysql-connector-java-5.1.27-bin.jar --driver-class-path ~/lib/mysql-connector-java-5.1.27-bin.jar 
spark-sql (default)> 
                   > 
                   > show databases;
databaseName
default
wzj
wzj_dw
wzj_erp
Time taken: 2.756 seconds, Fetched 4 row(s)
20/04/16 01:08:24 INFO thriftserver.SparkSQLCLIDriver: Time taken: 2.756 seconds, Fetched 4 row(s)
spark-sql (default)> use wzj_dw;

database	tableName	isTemporary
wzj_dw	domain_size	false
wzj_dw	dws_access_domain_traffic	false
wzj_dw	dws_access_province_traffic	false
wzj_dw	dws_isp_response_size	false
wzj_dw	dws_path_response_size	false
wzj_dw	ods_access	false
wzj_dw	ods_access_user	false
wzj_dw	ods_domain_userid	false
Time taken: 0.107 seconds, Fetched 8 row(s)
20/04/16 01:08:37 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.107 seconds, Fetched 8 row(s)
spark-sql (default)> select * from domain_size;

domain	time	traffic
gifshow.com	2020/01/01	5
yy.com	2020/01/01	4
huya.com	2020/01/01	1
gifshow.com	2020/01/20	6
gifshow.com	2020/02/01	8
yy.com	2020/01/20	5
gifshow.com	2020/02/02	7
Time taken: 1.865 seconds, Fetched 7 row(s)
20/04/16 01:09:36 INFO thriftserver.SparkSQLCLIDriver: Time taken: 1.865 seconds, Fetched 7 row(s)
spark-sql (default)> 

RDD中cache是lazy的,而SparkSQL中的cache是eager的


欢迎关注公众号,一起愉快的交流
在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值