http://spark.apache.org/docs/latest/sql-getting-started.html
scala> val df = spark.read.json("file:///home/wzj/app/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.show()
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
scala> df.createTempView("people")
scala> spark.sql("select * from people")
=====报错======
Caused by: org.datanucleus.store.rdbms.connectionpool.DatastoreDriverNotFoundException: The specified datastore driver ("com.mysql.jdbc.Driver") was not found in the CLASSPATH. Please check your CLASSPATH specification, and the name of the driver.
at org.datanucleus.store.rdbms.connectionpool.AbstractConnectionPoolFactory.loadDriver(AbstractConnectionPoolFactory.java:58)
at org.datanucleus.store.rdbms.connectionpool.BoneCPConnectionPoolFactory.createConnectionPool(BoneCPConnectionPoolFactory.java:54)
at org.datanucleus.store.rdbms.ConnectionFactoryImpl.generateDataSources(ConnectionFactoryImpl.java:238)
... 184 more
[wzj@hadoop001 bin]$ ./spark-shell --jars ~/lib/mysql-connector-java-5.1.27-bin.jar
Spark session available as 'spark'.
scala>
scala> val df = spark.read.json("file:///home/wzj/app/spark/examples/src/main/resources/people.json")
df: org.apache.spark.sql.DataFrame = [age: bigint, name: string]
scala> df.printSchema()
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
scala> df.createTempView("people")
scala> spark.sql("select * from people").show
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
API:
import org.apache.spark.sql.SparkSession
object SparkSessionApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local")
.appName("test")
.config("spark.some.config.option","some-value")
.getOrCreate()
val df = spark.read.json("data/test.json")
df.printSchema()
df.createTempView("people")
spark.sql("select * from people").show
spark.close()
}
}
如果使用spark-sql,必须同时指定–driver-class-path,否则会报错
[wzj@hadoop001 bin]$ ./spark-sql --jars ~/lib/mysql-connector-java-5.1.27-bin.jar --driver-class-path ~/lib/mysql-connector-java-5.1.27-bin.jar
spark-sql (default)>
>
> show databases;
databaseName
default
wzj
wzj_dw
wzj_erp
Time taken: 2.756 seconds, Fetched 4 row(s)
20/04/16 01:08:24 INFO thriftserver.SparkSQLCLIDriver: Time taken: 2.756 seconds, Fetched 4 row(s)
spark-sql (default)> use wzj_dw;
database tableName isTemporary
wzj_dw domain_size false
wzj_dw dws_access_domain_traffic false
wzj_dw dws_access_province_traffic false
wzj_dw dws_isp_response_size false
wzj_dw dws_path_response_size false
wzj_dw ods_access false
wzj_dw ods_access_user false
wzj_dw ods_domain_userid false
Time taken: 0.107 seconds, Fetched 8 row(s)
20/04/16 01:08:37 INFO thriftserver.SparkSQLCLIDriver: Time taken: 0.107 seconds, Fetched 8 row(s)
spark-sql (default)> select * from domain_size;
domain time traffic
gifshow.com 2020/01/01 5
yy.com 2020/01/01 4
huya.com 2020/01/01 1
gifshow.com 2020/01/20 6
gifshow.com 2020/02/01 8
yy.com 2020/01/20 5
gifshow.com 2020/02/02 7
Time taken: 1.865 seconds, Fetched 7 row(s)
20/04/16 01:09:36 INFO thriftserver.SparkSQLCLIDriver: Time taken: 1.865 seconds, Fetched 7 row(s)
spark-sql (default)>
RDD中cache是lazy的,而SparkSQL中的cache是eager的
欢迎关注公众号,一起愉快的交流