ase on spark 1.5.1 overview
一、入口:
- val sc: SparkContext
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
-
- import sqlContext.implicits._
SQLContext默认解析器为"sql",用于解析sql语句。
除了SQLContext之外,还有HiveContext,用了更加完善的解析器,默认解析器为spark.sql.dialect
="hiveql"
二、创建DataFrames
目前支持从RDD、hive表以及其它数据源中创建DataFrames。详见下一篇介绍
- val sc: SparkContext
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
- val df = sqlContext.read.json("examples/src/main/resources/people.json")
-
-
- df.show()
三、DataFrames对外DSL接口
1、show:打印
2、printSchema:打印schema信息
3、select: 从原始DataFrames中选择部分colums
4、filter:过滤
5、groupBy:分组
6、count:计数
...
详见下一篇介绍
四、运行sql
- val sqlContext = ...
- val df = sqlContext.sql("SELECT * FROM table")
五、Schema推断
1、从已知格式中反射出对应的schema信息,使用case classes
-
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
- import sqlContext.implicits._
-
-
-
-
- case class Person(name: String, age: Int)
-
-
- val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
- people.registerTempTable("people")
-
-
- val teenagers = sqlContext.sql("SELECT name, age FROM people WHERE age >= 13 AND age <= 19")
-
-
-
- teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
-
-
- teenagers.map(t => "Name: " + t.getAs[String]("name")).collect().foreach(println)
-
-
- teenagers.map(_.getValuesMap[Any](List("name", "age"))).collect().foreach(println)
-
2、不使用case classes
-
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
-
-
- val people = sc.textFile("examples/src/main/resources/people.txt")
-
-
- val schemaString = "name age"
-
-
- import org.apache.spark.sql.Row;
-
-
- import org.apache.spark.sql.types.{StructType,StructField,StringType};
-
-
- val schema =
- StructType(
- schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))
-
-
- val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))
-
-
- val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)
-
-
- peopleDataFrame.registerTempTable("people")
-
-
- val results = sqlContext.sql("SELECT name FROM people")
-
-
-
- results.map(t => "Name: " + t(0)).collect().foreach(println)
六、数据源
1、从parquet格式的文件中加载/输出
- val df = sqlContext.read.load("examples/src/main/resources/users.parquet")
- df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
2、更方便的格式化,而无需像以上一样去解析
- val df = sqlContext.read.format("json").load("examples/src/main/resources/people.json")
- df.select("name", "age").write.format("parquet").save("namesAndAges.parquet")
3、Hive表
支持Hive的spark需要 -Phive
和 -Phive-thriftserver
编译spark,且需要将lib_managed/jars下的datanucleus相关jar包以及hive-site.xml
放在指定的位置。
详见spark 1.5.1 overview: http://spark.apache.org/docs/latest/sql-programming-guide.html
-
- val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
-
- sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
- sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
-
-
- sqlContext.sql("FROM src SELECT key, value").collect().foreach(println)
4、jdbc
需首先加载相应的jdbc驱动到spark classpath
每个工作节点上也需要能加载驱动包,可以将驱动jars放在每个节点的classpath中。
转载: http://blog.youkuaiyun.com/yueqian_zhu/article/details/49616563