1、import org.apache.spark.sql.functions._
sparksql中的函数需要引入这个包
2、尽量不适用join操作。推荐使用window窗口函数操作。
val resdf = sdf.withColumn(“age_avg”, avg(“weight”).over(Window.partitionBy(“sex”)))
.withColumn(“weight_min”, min(“weight”).over(Window.partitionBy(“sex”)))
3、spark2.x中 读入csv文件方法,其中option可以控制header首行是否当做模式读入
val sdf1 = spark.read.option(“header”, “true”).csv(“E:\traindata\ml-100k\test.csv”).toDF
4、printSchema()或 printSchema打印模式
5、sparksesson启动方式
val spark = SparkSession
.builder()
.appName(this.getClass.getName)
.master(“local[2]”)
.getOrCreate()
7、dataframe fildeIndex获取字段对应的值
8、spark.close()
9、根据模式生成dataframe的方法
import org.apache.spark.sql.types._
// Create an RDD
val peopleRDD = spark.sparkContext.textFile("examples/src/main/resources/people.txt")
// The schema is encoded in a string
val schemaString = "name age"
// Generate the schema based on the string of schema
val fields = schemaString.split(" ")
.map(fieldName => StructField(fieldName, StringType, nullable = true))
val schema = StructType(fields)
// Convert records of the RDD (people) to Rows
val rowRDD = peopleRDD
.map(_.split(","))
.map(attributes => Row(attributes(0), attributes(1).trim))
// Apply the schema to the RDD
val peopleDF = spark.createDataFrame(rowRDD, schema)
// Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")
// SQL can be run over a temporary view created using DataFrames
val results = spark.sql("SELECT name FROM people")
// The results of SQL queries are DataFrames and support all the normal RDD operations
// The columns of a row in the result can be accessed by field index or by field name
results.map(attributes => "Name: " + attributes(0)).show()
// +-------------+
// | value|
// +-------------+
// |Name: Michael|
// | Name: Andy|
// | Name: Justin|
// +-------------+
本文介绍了Spark SQL编程的一些重要要点,包括导入functions包以便使用函数,避免join操作并利用window函数,详细说明了如何从csv文件中读取数据,以及SparkSession的启动方式。此外,还提到了获取DataFrame字段值的方法和关闭Spark连接的步骤。
984

被折叠的 条评论
为什么被折叠?



