SparkCore编程模型是RDD
SparkSQL编程模型是DataFrame/DataSet
SparkSQL编程入口为SparkSession
select 三种写法
- df.select(“column1”,“column2”)
- df.select(df(“column1”),df(“column2”))
- import spark.implicits._
val frame = df.select($“column1”, $“column2”)
filter 三种写法
value是数值直接写数值,如果是字符串需要加双引号
frame.where等同于frame.fileter
源码中可以看到where底层也是走的filter===>def where(condition: Column): Dataset[T] = filter(condition)==
- frame.filter(df(“column”)===value).show()
- frame.filter(“column = value”).show()
- frame.filter('column === value).show()
读取数据
读数据:DataFrameReader
1.spark.read.format(“json/text/jdbc…”).load(path)
返回格式为DataFrame
===>等价于
2.spark.read.json(path)
返回格式为DataSet
===> 此简写方式底层走的也是1的写法
写入数据
df.write.format("…").save(path)
写数据:DataFrameWriter
读写json数据
import org.apache.spark.sql.SparkSession
object sourceApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local")
.appName("test")
.getOrCreate()
json(spark)
spark.stop()
}
def json(spark:SparkSession): Unit ={
val df = spark.read.format("json").json("D:\\bigdata\\ruozedata-spark\\ruozedata-spark-sql\\data\\people.json")
//select x,x,x,...
val frame = df.select("age", "name")
// df.select(df("age"),df("name"))
// import spark.implicits._
// val frame = df.select($"age", $"name")
val result = frame.filter(df("age") === 19)
// frame.filter("age = 19").show()
// frame.filter('age === 19).show()
result.write.format("json").save("out")
}
}
当重复执行的时候,会报错路径已存在,所以我们可以选择追加或者覆盖的方式
源码如下:
def mode(saveMode: String): DataFrameWriter[T] = {
this.mode = saveMode.toLowerCase(Locale.ROOT) match {
case "overwrite" => SaveMode.Overwrite
case "append" => SaveMode.Append
case "ignore" => SaveMode.Ignore
case "error" | "errorifexists" | "default" => SaveMode.ErrorIfExists
case _ => throw new IllegalArgumentException(s"Unknown save mode: $saveMode. " +
"Accepted save modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists'.")
}
this
}
备注:append不是在原有结果上追加,而是重新生成结果文件
所以数据写入可以修改为:
result.write.format(“json”).mode(“overwrite”).save(“out”)
读写txt
import org.apache.spark.sql.SparkSession
object sourceApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local")
.appName("test")
.getOrCreate()
text(spark)
spark.stop()
}
def text(spark:SparkSession): Unit ={
val value = spark.read.textFile("D:\\bigdata\\ruozedata-spark\\ruozedata-spark-sql\\data\\people.txt")
import spark.implicits._
val value1 = value.map(x => {
val strings = x.split(",")
(strings(0)+strings(1))
})
value1.write.format("text")
.mode("overwrite")
//指定压缩格式
.option("compression","gzip")
.save("out")
}
}
源码中默认压缩格式
private val shortCompressionCodecNames = Map(
"none" -> null,
"uncompressed" -> null,
"bzip2" -> classOf[BZip2Codec].getName,
"deflate" -> classOf[DeflateCodec].getName,
"gzip" -> classOf[GzipCodec].getName,
"lz4" -> classOf[Lz4Codec].getName,
"snappy" -> classOf[SnappyCodec].getName)
读写csv
def csv(spark: SparkSession): Unit = {
val df = spark.read.format("csv")
.option("header","true")
.option("sep",";")
.option("inferSchema","true")
.load("D:\\bigdata\\ruozedata-spark\\ruozedata-spark-sql\\data\\people.csv")
df.printSchema()
}
jdbc读写
def jdbc(spark: SparkSession): Unit = {
val df = spark.read.format("jdbc")
.option("url", "jdbc:mysql://hadoop001:3306")
.option("dbtable", "offsets.offsets_storage")
.option("user", "root")
.option("password", "123456")
.load()
import spark.implicits._
df.printSchema()
df.filter('partitions === 0)
.write.format("jdbc")
.option("url", "jdbc:mysql://hadoop001:3306")
.option("dbtable", "offsets.offsets_storage_2")
.option("user", "root")
.option("password", "123456")
.save()
}
欢迎关注公众号,一起愉快的交流