产生背景
1.Every Spark application starts with loading data and ends with saving data
2.Loading and saving Data is not easy
3.Parse raw data:test/json/parquet
4.Convert data format transformation
5.Datasets stored in various Formats/Ststems
对于用户来说:
方便快速从不同的数据源(json,parquet,rdbms),经过混合处理(son, join parquet),再将处理结果以特定格式(json,parquet)写回到指定的系统(HDFS,S3)上面去
所以Spark SQL 1.2 ==>外部数据源API
外部数据源的目的:
1.Developer:build libraries for various data sources
2.User:easy loading/saving DataFrames
读:spark.read.format(format)
format
build-in :json. parquet jdbc csv
packages:外部的 并不是spark内置的
具体访问:https://spark-packages.orgs/
写:people.write.format(“parquet”).save(“path”)
操作Parqurt文件数据
1.spark.read.format(“parquet”).load(path)
2.df.write.format(“pa