方法一,Spark中使用toDF函数创建DataFrame
通过导入(importing)Spark sql implicits, 就可以将本地序列(seq), 数组或者RDD转为DataFrame。只要这些数据的内容能指定数据类型即可。需要注意spark和scala的版本,否则会报找不到function(Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12)
本地seq + toDF创建DataFrame:
val sc = SparkSession.builder().appName("userImage").master("local").enableHiveSupport().getOrCreate()
import sc.sqlContext.implicits._
val custInfo = Seq((1,"321034202008105678","Frank","18910983245","10101010101010101010",java.sql.Date.valueOf("2020-08-10")),
(2,"321034202008105679","Mike","13510983245","10101010101010101020",java.sql.Date.valueOf("2020-08-10")),
(3,"321034202008105680","Kade","13910983245","10101010101010101030",java.sql.Date.valueOf("2020-08-10"))).toDF("id","id_card","name","phone_no","debit_card","regster_date")
custInfo.show()
//+---+------------------+-----+-----------+--------------------+------------+
//| id| id_card| name| phone_no| debit_card|regster_date|
//+---+------------------+-----+-----------+--------------------+------------+
//| 1|321034202008105678|Frank|18910983245|10101010101010101010| 2020-08-10|
//| 2|321034202008105679| Mike|13510983245|10101010101010101020| 2020-08-10|
//| 3|321034202008105680| Kade|13910983245|10101010101010101030| 2020-08-10|
//+---+------------------+-----+-----------+--------------------+------------+
注意:如果直接用toDF()而不指定列名字,那么默认列名为"_1", “_2”, …
通过rdd + toDF创建DataFrame:
val rdd = sc.sparkContext.parallelize(Seq(
Row("身份证", "321034202008105678","HNIX", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105678","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105679","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105680","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "18910983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "18910983245","HNIX", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "13510983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("银行卡号", "10101010101010101010","MKDN", java.sql.Date.valueOf("2020-08-10"))
))
import sc.sqlContext.implicits._
val userImage1 = rdd.map(x => (x.getString(0),x.getString(1),x.getString(2),x.getDate(3))).toDF("dim_name","dim_value","label_code","ope_date")
userImage1.show()
//+--------+--------------------+----------+----------+
//|dim_name| dim_value|label_code| ope_date|
//+--------+--------------------+----------+----------+
//| 身份证| 321034202008105678| HNIX|2020-08-10|
//| 身份证| 321034202008105678| MKDN|2020-08-10|
//| 身份证| 321034202008105679| MKDN|2020-08-10|
//| 身份证| 321034202008105680| MKDN|2020-08-10|
//| 手机号| 18910983245| MKDN|2020-08-10|
//| 手机号| 18910983245| HNIX|2020-08-10|
//| 手机号| 13510983245| MKDN|2020-08-10|
//|银行卡号|10101010101010101010| MKDN|2020-08-10|
//+--------+--------------------+----------+----------+
方法二,Spark中使用createDataFrame函数创建DataFrame
在SqlContext中使用createDataFrame也可以创建DataFrame。跟toDF一样,这里创建DataFrame的数据形态也可以是本地数组或者RDD。
通过row+schema创建:
import org.apache.spark.sql.types._
val schema = StructType(List(
StructField("dim_name", StringType, nullable = false),
StructField("dim_value", StringType, nullable = false),
StructField("label_code", StringType, nullable = true),
StructField("ope_date", DateType, nullable = true)
))
val rdd = sc.sparkContext.parallelize(Seq(
Row("身份证", "321034202008105678","HNIX", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105678","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105679","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("身份证", "321034202008105680","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "18910983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "18910983245","HNIX", java.sql.Date.valueOf("2020-08-10")),
Row("手机号", "13510983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
Row("银行卡号", "10101010101010101010","MKDN", java.sql.Date.valueOf("2020-08-10"))
))
val userImage = sc.createDataFrame(rdd, schema)
userImage.show()
//+--------+--------------------+----------+----------+
//|dim_name| dim_value|label_code| ope_date|
//+--------+--------------------+----------+----------+
//| 身份证| 321034202008105678| HNIX|2020-08-10|
//| 身份证| 321034202008105678| MKDN|2020-08-10|
//| 身份证| 321034202008105679| MKDN|2020-08-10|
//| 身份证| 321034202008105680| MKDN|2020-08-10|
//| 手机号| 18910983245| MKDN|2020-08-10|
//| 手机号| 18910983245| HNIX|2020-08-10|
//| 手机号| 13510983245| MKDN|2020-08-10|
//|银行卡号|10101010101010101010| MKDN|2020-08-10|
//+--------+--------------------+----------+----------+
方法三,通过文件直接创建DataFrame
使用parquet文件创建:
sc.read.parquet("/path/to/file.parquet")
使用json文件创建:
val json = sc.read.json("D:\\test\\test.json")
json.show()
使用csv文件,spark2.0+之后的版本可用:
val loadDF = sc.read
.format("com.databricks.spark.csv")
.option("header", "true") //reading the headers
.option("mode", "DROPMALFORMED")
.csv("D:\\test\\test.csv")
loadDF.show()
如果加载文件后,出现中文乱码,可以将文件先转为UTF-8