Spark创建DataFrame

本文介绍了在Spark中创建DataFrame的三种方法:使用toDF函数,通过序列或RDD转换;使用createDataFrame函数结合SqlContext操作;以及直接从parquet、json和csv文件加载数据。特别提醒,Spark版本与Scala版本的对应关系会影响函数的使用,如不匹配可能导致找不到function的问题。对于中文乱码问题,可以考虑转换文件编码为UTF-8。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

方法一,Spark中使用toDF函数创建DataFrame
通过导入(importing)Spark sql implicits, 就可以将本地序列(seq), 数组或者RDD转为DataFrame。只要这些数据的内容能指定数据类型即可。需要注意spark和scala的版本,否则会报找不到function(Note that, Spark 2.x is pre-built with Scala 2.11 except version 2.4.2, which is pre-built with Scala 2.12. Spark 3.0+ is pre-built with Scala 2.12)

本地seq + toDF创建DataFrame:

val sc = SparkSession.builder().appName("userImage").master("local").enableHiveSupport().getOrCreate()
import sc.sqlContext.implicits._

val custInfo = Seq((1,"321034202008105678","Frank","18910983245","10101010101010101010",java.sql.Date.valueOf("2020-08-10")),
      (2,"321034202008105679","Mike","13510983245","10101010101010101020",java.sql.Date.valueOf("2020-08-10")),
      (3,"321034202008105680","Kade","13910983245","10101010101010101030",java.sql.Date.valueOf("2020-08-10"))).toDF("id","id_card","name","phone_no","debit_card","regster_date")

custInfo.show()

//+---+------------------+-----+-----------+--------------------+------------+
//| id|           id_card| name|   phone_no|          debit_card|regster_date|
//+---+------------------+-----+-----------+--------------------+------------+
//|  1|321034202008105678|Frank|18910983245|10101010101010101010|  2020-08-10|
//|  2|321034202008105679| Mike|13510983245|10101010101010101020|  2020-08-10|
//|  3|321034202008105680| Kade|13910983245|10101010101010101030|  2020-08-10|
//+---+------------------+-----+-----------+--------------------+------------+    

注意:如果直接用toDF()而不指定列名字,那么默认列名为"_1", “_2”, …

通过rdd + toDF创建DataFrame:

val rdd = sc.sparkContext.parallelize(Seq(
      Row("身份证", "321034202008105678","HNIX", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105678","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105679","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105680","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "18910983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "18910983245","HNIX", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "13510983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("银行卡号", "10101010101010101010","MKDN", java.sql.Date.valueOf("2020-08-10"))
    ))
    
import sc.sqlContext.implicits._

val userImage1 = rdd.map(x => (x.getString(0),x.getString(1),x.getString(2),x.getDate(3))).toDF("dim_name","dim_value","label_code","ope_date")
userImage1.show()

//+--------+--------------------+----------+----------+
//|dim_name|           dim_value|label_code|  ope_date|
//+--------+--------------------+----------+----------+
//|  身份证|  321034202008105678|      HNIX|2020-08-10|
//|  身份证|  321034202008105678|      MKDN|2020-08-10|
//|  身份证|  321034202008105679|      MKDN|2020-08-10|
//|  身份证|  321034202008105680|      MKDN|2020-08-10|
//|  手机号|         18910983245|      MKDN|2020-08-10|
//|  手机号|         18910983245|      HNIX|2020-08-10|
//|  手机号|         13510983245|      MKDN|2020-08-10|
//|银行卡号|10101010101010101010|      MKDN|2020-08-10|
//+--------+--------------------+----------+----------+

SparkContext加载集合为rdd

方法二,Spark中使用createDataFrame函数创建DataFrame
在SqlContext中使用createDataFrame也可以创建DataFrame。跟toDF一样,这里创建DataFrame的数据形态也可以是本地数组或者RDD。

通过row+schema创建:

import org.apache.spark.sql.types._
    val schema = StructType(List(
      StructField("dim_name", StringType, nullable = false),
      StructField("dim_value", StringType, nullable = false),
      StructField("label_code", StringType, nullable = true),
      StructField("ope_date", DateType, nullable = true)
    ))
val rdd = sc.sparkContext.parallelize(Seq(
      Row("身份证", "321034202008105678","HNIX", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105678","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105679","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("身份证", "321034202008105680","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "18910983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "18910983245","HNIX", java.sql.Date.valueOf("2020-08-10")),
      Row("手机号", "13510983245","MKDN", java.sql.Date.valueOf("2020-08-10")),
      Row("银行卡号", "10101010101010101010","MKDN", java.sql.Date.valueOf("2020-08-10"))
    ))
val userImage = sc.createDataFrame(rdd, schema)
userImage.show()

//+--------+--------------------+----------+----------+
//|dim_name|           dim_value|label_code|  ope_date|
//+--------+--------------------+----------+----------+
//|  身份证|  321034202008105678|      HNIX|2020-08-10|
//|  身份证|  321034202008105678|      MKDN|2020-08-10|
//|  身份证|  321034202008105679|      MKDN|2020-08-10|
//|  身份证|  321034202008105680|      MKDN|2020-08-10|
//|  手机号|         18910983245|      MKDN|2020-08-10|
//|  手机号|         18910983245|      HNIX|2020-08-10|
//|  手机号|         13510983245|      MKDN|2020-08-10|
//|银行卡号|10101010101010101010|      MKDN|2020-08-10|
//+--------+--------------------+----------+----------+

方法三,通过文件直接创建DataFrame

使用parquet文件创建:

sc.read.parquet("/path/to/file.parquet")

使用json文件创建:

val json = sc.read.json("D:\\test\\test.json")
json.show()

使用csv文件,spark2.0+之后的版本可用:

val loadDF = sc.read
      .format("com.databricks.spark.csv")
      .option("header", "true") //reading the headers
      .option("mode", "DROPMALFORMED")
      .csv("D:\\test\\test.csv")
loadDF.show()

如果加载文件后,出现中文乱码,可以将文件先转为UTF-8
SparkSession

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值