使用sparksql会遇到下面错误:
Cannot create encoder for Option of Product type, because Product type is represented as a row, and the entire row can not be null in Spark SQL like normal databases. You can wrap your type with Tuple1 if you do want top level null Product objects, e.g. instead of creating `Dataset[Option[MyClass]]`, you can do something like `val ds: Dataset[Tuple1[MyClass]] = Seq(Tuple1(MyClass(...)), Tuple1(null)).toDS`
java.lang.UnsupportedOperationException: Cannot create encoder for Option of Product type, because Product type is represented as a row, and the entire row can not be null in Spark SQL like normal databases. You can wrap your type with Tuple1 if you do want top level null Product objects, e.g. instead of creating `Dataset[Option[MyClass]]`, you can do something like `val ds: Dataset[Tuple1[MyClass]] = Seq(Tuple1(MyClass(...)), Tuple1(null)).toDS`
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:52)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
此问题主要是由于将data[Row]转换成对应的的dataSet类型时,找不到对应的类型转换导致的,需要为对应的类型添加隐式转换,一般添加代码:
implicit val registerKryoEncoder = Encoders.kryo[MyClass]
背景:
开始写spark代码,之前 是使用的spark 1.xx的方法 ,因此sc.textFile(ads_channel_type_path).map(...)返回的是一个RDD,但是我将spark的入口改为SparkSession的时候,如下:
val spark = SparkSession
.builder()
.appName("Spark SQL basic example")
.config("spark.some.config.option", "some-value")
.getOrCreate()
同样的代码,也是读取HDFS的文件,spark.read.textFile(basePath).map(...) 这个时候返回的是Dataset,再次提交作业的时候,就会报如上的错误。但是原因依然是data[Row]转换成对应的的dataSet类型时,找不到对应的类型转换导致的,使用如上的解决办法不奏效。
看我的伪代码代码:
//创建sparkSession
val spark = SparkSession
.builder()
.appName(appname)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max","1024m")
.getOrCreate()
//注意:map返回的类型是DataSet
val rdd = spark.read.textFile(path).map(line=>{
业务代码省略...
if (event == "show"){
业务代码省略...
if (currentDay==day){
业务代码省略...
Option(id,age,name) //返回类型
}else{
None //返回类型
}
}else{
None //返回类型
}
})
.toDF("id","age","name") //转df
.cache()
那应该怎么办?问题就是data[Row]转换成对应的的dataSet类型时,找不到对应的类型,看这篇文章中的例子:
主要的内容如下:
There is nothing unexpected here. You're trying to use code which has been written with Spark 1.x and is no longer supported in Spark 2.0:
- in 1.x
DataFrame.map
is((Row) ⇒ T)(ClassTag[T]) ⇒ RDD[T]
- in 2.x
Dataset[Row].map
is((Row) ⇒ T)(Encoder[T]) ⇒ Dataset[T]
To be honest it didn't make much sense in 1.x either. Independent of version you can simply use DataFrame
API:
import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map
you should use statically typed Dataset
:
import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row]
you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
仔细看上面的描述以及给出的解决的方法,对row操作的时候,就给它一个encoder,因为dataset是强数据类型的。因此,本人的代码就可以将Option改为Row,给一个encoder,伪代码如下:
//创建sparkSession
val spark = SparkSession
.builder()
.appName(appname)
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max","1024m")
.getOrCreate()
//由于dataset数强数据类型,所以在row转化为dataset额时候,需要给一个encoder,df转dataset的时候。需要as[object]
val schema = StructType(Seq(
StructField("id", IntegerType),
StructField("age", IntegerType),
StructField("name", StringType)
))
val encoder = RowEncoder(schema)
//注意:map返回的类型是DataSet
val rdd = spark.read.textFile(path).map(line=>{
业务代码省略...
if (event == "show"){
业务代码省略...
if (currentDay==day){
业务代码省略...
Row(id,age,name) //返回类型
}else{
Row.empty //返回类型
}
}else{
Row.empty //返回类型
}
})(encoder)
.filter(_ != Row.empty)
.toDF("id","age","name") //转df
.cache()
对很奇怪的异常,对思考,多读一下官网,对自己有很大的帮助!
感谢这位博主对我的指导!!!
sparkSQL学习笔记
https://segmentfault.com/a/1190000010039233?utm_source=tag-newest