问题描述
* Usage example:
* {{{
* spark.read.text("/path/to/spark/README.md")
* }}}
目前,Spark 2.4.3 读取 text(文本文件)的时候,只支持 UTF-8 编码,如果是其他编码(例如 GBK),返回的 DataFrame 会出现乱码。
解决方法1
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapred.TextInputFormat
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StringType, StructField, StructType}
val spark = SparkSession.builder().getOrCreate()
val sc = spark.sparkContext
val encoding = "gbk"
val filePath = "/path/to/spark/README.md"
val rdd = sc.hadoopFile(filePath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map {
case (_, t) => Row(new String(t.getBytes, 0, t.getLength, encoding))
}
val schema = StructType(Seq(StructField("value", StringType, nullable = true)))
spark.createDataFrame(rdd, schema)
解决方法2
import org.apache.spark.sql.functions.{col, decode}
spark.read.format("text").load(filePath).select(decode(col("value"), encoding).as("value"))