解决：Spark DataFrame 读取 text（文本文件）乱码

最新推荐文章于 2023-09-27 16:37:50 发布

修行修心

最新推荐文章于 2023-09-27 16:37:50 发布

阅读量5.5k

点赞数

CC 4.0 BY-SA版权

分类专栏： Spark 文章标签： Spark DataFrame text 文本文件乱码

本文链接：https://blog.youkuaiyun.com/yitengtongweishi/article/details/98874519

Spark 专栏收录该内容

14 篇文章

订阅专栏

本文档介绍了在Spark 2.4.3中遇到读取GBK编码的text文件导致乱码的问题，提供了两种解决方法，帮助读者解决DataFrame读取文本时的编码问题。

问题描述

   * Usage example:
   * {{{
   *   spark.read.text("/path/to/spark/README.md")
   * }}}

目前，Spark 2.4.3 读取 text（文本文件）的时候，只支持 UTF-8 编码，如果是其他编码（例如 GBK），返回的 DataFrame 会出现乱码。

解决方法1

      import org.apache.hadoop.io.{LongWritable, Text}
      import org.apache.hadoop.mapred.TextInputFormat
      import org.apache.spark.sql.Row
      import org.apache.spark.sql.types.{StringType, StructField, StructType}

      val spark = SparkSession.builder().getOrCreate()
      val sc = spark.sparkContext
      val encoding = "gbk"
      val filePath = "/path/to/spark/README.md"

      val rdd = sc.hadoopFile(filePath, classOf[TextInputFormat], classOf[LongWritable], classOf[Text]).map {
        case (_, t) => Row(new String(t.getBytes, 0, t.getLength, encoding))
      }
      val schema = StructType(Seq(StructField("value", StringType, nullable = true)))
      spark.createDataFrame(rdd, schema)

解决方法2

    import org.apache.spark.sql.functions.{col, decode}
    spark.read.format("text").load(filePath).select(decode(col("value"), encoding).as("value"))