Spark -- LabelEncoder，使用StringIndexer对多列编码

最新推荐文章于 2022-07-07 17:50:03 发布

TheBiiigBlue

最新推荐文章于 2022-07-07 17:50:03 发布

阅读量3.9k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： Spark 文章标签： SparkML

本文链接：https://blog.youkuaiyun.com/Aeve_imp/article/details/97121317

Spark 专栏收录该内容

21 篇文章

订阅专栏

由于业务需求，需要对多列进行LabelEncoder编码，SparkML中是通过StringIndexer来实现LabelEncoder的，而StringIndexer是对单列操作的，而如果循环对每列进行编码，不符合Spark的设计，效率是十分低下的，对于这样的需求，我们使用Pipeline来解决这个问题。
在这里插入图片描述
通过map操作为每一组创建StringIndexer特征，然后通过pipeline一次性对所有列进行转换。代码如下：

  /**
    * @Author: TheBigBlue
    * @Description: 2.4版本未支持多列编码，使用pipeline对多列编码
    * @Date: 2019/2/13
    * @param spark     :
    * @param inputDF   :
    * @param configMap :
    * @Return:
    **/
  def invokeLabelEncoder(spark: SparkSession, inputDF: DataFrame, configMap: Map[String, Any]): (DataFrame, DataFrame) = {
    val disColsList: List[String] = configMap.getOrElse(OneHotEncoder.DIS_COLS, null).asInstanceOf[List[String]]
    if (disColsList == null || disColsList.size == 0) throw new Exception("离散列为空！")
    //用户选择的需要转换的离散列
    val userSelectCols = disColsList.toArray
    inputDF.cache()
    //校验空值
    val nullValueCount = NullValueCheck.countNullValue(inputDF, userSelectCols)
    if (nullValueCount > 0) {
      throw new Exception("输入数据有" + nullValueCount + "条存在空值!")
    }
    //使用pipeline一次转换
    val indexers = userSelectCols.map(col => {
      new StringIndexer().setInputCol(col).setOutputCol(col + "_indexed")
    })
    //转换后数据
    val finalDF = new Pipeline().setStages(indexers).fit(inputDF).transform(inputDF).cache()
    println(finalDF.count())
    //生成字典DF
    val colNames = finalDF.schema.fieldNames
    val dictRDD: RDD[(String, String, Double)] = finalDF.rdd.flatMap(row => {
      colNames.filter(_.indexOf("_indexed") > 0).map(colName => {
        val originalCol = colName.substring(0, colName.lastIndexOf("_indexed"))
        (originalCol, row.getAs(originalCol).toString, row.getAs[Double](colName))
      })
    })
    val dictDF = spark.createDataFrame(dictRDD)
      .toDF("columns", "properties", "labels")
      .dropDuplicates("columns", "properties", "labels")
      .orderBy("columns", "labels")
    (dictDF, finalDF)
  }