特征的转换_05-标签索引的转换与特征的组合

最新推荐文章于 2025-03-18 16:19:57 发布

原创最新推荐文章于 2025-03-18 16:19:57 发布 · 1k 阅读

1 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习

王小草spark机器学习笔记专栏收录该内容

14 篇文章

订阅专栏

这篇博客详细介绍了在机器学习中如何处理标签和特征的转换，包括StringIndexer将类别标签转为数字索引，IndexToString逆转换，VectorIndexer识别类别变量并转换，OneHotEncoder进行独热编码，VectorAssembler组合多个特征为一个向量，以及SQLTransformer通过SQL语句创建新特征。这些转换在训练和预测模型时至关重要。

笔记整理时间：2017年1月20日
笔记整理人：王小草

1.StringIndexer

将类别型的标签变量转换成数字索引。根据该类别出现的频数由高到低排列分别对应索引0,1,2，…

如果输入的是数字，那么会将数字转换成字符串类型，然后再进行相同方式的建立对应的索引。

比如可以将如下第二列的类别变量：

id	category
0	a
1	b
2	c
3	a
4	a
5	c

上面有三个类别a,b,c,a的频数最高所以标注为0，其次是c,b.转换成如下第三列的对应的索引：

id	category	categoryIndex
0	a	0.0
1	b	2.0
2	c	1.0
3	a	0.0
4	a	0.0
5	c	1.0

那如果在训练集中有3个类别，但测试集或者新的数据进来有第4个类别d了呢？此时可以选择两种机制：
一是抛出异常（这是默认的）
二是忽略掉那个新的类别的一整组数据。

根据第二种方法，比如进来了一组新数据如下：

id	category
0	a
1	b
2	c
3	d

因为d在之前编码中没有出现，所以自动忽略啦~

id	category	categoryIndex
0	a	0.0
1	b	2.0
2	c	1.

代码：

import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
  Seq((0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"))
).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)
indexed.show()

2.IndexToString

与StringIndexer正好相反，将标签索引转换成标签字符串。
一般情况下，为了之后运算的方便，会事先用StringIndexer将字符串的类别标签转换成索引，而最后预测或输出的结果中再将索引转换成原来的字符串标签。

代码：

object FeatureTransform01 {

  def main(args: Array[String]) {

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    val conf = new SparkConf().setAppName("FeatureTransform01").setMaster("local")
    val sc = new SparkContext(conf)

    val spark = SparkSession
      .builder()
      .appName("Feature Extraction")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    //创建一个DataFrame
    val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c")
    )).toDF("id", "category")

    //将字符串标签转换成索引标签
    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)

    println(s"Transformed string column '${indexer.getInputCol}' " +
      s"to indexed column '${indexer.getOutputCol}'")
    indexed.show()


    // 将索引标签转换成字符串
    val converter = new IndexToString()
      .setInputCol("categoryIndex")
      .setOutputCol("originalCategory")

    val converted = converter.transform(indexed)

    println(s"Transformed indexed column '${converter.getInputCol}' back to original string " +
      s"column '${converter.getOutputCol}' using labels in metadata")
    converted.select("id", "categoryIndex", "originalCategory").show()


    sc.stop()

  }

}

打印结果

Transformed string column 'category' to indexed column 'categoryIndex'
+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+

Transformed indexed column 'categoryIndex' back to original string column 'originalCategory' using labels in metadata
+---+-------------+----------------+
| id|categoryIndex|originalCategory|
+---+-------------+----------------+
|  0|          0.0|               a|
|  1|          2.0|               b|
|  2|          1.0|               c|
|  3|          0.0|               a|
|  4|          0.0|               a|
|  5|          1.0|               c|
+---+-------------+----------------+

3.VectorIndexer

输入一组特征向量，VectorIndexer可以根据输入的参数maxCategories自动识别处类别变量，然后将类别变量转换成索引标签，从而输出一组新的全部使用数字表征的特征向量。

在做决策树等分类模型前都需要强字符串类别标签转换成索引标签。

代码如下：

    // 读入数据
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    // 将设置最大类别数为10 ，将种类小于10的变量识别为类别变量并转换成索引
    val indexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexed")
      .setMaxCategories(10)

    val indexerModel = indexer.fit(data)

    val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
    println(s"Chose ${categoricalFeatures.size} categorical features: " +
      categoricalFeatures.mkString(", "))

    // 转换
    val indexedData = indexerModel.transform(data)
    indexedData.show()

4.OneHotEncoder

独热编码在很多地方都需要用到。将一列类别变量转换成多列二元变量。
比如在逻辑回归中就需要用独热编码的类别特征。

比如：

a
b
c

转换成：

a 1 0 0 
b 0 1 0
c 0 0 1

代码：

    // 创建一组尅别类别标签
    val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d")
    )).toDF("id", "category")

    // 将字符串类别转换成索引标签
    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)

    // 将索引标签进行独热编码
    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")

    val encoded = encoder.transform(indexed)
    encoded.show()


    sc.stop()

打印结果：

+---+--------+-------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+-------------+-------------+
|  0|       a|          0.0|(3,[0],[1.0])|
|  1|       b|          3.0|    (3,[],[])|
|  2|       c|          1.0|(3,[1],[1.0])|
|  3|       a|          0.0|(3,[0],[1.0])|
|  4|       a|          0.0|(3,[0],[1.0])|
|  5|       c|          1.0|(3,[1],[1.0])|
|  6|       d|          2.0|(3,[2],[1.0])|
+---+--------+-------------+-------------+

第四列结果是稀疏矩阵的表示方法
(3,[0],[1.0])表示，向量长度维3，索引为的的位置值为1，其余位置都是0.

5.VectorAssembler

将多列的特征选择出来并组合成一个特征向量。

比如，以下是3类特征：

id	hour	mobile	userFeatures	clicked
0	18	1.0	[0.0, 10.0, 0.5]	1.0

为了模型输入的格式方便，想要将3类特征组合成一个特征向量，并放在一列中：

id	hour	mobile	userFeatures	clicked	features
0	18	1.0	[0.0, 10.0, 0.5]	1.0	[18.0, 1.0, 0.0, 10.0, 0.5]

代码：

   //创建一个DataFrame，3个特征
    val dataset = spark.createDataFrame(
      Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")

    // 将三个特征合并成一个特征向量
    val assembler = new VectorAssembler()
      .setInputCols(Array("hour", "mobile", "userFeatures"))
      .setOutputCol("features")

    val output = assembler.transform(dataset)
    println("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
    output.show(false)

打印结果：

+---+----+------+--------------+-------+-----------------------+
|id |hour|mobile|userFeatures  |clicked|features               |
+---+----+------+--------------+-------+-----------------------+
|0  |18  |1.0   |[0.0,10.0,0.5]|1.0    |[18.0,1.0,0.0,10.0,0.5]|
+---+----+------+--------------+-------+-----------------------+

7.SQLTransformer

可以用sql语句去提取或者重组新的特征，目前（2.1.0版本）只支持”SELECT … FROM THIS …” where “THIS” 这样语句。

比如说：

SELECT a, a + b AS a_b FROM __THIS__
SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b

代码

val df = spark.createDataFrame(
  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")

val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlTrans.transform(df).show()

上述代码就是将：