特征的转换_05-标签索引的转换与特征的组合

这篇博客详细介绍了在机器学习中如何处理标签和特征的转换,包括StringIndexer将类别标签转为数字索引,IndexToString逆转换,VectorIndexer识别类别变量并转换,OneHotEncoder进行独热编码,VectorAssembler组合多个特征为一个向量,以及SQLTransformer通过SQL语句创建新特征。这些转换在训练和预测模型时至关重要。

笔记整理时间:2017年1月20日
笔记整理人:王小草


1.StringIndexer

将类别型的标签变量转换成数字索引。根据该类别出现的频数由高到低排列分别对应索引0,1,2,…

如果输入的是数字,那么会将数字转换成字符串类型,然后再进行相同方式的建立对应的索引。

比如可以将如下第二列的类别变量:

idcategory
0a
1b
2c
3a
4a
5c

上面有三个类别a,b,c,a的频数最高所以标注为0,其次是c,b.转换成如下第三列的对应的索引:

idcategorycategoryIndex
0a0.0
1b2.0
2c1.0
3a0.0
4a0.0
5c1.0

那如果在训练集中有3个类别,但测试集或者新的数据进来有第4个类别d了呢?此时可以选择两种机制:
一是抛出异常(这是默认的)
二是忽略掉那个新的类别的一整组数据。

根据第二种方法,比如进来了一组新数据如下:

idcategory
0a
1b
2c
3d

因为d在之前编码中没有出现,所以自动忽略啦~

idcategorycategoryIndex
0a0.0
1b2.0
2c1.

代码:

import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
  Seq((0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"))
).toDF("id", "category")

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)
indexed.show()

2.IndexToString

与StringIndexer正好相反,将标签索引转换成标签字符串。
一般情况下,为了之后运算的方便,会事先用StringIndexer将字符串的类别标签转换成索引,而最后预测或输出的结果中再将索引转换成原来的字符串标签。

代码:

object FeatureTransform01 {

  def main(args: Array[String]) {

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)

    val conf = new SparkConf().setAppName("FeatureTransform01").setMaster("local")
    val sc = new SparkContext(conf)

    val spark = SparkSession
      .builder()
      .appName("Feature Extraction")
      .config("spark.some.config.option", "some-value")
      .getOrCreate()

    //创建一个DataFrame
    val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c")
    )).toDF("id", "category")

    //将字符串标签转换成索引标签
    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)

    println(s"Transformed string column '${indexer.getInputCol}' " +
      s"to indexed column '${indexer.getOutputCol}'")
    indexed.show()


    // 将索引标签转换成字符串
    val converter = new IndexToString()
      .setInputCol("categoryIndex")
      .setOutputCol("originalCategory")

    val converted = converter.transform(indexed)

    println(s"Transformed indexed column '${converter.getInputCol}' back to original string " +
      s"column '${converter.getOutputCol}' using labels in metadata")
    converted.select("id", "categoryIndex", "originalCategory").show()


    sc.stop()

  }

}

打印结果

Transformed string column 'category' to indexed column 'categoryIndex'
+---+--------+-------------+
| id|category|categoryIndex|
+---+--------+-------------+
|  0|       a|          0.0|
|  1|       b|          2.0|
|  2|       c|          1.0|
|  3|       a|          0.0|
|  4|       a|          0.0|
|  5|       c|          1.0|
+---+--------+-------------+

Transformed indexed column 'categoryIndex' back to original string column 'originalCategory' using labels in metadata
+---+-------------+----------------+
| id|categoryIndex|originalCategory|
+---+-------------+----------------+
|  0|          0.0|               a|
|  1|          2.0|               b|
|  2|          1.0|               c|
|  3|          0.0|               a|
|  4|          0.0|               a|
|  5|          1.0|               c|
+---+-------------+----------------+

3.VectorIndexer

输入一组特征向量,VectorIndexer可以根据输入的参数maxCategories自动识别处类别变量,然后将类别变量转换成索引标签,从而输出一组新的全部使用数字表征的特征向量。

在做决策树等分类模型前都需要强字符串类别标签转换成索引标签。

代码如下:

    // 读入数据
    val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

    // 将设置最大类别数为10 ,将种类小于10的变量识别为类别变量并转换成索引
    val indexer = new VectorIndexer()
      .setInputCol("features")
      .setOutputCol("indexed")
      .setMaxCategories(10)

    val indexerModel = indexer.fit(data)

    val categoricalFeatures: Set[Int] = indexerModel.categoryMaps.keys.toSet
    println(s"Chose ${categoricalFeatures.size} categorical features: " +
      categoricalFeatures.mkString(", "))

    // 转换
    val indexedData = indexerModel.transform(data)
    indexedData.show()

4.OneHotEncoder

独热编码在很多地方都需要用到。将一列类别变量转换成多列二元变量。
比如在逻辑回归中就需要用独热编码的类别特征。

比如:

a
b
c

转换成:

a 1 0 0 
b 0 1 0
c 0 0 1

代码:

    // 创建一组尅别类别标签
    val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d")
    )).toDF("id", "category")

    // 将字符串类别转换成索引标签
    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)
    val indexed = indexer.transform(df)

    // 将索引标签进行独热编码
    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")

    val encoded = encoder.transform(indexed)
    encoded.show()


    sc.stop()

打印结果:

+---+--------+-------------+-------------+
| id|category|categoryIndex|  categoryVec|
+---+--------+-------------+-------------+
|  0|       a|          0.0|(3,[0],[1.0])|
|  1|       b|          3.0|    (3,[],[])|
|  2|       c|          1.0|(3,[1],[1.0])|
|  3|       a|          0.0|(3,[0],[1.0])|
|  4|       a|          0.0|(3,[0],[1.0])|
|  5|       c|          1.0|(3,[1],[1.0])|
|  6|       d|          2.0|(3,[2],[1.0])|
+---+--------+-------------+-------------+

第四列结果是稀疏矩阵的表示方法
(3,[0],[1.0])表示,向量长度维3,索引为的的位置值为1,其余位置都是0.

5.VectorAssembler

将多列的特征选择出来并组合成一个特征向量。

比如,以下是3类特征:

idhourmobileuserFeaturesclicked
0181.0[0.0, 10.0, 0.5]1.0

为了模型输入的格式方便,想要将3类特征组合成一个特征向量,并放在一列中:

idhourmobileuserFeaturesclickedfeatures
0181.0[0.0, 10.0, 0.5]1.0[18.0, 1.0, 0.0, 10.0, 0.5]

代码:

   //创建一个DataFrame,3个特征
    val dataset = spark.createDataFrame(
      Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
    ).toDF("id", "hour", "mobile", "userFeatures", "clicked")

    // 将三个特征合并成一个特征向量
    val assembler = new VectorAssembler()
      .setInputCols(Array("hour", "mobile", "userFeatures"))
      .setOutputCol("features")

    val output = assembler.transform(dataset)
    println("Assembled columns 'hour', 'mobile', 'userFeatures' to vector column 'features'")
    output.show(false)

打印结果:

+---+----+------+--------------+-------+-----------------------+
|id |hour|mobile|userFeatures  |clicked|features               |
+---+----+------+--------------+-------+-----------------------+
|0  |18  |1.0   |[0.0,10.0,0.5]|1.0    |[18.0,1.0,0.0,10.0,0.5]|
+---+----+------+--------------+-------+-----------------------+

7.SQLTransformer

可以用sql语句去提取或者重组新的特征,目前(2.1.0版本)只支持”SELECT … FROM THIS …” where “THIS” 这样语句。

比如说:

SELECT a, a + b AS a_b FROM __THIS__
SELECT a, SQRT(b) AS b_sqrt FROM __THIS__ where a > 5
SELECT a, b, SUM(c) AS c_sum FROM __THIS__ GROUP BY a, b

代码

val df = spark.createDataFrame(
  Seq((0, 1.0, 3.0), (2, 2.0, 5.0))).toDF("id", "v1", "v2")

val sqlTrans = new SQLTransformer().setStatement(
  "SELECT *, (v1 + v2) AS v3, (v1 * v2) AS v4 FROM __THIS__")

sqlTrans.transform(df).show()

上述代码就是将:

idv1v2
01.03.0
22.05.0

转换成了:

idv1v2v3v4
01.03.04.03.0
22.05.07.010.0
/*===================================================================================================================================*/ /* Display Mainscroll State Transtion : Main Sw. Scroll LinkList Update Hook */ /* */ /* --------------------------------------------------------------------------------------------------------------------------------- */ /* Arguments: Indication/Insertion Request Bit Array/Main Sw. Scroll Contents Support Bit */ /* Return: Updating was completed, TRUE/FALSE */ /*===================================================================================================================================*/ static U1 u1_s_DspMnscrlLnkUpdt(const U4 * u4p_indctdbit, U1 u1_numcntt, ST_DMNSCRL_STCTL * stp_stctl, const ST_DMNSCRL_STCFG * stp_stcfg, U1 u1_tabactv) { const ST_DMNINTRPT * stp_intrpt; U4 u4_jdgintrpt; U1 u1_scrlend; U1 u1_rsult; U1 u1_jdglinkupdt; U1 u1_updtmd; U1 u1_intrptcntt; U1 u1_prev; U1 u1_next; stp_intrpt = stp_stcfg->stp_intrpt; u1_rsult = (U1)TRUE; u1_scrlend = stp_stctl->u1_scrlend; u1_jdglinkupdt = (U1)DMNSCRL_INTBKCALL; u1_intrptcntt = stp_intrpt->u1_intrptcntt; while((u1_intrptcntt < u1_numcntt ) &&(u1_rsult == (U1)TRUE)){ if((stp_stctl->st_link[u1_intrptcntt].sts.u1_prev < u1_numcntt) && (stp_stctl->st_link[u1_intrptcntt].sts.u1_next < u1_numcntt)){ u1_jdglinkupdt |= (U1)DMNSCRL_INTLINKD; } else if(stp_stctl->st_link[u1_intrptcntt].u2_cnttlink == (U2)DMNSCRL_UNLINKD){ u1_jdglinkupdt &= (U1)U1_MAX ^ (U1)DMNSCRL_INTLINKD; } else{ u1_rsult = (U1)FALSE; break; } if(u1_intrptcntt == stp_stctl->st_sts->sts.u1_dmncntt){ u1_jdglinkupdt |= (U1)DMNSCRL_INTCRRNT; } else{ u1_jdglinkupdt &= (U1)U1_MAX ^ (U1)DMNSCRL_INTCRRNT; } u4_jdgintrpt = u4p_indctdbit[stp_intrpt->u1_rqarrypos] & stp_intrpt->u4_rqbit; if(u4_jdgintrpt != (U4)0){ u1_jdglinkupdt |= (U1)DMNSCRL_INTINDCTD; } else{ u1_jdglinkupdt &= (U1)U1_MAX ^ (U1)DMNSCRL_INTINDCTD; } u1_jdglinkupdt &= (U1)DMNLNK_CNDMASK; u1_jdglinkupdt |= (U1)(stp_stcfg->stp_cnttinf[u1_intrptcntt].u1_attrbt << DMNLNK_CNDSHFT); u1_updtmd = u1_DMNSCRL_LNKUPDTMD[u1_jdglinkupdt]; switch(u1_updtmd){ case (U1)DMNLNK_NOP: break; case (U1)DMNLNK_RMV: u1_prev = stp_stctl->st_link[u1_intrptcntt].sts.u1_prev; u1_next = stp_stctl->st_link[u1_intrptcntt].sts.u1_next; stp_stctl->st_link[u1_prev].sts.u1_next = u1_next; stp_stctl->st_link[u1_next].sts.u1_prev = u1_prev; stp_stctl->st_link[u1_intrptcntt].u2_cnttlink = (U2)DMNSCRL_UNLINKD; break; case (U1)DMNLNK_RMVNXT: u1_prev = stp_stctl->st_link[u1_intrptcntt].sts.u1_prev; u1_next = stp_stctl->st_link[u1_intrptcntt].sts.u1_next; #if (__DMNSCRL_RMVNXTJMPFST_SUP__ == 1) stp_stctl->st_sts->sts.u1_dmncntt = stp_stctl->u1_scrlbegin; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[stp_stctl->u1_scrlbegin].u1_defopt; #else stp_stctl->st_sts->sts.u1_dmncntt = u1_next; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[u1_next].u1_defopt; #endif stp_stctl->st_link[u1_prev].sts.u1_next = u1_next; stp_stctl->st_link[u1_next].sts.u1_prev = u1_prev; stp_stctl->st_link[u1_intrptcntt].u2_cnttlink = (U2)DMNSCRL_UNLINKD; break; case (U1)DMNLNK_RMVNXT1ST: if(stp_stcfg->stp_cnttinf[stp_stctl->st_sts->sts.u1_dmncntt].fp_bkgnd != NULL){ stp_stctl->st_sts->u2_ctlsts = (stp_stcfg->stp_cnttinf[stp_stctl->st_sts->sts.u1_dmncntt].fp_bkgnd)(stp_stctl->st_sts->u2_ctlsts, (U1)DMNSCRL_BKGNDACT_INIT); } if(u1_tabactv == (U1)TRUE){ vd_DSPMNSCRL_LNKUPDTHOOK(); } u1_jdglinkupdt &= (U1)U1_MAX ^ (U1)DMNSCRL_INTBKCALL; u1_prev = stp_stctl->st_link[u1_intrptcntt].sts.u1_prev; u1_next = stp_stctl->st_link[u1_intrptcntt].sts.u1_next; #if (__DMNSCRL_RMVNXTJMPFST_SUP__ == 1) stp_stctl->st_sts->sts.u1_dmncntt = stp_stctl->u1_scrlbegin; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[stp_stctl->u1_scrlbegin].u1_defopt; #else stp_stctl->st_sts->sts.u1_dmncntt = u1_next; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[u1_next].u1_defopt; #endif stp_stctl->st_link[u1_prev].sts.u1_next = u1_next; stp_stctl->st_link[u1_next].sts.u1_prev = u1_prev; stp_stctl->st_link[u1_intrptcntt].u2_cnttlink = (U2)DMNSCRL_UNLINKD; break; case (U1)DMNLNK_INSAREA: u1_next = stp_stctl->st_link[u1_scrlend].sts.u1_next; if(u1_next < u1_numcntt){ stp_stctl->st_link[u1_intrptcntt].sts.u1_prev = u1_scrlend; stp_stctl->st_link[u1_intrptcntt].sts.u1_next = u1_next; stp_stctl->st_link[u1_scrlend].sts.u1_next = u1_intrptcntt; stp_stctl->st_link[u1_next].sts.u1_prev = u1_intrptcntt; u1_scrlend = u1_intrptcntt; } else{ u1_rsult = (U1)FALSE; /* error */ } break; case (U1)DMNLNK_UPDTLNKPT: u1_scrlend = u1_intrptcntt; break; case (U1)DMNLNK_INSFLXBL: u1_next = stp_stctl->st_sts->sts.u1_dmncntt; u1_prev = stp_stctl->st_link[u1_next].sts.u1_prev; if((u1_next < u1_numcntt) &&(u1_prev < u1_numcntt)){ stp_stctl->st_link[u1_prev].sts.u1_next = u1_intrptcntt; stp_stctl->st_link[u1_next].sts.u1_prev = u1_intrptcntt; stp_stctl->st_link[u1_intrptcntt].sts.u1_prev = u1_prev; stp_stctl->st_link[u1_intrptcntt].sts.u1_next = u1_next; #if (__DMNSCRL_INSFLXBLJMPCRRNT_SUP__ == 1) stp_stctl->st_sts->sts.u1_dmncntt = u1_intrptcntt; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[u1_intrptcntt].u1_defopt; #endif } else{ u1_rsult = (U1)FALSE; } break; case (U1)DMNLNK_INSFLXBL1ST: #if (__DMNSCRL_INSFLXBLJMPCRRNT_SUP__ == 1) if(stp_stcfg->stp_cnttinf[stp_stctl->st_sts->sts.u1_dmncntt].fp_bkgnd != NULL){ stp_stctl->st_sts->u2_ctlsts = (stp_stcfg->stp_cnttinf[stp_stctl->st_sts->sts.u1_dmncntt].fp_bkgnd)(stp_stctl->st_sts->u2_ctlsts, (U1)DMNSCRL_BKGNDACT_INIT); } if(u1_tabactv == (U1)TRUE){ vd_DSPMNSCRL_LNKUPDTHOOK(); } #endif u1_jdglinkupdt &= (U1)U1_MAX ^ (U1)DMNSCRL_INTBKCALL; u1_next = stp_stctl->st_sts->sts.u1_dmncntt; u1_prev = stp_stctl->st_link[u1_next].sts.u1_prev; if((u1_next < u1_numcntt) &&(u1_prev < u1_numcntt)){ stp_stctl->st_link[u1_prev].sts.u1_next = u1_intrptcntt; stp_stctl->st_link[u1_next].sts.u1_prev = u1_intrptcntt; stp_stctl->st_link[u1_intrptcntt].sts.u1_prev = u1_prev; stp_stctl->st_link[u1_intrptcntt].sts.u1_next = u1_next; #if (__DMNSCRL_INSFLXBLJMPCRRNT_SUP__ == 1) stp_stctl->st_sts->sts.u1_dmncntt = u1_intrptcntt; stp_stctl->st_sts->sts.u1_dmnopt = stp_stcfg->stp_cnttinf[u1_intrptcntt].u1_defopt; #endif } else{ u1_rsult = (U1)FALSE; } break; default: u1_rsult = (U1)FALSE; /* error */ break; } stp_intrpt++; u1_intrptcntt = stp_intrpt->u1_intrptcntt; } return(u1_rsult); }该代码的作用
最新发布
08-11
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值