Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

混合类型数据的SVM训练
本文探讨了如何使用支持向量机(SVM)处理包含连续型和类别型特征的数据集,并介绍了将类别数据转换为数值型数据的方法,如1-of-K编码。

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

  • 1 is "income"
  • 2 is "year"
  • 3 is "use/COMMERCIAL EXEMPT"
  • 4 is "use/CONDOMINIUM"
  • 5 is "use/SINGLE FAMILY"
  • 6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

  1. If you are sure the categorical attribute is actually ordinal, then just treat it as numerical attribute.
  2. If not, use some coding trick to turn it into numerical attribute. According to the suggestion by the author of libsvm, one can simply use 1-of-K coding. For instance, suppose a 1-dimensional category attribute taking value from  {A,B,C} . Just turn it into 3-dimensional numbers such that  A=(1,0,0) B=(0,1,0) C=(0,0,1) . Of course, this will incur significantly additional dimensions in your problem, but I think that is not a serious problem for modern SVM solver (no matter Linear type or Kernel type you adopt)

share improve this question
 
2  
You should spell out "SVM", at least once. –   Peter Flom  Feb 21 '13 at 1:28
1  
Make sure you scale that data! –   Patrick Caldon  Feb 21 '13 at 2:16

1 Answer

up vote 5 down vote accepted

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!

share improve this answer
 
 
thanks Kyle, can you be a little more specific? What do you mean "assign your partitioned categorical features to a specific place"? –   Seamus Abshere  Feb 21 '13 at 2:42
 
@SeamusAbshere No problem! I edited my answer to address this! –   Kyle.  Feb 21 '13 at 3:01
 
I feel like I've heard that libsvm does what you're talking about automatically - any thoughts? –  Seamus Abshere  Feb 21 '13 at 15:31
 
@SeamusAbshere I imagine you're right, but I don't know for sure. Now that I think about it, I'm not sure how it could work any other way. –   Kyle.  Feb 21 '13 at 15:51
 
Emboldened by @Kyle's answer, I wrote a Ruby library (VectorEmbed) that does this conversion (embedding) automatically, both for categorical (using Murmur32 hashes) and continuous data. It outputs libsvm-formatted files. –   Seamus Abshere  Mar 29 '13 at 0:38
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值