Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

最新推荐文章于 2020-02-21 17:04:20 发布

大模型发展与战略研究中心

最新推荐文章于 2020-02-21 17:04:20 发布

阅读量1.2k

点赞数

本文探讨了如何使用支持向量机(SVM)处理包含连续型和类别型特征的数据集，并介绍了将类别数据转换为数值型数据的方法，如1-of-K编码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

1 is "income"
2 is "year"
3 is "use/COMMERCIAL EXEMPT"
4 is "use/CONDOMINIUM"
5 is "use/SINGLE FAMILY"
6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

If you are sure the categorical attribute is actually ordinal, then just treat it as numerical attribute.
If not, use some coding trick to turn it into numerical attribute. According to the suggestion by the author of libsvm, one can simply use 1-of-K coding. For instance, suppose a 1-dimensional category attribute taking value from {A,B,C} . Just turn it into 3-dimensional numbers such that A=(1,0,0) , B=(0,1,0) , C=(0,0,1) . Of course, this will incur significantly additional dimensions in your problem, but I think that is not a serious problem for modern SVM solver (no matter Linear type or Kernel type you adopt)

edited Jun 12 '13 at 9:55

COOLSerdash
7,049 2 21 46

asked Feb 21 '13 at 0:56

Seamus Abshere
148 7

You should spell out "SVM", at least once. – Peter Flom♦ Feb 21 '13 at 1:28

Make sure you scale that data! – Patrick Caldon Feb 21 '13 at 2:16

add a comment

1 Answer

active oldest votes

up vote 5 down vote accepted

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!

edited Feb 21 '13 at 3:01

answered Feb 21 '13 at 2:01

Kyle.
1,146 1 7 18

thanks Kyle, can you be a little more specific? What do you mean "assign your partitioned categorical features to a specific place"? – Seamus Abshere Feb 21 '13 at 2:42

@SeamusAbshere No problem! I edited my answer to address this! – Kyle. Feb 21 '13 at 3:01

I feel like I've heard that libsvm does what you're talking about automatically - any thoughts? – Seamus Abshere Feb 21 '13 at 15:31

@SeamusAbshere I imagine you're right, but I don't know for sure. Now that I think about it, I'm not sure how it could work any other way. – Kyle. Feb 21 '13 at 15:51

Emboldened by @Kyle's answer, I wrote a Ruby library (VectorEmbed) that does this conversion (embedding) automatically, both for categorical (using Murmur32 hashes) and continuous data. It outputs libsvm-formatted files. – Seamus Abshere Mar 29 '13 at 0:38