Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

本文探讨了如何使用支持向量机(SVM)处理包含连续型和类别型特征的数据集,并介绍了将类别数据转换为数值型数据的方法,如1-of-K编码。

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

  • 1 is "income"
  • 2 is "year"
  • 3 is "use/COMMERCIAL EXEMPT"
  • 4 is "use/CONDOMINIUM"
  • 5 is "use/SINGLE FAMILY"
  • 6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

  1. If you are sure the categorical attribute is actually ordinal, then just treat it as numerical attribute.
  2. If not, use some coding trick to turn it into numerical attribute. According to the suggestion by the author of libsvm, one can simply use 1-of-K coding. For instance, suppose a 1-dimensional category attribute taking value from  {A,B,C} . Just turn it into 3-dimensional numbers such that  A=(1,0,0) B=(0,1,0) C=(0,0,1) . Of course, this will incur significantly additional dimensions in your problem, but I think that is not a serious problem for modern SVM solver (no matter Linear type or Kernel type you adopt)

share improve this question
 
2  
You should spell out "SVM", at least once. –   Peter Flom  Feb 21 '13 at 1:28
1  
Make sure you scale that data! –   Patrick Caldon  Feb 21 '13 at 2:16

1 Answer

up vote 5 down vote accepted

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!

share improve this answer
 
 
thanks Kyle, can you be a little more specific? What do you mean "assign your partitioned categorical features to a specific place"? –   Seamus Abshere  Feb 21 '13 at 2:42
 
@SeamusAbshere No problem! I edited my answer to address this! –   Kyle.  Feb 21 '13 at 3:01
 
I feel like I've heard that libsvm does what you're talking about automatically - any thoughts? –  Seamus Abshere  Feb 21 '13 at 15:31
 
@SeamusAbshere I imagine you're right, but I don't know for sure. Now that I think about it, I'm not sure how it could work any other way. –   Kyle.  Feb 21 '13 at 15:51
 
Emboldened by @Kyle's answer, I wrote a Ruby library (VectorEmbed) that does this conversion (embedding) automatically, both for categorical (using Murmur32 hashes) and continuous data. It outputs libsvm-formatted files. –   Seamus Abshere  Mar 29 '13 at 0:38
【直流微电网】径向直流微电网的状态空间建模与线性化:一种耦合DC-DC变换器状态空间平均模型的方法 (Matlab代码实现)内容概要:本文介绍了径向直流微电网的状态空间建模与线性化方法,重点提出了一种基于耦合DC-DC变换器状态空间平均模型的建模策略。该方法通过对系统中多个相互耦合的DC-DC变换器进行统一建模,构建出整个微电网的集中状态空间模型,并在此基础上实施线性化处理,便于后续的小信号分析与稳定性研究。文中详细阐述了建模过程中的关键步骤,包括电路拓扑分析、状态变量选取、平均化处理以及雅可比矩阵的推导,最终通过Matlab代码实现模型仿真验证,展示了该方法在动态响应分析和控制器设计中的有效性。; 适合人群:具备电力电子、自动控制理论基础,熟悉Matlab/Simulink仿真工具,从事微电网、新能源系统建模与控制研究的研究生、科研人员及工程技术人员。; 使用场景及目标:①掌握直流微电网中多变换器系统的统一建模方法;②理解状态空间平均法在非线性电力电子系统中的应用;③实现系统线性化并用于稳定性分析与控制器设计;④通过Matlab代码复现和扩展模型,服务于科研仿真与教学实践。; 阅读建议:建议读者结合Matlab代码逐步理解建模流程,重点关注状态变量的选择与平均化处理的数学推导,同时可尝试修改系统参数或拓扑结构以加深对模型通用性和适应性的理解。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值