Is it OK to mix categorical and continuous data for SVM (Support Vector Machines)?

本文探讨了如何使用支持向量机(SVM)处理包含连续型和类别型特征的数据集,并介绍了将类别数据转换为数值型数据的方法,如1-of-K编码。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I have a dataset like

+--------+------+-------------------+
| income | year |        use        |
+--------+------+-------------------+
|  46328 | 1989 | COMMERCIAL EXEMPT |
|  75469 | 1998 | CONDOMINIUM       |
|  49250 | 1950 | SINGLE FAMILY     |
|  82354 | 2001 | SINGLE FAMILY     |
|  88281 | 1985 | SHOP & HOUSE      |
+--------+------+-------------------+

I embed it into a LIBSVM format vector space

+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1

Feature indices:

  • 1 is "income"
  • 2 is "year"
  • 3 is "use/COMMERCIAL EXEMPT"
  • 4 is "use/CONDOMINIUM"
  • 5 is "use/SINGLE FAMILY"
  • 6 is "use/SHOP & HOUSE"

Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?

  1. If you are sure the categorical attribute is actually ordinal, then just treat it as numerical attribute.
  2. If not, use some coding trick to turn it into numerical attribute. According to the suggestion by the author of libsvm, one can simply use 1-of-K coding. For instance, suppose a 1-dimensional category attribute taking value from  {A,B,C} . Just turn it into 3-dimensional numbers such that  A=(1,0,0) B=(0,1,0) C=(0,0,1) . Of course, this will incur significantly additional dimensions in your problem, but I think that is not a serious problem for modern SVM solver (no matter Linear type or Kernel type you adopt)

share improve this question
 
2  
You should spell out "SVM", at least once. –   Peter Flom  Feb 21 '13 at 1:28
1  
Make sure you scale that data! –   Patrick Caldon  Feb 21 '13 at 2:16

1 Answer

up vote 5 down vote accepted

Yes! But maybe not in the way you mean. In my research I frequently create categorical features from continuously-valued ones using an algorithm like recursive partitioning. I usually use this approach with the SVMLight implementation of support vector machines, but I've used it with LibSVM as well. You'll need to be sure you assign your partitioned categorical features to a specific place in your feature vector during training and classification, otherwise your model is going to end up jumbly.

Edit: That is to say, when I've done this, I assign the first n elements of the vector to the binary values associated with the output of recursive partitioning. In binary feature modeling, you just have a giant vector of 0's and 1's, so everything looks the same to the model, unless you explicitly indicate where different features are. This is probably overly specific, as I imagine most SVM implementations will do this on their own, but, if you like to program your own, it might be something to think about!

share improve this answer
 
 
thanks Kyle, can you be a little more specific? What do you mean "assign your partitioned categorical features to a specific place"? –   Seamus Abshere  Feb 21 '13 at 2:42
 
@SeamusAbshere No problem! I edited my answer to address this! –   Kyle.  Feb 21 '13 at 3:01
 
I feel like I've heard that libsvm does what you're talking about automatically - any thoughts? –  Seamus Abshere  Feb 21 '13 at 15:31
 
@SeamusAbshere I imagine you're right, but I don't know for sure. Now that I think about it, I'm not sure how it could work any other way. –   Kyle.  Feb 21 '13 at 15:51
 
Emboldened by @Kyle's answer, I wrote a Ruby library (VectorEmbed) that does this conversion (embedding) automatically, both for categorical (using Murmur32 hashes) and continuous data. It outputs libsvm-formatted files. –   Seamus Abshere  Mar 29 '13 at 0:38
内容概要:文章详细介绍了ETL工程师这一职业,解释了ETL(Extract-Transform-Load)的概念及其在数据处理中的重要性。ETL工程师负责将分散、不统一的数据整合为有价值的信息,支持企业的决策分析。日常工作包括数据整合、存储管理、挖掘设计支持和多维分析展现。文中强调了ETL工程师所需的核心技能,如数据库知识、ETL工具使用、编程能力、业务理解能力和问题解决能力。此外,还盘点了常见的ETL工具,包括开源工具如Kettle、XXL-JOB、Oozie、Azkaban和海豚调度,以及企业级工具如TASKCTL和Moia Comtrol。最后,文章探讨了ETL工程师的职业发展路径,从初级到高级的技术晋升,以及向大数据工程师或数据产品经理的横向发展,并提供了学习资源和求职技巧。 适合人群:对数据处理感兴趣,尤其是希望从事数据工程领域的人士,如数据分析师、数据科学家、软件工程师等。 使用场景及目标:①了解ETL工程师的职责和技能要求;②选择适合自己的ETL工具;③规划ETL工程师的职业发展路径;④获取相关的学习资源和求职建议。 其他说明:随着大数据技术的发展和企业数字化转型的加速,ETL工程师的需求不断增加,尤其是在金融、零售、制造、人工智能、物联网和区块链等领域。数据隐私保护法规的完善也使得ETL工程师在数据安全和合规处理方面的作用更加重要。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值