I have a dataset like
+--------+------+-------------------+
| income | year | use |
+--------+------+-------------------+
| 46328 | 1989 | COMMERCIAL EXEMPT |
| 75469 | 1998 | CONDOMINIUM |
| 49250 | 1950 | SINGLE FAMILY |
| 82354 | 2001 | SINGLE FAMILY |
| 88281 | 1985 | SHOP & HOUSE |
+--------+------+-------------------+
I embed it into a LIBSVM format vector space
+1 1:46328 2:1989 3:1
-1 1:75469 2:1998 4:1
+1 1:49250 2:1950 5:1
-1 1:82354 2:2001 5:1
+1 1:88281 2:1985 6:1
Feature indices:
- 1 is "income"
- 2 is "year"
- 3 is "use/COMMERCIAL EXEMPT"
- 4 is "use/CONDOMINIUM"
- 5 is "use/SINGLE FAMILY"
- 6 is "use/SHOP & HOUSE"
Is it OK to train a support vector machine (SVM) with a mix of continuous (year, income) and categorical (use) data like this?
- If you are sure the categorical attribute is actually ordinal, then just treat it as numerical attribute.
- If not, use some coding trick to turn it into numerical attribute. According to the suggestion by the author of libsvm, one can simply use 1-of-K coding. For instance, suppose a 1-dimensional category attribute taking value from {A,B,C} . Just turn it into 3-dimensional numbers such that A=(1,0,0) , B=(0,1,0) , C=(0,0,1) . Of course, this will incur significantly additional dimensions in your problem, but I think that is not a serious problem for modern SVM solver (no matter Linear type or Kernel type you adopt)