Kaggle学习之Machine Learning from Disaster(2) -SVM

本文介绍了SVM(支持向量机)的基本原理,它通过非线性映射将样本从低维空间转化为高维空间,实现线性可分。SVM使用核函数避免了计算复杂性和维数灾难。文章提到了四种常见的核函数类型,并讨论了sklearn库中的SVC函数包及其参数。最后,通过实例展示了SVM在Titanic数据集上的应用,预测准确率优于逻辑回归和随机森林。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、SVM介绍

(以下来自百度百科词条SVM)

SVM方法是通过一个非线性映射p,把样本空间映射到一个高维乃至无穷维的特征空间中(Hilbert空间),使得在原来的样本空间中非线性可分的问题转化为在特征空间中的线性可分的问题。简单地说,就是升维和线性化。

升维,就是把样本向高维空间做映射,一般情况下这会增加计算的复杂性,甚至会引起“维数灾难”,因而人们很少问津.但是作为分类、回归等问题来说,很可能在低维样本空间无法线性处理的样本集,在高维特征空间中却可以通过一个线性超平面实现线性划分(或回归)。

一般的升维都会带来计算的复杂化,SVM方法巧妙地解决了这个难题:应用核函数的展开定理,就不需要知道非线性映射的显式表达式;由于是在高维特征空间中建立线性学习机,所以与线性模型相比,不但几乎不增加计算的复杂性,而且在某种程度上避免了“维数灾难”.这一切要归功于核函数的展开和计算理论。

选择不同的核函数,可以生成不同的SVM,常用的核函数有以下4种:

1、线性核函数K(x,y)=x·y;
2、多项式核函数K(x,y)=[(x·y)+1]^d;
3、径向基函数K(x,y)=exp(-|x-y|^2/d^2)
4、二层神经网络核函数K(x,y)=tanh(a(x·y)+b)

二、Python SVC函数包

skicit-learn集成了SVC函数包,查看官方文档输入如下:

http://scikit-learn.org/stable/modules/svm.html#svm

对于Classification:

这里写图片描述

Py官方输入参数示例如下:


>>> from sklearn import svm
>>> X = [[0, 0], [1, 1]]
>>> y = [0, 1]
>>> clf = svm.SVC()
>>> clf.fit(X, y)  
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

SVC参数说明(详见http://blog.youkuaiyun.com/szlcw1/article/details/52336824):

1、C:惩罚参数C,默认值是1.0。C越大,相当于惩罚松弛变量,希望松弛变量接近0,即对误分类的惩罚增大,趋向于对训练集全分对的情况,这样对训练集测试时准确率很高,但泛化能力弱。C值小,对误分类的惩罚减小,允许容错,将他们当成噪声点,泛化能力较强。

2、kernel :核函数,默认是rbf,可以是‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ 。

3、degree :多项式poly函数的维度,默认是3,选择其他核函数时会被忽略。

4、gamma : ‘rbf’,‘poly’ 和‘sigmoid’的核函数参数。默认是’auto’,则会选择1/n_features。

5、coef0 :核函数的常数项。对于‘poly’和 ‘sigmoid’有用。

6、probability :是否采用概率估计,默认为False。

7、 shrinking :是否采用shrinking heuristic方法,默认为true。

8、tol :停止训练的误差值大,默认为1e-3。

9、cache_size :核函数cache缓存大小,默认为200。

10、class_weight :类别的权重,字典形式传递。设置第几类的参数C为weight*C(C-SVC中的C)。

11、verbose :允许冗余输出。

12、max_iter :最大迭代次数,-1为无限制。

13、decision_function_shape :‘ovo’, ‘ovr’ or None, default=None。

14、random_state :数据洗牌时的种子值,int值。

主要调节的参数有:C、kernel、degree、gamma、coef0。

三、实际操作

应用sklearn中svc对titanic数据进行预测,代码如下:

#SVM
from sklearn import svm
predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
clf=svm.SVC()
#clf=svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, probability=False,
#tol=0.001, cache_size=200, class_weight=None, verbose=False, max_iter=-1, decision_function_shape=None)
clf.fit(titanic[predictors],titanic["Survived"])
print(clf.score(titanic[predictors],titanic["Survived"]))

参数均采用默认,得出结果为0.83。相比LR(0.76)、RF(0.81)效果要好。

### Kaggle Machine Learning Datasets and Tutorials Kaggle is a platform that provides an extensive collection of datasets, kernels (notebooks), and competitions to help individuals learn about data science and machine learning[^1]. The following sections outline the resources available on Kaggle related to machine learning. #### Datasets Kaggle hosts numerous datasets covering various domains such as healthcare, finance, social media analysis, etc. These datasets are curated by both organizations and individual contributors. Users can download these datasets directly from the website or use APIs provided by Kaggle for programmatic access[^2]. For example, one popular dataset often used in beginner-level projects includes Titanic: Machine Learning from Disaster where participants predict survival outcomes based on passenger information like age, gender, class, fare paid among others. #### Tutorials & Kernels Tutorials come under two categories - guided courses offered through partnership with experts which require registration but offer certification upon completion; secondly there exist community-contributed notebooks known as 'kernels'. Guided Courses cover topics ranging from introductory Python programming all way up advanced neural networks while Community Notebooks provide practical examples demonstrating how specific algorithms work using real-world problems alongside code snippets written primarily either R or python language depending user preference: Here’s a simple illustration showing logistic regression implementation within Jupyter Notebook environment utilizing Scikit-Learn library over Iris flower classification problem: ```python from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split import numpy as np # Load iris dataset data = load_iris() X_train, X_test, y_train, y_test = train_test_split(data.data, data.target) clf = LogisticRegression(random_state=0).fit(X_train, y_train) print(f'Accuracy Score:{np.round(clf.score(X_test,y_test)*100)}%') ``` This script demonstrates loading IRIS sample set into memory then splitting it randomly between training/testing groups before applying standard binary classifier algorithm called Logit Regression finally printing out accuracy percentage achieved during evaluation phase against unseen test cases not part original teaching material given earlier stages process pipeline execution flow sequence order steps taken here shown above clearly explained manner easy understand follow along practice try yourself home computer system setup ready go start experimenting immediately once installed necessary software packages required run successfully without errors encountered runtime exceptions thrown unexpected situations arise need troubleshooting resolve quickly efficiently move forward continue learning journey path success achieve goals aspirations dreams become reality true!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值