运用scikit-learn库进行线性和非线性分类

最新推荐文章于 2024-07-21 01:39:27 发布

原创

最新推荐文章于 2024-07-21 01:39:27 发布 · 1.9k 阅读

14 ·

CC 4.0 BY-SA版权

文章标签：

#对数据进行分类

运用scikit-learn库进行线性和非线性分类

鸢尾花数据是进行机器学习常用的数据之一，本文就鸢尾花数据对分类进行系统的学习。鸢尾花数据获得来自http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data。
数据分类分为线性分类和非线性分类。线性分类分为感知器算法、Logistic回归、SVM（支持向量机）、决策树、随机森林等；非线性分类有核SVM、K近邻算法。

一、线性分类

（一）感知器算法

在对数据进行分类之前，先对数据进行预处理。

from sklearn import datasets
import numpy as np
# 鸢尾花数据集包含在sklearn中
iris = datasets.load_iris()
# 获取特征矩阵（花瓣长度和花瓣宽度）
X = iris.data[:,[2,3]]
# 获取花朵的类标
y = iris.target

# 将数据集划分为训练数据集和测试数据集
from sklearn.cross_validation import train_test_split
# test_size = 0.3表示选取30%的数据集作为测试数据集， 
#  random_state = 0表示每次选取都重新排列训练数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

# 对特征矩阵进行标准化
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# 获取每个特征的均值和标准差
sc.fit(X_train)
# 按每个特征的均值和标准差对每个特征进行标准化
X_train_std = sc.transform(X_train)
# 对获得的均值和标准差对测试数据集进行标准化
X_test_std = sc.transform(X_test)

# 设置 决策区域 图函数
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import warnings


def versiontuple(v):
    return tuple(map(int, (v.split("."))))


def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):

    # setup marker generator and color map
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])

    # plot the decision surface
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0]<