一.判别模式与生成模型基础知识
举例:要确定一个瓜是好瓜还是坏瓜,用判别模型的方法是从历史数据中学习到模型,然后通过提取这个瓜的特征来预测出这只瓜是好瓜的概率,是坏瓜的概率。
举例:利用生成模型是根据好瓜的特征首先学习出一个好瓜的模型,然后根据坏瓜的特征学习得到一个坏瓜的模型,然后从需要预测的瓜中提取特征,放到生成好的好瓜的模型中看概率是多少,在放到生产的坏瓜模型中看概率是多少,哪个概率大就预测其为哪个。
举例:
假如你的任务是识别一个语音属于哪种语言。例如对面一个人走过来,和你说了一句话,你需要识别出她说的到底是汉语、英语还是法语等。那么你可以有两种方法达到这个目的:
1.学习每一种语言,你花了大量精力把汉语、英语和法语等都学会了,我指的学会是你知道什么样的语音对应什么样的语言。然后再有人过来对你说,你就可以知道他说的是什么语言.
2.不去学习每一种语言,你只学习这些语言之间的差别,然后再判断(分类)。意思是指我学会了汉语和英语等语言的发音是有差别的,我学会这种差别就好了。
那么第一种方法就是生成方法,第二种方法是判别方法。
生成模型是所有变量的全概率模型,而判别模型是在给定观测变量值前提下目标变量条件概率模型。因此生成模型能够用于模拟(即生成)模型中任意变量的分布情况,而判别模型只能根据观测变量得到目标变量的采样。判别模型不对观测变量的分布建模,因此它不能够表达观测变量与目标变量之间更复杂的关系。因此,生成模型更适用于无监督的任务,如分类和聚类。
条件概率: 就是事件A在事件B发生的条件下发生的概率。条件概率表示为P(A|B),读作“A在B发生的条件下发生的概率”。
贝叶斯公式:
P(X) 代表 X 事件发生的概率,也称为先验概率;
P(Y|X) 代表在 X 事件发生的前提下,Y 事件发生的概率,也称为似然率;
P(X|Y) 代表事件 Y 发生后,X 事件发生的概率,也称为后验概率;
最大似然估计(英语:maximum likelihood estimation,缩写为MLE),是用来估计一个概率模型的参数的一种方法。
条件概率,就是在条件为瓜的颜色是青绿的情况下,瓜是好瓜的概率
先验概率,就是常识、经验、统计学所透露出的“因”的概率,即瓜的颜色是青绿的概率。
后验概率,就是在知道“果”之后,去推测“因”的概率,也就是说,如果已经知道瓜是好瓜,那么瓜的颜色是青绿的概率是多少。后验和先验的关系就需要运用贝叶斯决策理论来求解。
基于条件独立性假设,对于多个属性的后验概率可以写成:
d为属性数目,xi是x在第i个属性上取值。
对于所有的类别来说P(x)相同,基于极大似然的贝叶斯判定准则有朴素贝叶斯的表达式:
朴素贝叶斯算法实现:
#coding:utf-8
#P(y|x) = [P(x|y)*P(y)]/P(x)
import numpy as np
import pandas as pd
class Naive_Bayes:
def __init__(self):
pass
# 朴素贝叶斯训练过程
def nb_fit(self, X, y):
# print('===y.columns[0]:', y.columns[0])
classes = y[y.columns[0]].unique()
# print('==classes:', classes)
# print('==y[y.columns[0]]:', y[y.columns[0]])
class_count = y[y.columns[0]].value_counts()
# print('=class_count:', class_count)
# 计算类先验概率
class_prior = class_count / len(y)
print('==class_prior:', class_prior)
# 计算类条件概率
prior = dict()
#也就是求P(x1=?|y=?)
for col in X.columns:
for j in classes:
# print('y:', y)
# print('j:', j)
# print('===X[(y == j).values]:', X[(y == j).values])
# print('==X[(y == j).values][col]:', X[(y == j).values][col])
p_x_y = X[(y == j).values][col].value_counts()
# print('==p_x_y:', p_x_y)
for i in p_x_y.index:
# print('=i:', i)
# print('==p_x_y[i]:', p_x_y[i])
prior[(col, i, j)] = p_x_y[i] / class_count[j]
# print(prior)
# assert 1 == 0
print('==prior:', prior)
return classes, class_prior, prior
# 预测新的实例
def predict(self, X_test):
#argmax(P(x1=?|y=?)*P(y=?))
res = []
for c in classes:
p_y = class_prior[c]
p_x_y = 1
for i in X_test.items():
# print('i:', i)
# print(tuple(list(i) + [c]))
p_x_y *= prior[tuple(list(i) + [c])]
res.append(p_y * p_x_y)
# print('===res:', res)
return classes[np.argmax(res)]
if __name__ == "__main__":
x1 = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
x2 = ['S', 'M', 'M', 'S', 'S', 'S', 'M', 'M', 'L', 'L', 'L', 'M', 'M', 'L', 'L']
y = [-1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
print('==df:\n', df)
X = df[['x1', 'x2']]
# print('==X:', X)
y = df[['y']]
# print('==y:', y)
X_test = {'x1': 2, 'x2': 'S'}
nb = Naive_Bayes()
classes, class_prior, prior = nb.nb_fit(X, y)
print('测试数据预测类别为:', nb.predict(X_test))
朴素贝叶斯分类器代码:
朴素贝叶斯分类器采用了“属性条件独立性假设”,对已知类别,假设所有属性相互独立。换言之,假设每个属性独立的对分类结果发生影响相互独立。
采用GaussianNB 高斯朴素贝叶斯,概率密度函数为
import math
class NaiveBayes:
def __init__(self):
self.model = None
# 数学期望
@staticmethod
def mean(X):
"""计算均值
Param: X : list or np.ndarray
Return:
avg : float
"""
avg = 0.0
# ========= show me your code ==================
avg = sum(X) / float(len(X))
# ========= show me your code ==================
return avg
# 标准差(方差)
def stdev(self, X):
"""计算标准差
Param: X : list or np.ndarray
Return:
res : float
"""
res = 0.0
avg = self.mean(X)
res = math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))
return res
# 概率密度函数
def gaussian_probability(self, x, mean, stdev):
"""根据均值和标注差计算x符号该高斯分布的概率
Parameters:
----------
x : 输入
mean : 均值
stdev : 标准差
Return:
res : float, x符合的概率值
"""
res = 0.0
# ========= show me your code ==================
exponent = math.exp(-(math.pow(x - mean, 2) /
(2 * math.pow(stdev, 2))))
res = (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
# ========= show me your code ==================
return res
# 处理X_train
def summarize(self, train_data):
"""计算每个类目下对应数据的均值和标准差
Param: train_data : list
Return : [mean, stdev]
"""
summaries = [0.0, 0.0]
# ========= show me your code ==================
# for i in zip(*train_data):
# print(i)
summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
# ========= show me your code ==================
return summaries
# 分类别求出数学期望和标准差
def fit(self, X, y):
labels = list(set(y))
data = {label: [] for label in labels}
for f, label in zip(X, y):
data[label].append(f)
print('===data:', data)
self.model = {
label: self.summarize(value) for label, value in data.items()
}
print(self.model)#得到每一类的每个特征的均值和方差
return 'gaussianNB train done!'
# 计算概率
def calculate_probabilities(self, input_data):
"""计算数据在各个高斯分布下的概率
Paramter:
input_data : 输入数据
Return:
probabilities : {label : p}
"""
# summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
# input_data:[1.1, 2.2]
probabilities = {}
# ========= show me your code ==================
for label, value in self.model.items():
print('====label, value', label, value)
print('==len(value)', len(value))
probabilities[label] = 1
for i in range(len(value)):
mean, stdev = value[i]
probabilities[label] *= self.gaussian_probability(
input_data[i], mean, stdev)
print('===probabilities:', probabilities)
# ========= show me your code ==================
return probabilities
# 类别
def predict(self, X_test):
# {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
label = sorted(self.calculate_probabilities(X_test).items(), key=lambda x: x[-1])[-1][0]
return label
# 计算得分
def score(self, X_test, y_test):
right = 0
for X, y in zip(X_test, y_test):
label = self.predict(X)
if label == y:
right += 1
return right / float(len(X_test))
def test_bayes_model():
from sklearn.datasets import load_iris
import pandas as pd
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
print(len(X_train))
print(len(y_train))
model = NaiveBayes()
model.fit(X_train, y_train)
print(model.predict([4.4, 3.2, 1.3, 0.2]))
if __name__ == '__main__':
test_bayes_model()
基于pgmpy的贝叶斯网络例子:
pgmpy是一款基于Python的概率图模型包,主要包括贝叶斯网络和马尔可夫蒙特卡洛等常见概率图模型的实现以及推断方法.
下图是学生获得推荐信质量的例子。具体有向图和概率表如下图所示:
代码:
#coding:utf-8
#git clone https://github.com/pgmpy/pgmpy
#cd pgmpy
#python setup.py install
from pgmpy.factors.discrete import TabularCPD
from pgmpy.models import BayesianModel
student_model = BayesianModel([('D', 'G'),
('I', 'G'),
('G', 'L'),
('I', 'S')])
#分数节点
grade_cpd = TabularCPD(
variable='G',# 节点名称
variable_card=3,# 节点取值个数
values=[[0.3, 0.05, 0.9, 0.5],# 该节点的概率表
[0.4, 0.25, 0.08, 0.3],
[0.3, 0.7, 0.02, 0.2]],
evidence=['I', 'D'], # 该节点的依赖节点
evidence_card=[2, 2] # 依赖节点的取值个数
)
#考试难度节点
difficulty_cpd = TabularCPD(
variable='D',
variable_card=2,
values=[[0.6, 0.4]]
)
##智商节点
intel_cpd = TabularCPD(
variable='I',
variable_card=2,
values=[[0.7, 0.3]]
)
#收到推荐信节点
letter_cpd = TabularCPD(
variable='L',
variable_card=2,
values=[[0.1, 0.4, 0.99],
[0.9, 0.6, 0.01]],
evidence=['G'],
evidence_card=[3]
)
#sat分数节点
sat_cpd = TabularCPD(
variable='S',
variable_card=2,
values=[[0.95, 0.2],
[0.05, 0.8]],
evidence=['I'],
evidence_card=[2]
)
student_model.add_cpds(
grade_cpd,
difficulty_cpd,
intel_cpd,
letter_cpd,
sat_cpd
)
print(student_model.get_cpds())
print('D节点路径:', student_model.active_trail_nodes('D'))
print('I节点路径:', student_model.active_trail_nodes('I'))
print(student_model.local_independencies('G'))
# print(student_model.get_independencies())
# print(student_model.to_markov_model())
# 进行贝叶斯推断
from pgmpy.inference import VariableElimination
student_infer = VariableElimination(student_model)
prob_G = student_infer.query(variables=['G'])
print('所有可能性的分数概率prob_G:', prob_G)
prob_G = student_infer.query(
variables=['G'],
evidence={'I': 1, 'D': 0})
print('聪明学生的分数概率prob_G', prob_G)
# prob_G = student_infer.query(
# variables=['G'],
# evidence={'I': 0, 'D': 1})
# print(prob_G)
# # 生成数据
# import numpy as np
# import pandas as pd
#
# raw_data = np.random.randint(low=0, high=2, size=(1000, 5))
# data = pd.DataFrame(raw_data, columns=['D', 'I', 'G', 'L', 'S'])
# data.head()
#
#
# # 定义模型
# from pgmpy.models import BayesianModel
# from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
#
# model = BayesianModel([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')])
#
# # 基于极大似然估计进行模型训练
# model.fit(data, estimator=MaximumLikelihoodEstimator)
# for cpd in model.get_cpds():
# # 打印条件概率分布
# print("CPD of {variable}:".format(variable=cpd.variable))
# print(cpd)
二.机器学习
knn的详细链接:https://blog.youkuaiyun.com/fanzonghao/article/details/86411102
决策树的详细链接:https://blog.youkuaiyun.com/fanzonghao/article/details/85246720
1.SVM:寻找最优的间隔
等式约束的最优解
不等式约束的最优解:利用kkT条件
最终得到分类器:
也就是C(松弛变量)越大:得到高方差,低偏差的模型;更倾向于过拟合;
C越小:得到低方差,高偏差的模型;更倾向于欠拟合。
SVM案例,应用SMO算法:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
def create_data():
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['label'] = iris.target
df.columns = [
'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
]
data = np.array(df.iloc[:100, [0, 1, -1]])
for i in range(len(data)):
if data[i, -1] == 0:
data[i, -1] = -1
# print(data)
return data[:, :2], data[:, -1]
X, y = create_data()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print('==X_train.shape:', X_train.shape)
print('==y_train.shape:', y_train.shape)
plt.scatter(X[:50, 0], X[:50, 1], label='0', color='R')
plt.scatter(X[50:, 0], X[50:, 1], label='1', color='G')
plt.legend()
# plt.show()
#w = alpha*y*x
class SVM:
def __init__(self, max_iter=100, kernel='linear'):
self.max_iter = max_iter
self._kernel = kernel
def init_args(self, features, labels):
self.m, self.n = features.shape#m数据量 n特征维度
self.X = features
self.Y = labels
self.b = 0.0
# 将Ei保存在一个列表里
self.alpha = np.ones(self.m)
self.E = [self._E(i) for i in range(self.m)]
# 松弛变量
self.C = 1.0
def _KKT(self, i):
y_g = self._g(i) * self.Y[i]
if self.alpha[i] == 0:
return y_g >= 1
elif 0 < self.alpha[i] < self.C:
return y_g == 1
else:
return y_g <= 1
# g(x)预测值,输入xi(X[i])
def _g(self, i):
r = self.b
for j in range(self.m):
r += self.alpha[j] * self.Y[j] * self.kernel(self.X[i], self.X[j])
return r
# E(x)为g(x)对输入x的预测值和y的差
def _E(self, i):
return self._g(i) - self.Y[i]
# 核函数
def kernel(self, x1, x2):
if self._kernel == 'linear':
return sum([x1[k] * x2[k] for k in range(self.n)])
elif self._kernel == 'poly':
return (sum([x1[k] * x2[k] for k in range(self.n)]) + 1)**2
return 0
def _init_alpha(self):
# 外层循环首先遍历所有满足0<a<C的样本点,检验是否满足KKT
index_list = [i for i in range(self.m) if 0 < self.alpha[i] < self.C]
# 否则遍历整个训练集
non_satisfy_list = [i for i in range(self.m) if i not in index_list]
index_list.extend(non_satisfy_list)
for i in index_list:
if self._KKT(i):
continue
E1 = self.E[i]
# 如果E2是+,选择最小的;如果E2是负的,选择最大的
if E1 >= 0:
j = min(range(self.m), key=lambda x: self.E[x])
else:
j = max(range(self.m), key=lambda x: self.E[x])
return i, j
def _compare(self, _alpha, L, H):
if _alpha > H:
return H
elif _alpha < L:
return L
else:
return _alpha
def fit(self, features, labels):
self.init_args(features, labels)
for t in range(self.max_iter):
# train
i1, i2 = self._init_alpha()
# 边界
if self.Y[i1] == self.Y[i2]:
L = max(0, self.alpha[i1] + self.alpha[i2] - self.C)
H = min(self.C, self.alpha[i1] + self.alpha[i2])
else:
L = max(0, self.alpha[i2] - self.alpha[i1])
H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])
E1 = self.E[i1]
E2 = self.E[i2]
# eta=K11+K22-2K12
eta = self.kernel(self.X[i1], self.X[i1]) + self.kernel(
self.X[i2],
self.X[i2]) - 2 * self.kernel(self.X[i1], self.X[i2])
if eta <= 0:
# print('eta <= 0')
continue
alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (
E1 - E2) / eta #此处有修改,根据书上应该是E1 - E2,书上130-131页
alpha2_new = self._compare(alpha2_new_unc, L, H)
alpha1_new = self.alpha[i1] + self.Y[i1] * self.Y[i2] * (
self.alpha[i2] - alpha2_new)
b1_new = -E1 - self.Y[i1] * self.kernel(self.X[i1], self.X[i1]) * (
alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
self.X[i2],
self.X[i1]) * (alpha2_new - self.alpha[i2]) + self.b
b2_new = -E2 - self.Y[i1] * self.kernel(self.X[i1], self.X[i2]) * (
alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
self.X[i2],
self.X[i2]) * (alpha2_new - self.alpha[i2]) + self.b
if 0 < alpha1_new < self.C:
b_new = b1_new
elif 0 < alpha2_new < self.C:
b_new = b2_new
else:
# 选择中点
b_new = (b1_new + b2_new) / 2
# 更新参数
self.alpha[i1] = alpha1_new
self.alpha[i2] = alpha2_new
self.b = b_new
self.E[i1] = self._E(i1)
self.E[i2] = self._E(i2)
return 'train done!'
def predict(self, data):
r = self.b
for i in range(self.m):
r += self.alpha[i] * self.Y[i] * self.kernel(data, self.X[i])
return 1 if r > 0 else -1
def score(self, X_test, y_test):
right_count = 0
for i in range(len(X_test)):
result = self.predict(X_test[i])
if result == y_test[i]:
right_count += 1
return right_count / len(X_test)
# def _weight(self):
# # linear model
# yx = self.Y.reshape(-1, 1) * self.X
# self.w = np.dot(yx.T, self.alpha)
# return self.w
svm = SVM(max_iter=200)
svm.fit(X_train, y_train)
score = svm.score(X_test, y_test)
print('===score:', score)
SVM案例,用于水果数据集分类,调用scikit-learn:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC
import matplotlib.patches as mpatches
from matplotlib.colors import ListedColormap
def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None,
target_names=None, plot_decision_regions=True):
"""
根据分类器可视化数据分类的结果
只能用于二维特征的数据
"""
num_classes = np.amax(y) + 1
color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
cmap_light = ListedColormap(color_list_light[0:num_classes])
cmap_bold = ListedColormap(color_list_bold[0:num_classes])
h = 0.03
k = 0.5
x_plot_adjust = 0.1
y_plot_adjust = 0.1
plot_symbol_size = 50
x_min = X[:, 0].min()
x_max = X[:, 0].max()
y_min = X[:, 1].min()
y_max = X[:, 1].max()
x2, y2 = np.meshgrid(np.arange(x_min-k, x_max+k, h), np.arange(y_min-k, y_max+k, h))
P = clf.predict(np.c_[x2.ravel(), y2.ravel()])
P = P.reshape(x2.shape)
plt.figure()
if plot_decision_regions:
plt.contourf(x2, y2, P, cmap=cmap_light, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor='black')
plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
if X_test is not None:
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size,
marker='^', edgecolor='black')
train_score = clf.score(X, y)
test_score = clf.score(X_test, y_test)
title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
if target_names is not None:
legend_handles = []
for i in range(0, len(target_names)):
patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
legend_handles.append(patch)
plt.legend(loc=0, handles=legend_handles)
if title is not None:
plt.title(title)
plt.show()
# 加载数据集
fruits_df = pd.read_table('fruit_data_with_colors.txt')
X = fruits_df[['width', 'height']]
y = fruits_df['fruit_label'].copy()
# 将不是apple的标签设为0
y[y != 1] = 0
# 分割数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0)
print(y_test.shape)
# 不同的C值
c_values = [0.0001, 1, 100]
for c_value in c_values:
# 建立模型
svm_model = SVC(C=c_value, kernel='rbf')
# 训练模型
svm_model.fit(X_train, y_train)
# 验证模型
y_pred = svm_model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print('C={},准确率:{:.3f}'.format(c_value, acc))
# 可视化
plot_class_regions_for_classifier(svm_model, X_test.values, y_test.values, title='C={}'.format(c_value))
二维高斯分布
将kernel替换成‘linear’
2.集成学习
def load_data():
# 加载数据集
fruits_df = pd.read_table('fruit_data_with_colors.txt')
# print(fruits_df)
print('样本个数:', len(fruits_df))
# 创建目标标签和名称的字典
fruit_name_dict = dict(zip(fruits_df['fruit_label'], fruits_df['fruit_name']))
# 划分数据集
X = fruits_df[['mass', 'width', 'height', 'color_score']]
y = fruits_df['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0)
print('数据集样本数:{},训练集样本数:{},测试集样本数:{}'.format(len(X), len(X_train), len(X_test)))
# print(X_train)
return X_train, X_test, y_train, y_test
#特征归一化
def minmax_scaler(X_train,X_test):
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
# print(X_train_scaled)
#此时scaled得到一个最小最大值,对于test直接transform就行
X_test_scaled = scaler.transform(X_test)
for i in range(4):
print('归一化前,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1,
X_train.iloc[:, i].max(),
X_train.iloc[:, i].min()))
print('归一化后,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1,
X_train_scaled[:, i].max(),
X_train_scaled[:, i].min()))
return X_train_scaled,X_test_scaled
def stack(X_train_scaled, y_train,X_test_scaled, y_test):
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = SVC(kernel='linear')
clf3 = DecisionTreeClassifier()
lr = LogisticRegression(C=100)
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
clf1.fit(X_train_scaled, y_train)
clf2.fit(X_train_scaled, y_train)
clf3.fit(X_train_scaled, y_train)
sclf.fit(X_train_scaled, y_train)
print('kNN测试集准确率:{:.3f}'.format(clf1.score(X_test_scaled, y_test)))
print('SVM测试集准确率:{:.3f}'.format(clf2.score(X_test_scaled, y_test)))
print('DT测试集准确率:{:.3f}'.format(clf3.score(X_test_scaled, y_test)))
print('Stacking测试集准确率:{:.3f}'.format(sclf.score(X_test_scaled, y_test)))
if __name__ == '__main__':
X_train, X_test, y_train, y_test=load_data()
X_train_scaled,X_test_scaled=minmax_scaler(X_train,X_test)
2.1Boosting
- Boosting(提升)方法从某个基学习器出发,反复学习,得到一系列基学习器,然后组合它们构成一个强学习器。
- Boosting 基于串行策略:基学习器之间存在依赖关系,新的学习器需要依据旧的学习器生成。
- 代表算法/模型:
- 提升方法 AdaBoost
- 提升树
- 梯度提升树 GBDT
2.1.1Adaboost
2.1.2 GBDT
def gbdt(X_train_scaled, y_train, X_test_scaled, y_test):
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
parameters = {'learning_rate': [0.001, 0.01, 0.1, 1, 10, 100]}
clf = GridSearchCV(GradientBoostingClassifier(), parameters, cv=3, scoring='accuracy')
clf.fit(X_train_scaled, y_train)
print('最优参数:', clf.best_params_)
print('验证集最高得分:', clf.best_score_)
print('测试集准确率:{:.3f}'.format(clf.score(X_test_scaled, y_test)))
2.2 Bagging
- Bagging 基于并行策略:基学习器之间不存在依赖关系,可同时生成。
- 代表算法/模型:
- 随机森林
- 神经网络的 Dropout 策略
import warnings
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import VotingClassifier,RandomForestClassifier,ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
warnings.filterwarnings('ignore')
X,y=make_circles(n_samples=300,noise=0.15,factor=0.5,random_state=233)
plt.scatter(X[y==0,0],X[y==0,1])
plt.scatter(X[y== 1, 0], X[y== 1, 1])
# plt.show()
X_train,X_test,y_train,y_test=train_test_split(X,y)
print('X_train.shape=',X_train.shape)
print('X_test.shape=',X_test.shape)
print(y_test)
print('===========knn==============')
knn_clf=KNeighborsClassifier()
knn_clf.fit(X_train,y_train)
print('knn accuracy={}'.format(knn_clf.score(X_test,y_test)))
print('\n')
print('===========logistic regression==============')
log_clf = LogisticRegression()
log_clf.fit(X_train, y_train)
print('logistic regression accuracy={}'.format(log_clf.score(X_test, y_test)))
print('\n')
print('===========SVM==============')
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
print('SVM accuracy={}'.format(svm_clf.score(X_test, y_test)))
print('\n')
print('===========Decison tree==============')
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
print('Decison tree accuracy={}'.format(dt_clf.score(X_test, y_test)))
print('\n')
print('===========ensemble classfier==============')
voting_clf=VotingClassifier(estimators=[('knn',KNeighborsClassifier()),
('logistic', LogisticRegression()),
('SVM',SVC()),
('decision tree',DecisionTreeClassifier())],
voting='hard')#严格遵守少数服从多数
voting_clf.fit(X_train,y_train)
print('voting classfier accuracy={}'.format(voting_clf.score(X_test, y_test)))
print('\n')
print('===========random forest==============')
rf_clf=RandomForestClassifier(n_estimators=500,#500棵树
max_depth=6,#每颗树的深度
bootstrap=True,# 放回抽样
oob_score=True,#使用没有被抽到的数据做验证
)
rf_clf.fit(X,y)#由于oob_score为true 故直接fit整个训练集
print('rf accuracy={}'.format(rf_clf.oob_score_))
print('\n')
print('===========extreme random tree==============')
ex_clf=ExtraTreesClassifier(n_estimators=500,
max_depth=6,
bootstrap=True,
oob_score=True)
ex_clf.fit(X,y)
print('extreme random treeaccuracy={}'.format(ex_clf.oob_score_))
print('\n')
print('===========Adaboost classifier==============')
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(),
n_estimators=500,
learning_rate=0.3)
ada_clf.fit(X_train, y_train)
print('Adaboost accuracy={}'.format(ada_clf.score(X_test,y_test)))
print('\n')
随机森林算法的高明之处之一就是利用随机性,使得模型更鲁棒。假如森林中有 N 棵树,那么就随机取出 N 个训练数据集,对 N 棵树分别进行训练,通过统计每棵树的预测结果来得出随机森林的预测结果。
因为随机森林的主要构件是决策树,所以随机森林的超参数很多与决策树相同。除此之外,有2个比较重要的超参数值得注意,一个是 bootstrap,取 true 和 false,表示在划分训练数据集时是否采用放回取样;另一个是 oob_score,因为采用放回取样时,构建完整的随机森林之后会有大约 33% 的数据没有被取到过,所以当 oob_score 取 True 时,就不必再将数据集划分为训练集和测试集了,直接取未使用过的数据来验证模型的准确率。
由上述可以看出Extremely Randomized Trees 算法精度最高,它不仅在构建数据子集时对样本的选择进行随机抽取,而且还会对样本的特征进行随机抽取(即在建树模型时,采用部分特征而不是全部特征进行训练)。换句话说,就是对于特征集 X,随机森林只是在行上随机,Extremely Randomized Trees是在行和列上都随机。
Boosting/Bagging 与 偏差/方差 的关系
- 简单来说,Boosting 能提升弱分类器性能的原因是降低了偏差;Bagging 则是降低了方差;
- Boosting 方法:
- Boosting 的基本思路就是在不断减小模型的训练误差(拟合残差或者加大错类的权重),加强模型的学习能力,从而减小偏差;
- 但 Boosting 不会显著降低方差,因为其训练过程中各基学习器是强相关的,缺少独立性。
- Bagging 方法:
- 对
n
个独立不相关的模型预测结果取平均,方差是原来的1/n
; - 假设所有基分类器出错的概率是独立的,超过半数基分类器出错的概率会随着基分类器的数量增加而下降。
- 对
- 泛化误差、偏差、方差、过拟合、欠拟合、模型复杂度(模型容量)的关系图:
参考: