第四节–朴素贝叶斯(Naive Bayes)法
朴素贝叶斯(Naive Bayes,NB)法是基于贝叶斯定理与特征条件独立假设的分类方法.对于给定的训练数据集,首先基于特征条件独立假设学习输入/输出的联合概率分布;然后基于此模型,对给定的输入x,利用贝叶斯定理求出后验概率最大的输出y.
NB包括以下算法:
- 高斯朴素贝叶斯(Gaussian Naive Bayes)–适用于正态分布
- 伯努利朴素贝叶斯(Bernoulli Naive Bayes)–适用于二项分布
- 多项式朴素贝叶斯(Multinomial Navie Bayes)
朴素贝叶斯法的优缺点:
- 优点:学习和预测的效率高,且易于实现;在数据较少的情况下仍然有效,可以处理分类问题
- 缺点:分类效果不一定很高,特征独立性假设会是朴素贝叶斯变得简单,但是会牺牲一定的分类准确率
一.朴素贝叶斯法的学习与分类
1.基本方法
设输入空间X⊆Rn\mathcal{X} \subseteq \mathbf{R}^{n}X⊆Rn为n维向量的集合,输出空间为类标记集合y={c1,c2,⋯ ,cx}y=\left\{c_{1},\right.c_{2}, \cdots, c_{x} \}y={c1,c2,⋯,cx},输入为特征向量x∈Xx \in \mathcal{X}x∈X,输出为类标记(class label)y∈Yy \in \mathcal{Y}y∈Y,X是定义在输入空间X\mathcal{X}X上的随机向量,Y是定义在输出空间Y\mathcal{Y}Y上的随机变量.P(X,Y)P(X, Y)P(X,Y)是X和Y的联合概率分布.训练数据集:
T={(x1,y1),(x2,y2),⋯ ,(xN,yN)}T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}T={(x1,y1),(x2,y2),⋯,(xN,yN)}
由P(X,Y)P(X, Y)P(X,Y)独立分布产生
朴素贝叶斯法通过训练数据集学习联合概率分布P(X,Y)P(X, Y)P(X,Y).具体地,学习以下先验概率分布及条件概率分布.先验概率分布:
P(Y=ck),k=1,2,⋯ ,K
P\left(Y=c_{k}\right), \quad k=1,2, \cdots, K
P(Y=ck),k=1,2,⋯,K
条件概率分布:
P(X=x∣Y=ck)=P(X(1)=x(1),⋯ ,X(n)=x(n)∣Y=ck),k=1,2,⋯ ,K
P\left(X=x | Y=c_{k}\right)=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right), \quad k=1,2, \cdots, K
P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck),k=1,2,⋯,K
于是学习到联合概率分布P(X,Y)P(X, Y)P(X,Y)
条件概率分布P(X=x∣Y=ck)P\left(X=x | Y=c_{k}\right)P(X=x∣Y=ck)有指数级数量的参数,其估计实际是不可行的.事实上,假设x(j)x^{(j)}x(j)可取值有SjS_{j}Sj个,j=1,2,⋯ ,nj=1,2, \cdots, nj=1,2,⋯,n,Y可取值有K个,那么参数个数K∏j=1nSjK \prod_{j=1}^{n} S_{j}K∏j=1nSj
朴素贝叶斯法对条件概率分布作了条件独立性的假设.由于这是一个较强的假设,朴素贝叶斯法也由此得名.具体地,条件独立性假设是:
P(X=x∣Y=ck)=P(X(1)=x(1),⋯ ,X(n)=x(n)∣Y=ck)=∏j=1nP(X(j)=x(j)∣Y=ck)
\begin{aligned} P\left(X=x | Y=c_{k}\right) &=P\left(X^{(1)}=x^{(1)}, \cdots, X^{(n)}=x^{(n)} | Y=c_{k}\right) \\ &=\prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) \end{aligned}
P(X=x∣Y=ck)=P(X(1)=x(1),⋯,X(n)=x(n)∣Y=ck)=j=1∏nP(X(j)=x(j)∣Y=ck)
朴素贝叶斯法实际上学习到生成数据的机制,所以属于生成模型.条件独立假设等于是说用于分类的特征在类确定的条件下都是条件独立的.这一假设使朴素贝叶斯法变得简单,但有时会牺牲一定的分类准确率
朴素贝叶斯法分类时,对给定的输入x,通过学习到的模型计算后验概率分布P(Y=ck∣X=x)P\left(Y=c_{k} | X=x\right)P(Y=ck∣X=x),将后验概率最大的类作为x的类输出,后验概率计算根据贝叶斯定理进行:
P(Y=ck∣X=x)=P(X=x∣Y=ck)P(Y=ck)∑kP(X=x∣Y=ck)P(Y=ck)
P\left(Y=c_{k} | X=x\right)=\frac{P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)}{\sum_{k} P\left(X=x | Y=c_{k}\right) P\left(Y=c_{k}\right)}
P(Y=ck∣X=x)=∑kP(X=x∣Y=ck)P(Y=ck)P(X=x∣Y=ck)P(Y=ck)
将上面两式联合:
P(Y=ck∣X=x)=P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯ ,K
P\left(Y=c_{k} | X=x\right)=\frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}, \quad k=1,2, \cdots, K
P(Y=ck∣X=x)=∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K
这是朴素贝叶斯法分类的基本公式,于是朴素贝叶斯分类器可表示为:
y=f(x)=argmaxckP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
y=f(x)=\arg \max _{c_{k}} \frac{P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}{\sum_{k} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)}
y=f(x)=argckmax∑kP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)P(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
注意到,在上式中分母对所有ckc_{k}ck都是相同的,所以:
y=argmaxckP(Y=ck)∏jP(X(j)=x(j)∣Y=ck)
y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)
y=argckmaxP(Y=ck)j∏P(X(j)=x(j)∣Y=ck)
2.后验概率最大化的含义
朴素贝叶斯法将实例分到后验概率最大的类中,这等价于期望风险最小化.假设选择0-1损失函数:
L(Y,f(X))={1,Y≠f(X)0,Y=f(X)
L(Y, f(X))=\left\{\begin{array}{ll}{1,} & {Y \neq f(X)} \\ {0,} & {Y=f(X)}\end{array}\right.
L(Y,f(X))={1,0,Y̸=f(X)Y=f(X)
式中f(X)f(X)f(X)是分类决策函数.这时期望风险函数为:
Rexp(f)=E[L(Y,f(X))]
R_{\mathrm{exp}}(f)=E[L(Y, f(X))]
Rexp(f)=E[L(Y,f(X))]
期望是对联合分布P(X,Y)P(X, Y)P(X,Y)取的.由此取条件期望:
Rexp(f)=EX∑k=1K[L(ck,f(X))]P(ck∣X)
R_{\mathrm{exp}}(f)=E_{X} \sum_{k=1}^{K}\left[L\left(c_{k}, f(X)\right)\right] P\left(c_{k} | X\right)
Rexp(f)=EXk=1∑K[L(ck,f(X))]P(ck∣X)
为了使期望风险最小化,只需对X=xX=xX=x逐个极小化,由此得到:
f(x)=argminy∈y∑k=1KL(ck,y)P(ck∣X=x)=argminy∈y∑k=1KP(y≠ck∣X=x)=argminy∈Y(1−P(y=ck∣X=x))=argmaxy∈yP(y=ck∣X=x)
\begin{aligned} f(x) &=\arg \min _{y \in y} \sum_{k=1}^{K} L\left(c_{k}, y\right) P\left(c_{k} | X=x\right) \\ &=\arg \min _{y \in y} \sum_{k=1}^{K} P\left(y \neq c_{k} | X=x\right) \\ &=\arg \min _{y \in \mathcal{Y}}\left(1-P\left(y=c_{k} | X=x\right)\right) \\ &=\arg \max _{y \in y} P\left(y=c_{k} | X=x\right) \end{aligned}
f(x)=argy∈ymink=1∑KL(ck,y)P(ck∣X=x)=argy∈ymink=1∑KP(y̸=ck∣X=x)=argy∈Ymin(1−P(y=ck∣X=x))=argy∈ymaxP(y=ck∣X=x)
这样一来,根据期望风险最小化准则就得到了后验概率最大化准则:
f(x)=argmaxckP(ck∣X=x)
f(x)=\arg \max _{c_{k}} P\left(c_{k} | X=x\right)
f(x)=argckmaxP(ck∣X=x)
即朴素贝叶斯法所采用的原理
二.朴素贝叶斯法的参数估计
1.极大似然估计
在朴素贝叶斯法中,学习意味着估计P(Y=ck)P\left(Y=c_{k}\right)P(Y=ck)和P(X(j)=x(j)∣Y=ck)P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right)P(X(j)=x(j)∣Y=ck).可以应用极大似然估计相应的概率.先验概率P(Y=ck)P\left(Y=c_{k}\right)P(Y=ck)的极大似然估计是:
P(Y=ck)=∑i=1NI(yi=ck)N,k=1,2,⋯ ,K
P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, k=1,2, \cdots, K
P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,K
设第j个特征x(j)x^{(j)}x(j)可能取值的集合为{aj1,aj2,⋯ ,ajSj}\left\{a_{j 1}, a_{j 2}, \cdots, a_{j S_{j}}\right\}{aj1,aj2,⋯,ajSj}.条件概率P(X(j)=ajl∣Y=ck)P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)P(X(j)=ajl∣Y=ck)的极大似然估计是:
P(X(j)=ajl∣Y=ck)=∑i=1NI(xi(j)=ajlyi=ck)∑i=1NI(yi=ck) P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l} y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)} P(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajlyi=ck)
j=1,2,⋯ ,n;l=1,2,⋯ ,Sj:k=1,2,⋯ ,K j=1,2, \cdots, n ; l=1,2, \cdots, S_{j} : k=1,2, \cdots, K j=1,2,⋯,n;l=1,2,⋯,Sj:k=1,2,⋯,K
式中,xi(j)x_{i}^{(j)}xi(j)是第i个样本的第j个特征;ajla_{j l}ajl是第j个特征可能取的第l个值:I为指示函数
2.学习与分类算法
输入:训练数据T={(x1,y1),(x2,y2),⋯ ,(xN,yN)}T=\left\{\left(x_{1}, y_{1}\right),\left(x_{2}, y_{2}\right), \cdots,\left(x_{N}, y_{N}\right)\right\}T={(x1,y1),(x2,y2),⋯,(xN,yN)},其中xi=(xi(1),xi(2),⋯ ,xi(n))Tx_{i}=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots, x_{i}^{(n)}\right)^{\mathrm{T}}xi=(xi(1),xi(2),⋯,xi(n))T,xi(j)x_{i}^{(j)}xi(j)是第i个样本的第j个特征,xi(j)∈{aj1,aj2,⋯ ,ajsj}x_{i}^{(j)} \in\left\{a_{j 1}, a_{j 2}, \cdots, a_{j s_{j}}\right\}xi(j)∈{aj1,aj2,⋯,ajsj},ajla_{j l}ajl是第j个特征可能取的第l个值,j=1,2,⋯ ,n,l=1,2,⋯ ,Sj,yi∈{c1,c2,⋯ ,cK}j=1,2, \cdots, n, \quad l=1,2, \cdots, S_{j}, \quad y_{i} \in\left\{c_{1}, c_{2}, \cdots, c_{K}\right\}j=1,2,⋯,n,l=1,2,⋯,Sj,yi∈{c1,c2,⋯,cK},实例x;
输出:实例x的分类
-
计算先验概率及条件概率
P(Y=ck)=∑i=1NI(yi=ck)N,k=1,2,⋯ ,KP(X(j)=ajl∣Y=ck)=∑i=1NI(xi(j)=ajl,yi=ck)∑i=1NI(yi=ck)j=1,2,⋯ ,n;l=1,2,⋯ ,Sj;k=1,2,⋯ ,K \begin{array}{l}{P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K} \\ {P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}} \\ {j=1,2, \cdots, n ; \quad l=1,2, \cdots, S_{j} ; \quad k=1,2, \cdots, K}\end{array} P(Y=ck)=N∑i=1NI(yi=ck),k=1,2,⋯,KP(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)∑i=1NI(xi(j)=ajl,yi=ck)j=1,2,⋯,n;l=1,2,⋯,Sj;k=1,2,⋯,K -
对于给定的实例x=(x(1),x(2),⋯ ,x(n))Tx=\left(x^{(1)}, x^{(2)}, \cdots, x^{(n)}\right)^{\mathrm{T}}x=(x(1),x(2),⋯,x(n))T,计算:
P(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck),k=1,2,⋯ ,K P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right), \quad k=1,2, \cdots, K P(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck),k=1,2,⋯,K -
确定实例x的类:
y=argmaxckP(Y=ck)∏j=1nP(X(j)=x(j)∣Y=ck) y=\arg \max _{c_{k}} P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right) y=argckmaxP(Y=ck)j=1∏nP(X(j)=x(j)∣Y=ck)
实例1:通过下表的训练数据学习一个朴素贝叶斯分类器并确定x=(2,S)Tx=(2, S)^{T}x=(2,S)T的类标记y.表中X(1),X(2)X^{(1)}, X^{(2)}X(1),X(2)为特征,取值的集合分别为A1={1,2,3},A2={S,M,L}A_{1}=\{1,2,3\}, A_{2}=\{S, M, L\}A1={1,2,3},A2={S,M,L},Y为类标记,Y∈C={1,−1}Y \in C=\{1,-1\}Y∈C={1,−1}
1 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
X(1)X^{(1)}X(1) | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 2 | 2 | 2 | 3 | 3 | 3 | 3 | 3 |
X(2)X^{(2)}X(2) | S | M | M | S | S | S | M | M | L | L | L | M | M | L | L |
Y | -1 | -1 | 1 | 1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 |
from IPython.display import Image
Image(filename="./data/4_2.png",width=500)
3.贝叶斯估计
用极大似然估计可能会出现所要估计的概率值为0的情况.这时会影响到后验概率的计算结果.是分类产生偏差.解决这一问题的方法是采用贝叶斯估计,具体地,条件概率的贝叶斯估计是:
Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(xi(j)=ajl,yi=ck)+λ∑i=1NI(yi=ck)+Sjλ
P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda}
Pλ(X(j)=ajl∣Y=ck)=∑i=1NI(yi=ck)+Sjλ∑i=1NI(xi(j)=ajl,yi=ck)+λ
式中λ⩾0\lambda \geqslant 0λ⩾0,等价于在随机变量各个取值的频数上赋予一个正数λ>0\lambda>0λ>0.当λ=0\lambda=0λ=0时就是极大似然估计.常取λ=1\lambda=1λ=1,这时称为拉普拉斯平滑(Laplace smoothing).显然对任何l=1,2,⋯ ,Sj,k=1,2,⋯ ,Kl=1,2, \cdots, S_{j}, \quad k=1,2, \cdots, Kl=1,2,⋯,Sj,k=1,2,⋯,K,有:
Pλ(X(j)=ajl∣Y=ck)>0∑i=1sjP(X(j)=ajl∣Y=ck)=1
\begin{array}{l}{P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)>0} \\ {\sum_{i=1}^{s_{j}} P\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=1}\end{array}
Pλ(X(j)=ajl∣Y=ck)>0∑i=1sjP(X(j)=ajl∣Y=ck)=1
同样,先验概率的贝叶斯估计是:
Pλ(Y=ck)=∑i=1NI(yi=ck)+λN+Kλ
P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda}
Pλ(Y=ck)=N+Kλ∑i=1NI(yi=ck)+λ
实例2:对实例1,按照拉普拉斯平滑估计概率,即取λ=1\lambda=1λ=1
Image(filename="./data/4_1.png",width=500)
三.代码实现
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from collections import Counter
import math
def load_data():
iris=load_iris()
df=pd.DataFrame(iris.data,columns=iris.feature_names)
df["label"]=iris.target
df.columns=["sepal lenght","sepal width","petal length","petal width","label"]
data=np.array(df.iloc[:100,:])
return data[:,:-1],data[:,-1]
X,y=load_data()
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
X_test[0],y_test[0]
(array([4.5, 2.3, 1.3, 0.3]), 0.0)
1.自定义GaussianNB
特征的可能性被假设为高斯
概率密度函数:
P(xi∣yk)=12πσyk2exp(−(xi−μyk)22σyk2)
P\left(x_{i} | y_{k}\right)=\frac{1}{\sqrt{2 \pi \sigma_{y k}^{2}}} \exp \left(-\frac{\left(x_{i}-\mu_{y k}\right)^{2}}{2 \sigma_{y k}^{2}}\right)
P(xi∣yk)=2πσyk21exp(−2σyk2(xi−μyk)2)
数学期望(mean):μ\muμ,方差:σ2=∑(X−μ)2N\sigma^{2}=\frac{\sum(X-\mu)^{2}}{N}σ2=N∑(X−μ)2
class NaiveBayes(object):
def __init__(self):
self.model=None
# 数学期望
@staticmethod
def mean(X):
return sum(X)/float(len(X))
# 标准差(方差)
def stdev(self,X):
avg=self.mean(X)
return math.sqrt(sum([pow(x-avg,2) for x in X])/float(len(X)))
#概率密度函数
def gaussian_probability(self,x,mean,stdev):
exponent=math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1/(math.sqrt(x*math.pi)*stdev))*exponent
# 处理X_train
def summarize(self,train_data):
summaries=[(self.mean(i),self.stdev(i)) for i in zip(*train_data)]
return summaries
# 分类别求出数学期望和标准差
def fit(self,X,y):
labels=list(set(y))
data={label:[] for label in labels}
for f,label in zip(X,y):
data[label].append(f)
self.model={label:self.summarize(value) for label,value in data.items()}
return "GaussianNB train done"
# 计算概率
def calculate_probabilities(self,input_data):
probabilities={}
for label,value in self.model.items():
probabilities[label]=1
for i in range(len(value)):
mean,stdev=value[i]
probabilities[label]*=self.gaussian_probability(input_data[i],mean,stdev)
return probabilities
# 类别
def predict(self,X_test):
label=sorted(self.calculate_probabilities(X_test).items(),key=lambda x:x[-1])[-1][0]
return label
def score(self,X_test,y_test):
right=0
for X,y in zip(X_test,y_test):
label=self.predict(X)
if label==y:
right+=1
return right/float(len(X_test))
model=NaiveBayes()
model.fit(X_train,y_train)
'GaussianNB train done'
print(model.predict([4.4,3.2,1.3,0.2]))
0.0
model.score(X_test,y_test)
1.0
2.sklearn Naive_Bayes
from sklearn.naive_bayes import GaussianNB
clf=GaussianNB()
clf.fit(X_train,y_train)
GaussianNB(priors=None, var_smoothing=1e-09)
clf.score(X_test,y_test)
1.0
clf.predict([[4.4,3.2,1.3,0.2]])
array([0.])