一,介绍
贝叶斯定理:
从上述公式,不难发现,类条件概率P(x|c)需要求所有属性上的联合概率,这难以从有限的训练集直接估计获取,为了解决这儿问题,我们假设所有属性都是独立的,获得朴素贝叶斯公式:
其中d为属性数量,Xi为第一个属性的取值。
又由于对所有类别来说,P(x)都是相同的,因而,我们可以忽略掉得到:
又由于连乘容易造成下溢,通过又采用对数似然:
令Dc表示训练集D中第c类样本集合,则:
对离散的属性,条件概率分布:
对连续属性,条件概率分布:
有时候,某个属性值可能在训练集中未出现,会出现概率为0的情况,从而导致分类结果异常,因此,我们经常用“拉普拉斯修正”对数据进行平滑:
二,代码实现
数据如下:
1,青绿,蜷缩,浊响,清晰,凹陷,硬滑,0.697,0.46,是
2,乌黑,蜷缩,沉闷,清晰,凹陷,硬滑,0.774,0.376,是
3,乌黑,蜷缩,浊响,清晰,凹陷,硬滑,0.634,0.264,是
4,青绿,蜷缩,沉闷,清晰,凹陷,硬滑,0.608,0.318,是
5,浅白,蜷缩,浊响,清晰,凹陷,硬滑,0.556,0.215,是
6,青绿,稍蜷,浊响,清晰,稍凹,软粘,0.403,0.237,是
7,乌黑,稍蜷,浊响,稍糊,稍凹,软粘,0.481,0.149,是
8,乌黑,稍蜷,浊响,清晰,稍凹,硬滑,0.437,0.211,是
9,乌黑,稍蜷,沉闷,稍糊,稍凹,硬滑,0.666,0.091,否
10,青绿,硬挺,清脆,清晰,平坦,软粘,0.243,0.267,否
11,浅白,硬挺,清脆,模糊,平坦,硬滑,0.245,0.057,否
12,浅白,蜷缩,浊响,模糊,平坦,软粘,0.343,0.099,否
13,青绿,稍蜷,浊响,稍糊,凹陷,硬滑,0.639,0.161,否
14,浅白,稍蜷,沉闷,稍糊,凹陷,硬滑,0.657,0.198,否
15,乌黑,稍蜷,浊响,清晰,稍凹,软粘,0.36,0.37,否
16,浅白,蜷缩,浊响,模糊,平坦,硬滑,0.593,0.042,否
17,青绿,蜷缩,沉闷,稍糊,稍凹,硬滑,0.719,0.103,否
Python代码:
import math
import numpy as np
# 加载数据
def loadData(filename):
dataSet = []
fr = open(filename,encoding='utf-8')
for line in fr.readlines():
lineArr = line.strip().split(',')
dataSet.append(lineArr)
labels = ['编号','色泽','根蒂','敲声','纹理','头部','触感','密度','含糖率','好瓜']
return dataSet,labels
# 贝叶斯训练器
def trainbayes(dataSet,labels):
test = dataSet[0] # 取第一列为测试数据
pn=[3,3,3,3,3,2] # 前6各变量的类型值
pc = 0 # 为正类的概率
nc = 0 # 为负类的概率
for i in range(1,6): # 遍历数据的列
c = np.zeros((2, 1))
for j in range(0,16): # 遍历数据的行
if(test[i]==dataSet[j][i] and dataSet[j][-1] == '是'):
c[0] +=1
elif(test[i]==dataSet[j][i] and dataSet[j][-1] == '否'):
c[1]+=1
pc += np.log((c[0]+1)/(8+pn[i-1]))
nc += np.log((c[1]+1)/(9+pn[i-1]))
sumyes1 = 0.0;sumno1 = 0.0;sumyes2 = 0.0;sumno2 = 0.0 # 定义好瓜的是和否用于计算均值
for i in dataSet: # 取出密度和含糖率和
if(i[-1]=='是'):
sumyes1+= float(i[7])
sumyes2+= float(i[8])
elif(i[-1]=='否'):
sumno1 += float(i[7])
sumno2 += float(i[8])
meanyes1 = sumyes1/8;meanno1 = sumno1/9;meanyes2 = sumyes2/8;meanno2 = sumno2 / 9
varyes1=0.0;varno1=0.0;varyes2=0.0;varno2=0.0
for i in dataSet: # 计算方差
if(i[-1]=='是'):
varyes1+= (float(i[7])-meanyes1)**2
varyes2+= (float(i[8])-meanyes2)**2
elif(i[-1]=='否'):
varno1 += (float(i[7])-meanyes1)**2
varno2 += (float(i[8]) - meanyes2)**2
varyes1 = np.sqrt(varyes1/8);varyes2=np.sqrt(varyes2/8);varno1 = np.sqrt(varno1/9);varno2=np.sqrt(varno2/9)
pc += normalDistribution(float(test[7]),meanyes1,varyes1) +normalDistribution(float(test[8]),meanyes2,varyes2)
nc += normalDistribution(float(test[7]), meanno1, varno1) + normalDistribution(float(test[8]), meanno2, varno2)
if(pc>nc):
print(pc,nc,'是')
else:
print(pc, nc, '否')
# 连续变量计算正态分布概率
# x--数值;u--均值;variance--方差
def normalDistribution(x,u,variance):
return np.exp(-(x - u) ** 2 / (2 * variance ** 2)) / (math.sqrt(2 * math.pi) * variance)
if __name__ == '__main__':
dataSet,labels=loadData('watermelon.txt')
trainbayes(dataSet,labels)