朴素贝叶斯的常见应用场景:
1、文本分类
2、图像识别
贝叶斯决策理论核心思想:
D是一个数据集,(x,y)是D的一个数据点,
P
1
(
x
,
y
)
P_1(x,y)
P1(x,y)表示(x,y)属于类别1的概率,
P
2
(
x
,
y
)
P_2(x,y)
P2(x,y)表示(x,y)属于类别2的概率,若:
1、
P
1
(
x
,
y
)
>
P
2
(
x
,
y
)
P_1(x,y) > P_2(x,y)
P1(x,y)>P2(x,y),则判定(x,y)属于类别1;
2、
P
2
(
x
,
y
)
>
P
1
(
x
,
y
)
P_2(x,y) > P_1(x,y)
P2(x,y)>P1(x,y),则判定(x,y)属于类别2;
贝叶斯准则:
P ( x ∣ y ) = P ( y ∣ x ) ∗ P ( x ) P ( y ) P(x|y) = \frac{P(y|x)*P(x) }{ P(y)} P(x∣y)=P(y)P(y∣x)∗P(x)
实现原理:文本分类
判断一句话是否是侮辱性语句。
文本数据集:
D
=
D
1
,
D
2
,
…
…
,
D
m
D = {D_1,D_2,……,D_m}
D=D1,D2,……,Dm,其中
D
i
D_i
Di是一句文本语句,如:my dog has flea problems;
类别标签:
L
=
l
1
,
l
2
,
…
…
,
l
m
L = {l_1,l_2,……,l_m}
L=l1,l2,……,lm,其中
l
i
l_i
li 的值为0、1(1表示侮辱性语句,0表示正常语句)。
模型作用: 给你一个语句
D
j
D_j
Dj,判断该语句是侮辱性还是正常语句语句。
实现原理:
1、先计算语句
D
j
=
d
1
,
d
2
,
…
…
,
d
n
Dj = {d_1,d_2,……,d_n}
Dj=d1,d2,……,dn中每一个词语
d
i
d_i
di的侮辱性概率,即该词语出现时,
p
1
(
d
i
)
p_1(d_i)
p1(di) = 侮辱性语句的概率,
p
0
(
d
i
)
p_0(d_i)
p0(di)= 正常性语句的概率;
2、再计算语句
D
j
D_j
Dj是侮辱性语句的概率:
p
1
(
D
j
)
=
∑
(
p
1
(
d
i
)
)
p_1(D_j) = \sum(p_1(d_i))
p1(Dj)=∑(p1(di));语句
D
j
D_j
Dj是正常性语句的概率:
p
0
(
D
j
)
=
∑
(
p
0
(
d
i
)
)
p_0(D_j) = \sum(p_0(d_i))
p0(Dj)=∑(p0(di)) ;
3、若
p
1
(
D
j
)
>
p
0
(
D
j
)
p1_(D_j) > p_0(D_j)
p1(Dj)>p0(Dj),则
D
j
D_j
Dj是侮辱性语句;若
p
1
(
D
j
)
<
p
0
(
D
j
)
p1_(D_j) < p_0(D_j)
p1(Dj)<p0(Dj),则
D
j
D_j
Dj是正常性语句;
其中,
p
1
(
d
i
)
p_1(d_i)
p1(di) 需要使用贝叶斯准则进行计算:
p
0
(
d
i
)
=
p
(
0
∣
d
i
)
=
p
(
d
i
∣
0
)
∗
p
(
0
)
p
(
d
i
)
p_0(d_i) = p(0|di) = \frac{p(d_i|0) *p(0) }{p(d_i)}
p0(di)=p(0∣di)=p(di)p(di∣0)∗p(0);
p
1
(
d
i
)
=
p
(
1
∣
d
i
)
=
p
(
d
i
∣
1
)
∗
p
(
1
)
p
(
d
i
)
p_1(d_i) = p(1|di) = \frac{p(d_i|1)* p(1) }{p(d_i)}
p1(di)=p(1∣di)=p(di)p(di∣1)∗p(1)
p
1
(
d
i
)
p_1(d_i)
p1(di) : 当语句Dj中出现di词时,Dj是侮辱性语句的概率
p
(
d
i
∣
1
)
p(d_i|1)
p(di∣1) : 侮辱性语句中di出现的概率
p
(
1
)
p(1)
p(1): 所有语句中侮辱语句出现的概率
p
(
d
i
)
p(d_i)
p(di): 所有语句中di出现的概率
p
1
(
D
j
)
=
∑
(
p
1
(
d
i
)
)
=
∑
p
(
d
i
∣
1
)
∗
p
(
1
)
p
(
d
i
)
=
p
(
1
)
∗
∑
(
p
(
d
i
∣
1
)
p
(
d
i
)
)
p_1(D_j) = \sum(p_1(d_i)) = \sum\frac{p(di|1) * p(1) }{p(d_i)}= p(1) * \sum(\frac{p(d_i|1)}{ p(d_i) } )
p1(Dj)=∑(p1(di))=∑p(di)p(di∣1)∗p(1)=p(1)∗∑(p(di)p(di∣1))
p
0
(
D
j
)
=
∑
(
p
0
(
d
i
)
)
=
∑
p
(
d
i
∣
0
)
∗
p
(
0
)
p
(
d
i
)
=
p
(
0
)
∗
∑
(
p
(
d
i
∣
0
)
p
(
d
i
)
)
p_0(D_j) = \sum(p_0(d_i)) = \sum\frac{p(di|0) * p(0) }{p(d_i)}= p(0) * \sum(\frac{p(d_i|0)}{ p(d_i) } )
p0(Dj)=∑(p0(di))=∑p(di)p(di∣0)∗p(0)=p(0)∗∑(p(di)p(di∣0))
import numpy as np
#创建实验样本及其类别变量
def loadDataSet():
postingList = [['my','dog','has','flea','problems','help','please'] \
,['maybe','note','take','him','to','dog','park','stupid']\
,['my','dalmation','is','so','cute','i','love','him']\
,['stop','posting','stupid','worthless','garbage']\
,['mr','licks','ate','my','steak','how','to','stop','him'] \
,['quit','buying','worthless','dog','food','stupid']]
classVec = [0,1,0,1,0,1] #类别变量:1侮辱性言论,0正常发言
return postingList,classVec
#创建函数,将数据集dataSet中的所有元素提出来,去重组合成一个向量列表
def createVocabList(dataSet):
vocabSet = set([])
for document in dataSet:
vocabSet = vocabSet | set(document)
return list(vocabSet)
#将文本数据列表转换为数值列表
def setOfWords2Vec(vocabList,inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print('The word %s is not in my Vocabulary!'%word)
return returnVec
#计算每个词语的分类概率
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix) #文档数
numWords = len(trainMatrix[0]) #文档中字符串去重数
pAbusive = sum(trainCategory) / float(numTrainDocs) #侮辱性文档数的占比
p0Num = np.zeros(numWords)
p1Num = np.zeros(numWords)
p0Denom = 0.0
p1Denom = 0.0
pNum = np.zeros(numWords)
pDenom = 0.0
for i in range(numTrainDocs):
pNum += trainMatrix[i]
pDenom += sum(trainMatrix[i])
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i] #计算在侮辱性语句中该词语的出现次数
p1Denom += sum(trainMatrix[i]) #计算侮辱性语句中所有词语的出现次数总和
else:
p0Num += trainMatrix[i] #计算正常语句中该词语的出现次数
p0Denom += sum(trainMatrix[i]) #计算正常语句中所有词语的出现次数总和
pVect = pNum / pDenom
p1Vect = p1Num / p1Denom
p0Vect = p0Num / p0Denom
for i in range(len(pVect)):
p1Vect[i] = p1Vect[i] / pVect[i] * pAbusive
p0Vect[i] = p0Vect[i] / pVect[i] * (1.0 - pAbusive)
return p1Vect,p0Vect
#训练集分类,由于python的计算精度问题,将
def classifyNB(vec2Classify,p0Vect,p1Vect):
p1 = sum(vec2Classify*p1Vect)/ sum(vec2Classify)
p0 = sum(vec2Classify*p0Vect)/ sum(vec2Classify)
if p1 > p0:
return 1,p1,p0
else:
return 0,p1,p0
listOPosts,listClasses = loadDataSet() #获取样本listOPosts、类别变量listClasses
myVocabList = createVocabList(listOPosts) #提取listOPosts元素,去重组合成一个新的列表myVocabList
trainMat = []
for val in listOPosts:
a = setOfWords2Vec(myVocabList,val)
trainMat.append(a) #将文档数据集listOPosts转化为向量集trainMat
p1Vect,p0Vect= trainNB0(trainMat,listClasses)
entSet = ['help','my','food']
entVec= setOfWords2Vec(myVocabList,entSet)
a,p1,p0 = classifyNB(entVec,p1Vect,p0Vect)
print(entSet,' ——> class as : ',a,'======\n' \
,' p1 = %.2f '% (p1*100) ,'======\n'\
,' p0 = %.2f '% (p0*100) ,'======\n' )