(机器学习)朴素贝叶斯及其实现原理

朴素贝叶斯的常见应用场景:

1、文本分类
2、图像识别

贝叶斯决策理论核心思想:

D是一个数据集,(x,y)是D的一个数据点, P 1 ( x , y ) P_1(x,y) P1(x,y)表示(x,y)属于类别1的概率, P 2 ( x , y ) P_2(x,y) P2(x,y)表示(x,y)属于类别2的概率,若:
1、 P 1 ( x , y ) > P 2 ( x , y ) P_1(x,y) > P_2(x,y) P1(x,y)>P2(x,y),则判定(x,y)属于类别1;
2、 P 2 ( x , y ) > P 1 ( x , y ) P_2(x,y) > P_1(x,y) P2(x,y)>P1(x,y),则判定(x,y)属于类别2;

贝叶斯准则:

P ( x ∣ y ) = P ( y ∣ x ) ∗ P ( x ) P ( y ) P(x|y) = \frac{P(y|x)*P(x) }{ P(y)} P(xy)=P(y)P(yx)P(x)

实现原理:文本分类

判断一句话是否是侮辱性语句。
文本数据集: D = D 1 , D 2 , … … , D m D = {D_1,D_2,……,D_m} D=D1,D2,,Dm,其中 D i D_i Di是一句文本语句,如:my dog has flea problems;
类别标签: L = l 1 , l 2 , … … , l m L = {l_1,l_2,……,l_m} L=l1,l2,,lm,其中 l i l_i li 的值为0、1(1表示侮辱性语句,0表示正常语句)。
模型作用: 给你一个语句 D j D_j Dj,判断该语句是侮辱性还是正常语句语句。
实现原理:
1、先计算语句 D j = d 1 , d 2 , … … , d n Dj = {d_1,d_2,……,d_n} Dj=d1,d2,,dn中每一个词语 d i d_i di的侮辱性概率,即该词语出现时, p 1 ( d i ) p_1(d_i) p1(di) = 侮辱性语句的概率, p 0 ( d i ) p_0(d_i) p0(di)= 正常性语句的概率;
2、再计算语句 D j D_j Dj是侮辱性语句的概率: p 1 ( D j ) = ∑ ( p 1 ( d i ) ) p_1(D_j) = \sum(p_1(d_i)) p1(Dj)=(p1(di));语句 D j D_j Dj是正常性语句的概率: p 0 ( D j ) = ∑ ( p 0 ( d i ) ) p_0(D_j) = \sum(p_0(d_i)) p0(Dj)=(p0(di))
3、若 p 1 ( D j ) > p 0 ( D j ) p1_(D_j) > p_0(D_j) p1(Dj)>p0(Dj),则 D j D_j Dj是侮辱性语句;若 p 1 ( D j ) < p 0 ( D j ) p1_(D_j) < p_0(D_j) p1(Dj)<p0(Dj),则 D j D_j Dj是正常性语句;

其中, p 1 ( d i ) p_1(d_i) p1(di) 需要使用贝叶斯准则进行计算:
p 0 ( d i ) = p ( 0 ∣ d i ) = p ( d i ∣ 0 ) ∗ p ( 0 ) p ( d i ) p_0(d_i) = p(0|di) = \frac{p(d_i|0) *p(0) }{p(d_i)} p0(di)=p(0di)=p(di)p(di0)p(0)
p 1 ( d i ) = p ( 1 ∣ d i ) = p ( d i ∣ 1 ) ∗ p ( 1 ) p ( d i ) p_1(d_i) = p(1|di) = \frac{p(d_i|1)* p(1) }{p(d_i)} p1(di)=p(1di)=p(di)p(di1)p(1)
p 1 ( d i ) p_1(d_i) p1(di) : 当语句Dj中出现di词时,Dj是侮辱性语句的概率
p ( d i ∣ 1 ) p(d_i|1) p(di1) : 侮辱性语句中di出现的概率
p ( 1 ) p(1) p(1): 所有语句中侮辱语句出现的概率
p ( d i ) p(d_i) p(di): 所有语句中di出现的概率
p 1 ( D j ) = ∑ ( p 1 ( d i ) ) = ∑ p ( d i ∣ 1 ) ∗ p ( 1 ) p ( d i ) = p ( 1 ) ∗ ∑ ( p ( d i ∣ 1 ) p ( d i ) ) p_1(D_j) = \sum(p_1(d_i)) = \sum\frac{p(di|1) * p(1) }{p(d_i)}= p(1) * \sum(\frac{p(d_i|1)}{ p(d_i) } ) p1(Dj)=(p1(di))=p(di)p(di1)p(1)=p(1)(p(di)p(di1))
p 0 ( D j ) = ∑ ( p 0 ( d i ) ) = ∑ p ( d i ∣ 0 ) ∗ p ( 0 ) p ( d i ) = p ( 0 ) ∗ ∑ ( p ( d i ∣ 0 ) p ( d i ) ) p_0(D_j) = \sum(p_0(d_i)) = \sum\frac{p(di|0) * p(0) }{p(d_i)}= p(0) * \sum(\frac{p(d_i|0)}{ p(d_i) } ) p0(Dj)=(p0(di))=p(di)p(di0)p(0)=p(0)(p(di)p(di0))

import numpy as np

#创建实验样本及其类别变量
def loadDataSet():
    postingList = [['my','dog','has','flea','problems','help','please'] \
                  ,['maybe','note','take','him','to','dog','park','stupid']\
                  ,['my','dalmation','is','so','cute','i','love','him']\
                  ,['stop','posting','stupid','worthless','garbage']\
                  ,['mr','licks','ate','my','steak','how','to','stop','him'] \
                  ,['quit','buying','worthless','dog','food','stupid']]
    classVec = [0,1,0,1,0,1] #类别变量:1侮辱性言论,0正常发言
    return postingList,classVec

#创建函数,将数据集dataSet中的所有元素提出来,去重组合成一个向量列表
def createVocabList(dataSet):
    vocabSet = set([])
    for document in dataSet:
        vocabSet = vocabSet | set(document)
    return list(vocabSet)

#将文本数据列表转换为数值列表
def setOfWords2Vec(vocabList,inputSet):
    returnVec = [0]*len(vocabList)
    for word in inputSet:
        if word in vocabList: 
            returnVec[vocabList.index(word)] = 1 
        else: print('The word %s is not in my Vocabulary!'%word)
    return returnVec
#计算每个词语的分类概率
def trainNB0(trainMatrix,trainCategory):
    numTrainDocs = len(trainMatrix)   #文档数
    numWords = len(trainMatrix[0])    #文档中字符串去重数
    pAbusive = sum(trainCategory) / float(numTrainDocs)  #侮辱性文档数的占比
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)
    p0Denom = 0.0
    p1Denom = 0.0
    pNum = np.zeros(numWords)
    pDenom = 0.0
    for i in range(numTrainDocs):
        pNum += trainMatrix[i]
        pDenom += sum(trainMatrix[i])
    for i in range(numTrainDocs):
        if trainCategory[i] == 1: 
            p1Num += trainMatrix[i]  #计算在侮辱性语句中该词语的出现次数
            p1Denom += sum(trainMatrix[i]) #计算侮辱性语句中所有词语的出现次数总和
        else:
            p0Num += trainMatrix[i]  #计算正常语句中该词语的出现次数
            p0Denom += sum(trainMatrix[i]) #计算正常语句中所有词语的出现次数总和
    pVect = pNum / pDenom
    p1Vect = p1Num / p1Denom 
    p0Vect = p0Num / p0Denom
    for i in range(len(pVect)):
        p1Vect[i] = p1Vect[i] / pVect[i] * pAbusive
        p0Vect[i] = p0Vect[i] / pVect[i] * (1.0 - pAbusive) 
    return p1Vect,p0Vect

#训练集分类,由于python的计算精度问题,将
def classifyNB(vec2Classify,p0Vect,p1Vect):
    p1 = sum(vec2Classify*p1Vect)/ sum(vec2Classify)
    p0 = sum(vec2Classify*p0Vect)/ sum(vec2Classify)
    if p1 > p0:
        return 1,p1,p0
    else:
        return 0,p1,p0

listOPosts,listClasses = loadDataSet() #获取样本listOPosts、类别变量listClasses
myVocabList = createVocabList(listOPosts) #提取listOPosts元素,去重组合成一个新的列表myVocabList
trainMat = []
for val in listOPosts:
    a = setOfWords2Vec(myVocabList,val)
    trainMat.append(a) #将文档数据集listOPosts转化为向量集trainMat
p1Vect,p0Vect= trainNB0(trainMat,listClasses)
entSet = ['help','my','food']
entVec= setOfWords2Vec(myVocabList,entSet)
a,p1,p0 = classifyNB(entVec,p1Vect,p0Vect)
print(entSet,' ——> class as : ',a,'======\n' \
      ,' p1 = %.2f '% (p1*100) ,'======\n'\
      ,' p0 = %.2f '% (p0*100) ,'======\n' )
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值