机器学习实战 - 基于概率论的分类方法:朴素贝叶斯

本文深入浅出地介绍了朴素贝叶斯分类方法,通过概率论基础知识,解释了如何利用贝叶斯准则进行决策。以侮辱性留言分类为例,详细阐述了概率计算和训练算法的过程,探讨了处理概率为0和下溢问题的方法,并提供了Python代码示例。此外,还展示了如何应用朴素贝叶斯过滤垃圾邮件和分析征婚广告的区域倾向。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

之前关于k-近邻和决策树的讲解没有一步一步具体的说明每个函数的实现功能和测试样例,让不是学习本书的人看起来一片代码一头雾水,这里开始,会仔细的讲解每个函数实现的功能和其对应的测试样例,之前的也会抽空修改。

跟概率的作用一样,如果A事件发生的概率大于B时间发生的概率那么我们的“决策机构”就选择A事件,反则选B。我们首先得有铺垫的知识点——条件概率:就是事件A在另外一个事件B已经发生条件下的发生概率。条件概率表示为P(A|B),读作“在B条件下A的概率”,其计算公式为:P(A|B) = P(AB)/P(B),证明及学习可自行搜索。

贝叶斯准则告诉我们如何交换条件概率中的条件与结果,如已知P(x|c),求P(c|x):公式,也就是说我们可以在已知量中计算未知量。在这里P(c1|x,y)和P(c2|x,y)的含义是:给定有x,y两个特征表示的数据,其属于c1的概率和属于c2的概率,那么,如果P(c1|x,y)大于P(c2|x,y),它就属于c1,反之属于c2。

根据上面的公式P(c|x) = (P(x|c)*P(c)) / P(x),我们知道P(c1|x) = (P(x|c1)*P(c1)) / P(x)和P(c2|x) = (P(x|c2)*P(c2)) / P(x),也就是在x特征下我们比较P(c1|x)和P(c2|x)的大小,由于这两个概率值都除了相同的值P(x),所以我们比较的时候除不除都无所谓,而且不除的话我们就少求一个已知量。所以比较P(c1|x)和P(c2|x)也就变成了比较(P(x|c1)*P(c1))和(P(x|c2)*P(c2))

这里没听懂?没关系,通过下面的例子就很容易理解上面所说的什么P什么c1、c2、x之类的了。

例:假设我们有一个留言板,要其动判断某个用户的某个留言是否是属于侮辱性的,那么如何做呢?我们要训练一下机器,让它自己判断,不过这种判断是有误差的,而且在我们这个简单的例子中误差还是挺大的。我们根据单词来分类,比如“stupid”是属于侮辱性的,而“like”则不是,在这里我们考虑每个单词都是相互独立的,也就是说某个单词出现于否与其它单词无关联,这也就是我们这个朴素贝叶斯的“朴素”的意思,只考虑最简单的条件。实际上,某些单词与其它单词关联挺大的,举个例子,to这个单词跟want或者其它的be等词关联很大,往往出现在它们的后面,或者它也常常出现在动词前面。同时,书本上说朴素贝叶斯的另一个特征是每个次都同等重要,这明显是不符合实际的,但目前我们就将其视为同等重要来考虑,学习其基本概念。

这里我使用python3.6运行,后面打开文件的时候有地方编码方式没搞明白python3.6笔者无法解决,届时会改为2.7运行。注意2.7使用中文要在代码最前面加#encoding=utf-8

我们先创建一个bayes.py,存入下面代码:

# encoding=utf-8
from numpy import *

"""
创建两个列表, 一个是包含多句话的列表, 另一个则对应每句话的标签, 1是侮辱性的
我们可以理解为把一篇文章分总了6句话
"""
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]
    return postingList, classVec

"""这个集合返回的是刚才6句话中所有的单词, 重复多次出现的只统计一次, 也就是统计所有出现的单词"""
def createVocabList(dataSet):
    vacabSet = set([])
    for document in dataSet:
        vacabSet = vacabSet | set(document)
    return vacabSet

"""
首先创建一个元素个数和刚才createVocabList函数得到的集合一样的列表, 全部初始化为0也就是没有出现过
对于输入inputSet中每个单词, 如果其在刚才的单词集合中则+1
"""
def setOfWords2Vec(vacabList, inputSet):
    returnVec = [0] * len(vacabList)
    for word in inputSet:
        if word in vacabList:
            """此处要转换为list, set没有index这个属性, pyhton3不转换list提示错误"""
            returnVec[list(vacabList).index(word)] += 1
        else:
            print("the word %s is not in my Vocabulary!" % word)
    return returnVec

我们新建一个test.py

#encoding=utf-8
import bayes


listOfPosts, listClasses = bayes.loadDataSet()
print(listOfPosts)
print(listClasses)
"""
答案如下:
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
[0, 1, 0, 1, 0, 1]
"""

myVocabList = bayes.createVocabList(listOfPosts)
print(myVocabList)
"""
答案如下:
set(['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my'])
"""

returnVec = bayes.setOfWords2Vec(myVocabList, listOfPosts[0])
print(returnVec)
"""
答案如下:
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
"""

接下来就是训练算法:从词向量计算概率

计算每个类别中的文档数目
    对每个类别:
        如果词条(单词)出现在文档中->增加该词条(单词)的计数值
        增加所有词条(单词)的计数值
对每个类别:
    对每个词条(单词):
        将该词条(单词)的数目除以总词(单词)条数得到条件概率
返回每个类别的条件概率

很难理解上面的伪代码?我们来看看下面的函数(该函数在我们创建的bayes.py中)

def trainNB0(trainMatrix, trainCategory):
    """获取总的文档数目numTrainDocs"""
    numTrainDocs = len(trainMatrix)
    """获取每个文档的词条(单词)的数目"""
    numWords = len(trainMatrix[0])
    """
    由于trainCategory中的分类要不就是1要不就是0, 所以sum(trainCategory)就是所有类别为1的数目
    要转换为浮点数, 否则计算会有误差出错
    """
    pAbusive = float(sum(trainCategory)) / numTrainDocs
    """
    创建两个数量跟每个文档单词数量相同的向量, 全部初始化为1, 为什么初始化为1下面解释
    其实应该理解为
    p0num = zeros(numWords)
    p1num = zeros(numWords)
    一开始应该初始化为0, 但我们初始化为1
    """
    p0num = ones(numWords)
    p1num = ones(numWords)
    """
    总数也初始化为2
    其实这里也应该是
    p0Denom = 0.0
    p1Denom = 0.0
    """
    p0Denom = 2.0
    p1Denom = 2.0
    """
    下面的if和else就是:
    对每个类别:
        如果词条(单词)出现在文档中->增加该词条(单词)的计数值
        增加所有词条(单词)的计数值
    """
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    """
    下面return就是:
    对每个类别:
        对每个词条(单词):
            将该词条(单词)的数目除以总词(单词)条数得到条件概率
    返回每个类别的条件概率
    """
    return log(p0num / p0Denom), log(p1num / p1Denom), pAbusive

最后 p0num/p0Denom p1num/p1Denom 应该是整个文档出现的单词中每个单词属于0(非侮辱性)和1(侮辱性)的概率,但是我们利用贝叶斯分类器对文档进行分类时,要计算多个概率的乘积以获得文档属于某个类别的概率,根据概率的大小判断其属于哪一类别,如计算属于1(侮辱性)的概率亦即计算 P(x0|1)P(x1|1)P(x2|1)...P(xn|1) ,如果我们刚才起始个数初始化为0,那假设我们输入的文档中出现了我们训练文档中未出现过的单词,它的概率就变成了0,0乘任何数都得0,那么会使得整个表达式的答案为0,为降低0带来的影响,所以我们初始化为1,虽然初始化为1确实也带来了误差方面的问题,但好过0带来的问题。
除此之外,我们还需要考虑的问题是计算机计算时的下溢问题,也就是说很多小数相乘可能使最终结果变得非常小而使误差非常大,所以我们引入log对数来解决此问题。
这里写图片描述
从上图我们看到,当f(x)处于0-1时,其变化趋势和ln(f(x))非常相似,而 f(x1)f(x2) 可以变为 ln(f(x1))+ln(f(x1)) ,乘法变成加法,解决了我们下溢问题。所以我们返回的时不再返回概率的向量,而是返回其对数的向量,后面我们处理的时候相加即可,不用相乘。也就是返回 log(p0num/p0Denom),log(p1num/p1Denom)

在test.py中测试一下:

trainMat = []
for i in range(len(listOfPosts)):
    trainMat.append(bayes.setOfWords2Vec(myVocabList, listOfPosts[i]))
p0, p1, pAbusive = bayes.trainNB0(trainMat, listClasses)
print(p0)
print(p1)
print(pAbusive)
"""
答案如下:
[-2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -1.87180218 -2.56494936
 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -3.25809654
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -2.15948425]
[-3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -1.65822808 -3.04452244 -3.04452244 -2.35137526
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -1.94591015 -2.35137526
 -2.35137526 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -1.94591015 -2.35137526]
0.5
"""

接着就是测试一下我们的写的程序,在bayes.py中加入:

"""
第一个参数是输入单词转化后的向量, 如[1, 2, 3, 1, 1, ……], 其它几个如字面意思
我们计算p1的log值和p0的log值, 由我们前面的分析, log函数在0-1内跟概率变化情况是一样的
我们利用log的大小(也就是p1和p0的大小近似预估为其概率的大小, 谁大返回谁)
"""
def classifyNB(vec2classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2classify * p1Vec) + log(pClass1)
    p0 = sum(vec2classify * p0Vec) + log(1.0 - pClass1)
    # print( "p0 is: ", p0, ", p1 is: ", p1)
    if p1 > p0:
        return 1
    else:
        return 0


def testingNB():
    """首先得到单词向量和分类向量"""
    listOPosts, listclasses = loadDataSet()
    """根据得到的单词向量得到不含重复词汇的set集"""
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    """
    将每一行的单词都放到myVocabList这个向量里面
    举个栗子, 假设的单词向量listOPosts是[['a', 'b', 'c']
                                       ['a', 'c', 'd']
                                       ['e', 'f']]
    那么我们的myVocabList就是['a', 'b', 'c', 'd', 'e', 'f'], 顺序不一定是abcdef跟你的单词向量有关
    setOfWords2Vec这个函数是将每一行都转换相应的数量
    如['a', 'b', 'c']对应于['a', 'b', 'c', 'd', 'e', 'f']就是[1, 1, 1, 0, 0, 0]
    如['e', 'f']对应于['a', 'b', 'c', 'd', 'e', 'f']就是[0, 0, 0, 0, 1, 1]
    得到的每一行都放入到trainMat中
    """
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    """调用函数得到相应的log值和为1的概率"""
    p0Vec, p1Vec, pAbusive = trainNB0(trainMat, listclasses)
    """下面是一些测试"""
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))
    testEntry = ['stupid', 'garbage', 'food']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))

接着我们调用我们刚才写的,在test.py中加入:

bayes.testingNB()
"""
答案如下:
['love', 'my', 'dalmation']  classified as:  0
['stupid', 'garbage']  classified as:  1
['stupid', 'garbage', 'food']  classified as:  1
"""

好了,我们使用我们刚才的思路和已经编写好的代码来写一个简单的例子,使用朴素贝叶斯过滤垃圾邮件。
到目前为止我们都是使用pyhton3.6,下面的程序我们改用pyhton2.7:

"""解析我们的文件, 解析为字符串列表, 然后去掉我们长度小于3的单词, 因为诸如is, of之类的单词实际上不用于分类"""
def textParse(bigString):
    import re
    listOfTikens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTikens if len(tok) > 2]


def spamTest():
    docList = []
    classes = []
    fullText = []
    """打开我们的文件, 放到我们的列表中"""
    for i in range(1, 26):
        docList.append(textParse(open('spam/%d.txt' % i).read()))
        fullText.extend(textParse(open('spam/%d.txt' % i).read()))
        classes.append(1)
        """
       下面两句在python2.7中可以运行, 但是在python3.6中提示编码方式什么的问题,
       目前还没找到解决方法, 应该是编码方式没对应上
       但上面两句解析文档又没有错误, 按理同他们的编码方式都是相同的, 暂不得其解
       """
        docList.append(textParse(open('ham/%d.txt' % i).read()))
        fullText.extend(textParse(open('ham/%d.txt' % i).read()))
        classes.append(0)
    """这下面的思路基本上与刚才的相同"""
    vocabList = createVocabList(docList)
    trainingSet = list(range(25))
    testSet = []
    """这里我们随机选取出10篇文章, 剩下的用于训练, 选出的用于测试"""
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del (trainingSet[randIndex])
    errorCount = 0
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classes[docIndex])
    p0, p1, pAbusive = trainNB0(array(trainMat), array(trainClasses))
    for docIndex in testSet:
        if classifyNB(array(setOfWords2Vec(vocabList, docList[docIndex])), p0, p1, pAbusive) != classes[docIndex]:
            errorCount += 1
    print("'the error rate is : ", float(errorCount) / len(testSet))

在test.py中添加:

bayes.spamTest()
"""
答案如下:
("'the error rate is : ", 0.1)
"""

课本的下个例子使用朴素贝叶斯分类器从个人广告中获取区域倾向。我们将分别从美国的两个城市中选取一些人,通过分析这些人发布的征婚广告信息,比较这两个城市人们在广告用词上是否不同,如果结论确实不同,那么他们各自常用的词都有哪些?其实简单来说就是获取数据的数据源不一样了,一开始是从文件中获取现在变成RSS源获取,获取到的内容还是一样分析,不过我没弄明白feed1[‘entries’][i][‘summary’]中的’entries’和’summary’是怎么弄出来的,也就是键值是怎么知道的,反正课本是这么写的,毕竟我也是新手。

这里我们再次使用python3.6

"""该函数将词汇按出现的频率排序返回前30个最常出现的, 也就是出现次数最多的30个"""
def calMostFreq(vocabList, fullText):
    import operator
    freqDict = {}
    for word in vocabList:
        freqDict[word] = fullText.count(word)
    sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True)
    return sortedFreq[:30]

"""该函数类似于spamTest(), 但是它访问的是RSS源而不是文件"""
def localWords(feed1, feed0):
    classes = []
    fullText = []
    docList = []
    """根据短的那一个来选取"""
    minLength = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLength):
        docList.append(textParse(feed1['entries'][i]['summary']))
        classes.append(1)
        fullText.extend(textParse(feed1['entries'][i]['summary']))
        docList.append(textParse(feed0['entries'][i]['summary']))
        classes.append(0)
        fullText.extend(textParse(feed0['entries'][i]['summary']))
    # print(docList)
    vocabList = createVocabList(docList)
    freqDict = calMostFreq(vocabList, fullText)
    for word in freqDict:
        if word in vocabList:
            vocabList.remove(word[0])
    trainingSet = list(range(2*minLength))
    testSet = []
    for i in range(20):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classes[docIndex])
    p0Vec, p1Vec, pAbuive = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0.0
    for docIndex in testSet:
        wordVecotr = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(array(wordVecotr), p0Vec, p1Vec, pAbuive) != classes[docIndex]:
            errorCount += 1.0
    print("the error rate is: ", float(errorCount)/len(testSet))
    return vocabList, p0Vec, p1Vec

"""将出现较多的单词打印出来"""
def getTopWord(ny, sf):
    import operator
    vocabList, p0Vec, p1Vec = localWords(ny, sf)
    topNY = []
    topSF = []
    for i in range(len(p0Vec)):
        if p0Vec[i] > -6.0:
            """此处需要将vocabList转换为list才能用数字索引, 否则会提示'set' ... index...之类的错误"""
            topSF.append((list(vocabList)[i],p0Vec[i]))
        if p1Vec[i] > -6.0:
            topNY.append((list(vocabList)[i],p1Vec[i]))
    sortedTopNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    sortedTopSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF")
    for word in sortedTopSF:
        print(word[0])
    print("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY")
    for word in sortedTopNY:
        print(word[0])

测试一下test.py加入以下代码:

import feedparser

ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
bayes.localWords(ny, sf)
bayes.getTopWord(ny, sf)
"""
答案如下:
the error rate is:  0.35
the error rate is:  0.2
SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF
and
for
the
you
looking
fit
with
hello
love
like
single
time
friends
see
can
latino
will
try
full
student
have
not
experience
would
pareja
active
host
first
from
out
old
some
miss
fun
movie
para
healthy
playas
fan
nudistas
that
massage
bay
female
very
say
older
who
frequent
what
resorts
lugares
yeah
book
sincere
corny
inquiry
comic
mujer
spooking
than
rican
dates
also
burlingame
busca
jump
hang
times
cool
beaches
sloop
originally
fore
hazel
preferably
daddy
craigslist
cheesy
sail
seeks
need
face
drama
fulfilling
parejas
hit
long
message
upon
nice
purposes
doesn
towels
entertainment
straight
clear
releases
myself
favors
share
etc
daysailer
como
range
basis
sexual
providing
around
owner
wide
bbc
body
latinos
come
couple
maybe
comedy
needs
butterflies
attached
enbusca
messing
take
available
employed
tall
white
phone
200lb
big
passionate
guardians
seriosly
asistir
this
randomely
420
east
pretty
nudist
may
dark
man
lighter
her
lunch
weird
only
galaxy
acknowledge
plus
eyes
little
anos
semi
real
dick
mature
theater
does
person
wait
details
huge
talk
business
must
midnight
tummy
reading
thanks
further
hispanic
kisses
woman
simply
movies
just
disease
searching
exchanges
oils
between
interest
free
family
number
latest
slim
email
college
think
admit
puerto
table
short
training
companies
year
true
term
friendship
nothing
self
someone
toner
early
most
open
till
photo
bit
sucked
lol
exist
but
anyways
sometime
too
possess
people
comes
partner
attractive
living
water
friend
amigos
attend
dude
show
europe
nude
honest
making
companion
action
going
married
companionship
your
6ft
when
role
customer
new
please
weekend
assist
their
drink
hate
5th
matters
allow
jus
guys
soon
très
oja
favorite
unless
loyal
sleepless
kobergen
chat
1980
vallejo
minded
friendly
weekends
raised
amp
road
asking
dinner
state
dudes
chilled
life
goes
picture
presentable
race
fingers
hanging
buddy
fell
lady
hard
has
bbw
touched
into
support
know
went
told
species
relaxing
20s
lives
obvious
cute
last
lived
mind
cock
each
lupe
find
connection
requested
side
way
interesting
receiving
hypnotic
other
dominant
whatever
hey
pics
meet
travel
confirmation
wonderful
explore
names
emphasizing
circle
one
talking
cure
years
conversation
sports
director
could
never
wouldn
attempts
cold
want
greatest
per
care
fla
started
adventure
half
ready
welcom
thru
upscale
wanna
dutchs
senior
use
upper
giving
kind
excellent
sleep
pic
father
century
company
resourceful
sounds
dog
anime
chic
dig
preference
previous
cares
shy
41yrs
eastern
late
possibility
soccer
creed
work
bored
smoke
afternoon
crossdresser
forward
amuse
join
include
155
off
lbs
alcoholic
sweet
backround
yourself
trivia
wonderfully
anyone
drinks
bro
numbers
hopefully
underneath
blunts
wondered
type
team
they
about
retired
bigger
wonder
safe
proud
kid
described
necessarily
1973
name
row
contact
actually
gentlemen
though
bring
persion
considered
loves
offer
platonic
mean
intelligent
prefer
interested
any
quality
sex
hearing
slithery
bobaloo
concord
aloud
diseases
goo
personality
twenty
away
general
even
many
where
possible
here
world
feel
light
goddess
definitely
skinned
sit
blond
skinny
look
appreciate
generous
thoroughly
smoking
perverted
found
middle
being
packing
relax
city
professional
something
passion
speak
chill
involved
liquor
bud
drifting
ship
age
boat
waist
were
nights
back
expect
youthful
genuine
sight
gaming
dressing
black
40s
area
average
turn
don
assistance
trace
nails
size
latin
funny
maids
let
loved
gently
asian
give
few
supply
papers
compassionate
grass
cal
small
date
among
sense
younger
supplied
green
mention
dec
getting
hand
soothing
truthfully
fucking
good
swgm
drive
shape
summer
still
iso
truly
submissive
ever
yorkville
musician
had
kids
perhaps
problem
divorced
muscles
haven
night
artist
legs
queens
likes
over
well
them
tonight
set
canada
then
might
cuddling
massages
home
sensual
descent
really
sinc
discrete
personalities
desire
friday
clue
rare
more
56yo
because
info
30s
after
party
better
while
odd
moment
nerd
spanish
pointers
right
trips
stuff
play
normal
dont
was
there
guy
ages
ladies
failed
truck
read
secondary
dank
jetsons
same
saree
our
are
easy
lately
mutt
been
traveling
nerdy
why
shopping
crushing
dining
remember
racial
trip
seen
hip
dress
spoil
meeting
soulmate
write
place
rob
teeth
intoxicating
send
educated
ethnic
ongoing
girl
longterm
build
get
thoroughbred
put
all
italian
enjoy
how
since
least
should
got
def
athlete
fashion
able
make
important
fact
male
driver
karen
enclosed
serve
41trip
inch
catch
NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY
you
and
for
the
looking
with
that
could
there
your
hang
old
any
white
can
but
way
hey
nice
from
out
man
just
someone
are
who
what
when
chat
than
chilled
lady
bbw
cool
meet
years
soccer
smoke
forward
though
interested
time
goo
here
look
generous
relax
professional
something
involved
age
friends
don
hispanic
good
woman
drive
shape
truly
perhaps
try
well
email
then
think
year
more
friendship
after
guy
read
love
have
attractive
not
dude
send
educated
get
all
got
going
male
catch
role
please
assist
drink
hate
5th
matters
allow
guys
très
sincere
unless
loyal
sleepless
kobergen
1980
minded
weekends
amp
asking
state
dudes
life
presentable
race
fingers
hanging
buddy
fell
touched
into
support
species
relaxing
obvious
lived
each
connection
requested
side
originally
interesting
preferably
hypnotic
other
dominant
pics
confirmation
need
explore
emphasizing
long
host
director
upon
first
attempts
doesn
care
adventure
thru
upscale
wanna
senior
use
upper
favors
kind
excellent
sleep
pic
century
some
company
resourceful
sounds
chic
cares
shy
41yrs
late
sexual
creed
work
bored
afternoon
amuse
fun
alcoholic
sweet
backround
yourself
body
drinks
bro
hopefully
underneath
type
they
about
wonder
safe
proud
described
couple
maybe
1973
row
gentlemen
loves
mean
quality
sex
hearing
slithery
aloud
twenty
available
away
possible
tall
phone
light
skinned
sit
blond
appreciate
east
found
may
being
dark
speak
chill
liquor
drifting
ship
boat
were
back
little
youthful
real
black
fit
turn
assistance
trace
size
maids
let
loved
gently
asian
talk
give
business
must
compassionate
grass
sense
younger
supplied
green
mention
getting
hand
soothing
further
truthfully
fucking
simply
swgm
summer
iso
submissive
yorkville
musician
will
divorced
muscles
artist
legs
slim
over
tonight
college
canada
female
sensual
descent
really
sinc
discrete
student
desire
clue
rare
56yo
nothing
party
better
while
odd
very
moment
spanish
right
stuff
play
open
bit
ladies
failed
truck
secondary
dank
jetsons
older
same
easy
mutt
been
why
friend
remember
racial
seen
hip
place
ethnic
ongoing
longterm
build
thoroughbred
put
companion
italian
how
least
should
def
athlete
fashion
make
important
driver
karen
serve
41trip
"""

附上整个文件bayes.py

# encoding=utf-8
from numpy import *

"""
创建两个列表, 一个是包含多句话的列表, 另一个则对应每句话的标签, 1是侮辱性的
我们可以理解为把一篇文章分总了6句话
"""
def loadDataSet():
    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
    classVec = [0, 1, 0, 1, 0, 1]
    return postingList, classVec

"""这个集合返回的是刚才6句话中所有的单词, 重复多次出现的只统计一次, 也就是统计所有出现的单词"""
def createVocabList(dataSet):
    vacabSet = set([])
    for document in dataSet:
        vacabSet = vacabSet | set(document)
    return vacabSet

"""
首先创建一个元素个数和刚才createVocabList函数得到的集合一样的列表, 全部初始化为0也就是没有出现过
对于输入inputSet中每个单词, 如果其在刚才的单词集合中则+1
"""
def setOfWords2Vec(vacabList, inputSet):
    returnVec = [0] * len(vacabList)
    for word in inputSet:
        if word in vacabList:
            """此处要转换为list, set没有index这个属性, pyhton3不转换list提示错误"""
            returnVec[list(vacabList).index(word)] += 1
        else:
            print("the word %s is not in my Vocabulary!" % word)
    return returnVec


def trainNB0(trainMatrix, trainCategory):
    """获取总的文档数目numTrainDocs"""
    numTrainDocs = len(trainMatrix)
    """获取每个文档的词条(单词)的数目"""
    numWords = len(trainMatrix[0])
    """
    由于trainCategory中的分类要不就是1要不就是0, 所以sum(trainCategory)就是所有类别为1的数目
    要转换为浮点数, 否则计算会有误差出错
    """
    pAbusive = float(sum(trainCategory)) / numTrainDocs
    """
    创建两个数量跟每个文档单词数量相同的向量, 全部初始化为1, 为什么初始化为1下面解释
    其实应该理解为
    p0num = zeros(numWords)
    p1num = zeros(numWords)
    一开始应该初始化为0, 但我们初始化为1
    """
    p0num = ones(numWords)
    p1num = ones(numWords)
    """
    总数也初始化为2
    其实这里也应该是
    p0Denom = 0.0
    p1Denom = 0.0
    """
    p0Denom = 2.0
    p1Denom = 2.0
    """
    下面的if和else就是:
    对每个类别:
        如果词条(单词)出现在文档中->增加该词条(单词)的计数值
        增加所有词条(单词)的计数值
    """
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:
            p1num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:
            p0num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    """
    下面return就是:
    对每个类别:
        对每个词条(单词):
            将该词条(单词)的数目除以总词(单词)条数得到条件概率
    返回每个类别的条件概率
    """
    return log(p0num / p0Denom), log(p1num / p1Denom), pAbusive

"""
第一个参数是输入单词转化后的向量, 如[1, 2, 3, 1, 1, ……], 其它几个如字面意思
我们计算p1的log值和p0的log值, 由我们前面的分析, log函数在0-1内跟概率变化情况是一样的
我们利用log的大小(也就是p1和p0的大小近似预估为其概率的大小, 谁大返回谁)
"""
def classifyNB(vec2classify, p0Vec, p1Vec, pClass1):
    p1 = sum(vec2classify * p1Vec) + log(pClass1)
    p0 = sum(vec2classify * p0Vec) + log(1.0 - pClass1)
    # print( "p0 is: ", p0, ", p1 is: ", p1)
    if p1 > p0:
        return 1
    else:
        return 0


def testingNB():
    """首先得到单词向量和分类向量"""
    listOPosts, listclasses = loadDataSet()
    """根据得到的单词向量得到不含重复词汇的set集"""
    myVocabList = createVocabList(listOPosts)
    trainMat = []
    """
    将每一行的单词都放到myVocabList这个向量里面
    举个栗子, 假设的单词向量listOPosts是[['a', 'b', 'c']
                                       ['a', 'c', 'd']
                                       ['e', 'f']]
    那么我们的myVocabList就是['a', 'b', 'c', 'd', 'e', 'f'], 顺序不一定是abcdef跟你的单词向量有关
    setOfWords2Vec这个函数是将每一行都转换相应的数量
    如['a', 'b', 'c']对应于['a', 'b', 'c', 'd', 'e', 'f']就是[1, 1, 1, 0, 0, 0]
    如['e', 'f']对应于['a', 'b', 'c', 'd', 'e', 'f']就是[0, 0, 0, 0, 1, 1]
    得到的每一行都放入到trainMat中
    """
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
    """调用函数得到相应的log值和为1的概率"""
    p0Vec, p1Vec, pAbusive = trainNB0(trainMat, listclasses)
    """下面是一些测试"""
    testEntry = ['love', 'my', 'dalmation']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))
    testEntry = ['stupid', 'garbage']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))
    testEntry = ['stupid', 'garbage', 'food']
    thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
    print(testEntry, " classified as: ", classifyNB(thisDoc, p0Vec, p1Vec, pAbusive))


"""解析我们的文件, 解析为字符串列表, 然后去掉我们长度小于3的单词, 因为诸如is, of之类的单词实际上不用于分类"""
def textParse(bigString):
    import re
    listOfTikens = re.split(r'\W*', bigString)
    return [tok.lower() for tok in listOfTikens if len(tok) > 2]


def spamTest():
    docList = []
    classes = []
    fullText = []
    """打开我们的文件, 放到我们的列表中"""
    for i in range(1, 26):
        docList.append(textParse(open('spam/%d.txt' % i).read()))
        fullText.extend(textParse(open('spam/%d.txt' % i).read()))
        classes.append(1)
        """
       下面两句在python2.7中可以运行, 但是在python3.6中提示编码方式什么的问题,
       目前还没找到解决方法, 应该是编码方式没对应上
       但上面两句解析文档又没有错误, 按理同他们的编码方式都是相同的, 暂不得其解
       """
        docList.append(textParse(open('ham/%d.txt' % i).read()))
        fullText.extend(textParse(open('ham/%d.txt' % i).read()))
        classes.append(0)
    """这下面的思路基本上与刚才的相同"""
    vocabList = createVocabList(docList)
    trainingSet = list(range(25))
    testSet = []
    """这里我们随机选取出10篇文章, 剩下的用于训练, 选出的用于测试"""
    for i in range(10):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del (trainingSet[randIndex])
    errorCount = 0
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classes[docIndex])
    p0, p1, pAbusive = trainNB0(array(trainMat), array(trainClasses))
    for docIndex in testSet:
        if classifyNB(array(setOfWords2Vec(vocabList, docList[docIndex])), p0, p1, pAbusive) != classes[docIndex]:
            errorCount += 1
    print("'the error rate is : ", float(errorCount) / len(testSet))


"""该函数将词汇按出现的频率排序返回前30个最常出现的, 也就是出现次数最多的30个"""
def calMostFreq(vocabList, fullText):
    import operator
    freqDict = {}
    for word in vocabList:
        freqDict[word] = fullText.count(word)
    sortedFreq = sorted(freqDict.items(), key=operator.itemgetter(1), reverse=True)
    return sortedFreq[:30]

"""该函数类似于spamTest(), 但是它访问的是RSS源而不是文件"""
def localWords(feed1, feed0):
    classes = []
    fullText = []
    docList = []
    """根据短的哪一个来选取"""
    minLength = min(len(feed1['entries']), len(feed0['entries']))
    for i in range(minLength):
        docList.append(textParse(feed1['entries'][i]['summary']))
        classes.append(1)
        fullText.extend(textParse(feed1['entries'][i]['summary']))
        docList.append(textParse(feed0['entries'][i]['summary']))
        classes.append(0)
        fullText.extend(textParse(feed0['entries'][i]['summary']))
    # print(docList)
    vocabList = createVocabList(docList)
    freqDict = calMostFreq(vocabList, fullText)
    for word in freqDict:
        if word in vocabList:
            vocabList.remove(word[0])
    trainingSet = list(range(2*minLength))
    testSet = []
    for i in range(20):
        randIndex = int(random.uniform(0, len(trainingSet)))
        testSet.append(trainingSet[randIndex])
        del(trainingSet[randIndex])
    trainMat = []
    trainClasses = []
    for docIndex in trainingSet:
        trainMat.append(setOfWords2Vec(vocabList, docList[docIndex]))
        trainClasses.append(classes[docIndex])
    p0Vec, p1Vec, pAbuive = trainNB0(array(trainMat), array(trainClasses))
    errorCount = 0.0
    for docIndex in testSet:
        wordVecotr = setOfWords2Vec(vocabList, docList[docIndex])
        if classifyNB(array(wordVecotr), p0Vec, p1Vec, pAbuive) != classes[docIndex]:
            errorCount += 1.0
    print("the error rate is: ", float(errorCount)/len(testSet))
    return vocabList, p0Vec, p1Vec

"""将出现较多的单词打印出来"""
def getTopWord(ny, sf):
    import operator
    vocabList, p0Vec, p1Vec = localWords(ny, sf)
    topNY = []
    topSF = []
    for i in range(len(p0Vec)):
        if p0Vec[i] > -6.0:
            """此处需要将vocabList转换为list才能用数字索引, 否则会提示'set' ... index...之类的错误"""
            topSF.append((list(vocabList)[i],p0Vec[i]))
        if p1Vec[i] > -6.0:
            topNY.append((list(vocabList)[i],p1Vec[i]))
    sortedTopNY = sorted(topNY, key=lambda pair: pair[1], reverse=True)
    sortedTopSF = sorted(topSF, key=lambda pair: pair[1], reverse=True)
    print("SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF**SF")
    for word in sortedTopSF:
        print(word[0])
    print("NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY**NY")
    for word in sortedTopNY:
        print(word[0])

test.py

#encoding=utf-8
import bayes

listOfPosts, listClasses = bayes.loadDataSet()
print(listOfPosts)
print(listClasses)
"""
答案如下:
[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
 ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
 ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
 ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
 ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
 ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
[0, 1, 0, 1, 0, 1]
"""

myVocabList = bayes.createVocabList(listOfPosts)
print(myVocabList)
"""
答案如下:
set(['cute', 'love', 'help', 'garbage', 'quit', 'I', 'problems', 'is', 'park', 'stop', 'flea', 'dalmation', 'licks', 'food', 'not', 'him', 'buying', 'posting', 'has', 'worthless', 'ate', 'to', 'maybe', 'please', 'dog', 'how', 'stupid', 'so', 'take', 'mr', 'steak', 'my'])
"""

returnVec = bayes.setOfWords2Vec(myVocabList, listOfPosts[0])
print(returnVec)
"""
答案如下:
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1]
"""

"""我们先将前面得到的每篇文章转换为单词向量"""
trainMat = []
for i in range(len(listOfPosts)):
    trainMat.append(bayes.setOfWords2Vec(myVocabList, listOfPosts[i]))
p0, p1, pAbusive = bayes.trainNB0(trainMat, listClasses)
print(p0)
print(p1)
print(pAbusive)
"""
答案如下:
[-2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936 -2.56494936
 -3.25809654 -2.56494936 -3.25809654 -2.56494936 -1.87180218 -2.56494936
 -2.56494936 -3.25809654 -3.25809654 -2.56494936 -3.25809654 -2.56494936
 -3.25809654 -3.25809654 -2.56494936 -2.56494936 -2.56494936 -3.25809654
 -3.25809654 -2.56494936 -2.56494936 -3.25809654 -3.25809654 -2.56494936
 -2.56494936 -2.15948425]
[-3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244 -3.04452244
 -2.35137526 -3.04452244 -1.65822808 -3.04452244 -3.04452244 -2.35137526
 -3.04452244 -2.35137526 -2.35137526 -3.04452244 -1.94591015 -2.35137526
 -2.35137526 -2.35137526 -3.04452244 -3.04452244 -3.04452244 -2.35137526
 -2.35137526 -3.04452244 -3.04452244 -2.35137526 -2.35137526 -3.04452244
 -1.94591015 -2.35137526]
0.5
"""

bayes.testingNB()
"""
答案如下:
['love', 'my', 'dalmation']  classified as:  0
['stupid', 'garbage']  classified as:  1
['stupid', 'garbage', 'food']  classified as:  1
"""

bayes.spamTest()
"""
答案如下:
("'the error rate is : ", 0.1)
"""

import feedparser

ny = feedparser.parse('http://newyork.craigslist.org/stp/index.rss')
sf = feedparser.parse('http://sfbay.craigslist.org/stp/index.rss')
bayes.localWords(ny, sf)
bayes.getTopWord(ny, sf)
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值