拼写检查器——朴素贝叶斯应用

拼写检查器
本文介绍了一个基于Python的小型拼写检查器实现,采用朴素贝叶斯算法,仅20多行代码却功能强大。该拼写检查器能够纠正常见的拼写错误,并提供了详细的代码解释。

How to Write a Spelling Corrector

作者利用飞行的时间写了一个小代码,实现单词拼写检查功能,基本原理仍是朴素贝叶斯原理。

代码仅仅20多行,精湛强大,值得学习。至于朴素贝叶斯算法,作者作了简单的介绍。

(下面代码是注释了的,很详细,借鉴凉茶方便面

import re, collections
 
#返回单词的列表
#通过正则表达式取出纯字母组成的单词列表
def words(text): return re.findall('[a-z]+', text.lower()) 
 
#训练,获得模型
#其实也就是统计每个单词出现的次数,用于计算每个单词出现的频率
#事实上并不是真的在计算单词的出现次数,因为默认词频(没有在big.txt中的单词频率)为1,
#所以统计出的词频会比统计次数多1(称为“平滑”)
def train(features):
    model = collections.defaultdict(lambda: 1)
    for f in features:
        model[f] += 1
    return model
#读取整个文本,并且把文本中的单词进行统计
NWORDS = train(words(file('big.txt').read()))
 
#小写字母列表
alphabet = 'abcdefghijklmnopqrstuvwxyz'
 
#计算每个与给定单词编辑距离为1的单词,并组成一个集合
def edits1(word):
    #将每个单词都分割为两两一组,方便后边的编辑距离的计算
    #(如‘if’,分割为[('','if'),('i','f'),('if','')])
    splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
    #删除的编辑距离:每次从b字符串中删除一个字符
    #如‘if’,形成的列表为['f', 'i'],只对前两组进行了分割,因为第三组的b是空串''
    deletes    = [a + b[1:] for a, b in splits if b]
    #交换的编辑距离:单词中相邻的字母位置被交换
    #如‘if’,形成的列表为['fi'],只对第一组进行了字母交换,因为只有b的长度大于2才能交换
    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
    #替换的编辑距离:每次替换b中的一个字母
    # 如‘if’,形成的列表为
    #['af', 'bf', 'cf', 'df', 'ef', 'ff', 'gf', 'hf', 'if', 'jf', 'kf', 'lf', 'mf',
    # 'nf', 'of', 'pf', 'qf', 'rf', 'sf', 'tf', 'uf', 'vf', 'wf', 'xf', 'yf', 'zf',
    # 'ia', 'ib', 'ic', 'id', 'ie', 'if', 'ig', 'ih', 'ii', 'ij', 'ik', 'il', 'im',
    # 'in', 'io', 'ip', 'iq', 'ir', 'is', 'it', 'iu', 'iv', 'iw', 'ix', 'iy', 'iz']
    replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
    #插入的编辑距离:向每个单词分割的组插入一个字母
    # 如‘if’,形成的列表为:
    #['aif', 'bif', 'cif', 'dif', 'eif', 'fif', 'gif', 'hif', 'iif', 'jif', 'kif',
    # 'lif', 'mif', 'nif', 'oif', 'pif', 'qif', 'rif', 'sif', 'tif', 'uif', 'vif',
    # 'wif', 'xif', 'yif', 'zif', 'iaf', 'ibf', 'icf', 'idf', 'ief', 'iff', 'igf',
    # 'ihf', 'iif', 'ijf', 'ikf', 'ilf', 'imf', 'inf', 'iof', 'ipf', 'iqf', 'irf',
    # 'isf', 'itf', 'iuf', 'ivf', 'iwf', 'ixf', 'iyf', 'izf', 'ifa', 'ifb', 'ifc',
    # 'ifd', 'ife', 'iff', 'ifg', 'ifh', 'ifi', 'ifj', 'ifk', 'ifl', 'ifm', 'ifn',
    # 'ifo', 'ifp', 'ifq', 'ifr', 'ifs', 'ift', 'ifu', 'ifv', 'ifw', 'ifx', 'ify', 'ifz']
    inserts    = [a + c + b     for a, b in splits for c in alphabet]
    return set(deletes + transposes + replaces + inserts)
 
#计算出编辑距离为2的单词集合
#e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS
#   1.计算与给定单词编辑距离为1的单词列表
#                          2.计算与编辑距离为1的单词列表的编辑距离为1的列表(也就是编辑距离为2的单词列表)
#                                               3.如果单词编辑距离为2的单词在文本中,就认为是可用的单词
def known_edits2(word):
    return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
 
#计算单词列表words中,在文件中的单词的列表(也就是只保留可能是单词的,其他的没有出现在文本中的都判断为非单词)
#对于非单词的将不认为是候选单词
def known(words): return set(w for w in words if w in NWORDS)
 
#计算候选单词中词频最大的单词
def correct(word):
    #这里牵扯到python中or的用法,在这里,当known([word])不为False则返回它,否则继续计算known(edits1(word))
    #如果known(edits1(word))不为False则返回known(edits1(word)),否则继续。
    #这里这么做的原因:
    #1. 如果单词是在已知列表中,那就是正确的直接返回它即可
    #2. 如果单词是未知的,那么看看与它编辑距离为1的单词,如果有,那么就计算它们的最大概率那么
    #3. 编辑距离为1的没有,那就找编辑距离为2的
    #4. 如果单词未知,且编辑距离为1,2的也找不到,那么只能返回它本身了
    #or的用法:x or y 表示 if x is false, then y, else x,
    #         it only evaluates the second argument if the first one is False
    #         所以当x为True时,也就不再计算y直接返回x
    candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
    #通过统计candidates中的单词在NWORDS字典中最大的那个单词,并返回它
    return max(candidates, key=NWORDS.get)

下面这个是官方网站上原始代码(注释部分有些尴尬):

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

How It Works: Some Probability Theory

The call correction(w) tries to choose the most likely spelling correction for w. There is no way to know for sure (for example, should "lates" be corrected to "late" or "latest" or "lattes" or ...?), which suggests we use probabilities. We are trying to find the correction c, out of all possible candidate corrections, that maximizes the probability that c is the intended correction, given the original word w:

argmax c ∈ candidates P( c| w)

By Bayes' Theorem this is equivalent to:

argmax c ∈ candidates P( c) P( w| c) / P( w)

Since P(w) is the same for every possible candidate c, we can factor it out, giving:

argmax c ∈ candidates P( c) P( w| c)

The four parts of this expression are:

  1. Selection Mechanism: argmax 
    We choose the candidate with the highest combined probability.

     

  2. Candidate Modelc ∈ candidates
    This tells us which candidate corrections, c, to consider.

     

  3. Language Model: P(c
    The probability that c appears as a word of English text. For example, occurrences of "the" make up about 7% of English text, so we should have P(the) = 0.07.

     

  4. Error Model: P(w|c)
    The probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.

One obvious question is: why take a simple expression like P(c|w) and replace it with a more complex expression involving two models rather than one? The answer is that P(c|w) is already conflating two factors, and it is easier to separate the two out and deal with them explicitly. Consider the misspelled wordw="thew" and the two candidate corrections c="the" and c="thaw". Which has a higher P(c|w)? Well, "thaw" seems good because the only change is "a" to "e", which is a small change. On the other hand, "the" seems good because "the" is a very common word, and while adding a "w" seems like a larger, less probable change, perhaps the typist's finger slipped off the "e". The point is that to estimate P(c|w) we have to consider both the probability of c and the probability of the change from c to w anyway, so it is cleaner to formally separate the two factors.

 


测试方法:

tests1 = { 'access': 'acess', 'accessing': 'accesing', 'accommodation':
'accomodation acommodation acomodation', 'account': 'acount', 'address':
'adress adres', 'addressable': 'addresable', 'arranged': 'aranged arrainged',
'arrangeing': 'aranging', 'arrangement': 'arragment', 'articles': 'articals',
'aunt': 'annt anut arnt', 'auxiliary': 'auxillary', 'available': 'avaible',
'awful': 'awfall afful', 'basically': 'basicaly', 'beginning': 'begining',
'benefit': 'benifit', 'benefits': 'benifits', 'between': 'beetween', 'bicycle':
'bicycal bycicle bycycle', 'biscuits': 
'biscits biscutes biscuts bisquits buiscits buiscuts', 'built': 'biult', 
'cake': 'cak', 'career': 'carrer',
'cemetery': 'cemetary semetary', 'centrally': 'centraly', 'certain': 'cirtain',
'challenges': 'chalenges chalenges', 'chapter': 'chaper chaphter chaptur',
'choice': 'choise', 'choosing': 'chosing', 'clerical': 'clearical',
'committee': 'comittee', 'compare': 'compair', 'completely': 'completly',
'consider': 'concider', 'considerable': 'conciderable', 'contented':
'contenpted contende contended contentid', 'curtains': 
'cartains certans courtens cuaritains curtans curtians curtions', 'decide': 'descide', 'decided':
'descided', 'definitely': 'definately difinately', 'definition': 'defenition',
'definitions': 'defenitions', 'description': 'discription', 'desiccate':
'desicate dessicate dessiccate', 'diagrammatically': 'diagrammaticaally',
'different': 'diffrent', 'driven': 'dirven', 'ecstasy': 'exstacy ecstacy',
'embarrass': 'embaras embarass', 'establishing': 'astablishing establising',
'experience': 'experance experiance', 'experiences': 'experances', 'extended':
'extented', 'extremely': 'extreamly', 'fails': 'failes', 'families': 'familes',
'february': 'febuary', 'further': 'futher', 'gallery': 'galery gallary gallerry gallrey', 
'hierarchal': 'hierachial', 'hierarchy': 'hierchy', 'inconvenient':
'inconvienient inconvient inconvinient', 'independent': 'independant independant',
'initial': 'intial', 'initials': 'inetials inistals initails initals intials',
'juice': 'guic juce jucie juise juse', 'latest': 'lates latets latiest latist', 
'laugh': 'lagh lauf laught lugh', 'level': 'leval',
'levels': 'levals', 'liaison': 'liaision liason', 'lieu': 'liew', 'literature':
'litriture', 'loans': 'lones', 'locally': 'localy', 'magnificent': 
'magnificnet magificent magnifcent magnifecent magnifiscant magnifisent magnificant',
'management': 'managment', 'meant': 'ment', 'minuscule': 'miniscule',
'minutes': 'muinets', 'monitoring': 'monitering', 'necessary': 
'neccesary necesary neccesary necassary necassery neccasary', 'occurrence':
'occurence occurence', 'often': 'ofen offen offten ofton', 'opposite': 
'opisite oppasite oppesite oppisit oppisite opposit oppossite oppossitte', 'parallel': 
'paralel paralell parrallel parralell parrallell', 'particular': 'particulaur',
'perhaps': 'perhapse', 'personnel': 'personnell', 'planned': 'planed', 'poem':
'poame', 'poems': 'poims pomes', 'poetry': 'poartry poertry poetre poety powetry', 
'position': 'possition', 'possible': 'possable', 'pretend': 
'pertend protend prtend pritend', 'problem': 'problam proble promblem proplen',
'pronunciation': 'pronounciation', 'purple': 'perple perpul poarple',
'questionnaire': 'questionaire', 'really': 'realy relley relly', 'receipt':
'receit receite reciet recipt', 'receive': 'recieve', 'refreshment':
'reafreshment refreshmant refresment refressmunt', 'remember': 'rember remeber rememmer rermember',
'remind': 'remine remined', 'scarcely': 'scarcly scarecly scarely scarsely', 
'scissors': 'scisors sissors', 'separate': 'seperate',
'singular': 'singulaur', 'someone': 'somone', 'sources': 'sorces', 'southern':
'southen', 'special': 'speaical specail specal speical', 'splendid': 
'spledid splended splened splended', 'standardizing': 'stanerdizing', 'stomach': 
'stomac stomache stomec stumache', 'supersede': 'supercede superceed', 'there': 'ther',
'totally': 'totaly', 'transferred': 'transfred', 'transportability':
'transportibility', 'triangular': 'triangulaur', 'understand': 'undersand undistand', 
'unexpected': 'unexpcted unexpeted unexspected', 'unfortunately':
'unfortunatly', 'unique': 'uneque', 'useful': 'usefull', 'valuable': 'valubale valuble', 
'variable': 'varable', 'variant': 'vairiant', 'various': 'vairious',
'visited': 'fisited viseted vistid vistied', 'visitors': 'vistors',
'voluntary': 'volantry', 'voting': 'voteing', 'wanted': 'wantid wonted',
'whether': 'wether', 'wrote': 'rote wote'}
 
tests2 = {'forbidden': 'forbiden', 'decisions': 'deciscions descisions',
'supposedly': 'supposidly', 'embellishing': 'embelishing', 'technique':
'tecnique', 'permanently': 'perminantly', 'confirmation': 'confermation',
'appointment': 'appoitment', 'progression': 'progresion', 'accompanying':
'acompaning', 'applicable': 'aplicable', 'regained': 'regined', 'guidelines':
'guidlines', 'surrounding': 'serounding', 'titles': 'tittles', 'unavailable':
'unavailble', 'advantageous': 'advantageos', 'brief': 'brif', 'appeal':
'apeal', 'consisting': 'consisiting', 'clerk': 'cleark clerck', 'component':
'componant', 'favourable': 'faverable', 'separation': 'seperation', 'search':
'serch', 'receive': 'recieve', 'employees': 'emploies', 'prior': 'piror',
'resulting': 'reulting', 'suggestion': 'sugestion', 'opinion': 'oppinion',
'cancellation': 'cancelation', 'criticism': 'citisum', 'useful': 'usful',
'humour': 'humor', 'anomalies': 'anomolies', 'would': 'whould', 'doubt':
'doupt', 'examination': 'eximination', 'therefore': 'therefoe', 'recommend':
'recomend', 'separated': 'seperated', 'successful': 'sucssuful succesful',
'apparent': 'apparant', 'occurred': 'occureed', 'particular': 'paerticulaur',
'pivoting': 'pivting', 'announcing': 'anouncing', 'challenge': 'chalange',
'arrangements': 'araingements', 'proportions': 'proprtions', 'organized':
'oranised', 'accept': 'acept', 'dependence': 'dependance', 'unequalled':
'unequaled', 'numbers': 'numbuers', 'sense': 'sence', 'conversely':
'conversly', 'provide': 'provid', 'arrangement': 'arrangment',
'responsibilities': 'responsiblities', 'fourth': 'forth', 'ordinary':
'ordenary', 'description': 'desription descvription desacription',
'inconceivable': 'inconcievable', 'data': 'dsata', 'register': 'rgister',
'supervision': 'supervison', 'encompassing': 'encompasing', 'negligible':
'negligable', 'allow': 'alow', 'operations': 'operatins', 'executed':
'executted', 'interpretation': 'interpritation', 'hierarchy': 'heiarky',
'indeed': 'indead', 'years': 'yesars', 'through': 'throut', 'committee':
'committe', 'inquiries': 'equiries', 'before': 'befor', 'continued':
'contuned', 'permanent': 'perminant', 'choose': 'chose', 'virtually':
'vertually', 'correspondence': 'correspondance', 'eventually': 'eventully',
'lonely': 'lonley', 'profession': 'preffeson', 'they': 'thay', 'now': 'noe',
'desperately': 'despratly', 'university': 'unversity', 'adjournment':
'adjurnment', 'possibilities': 'possablities', 'stopped': 'stoped', 'mean':
'meen', 'weighted': 'wagted', 'adequately': 'adequattly', 'shown': 'hown',
'matrix': 'matriiix', 'profit': 'proffit', 'encourage': 'encorage', 'collate':
'colate', 'disaggregate': 'disaggreagte disaggreaget', 'receiving':
'recieving reciving', 'proviso': 'provisoe', 'umbrella': 'umberalla', 'approached':
'aproached', 'pleasant': 'plesent', 'difficulty': 'dificulty', 'appointments':
'apointments', 'base': 'basse', 'conditioning': 'conditining', 'earliest':
'earlyest', 'beginning': 'begining', 'universally': 'universaly',
'unresolved': 'unresloved', 'length': 'lengh', 'exponentially':
'exponentualy', 'utilized': 'utalised', 'set': 'et', 'surveys': 'servays',
'families': 'familys', 'system': 'sysem', 'approximately': 'aproximatly',
'their': 'ther', 'scheme': 'scheem', 'speaking': 'speeking', 'repetitive':
'repetative', 'inefficient': 'ineffiect', 'geneva': 'geniva', 'exactly':
'exsactly', 'immediate': 'imediate', 'appreciation': 'apreciation', 'luckily':
'luckeley', 'eliminated': 'elimiated', 'believe': 'belive', 'appreciated':
'apreciated', 'readjusted': 'reajusted', 'were': 'wer where', 'feeling':
'fealing', 'and': 'anf', 'false': 'faulse', 'seen': 'seeen', 'interrogating':
'interogationg', 'academically': 'academicly', 'relatively': 'relativly relitivly',
'traditionally': 'traditionaly', 'studying': 'studing',
'majority': 'majorty', 'build': 'biuld', 'aggravating': 'agravating',
'transactions': 'trasactions', 'arguing': 'aurguing', 'sheets': 'sheertes',
'successive': 'sucsesive sucessive', 'segment': 'segemnt', 'especially':
'especaily', 'later': 'latter', 'senior': 'sienior', 'dragged': 'draged',
'atmosphere': 'atmospher', 'drastically': 'drasticaly', 'particularly':
'particulary', 'visitor': 'vistor', 'session': 'sesion', 'continually':
'contually', 'availability': 'avaiblity', 'busy': 'buisy', 'parameters':
'perametres', 'surroundings': 'suroundings seroundings', 'employed':
'emploied', 'adequate': 'adiquate', 'handle': 'handel', 'means': 'meens',
'familiar': 'familer', 'between': 'beeteen', 'overall': 'overal', 'timing':
'timeing', 'committees': 'comittees commitees', 'queries': 'quies',
'econometric': 'economtric', 'erroneous': 'errounous', 'decides': 'descides',
'reference': 'refereence refference', 'intelligence': 'inteligence',
'edition': 'ediion ediition', 'are': 'arte', 'apologies': 'appologies',
'thermawear': 'thermawere thermawhere', 'techniques': 'tecniques',
'voluntary': 'volantary', 'subsequent': 'subsequant subsiquent', 'currently':
'curruntly', 'forecast': 'forcast', 'weapons': 'wepons', 'routine': 'rouint',
'neither': 'niether', 'approach': 'aproach', 'available': 'availble',
'recently': 'reciently', 'ability': 'ablity', 'nature': 'natior',
'commercial': 'comersial', 'agencies': 'agences', 'however': 'howeverr',
'suggested': 'sugested', 'career': 'carear', 'many': 'mony', 'annual':
'anual', 'according': 'acording', 'receives': 'recives recieves',
'interesting': 'intresting', 'expense': 'expence', 'relevant':
'relavent relevaant', 'table': 'tasble', 'throughout': 'throuout', 'conference':
'conferance', 'sensible': 'sensable', 'described': 'discribed describd',
'union': 'unioun', 'interest': 'intrest', 'flexible': 'flexable', 'refered':
'reffered', 'controlled': 'controled', 'sufficient': 'suficient',
'dissension': 'desention', 'adaptable': 'adabtable', 'representative':
'representitive', 'irrelevant': 'irrelavent', 'unnecessarily': 'unessasarily',
'applied': 'upplied', 'apologised': 'appologised', 'these': 'thees thess',
'choices': 'choises', 'will': 'wil', 'procedure': 'proceduer', 'shortened':
'shortend', 'manually': 'manualy', 'disappointing': 'dissapoiting',
'excessively': 'exessively', 'comments': 'coments', 'containing': 'containg',
'develop': 'develope', 'credit': 'creadit', 'government': 'goverment',
'acquaintances': 'aquantences', 'orientated': 'orentated', 'widely': 'widly',
'advise': 'advice', 'difficult': 'dificult', 'investigated': 'investegated',
'bonus': 'bonas', 'conceived': 'concieved', 'nationally': 'nationaly',
'compared': 'comppared compased', 'moving': 'moveing', 'necessity':
'nessesity', 'opportunity': 'oppertunity oppotunity opperttunity', 'thoughts':
'thorts', 'equalled': 'equaled', 'variety': 'variatry', 'analysis':
'analiss analsis analisis', 'patterns': 'pattarns', 'qualities': 'quaties', 'easily':
'easyly', 'organization': 'oranisation oragnisation', 'the': 'thw hte thi',
'corporate': 'corparate', 'composed': 'compossed', 'enormously': 'enomosly',
'financially': 'financialy', 'functionally': 'functionaly', 'discipline':
'disiplin', 'announcement': 'anouncement', 'progresses': 'progressess',
'except': 'excxept', 'recommending': 'recomending', 'mathematically':
'mathematicaly', 'source': 'sorce', 'combine': 'comibine', 'input': 'inut',
'careers': 'currers carrers', 'resolved': 'resoved', 'demands': 'diemands',
'unequivocally': 'unequivocaly', 'suffering': 'suufering', 'immediately':
'imidatly imediatly', 'accepted': 'acepted', 'projects': 'projeccts',
'necessary': 'necasery nessasary nessisary neccassary', 'journalism':
'journaism', 'unnecessary': 'unessessay', 'night': 'nite', 'output':
'oputput', 'security': 'seurity', 'essential': 'esential', 'beneficial':
'benificial benficial', 'explaining': 'explaning', 'supplementary':
'suplementary', 'questionnaire': 'questionare', 'employment': 'empolyment',
'proceeding': 'proceding', 'decision': 'descisions descision', 'per': 'pere',
'discretion': 'discresion', 'reaching': 'reching', 'analysed': 'analised',
'expansion': 'expanion', 'although': 'athough', 'subtract': 'subtrcat',
'analysing': 'aalysing', 'comparison': 'comparrison', 'months': 'monthes',
'hierarchal': 'hierachial', 'misleading': 'missleading', 'commit': 'comit',
'auguments': 'aurgument', 'within': 'withing', 'obtaining': 'optaning',
'accounts': 'acounts', 'primarily': 'pimarily', 'operator': 'opertor',
'accumulated': 'acumulated', 'extremely': 'extreemly', 'there': 'thear',
'summarys': 'sumarys', 'analyse': 'analiss', 'understandable':
'understadable', 'safeguard': 'safegaurd', 'consist': 'consisit',
'declarations': 'declaratrions', 'minutes': 'muinutes muiuets', 'associated':
'assosiated', 'accessibility': 'accessability', 'examine': 'examin',
'surveying': 'servaying', 'politics': 'polatics', 'annoying': 'anoying',
'again': 'agiin', 'assessing': 'accesing', 'ideally': 'idealy', 'scrutinized':
'scrutiniesed', 'simular': 'similar', 'personnel': 'personel', 'whereas':
'wheras', 'when': 'whn', 'geographically': 'goegraphicaly', 'gaining':
'ganing', 'requested': 'rquested', 'separate': 'seporate', 'students':
'studens', 'prepared': 'prepaired', 'generated': 'generataed', 'graphically':
'graphicaly', 'suited': 'suted', 'variable': 'varible vaiable', 'building':
'biulding', 'required': 'reequired', 'necessitates': 'nessisitates',
'together': 'togehter', 'profits': 'proffits'}
 
def spelltest(tests, bias=None, verbose=False):
    import time
    n, bad, unknown, start = 0, 0, 0, time.clock()
    if bias:
        for target in tests: NWORDS[target] += bias
    for target,wrongs in tests.items():
        for wrong in wrongs.split():
            n += 1
            w = correct(wrong)
            if w!=target:
                bad += 1
                unknown += (target not in NWORDS)
                if verbose:
                    print '%r => %r (%d); expected %r (%d)' % (
                        wrong, w, NWORDS[w], target, NWORDS[target])
    return dict(bad=bad, n=n, bias=bias, pct=int(100. - 100.*bad/n), 
                unknown=unknown, secs=int(time.clock()-start) )
 
if __name__ == '__main__':
    print spelltest(tests1)

 


<think>好的,我现在需要回答用户关于头歌平台上的朴素贝叶斯分类实验的问题。用户可能是在做实验时遇到了困难,需要解析或者答案。首先,我要回顾朴素贝叶斯的基本原理,确保自己理解正确。根据提供的引用[1],朴素贝叶斯是基于贝叶斯定理和特征条件独立假设的分类算法,常用于文本分类、垃圾邮件过滤等。 接下来,用户提到的头歌平台实验可能涉及MultinomialNB类的使用,引用[2]提到fit和predict函数的功能。因此,实验可能要求学员实现数据预处理、模型训练和预测的步骤。我需要考虑实验中常见的任务,比如数据加载、特征提取(比如文本分类中的TF-IDF或词袋模型)、处理缺失值、模型训练与评估等。 用户可能需要具体的代码示例或步骤解析。例如,在数据预处理阶段,可能需要将文本转换为数值特征,使用CountVectorizer或TfidfTransformer。然后,应用MultinomialNB进行分类。此外,实验可能涉及参数调整,比如平滑参数alpha的设置,或者评估指标如准确率、混淆矩阵的计算。 另外,需要注意头歌平台可能有特定的代码结构或输入输出要求,比如需要补全部分代码块,或者按照特定格式输出结果。虽然无法提供直接的实验答案,但可以指导正确的实现方法,比如如何调用fit和predict方法,如何处理多项式分布的数据(适用于离散特征,如词频)。 可能的问题点包括:特征的条件独立性假设是否合理,如何处理连续型数据(可能需要高斯朴素贝叶斯),以及拉普拉斯平滑的作用。在实验中,正确应用这些概念是关键。例如,在代码中添加平滑参数防止零概率问题。 总结回答结构时,应先解释朴素贝叶斯的原理,然后分步骤说明实验可能涉及的关键环节,并给出代码示例,同时提醒注意事项和常见错误。最后,生成相关问题帮助用户进一步理解。</think>### 头歌平台朴素贝叶斯分类实验解析 #### 1. **实验核心目标** 头歌平台的朴素贝叶斯实验通常围绕以下内容展开: - **数据预处理**:文本特征提取(如词频统计、TF-IDF转换) - **模型训练**:调用`MultinomialNB`类的`fit`函数,学习特征与标签的联合概率分布 - **模型预测**:使用`predict`函数对新样本进行分类 - **性能评估**:计算准确率、召回率、F1值等指标 #### 2. **关键代码实现** ```python from sklearn.naive_bayes import MultinomialNB from sklearn.feature_extraction.text import CountVectorizer # 示例文本数据集 texts = ["机器学习 朴素贝叶斯 分类", "垃圾邮件 检测 朴素贝叶斯", "文本分类 算法"] labels = [1, 0, 1] # 假设1为技术类,0为垃圾类 # 文本向量化(词频统计) vectorizer = CountVectorizer() X = vectorizer.fit_transform(texts) # 模型训练(注意alpha平滑参数) model = MultinomialNB(alpha=1.0) model.fit(X, labels) # 预测新样本 new_text = ["朴素贝叶斯 应用 场景"] X_new = vectorizer.transform(new_text) print(model.predict(X_new)) # 输出预测标签 ``` #### 3. **实验常见问题** - **特征独立性假设**:实验中需验证特征工程是否符合独立性假设[^1] - **零概率问题**:通过`alpha`参数实现拉普拉斯平滑,避免未出现特征导致概率为0 - **数据分布类型**:离散特征用`MultinomialNB`,连续特征需用`GaussianNB` - **评估指标**:头歌平台可能要求输出特定格式的准确率或混淆矩阵 #### 4. **实验注意事项** - 严格匹配输入输出格式(如矩阵维度、小数点位数) - 文本数据需统一转换为小写并去除停用词 - 多项式朴素贝叶斯适用于**词频统计**场景(如实验中的文本分类) ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值