FMM算法分词

最新推荐文章于 2025-03-18 13:51:43 发布

最新推荐文章于 2025-03-18 13:51:43 发布 · 545 阅读

文章标签：

本文介绍了FMM（Forward Maximum Matching）分词算法的基本原理及其实现过程。通过使用Python编程语言，详细展示了如何实现FMM算法，并给出了具体的代码实例。文章首先定义了预处理步骤以去除句子中的标点符号，接着解释了如何根据最大长度向前查找匹配词汇的过程。最后，通过一系列测试用例验证了算法的有效性。

FMM算法的最简单思想是使用贪心算法向前找n个，如果这n个组成的词在词典中出现，就ok，如果没有出现，那么找n-1个...然后继续下去。假如n个词在词典中出现，那么从n+1位置继续找下去，知道句子结束。

标签： Python

代码片段(3)

[代码][Python]代码

 
importre 

defPreProcess(sentence,edcode="utf-8"): 

sentence=sentence.decode(edcode) 

sentence=re.sub(u"[。，,！……!《》<>\"':：？\?、\|“”‘’；]"," ",sentence) 

returnsentence 



defFMM(sentence,diction,result=[],maxwordLength=4,edcode="utf-8"): 

i=0 

sentence=PreProcess(sentence,edcode) 

length=len(sentence) 

whilei < length: 

# find the ascii word 

tempi=i 

tok=sentence[i:i+1] 

whilere.search("[0-9A-Za-z\-\+#@_\.]{1}",tok)<>None: 

i=i+1 

tok=sentence[i:i+1] 

ifi-tempi>0: 

result.append(sentence[tempi:i].lower().encode(edcode)) 

# find chinese word 

left=len(sentence[i:]) 

ifleft==1: 

"""go to 4 step over the FMM""" 

"""should we add the last one? Yes, if not blank""" 

ifsentence[i:] <>" ": 

result.append(sentence[i:].encode(edcode)) 

returnresult 

m=min(left,maxwordLength) 



forjinxrange(m,0,-1): 

leftword=sentence[i:j+i].encode(edcode) 

# print leftword.decode(edcode) 

ifLookUp(leftword,diction): 

# find the left word in dictionary 

# it's the right one 

i=j+i 

result.append(leftword) 

break 

elifj==1: 

"""only one word, add into result, if not blank""" 

ifleftword.decode(edcode) <>" ": 

result.append(leftword) 

i=i+1 

else: 

continue 

returnresult 

defLookUp(word,dictionary): 

ifdictionary.has_key(word): 

returnTrue 

returnFalse 

defConvertGBKtoUTF(sentence): 

returnsentence.decode('gbk').encode('utf-8')

[代码][Python]代码

 
dictions={} 

dictions["ab"]=1 

dictions["cd"]=2 

dictions["abc"]=1 

dictions["ss"]=1 

dictions[ConvertGBKtoUTF("好的")]=1 

dictions[ConvertGBKtoUTF("真的")]=1 

sentence="asdfa好的是这样吗vasdiw呀真的daf dasfiw asid是吗？" 

s=FMM(ConvertGBKtoUTF(sentence),dictions) 

foriins: 

printi.decode("utf-8")

[代码][Python]代码

 
test=open("test.txt","r") 

forlineintest: 

s=FMM(CovertGBKtoUTF(line),dictions) 

foriins: 

printi.decode("utf-8")

FMM算法 分词

代码片段(3)

[代码][Python]代码

[代码][Python]代码

[代码][Python]代码

FMM算法分词