FMM算法的最简单思想是使用贪心算法向前找n个,如果这n个组成的词在词典中出现,就ok,如果没有出现,那么找n-1个...然后继续下去。假如n个词在词典中出现,那么从n+1位置继续找下去,知道句子结束。
代码片段(3)
[代码][Python]代码
01 | importre |
02 | defPreProcess(sentence,edcode="utf-8"): |
03 | sentence=sentence.decode(edcode) |
04 | sentence=re.sub(u"[。,,!……!《》<>\"'::?\?、\|“”‘’;]"," ",sentence) |
05 | returnsentence |
06 | |
07 | defFMM(sentence,diction,result=[],maxwordLength=4,edcode="utf-8"): |
08 | i=0 |
09 | sentence=PreProcess(sentence,edcode) |
10 | length=len(sentence) |
11 | whilei < length: |
12 | # find the ascii word |
13 | tempi=i |
14 | tok=sentence[i:i+1] |
15 | whilere.search("[0-9A-Za-z\-\+#@_\.]{1}",tok)<>None: |
16 | i=i+1 |
17 | tok=sentence[i:i+1] |
18 | ifi-tempi>0: |
19 | result.append(sentence[tempi:i].lower().encode(edcode)) |
20 | # find chinese word |
21 | left=len(sentence[i:]) |
22 | ifleft==1: |
23 | """go to 4 step over the FMM""" |
24 | """should we add the last one? Yes, if not blank""" |
25 | ifsentence[i:] <>" ": |
26 | result.append(sentence[i:].encode(edcode)) |
27 | returnresult |
28 | m=min(left,maxwordLength) |
29 | |
30 | forjinxrange(m,0,-1): |
31 | leftword=sentence[i:j+i].encode(edcode) |
32 | # print leftword.decode(edcode) |
33 | ifLookUp(leftword,diction): |
34 | # find the left word in dictionary |
35 | # it's the right one |
36 | i=j+i |
37 | result.append(leftword) |
38 | break |
39 | elifj==1: |
40 | """only one word, add into result, if not blank""" |
41 | ifleftword.decode(edcode) <>" ": |
42 | result.append(leftword) |
43 | i=i+1 |
44 | else: |
45 | continue |
46 | returnresult |
47 | defLookUp(word,dictionary): |
48 | ifdictionary.has_key(word): |
49 | returnTrue |
50 | returnFalse |
51 | defConvertGBKtoUTF(sentence): |
52 | returnsentence.decode('gbk').encode('utf-8') |
[代码][Python]代码
01 | dictions={} |
02 | dictions["ab"]=1 |
03 | dictions["cd"]=2 |
04 | dictions["abc"]=1 |
05 | dictions["ss"]=1 |
06 | dictions[ConvertGBKtoUTF("好的")]=1 |
07 | dictions[ConvertGBKtoUTF("真的")]=1 |
08 | sentence="asdfa好的是这样吗vasdiw呀真的daf dasfiw asid是吗?" |
09 | s=FMM(ConvertGBKtoUTF(sentence),dictions) |
10 | foriins: |
11 | printi.decode("utf-8") |
[代码][Python]代码
1 | test=open("test.txt","r") |
2 | forlineintest: |
3 | s=FMM(CovertGBKtoUTF(line),dictions) |
4 | foriins: |
5 | printi.decode("utf-8") |
本文介绍了FMM(Forward Maximum Matching)分词算法的基本原理及其实现过程。通过使用Python编程语言,详细展示了如何实现FMM算法,并给出了具体的代码实例。文章首先定义了预处理步骤以去除句子中的标点符号,接着解释了如何根据最大长度向前查找匹配词汇的过程。最后,通过一系列测试用例验证了算法的有效性。
1261

被折叠的 条评论
为什么被折叠?



