目标
给定已知字符串和相关的词表集合,实现该字符串的正向最大匹配算法。
待切分字符串
#待切分文本
sentence = "经常有意见分歧"
对应的词表
#词典;每个词后方存储的是其词频,词频仅为示例,不会用到,也可自行修改
Dict = {"经常":0.1,
"经":0.05,
"有":0.1,
"常":0.001,
"有意见":0.1,
"歧":0.001,
"意见":0.2,
"分歧":0.2,
"见":0.05,
"意":0.05,
"见分歧":0.05,
"分":0.1}
要求切分结果
#目标输出
target = ['经常', '有意见', '分歧']
实现
# 实现正向最大匹配
def forward_match(sentence, Dict):
maxLength = 0
vacab = []
for str, pro in Dict.items():
vacab.append(str)
if len(str) > maxLength:
maxLength = len(str)
curlist = []
while len(sentence) > 0:
end = min(maxLength + 1, len(sentence) + 1)
for j in reversed(range(0, end)):
word = sentence[0:j]
if word in vacab:
curlist.append(word)
sentence = sentence[j:]
break
return curlist
测试结果
if __name__=="__main__":
match = forward_match(sentence, Dict)
print(match)
运行结果:
[‘经常’, ‘有意见’, ‘分歧’]