使用最大匹配算法进行分词

最新推荐文章于 2022-09-22 23:54:37 发布

原创最新推荐文章于 2022-09-22 23:54:37 发布 · 2k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#NLP #分词 #最大匹配 #HMM #词法分析

AI 同时被 2 个专栏收录

68 篇文章

订阅专栏

NLP

60 篇文章

订阅专栏

最大匹配算法的原理：每次从句子最左侧分出一个词，这个词是词典中所有和句子当前位置匹配的词里面最长的那一个。
首先我们实现一个最大匹配算法：

def max_match(sentence, dictionary):
    if ("" == sentence):
        return []
    word_end = 1
    for i in range(str(sentence).__len__(), 0, -1):
        word_tmp = sentence[0:i]
        if (word_tmp in dictionary.keys()):
            word_end = i
            break

    word = sentence[0:word_end]
    remainder = sentence[word_end:]
    return [word] + max_match(remainder, dictionary)

然后我们来测试这个最大匹配分词的效果
首先获取jieba的词典

dic = {}

dic_file = jieba.get_dict_file()
for line in dic_file:
    word, count, type = line.decode("utf-8").split()
    dic[word] = count
dic_file.close()

然后我们用这个词典来对句子进行分词

sentence = "我是张晨，我爱自然语言处理"
seperator="/"

print("max match cut result:")
max_match_words = seperator.join(max_match(sentence, dic))
print(max_match_words)

输出如下：

max match cut result:
我/是/张/晨/，/我/爱/自然语言/处理

我们再用jieba进行分词

print("\njieba cut result:")
jieba_cut_words = seperator.join(jieba.cut(sentence,HMM=False))
print(jieba_cut_words)

输出如下：

jieba cut result:
我/是/张/晨/，/我/爱/自然语言/处理

可以看出，最大匹配分词和jieba分词的结果是一致的。

不过这里有个问题，就是"张晨"作为一个命名实体，应该是一个词，不应该被分成“张”和“晨”，但是因为我们使用的词典中，没有“张晨”这个词，因此它属于未登录词，这种最大匹配的方法是无法处理未登录词的。

为了处理未登录词，我们可以使用HMM模型：

print("\njieba cut result with HMM:")
jieba_cut_words = seperator.join(jieba.cut(sentence,HMM=True))
print(jieba_cut_words)

输出如下：

jieba cut result with HMM:
我/是/张晨/，/我/爱/自然语言/处理

可见，使用了HMM模型之后，未登录词“张晨”可以被正确的识别出来。
完整代码可以在我的github上下载