关于语料库中OOV未登录词的处理方式

原创已于 2022-11-03 10:15:20 修改 · 1.9k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#nlp

于 2021-01-15 13:59:16 首次发布

技能专栏收录该内容

12 篇文章

订阅专栏

本文探讨了在处理未登录词（OOV）时，Word2Vec和FastText的不同策略。Word2Vec通过移除OOV词来构建纯净的语料库，而FastText则使用相似向量来表示未知词汇，提供了一种内在的OOV处理机制。这两种方法在自然语言处理中对于理解和表示词汇有着重要的作用。

在word2vec训练出来的词向量语料库中，对OOV问题是无法解决的，

方式一：参考word2vec工具中的一段代码

# 加载模型
wvmodel = gensim.models.Word2Vec.load(r'E:\pyCharmProject_new\word2vecTest\model\wiki_corpus.bin')

# Remove out-of-vocabulary words.
        len_pre_oov1 = len(document1)
        len_pre_oov2 = len(document2)
        document1 = [token for token in document1 if token in wvmodel]
        document2 = [token for token in document2 if token in wvmodel]
        diff1 = len_pre_oov1 - len(document1)
        diff2 = len_pre_oov2 - len(document2)
        if diff1 > 0 or diff2 > 0:
            logger.info('Removed %d and %d OOV words from document 1 and 2 (respectively).', diff1, diff2)