Gensim学习笔记-1.Corpora模块和向量空间表示

最新推荐文章于 2025-07-10 13:39:11 发布

原创最新推荐文章于 2025-07-10 13:39:11 发布 · 2.6k 阅读

17 ·

CC 4.0 BY-SA版权

文章标签：

#Gensim #NLP

NLP 专栏收录该内容

2 篇文章

订阅专栏

本文介绍如何使用Gensim库进行文本预处理及向量化，包括创建词典、生成词袋模型等步骤，并探讨了如何减少内存消耗及语料库的持久化存储。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

本系列的文章是我根据Gensim官方教程整理而成，并不完全是翻译

开始之前

如果需要记录日志，只需这样:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

将文本转为向量

现在假设我们现在有一个语料库documents，在这个语料库中总共有九个文档，每个文档都只包含一句话。

documents = ["Human machine interface for lab abc computer applications",
              "A survey of user opinion of computer system response time",
              "The EPS user interface management system",
              "System and human system engineering testing of EPS",
              "Relation of user perceived response time to error measurement",
              "The generation of random binary unordered trees",
              "The intersection graph of paths in trees",
              "Graph minors IV Widths of trees and well quasi ordering",
              "Graph minors A survey"]

文本预处理

首先要做的是分词并且去停用词:

from pprint import pprint
stopWordsList = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stopWordsList]
         for document in documents]
pprint(texts)

[['human', 'machine', 'interface', 'lab', 'abc', 'computer', 'applications'],
 ['survey', 'user', 'opinion', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'management', 'system'],
 ['system', 'human', 'system', 'engineering', 'testing', 'eps'],
 ['relation', 'user', 'perceived', 'response', 'time', 'error', 'measurement'],
 ['generation', 'random', 'binary', 'unordered', 'trees'],
 ['intersection', 'graph', 'paths', 'trees'],
 ['graph', 'minors', 'iv', 'widths', 'trees', 'well', 'quasi', 'ordering'],
 ['graph', 'minors', 'survey']]

然后我们去除那些仅仅出现一次词:

# 统计各个单词的频率
frequency = {}
for doc in texts:
    for token in doc:
        frequency.setdefault(token, 0)
        frequency[token] += 1

# 去除
texts = [[token for token in doc if frequency[token] > 1] for doc in texts]
pprint(texts)

[['human', 'interface', 'computer'],
 ['survey', 'user', 'computer', 'system', 'response', 'time'],
 ['eps', 'user', 'interface', 'system'],
 ['system', 'human', 'system', 'eps'],
 ['user', 'response', 'time'],
 ['trees'],
 ['graph', 'trees'],
 ['graph', 'minors', 'trees'],
 ['graph', 'minors', 'survey']]

当然，具体情况所使用的分词方式可能与这里不太相同，我们这里的分词方法是比较简陋的。

向量化

通过上面的几个步骤，我们就完成了对整个语料库的分词任务。接下来我们需要对整个库进行向量化。
在这里，我们使用一个最简单的词袋模型/One-Hot编码。在词袋模型中，每一篇文档都被表示成一个长长的向量，向量中每个数值代表着对应单词出现的频数。
为此，我们想对语料中的每一个单词关联一个唯一的ID。这可以用gensim.corpora.Dictionary来实现。这个字典定义了我们要处理的所有单词表。

# 直接导入可能出现warning
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
from gensim import corpora
dictionary = corpora.Dictionary(texts)
# 可以将corpora.Dictionary对象保存到文件
# dictionary.save(r'C:\Users\zuoyiping\Desktop\Doc.dict')
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)

上面的这个dictoinary就是建立了从ID到单词的映射关系的字典，它的词表长度为12，我们可以很轻易的查看想要的信息:

print('各个单词的ID:\n', dictionary.token2id)
print('各个单词的频率:\n', dictionary.dfs)

各个单词的ID:
 {'computer': 0, 'human': 1, 'interface': 2, 'response': 3, 'survey': 4, 'system': 5, 'time': 6, 'user': 7, 'eps': 8, 'trees': 9, 'graph': 10, 'minors': 11}
各个单词的频率:
 {1: 2, 2: 2, 0: 2, 4: 2, 7: 3, 5: 3, 3: 2, 6: 2, 8: 2, 9: 3, 10: 3, 11: 2}

此外Dictionary还可以继续添加语料库，为其扩容:

add_documents(documents, prune_at=2000000)
"""
documents (iterable of iterable of str) – Input corpus. All tokens should be already tokenized and normalized.
"""

在有了映射字典之后，就可以将给定的语料转换为词袋模型了，主要用到的函数有两个:

doc2bow(document, allow_update=False, return_missing=False)
"""
    将document转为 bag-of-words (BoW)
    Parameters: 
        document: (list of str)

"""
 doc2idx(document, unknown_word_index=-1)
"""
    将document转为对应的list of ID
    Parameters: 
        document (list of str)
"""

newDoc = "Human response interaction human".lower().split(' ')
oneHotVec = dictionary.doc2bow(newDoc)
docId = dictionary.doc2idx(newDoc)
print('出现的词:')
print(docId)
print('词袋向量:')
print(oneHotVec)

出现的词:
[1, 3, -1, 1]
词袋向量:
[(1, 2), (3, 1)]

新的文档newDoc被转换为了词袋向量oneHotVec。不过值得注意的是在Gensim中，语料库的词袋表示是使用稀疏向量的。表示方法为: [(ID, 词频)]，没有出现的ID对应的值为0。
现在，我们把原始的语料库转换为词袋向量组:

corpus = [dictionary.doc2bow(doc) for doc in texts]
pprint(corpus)

[[(0, 1), (1, 1), (2, 1)],
 [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)],
 [(2, 1), (5, 1), (7, 1), (8, 1)],
 [(1, 1), (5, 2), (8, 1)],
 [(3, 1), (6, 1), (7, 1)],
 [(9, 1)],
 [(9, 1), (10, 1)],
 [(9, 1), (10, 1), (11, 1)],
 [(4, 1), (10, 1), (11, 1)]]

这样以来，我们就将文本转化为可以被处理的向量了。当然我们知道，词袋模型只不过是一个最简单的模型，在实际的应用当中，我们很少直接把它拿来用。想要投入实际的应用当中，我们还要需要对这个模型进行一些转换，关于这个主题，我们在接下来的教程中会继续探究。

削减内存开支

在上一个板块当中，我们成功的得到了文档转换出来的向量。但是还存在着一个很严重的问题，那就是在之前的做法当中，我们一次性将所有的文档全部都入内存，并转换为向量，如果语料库的规模很大，这样做的开支是非常大。
为此，我们可以利用迭代器这一特性，来建立一个“内存友好型”的语料库。比如我们可以把整个原始文档语料库写入一个文件，每篇文档的内容占一行。然后迭代地读取每一行，并返回词袋向量。
比如现在我们把之前的文档写入一个文件mycorpus.txt中:

class MyCorpus:
    def __init__(self, filepath, dictionary):
        self.file = filepath
        self.dict = dictionary

    def __iter__(self):
        with open(self.file, encoding='utf-8') as f:
            for line in f:
                text = line.lower().split()
                yield self.dict.doc2bow(text)


memoryFriendlyCorpus = MyCorpus(r'./Data/mycorpus.txt', dictionary)
for vec in memoryFriendlyCorpus:
    print(vec)

[(0, 1), (1, 1), (2, 1)]
[(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)]
[(2, 1), (5, 1), (7, 1), (8, 1)]
[(1, 1), (5, 2), (8, 1)]
[(3, 1), (6, 1), (7, 1)]
[(9, 1)]
[(9, 1), (10, 1)]
[(9, 1), (10, 1), (11, 1)]
[(4, 1), (10, 1), (11, 1)]

同样的道理也可以用于构建Dictionary的过程，这里就不再赘述了。

语料库的序列化表示结构

为了持久化我们的语料库，一般我们都需要将其序列化并存在硬盘上。这样一来，我们就有许多种格式可以选择，而Gensim也为我们提供了很多种方法来序列化，每种方法都对应着一种不同的文件表示格式。
这些方法一般都形如corpora.xxxCorpus，其中xxxCorpus是对用方式的类。使用格式如下:
- 序列化并保存: corpora.xxxCorpus.serialize(filepath, corpus)
- 导入: corpora.xxxCorpus(filepath)

常用的主要有这几个:
- MmCorpus
- LowCorpus
- SvmLightCorpus
- BleiCorpus

其他更多的请看官方API文档。这里以MmCorpus为例。

sampleCorpus = [[(1, 0.5), (3, 4)], []]
# 序列化导出
corpora.MmCorpus.serialize('./Data/sampleCorpus.mm', sampleCorpus)
# 导入
sampleCorpus = corpora.MmCorpus('./Data/sampleCorpus.mm')
print(sampleCorpus)
for doc in sampleCorpus:
    print(doc)

MmCorpus(2 documents, 4 features, 2 non-zero entries)
[(1, 0.5), (3, 4.0)]
[]

注意到MmCorpus对象是可迭代的，这也是为了友好内存。另外，其他的一些也是可以保存的，比如Dictionary对象，他就有save和load两个方法，可以用于持久化。

dictionary.save('./Data/sampleDict.dict')
dictionary = corpora.Dictionary.load('./Data/sampleDict.dict')
print(dictionary)

Dictionary(12 unique tokens: ['computer', 'human', 'interface', 'response', 'survey']...)