Python与自然语言处理库Gensim实战

最新推荐文章于 2025-11-26 15:40:52 发布

原创

最新推荐文章于 2025-11-26 15:40:52 发布 · 1k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#python #Python #python开发 #IT

在这里插入图片描述

Python与自然语言处理库Gensim实战

解密Gensim：从零开始构建你的第一个文本主题模型

想象一下，你手中有一大堆杂乱无章的文档，就像一个未整理过的旧书架。而Gensim就像是那个神奇的图书管理员，它能够帮助你将这些文档分门别类地整理好。在这部分中，我们将一起探索如何使用Gensim来创建一个简单的LDA（Latent Dirichlet Allocation）主题模型。

首先，你需要安装Gensim：

pip install gensim

接着，准备一些文本数据，这里我们使用nltk库中的英文语料作为例子：

from nltk.corpus import reuters
import gensim
from gensim import corpora

# 准备一些示例文本
documents = [reuters.raw(file_id) for file_id in reuters.fileids()[:100]]

# 文本预处理
texts = [[word for word in document.lower().split()] for document in documents]

# 创建字典和词袋
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# 训练LDA模型
lda_model = gensim.models.LdaModel(corpus, num_topics=5, id2word=dictionary, passes=15)

# 显示每个主题的关键词
for idx, topic in lda_model.print_topics(-1):
    print(f"Topic: