语料库和向量空间
gensim安装之后,就有了一件对付巨量文本的武器了,还不快大展身手
想看logging信息就别忘了:
>>> import logging
>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
字符串到向量
以下将文本表示成字符串的形式,共九个文本,每个文本有一句话组成。
>>> from gensim import corpora
>>>
>>> documents = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
首先,我们标注文本,删除常见词:
>>> # remove common words and tokenize
>>> stoplist = set('for a of the and to in'.split())
>>> texts = [[word for word in document.lower().split() if word not in stoplist]
>>> for document in documents]
>>>
>>> # remove words that appear only once
>>> from collections import defaultdict
>>> frequency = defaultdict(int)
>>> for text in texts:
>>> for token in text:
>>> frequency[token] += 1
>>>
>>> texts = [[token for token in text if frequency[token] > 1]
>>> for text in texts]
>>>
>>> from pprint import pprint # pretty-printer
>>> pprint(texts)
[['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['

这篇博客介绍了如何使用Gensim处理文本数据,包括将字符串转换为向量,创建语料库流,理解不同的语料库格式,以及与NumPy和SciPy的兼容性。通过示例展示了从文本到向量化表示的过程,并提到了Matrix Market等语料库格式。
最低0.47元/天 解锁文章
1万+

被折叠的 条评论
为什么被折叠?



