from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
corpus = [
'This is the first document.',
'This is the second second document.',
'And the third one.',
'Is this the first document?',
]
CountVectorizer是通过fit_transform函数将文本中的词语转换为词频矩阵
get_feature_names()可看到所有文本的关键字
vocabulary_可看到所有文本的关键字和其位置
toarray()可看到词频矩阵的结果
vectorizer = CountVectorizer()
count = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
print(count.toarray())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
[[0 1 1 1 0 0 1 0 1]
[0 1 0 1 0 2 1 0 1]
[1 0 0 0 1 0 1 1 0]
[0 1 1 1 0 0 1 0 1]]
TfidfTransformer是统计CountVectorizer中每个词语的tf-idf权值
transf