[机器学习]文本主题相关
1 TF-IDF
常用于挖掘文本关键词:
- TF(词频) = 词在本文的出现次数/文章的总词数
- IDF(逆文档频率) = log(语料库的文档总数/包含该词的文档数+1)
- 对于某一篇文章,计算出文档的每个词的TF-IDF值,然后按降序排列,取排在最前面的几个词,即为本文关键词
import jieba
import numpy as np
texts=[
'...',
'...',
'...',
'...'
]
# sklearn
# 对于中文文档,需要提前使用jieba进行分词
x_train = [" ".join(jieba.cut(text)) for text in texts[:3]]
x_test = [" ".join(jieba.cut(text)) for text in texts[3:]]
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
vectorizer = CountVectorizer()
tf_idf_transformer = TfidfTransformer()
x_train_tf_idf = tf_idf_transformer.fit_transform(vectorizer.fit_transform(x_train))
x_test_tf_idf = tf_idf_transformer.transform(vectorizer.transform(x_test))
x_train_weight = x_train_tf_idf.toarray()
x_test_weight = x_test_tf_idf.toarray()
words = vectorizer.get_feature_names()
for w in x_train_weight:
loc = np.argsort(-w)
for i in range(5):
print('{}: {} {}'.format(str(i + 1), words[loc[i]], w[loc[i]]))
print('\n')
# gensim
x_train = [jieba.lcut(text) for text in texts[:3]]
x_test = [jieba.lcut(text) for text in texts[3:]]
from gensim import corpora
from gensim import models
# 建立词表
dic = corpora.Dictionary(x_train)
# 建立id2count
x_train_bow = [dic.doc2bow(sentence) for sentence in x_train]
tfidf = models.TfidfModel(x_train_bow)
tfidf_vec = []
for sentence in x_test:
word_bow = dic.doc2bow(sentence.lower())
word_tfidf = tfidf[word_bow]
tfidf_vec.append(word_tfidf)
# 输出 词语id与词语tfidf值
print(tfidf_vec)
# jieba
import jieba.analyse
# idf使用jieba默认的,也可以自行指定
keywords = jieba.analyse.extract_tags(texts[0], topK=5)
对于大规模文本,还可使用spark实现tf-idf:Spark MLlib TF-IDF – Example
2 LDA
常用于本文主题分析。

本文介绍了TF-IDF用于文本关键词挖掘,LDA进行主题分析,以及BERTopic作为基于sentence-BERT的主题模型工具。详细探讨了BERTopic的实现,包括Seq2Seq、Attention、Transformer和BERT的学习路径。此外,还提到了Spark MLlib对TF-IDF和LDA的支持。
最低0.47元/天 解锁文章
5581

被折叠的 条评论
为什么被折叠?



