LDA

文本预处理与LDA主题模型
本文介绍了一种文本预处理方法,并通过LDA主题模型进行主题抽取。使用nltk进行分词、去除停用词及词干提取,gensim实现LDA模型训练。通过对几个示例文档进行处理并分析,展示了如何从文本中提取关键信息。
#-*- coding:utf8 -*-
from nltk.tokenize import RegexpTokenizer
from stop_words import get_stop_words
from nltk.stem.porter import PorterStemmer
from gensim.models.ldamodel import LdaModel
from gensim import corpora, models, similarities

import sys

reload(sys)
sys.setdefaultencoding('utf-8')
def main():
    doc_a = "Brocolli is good to eat. My brother likes to eat good brocolli, but not my mother."
    doc_b = "My mother spends a lot of time driving my brother around to baseball practice."
    doc_c = "Some health experts suggest that driving may cause increased tension and blood pressure."
    doc_d = "I often feel pressure to perform well at school, but my mother never seems to drive my brother to do better."
    doc_e = "Health professionals say that brocolli is good for your health."

    # compile sample documents into a list
    doc_set = [doc_a, doc_b, doc_c, doc_d, doc_e]
    # print doc_set


    tokenizer = RegexpTokenizer(r'\w+')
    p_stemmer = PorterStemmer()
    en_stop = get_stop_words('en')
    texts=[]
    for raw in doc_set:
        raw = raw.lower()
        tokens = tokenizer.tokenize(raw)
        # print tokens

        stopped_tokens=[i for i in tokens if not i in en_stop]
        stemmed_token=[p_stemmer.stem(i) for i in stopped_tokens]
        texts.append(stemmed_token)
    # print texts
    dictionary = corpora.Dictionary(texts)
    print dictionary
    print(dictionary.token2id)
    corpus = [dictionary.doc2bow(text) for text in texts]
    """词袋模型生成矩阵"""
    print corpus
    """
    num_topics: 必须。LDA 模型要求用户决定应该生成多少个主题。由于我们的文档集很小,所以我们只生成三个主题。
    id2word:必须。LdaModel 类要求我们之前的 dictionary  id 都映射成为字符串。
    passes:可选。模型遍历语料库的次数。遍历的次数越多,模型越精确。但是对于非常大的语料库,遍历太多次会花费很长的时间。
    """
    ldamodel=LdaModel(corpus,num_topics=2,id2word=dictionary,passes=20)
    # print ldamodel.print_topics(num_topics=3, num_words=4)
    # # print(dictionary.roken2id)
    # #分支一建立 TF-IDF
    tfidf = models.TfidfModel(corpus)
    print tfidf
    corpus_tfidf = tfidf[corpus]
    print corpus_tfidf
    similarity = similarities.Similarity('Similarity-tfidf-index', corpus_tfidf, num_features=600)
    print similarity
    # """使用tf-idf 模型得出该评论集的tf-idf 模型"""
    # corpus_tfidf = tfidf[corpus]
    new_sensence = "My mother spends a lot of time driving my brother around to baseball practice"
    tokens = tokenizer.tokenize(new_sensence.lower())
    tokens1 = [i for i in tokens if not i in en_stop]
    new_sen = [p_stemmer.stem(i) for i in tokens1]
    test_corpus_1 = dictionary.doc2bow(new_sen)
    vec_tfidf = tfidf[test_corpus_1]
    print vec_tfidf
    id2token={value:key for key,value in dictionary.token2id.items()}
    print id2token
    for (key,freq) in vec_tfidf:
        print id2token[key],freq
if __name__ == '__main__':
    main()
LDA即Latent Dirichlet Allocation,是一种文档主题生成模型,也被称为三层贝叶斯概率模型,包含词、主题和文档三层结构。它是一种生成式模型,认为一篇文章的每个词都是通过“文章以一定概率选择了某个主题,并从这个主题中以一定概率选择某个词语”这样的过程得到的,其中文档到主题服从多项式分布,主题到词也服从多项式分布[^2]。 LDA主要用于对特定主题下的文档文本进行分类。对于每个文档,它会构建一个主题并包含相关的单词。该模型被证明可以为主题建模用例提供准确的结果,但在使用前需要对某些文件进行修改和预处理[^1]。 在判别式模型与生成式模型的对比中,判别式模型对标签信息Y的产生过程进行描述,对特征信息本身不建模,有利于构建分类器或者回归分析;而LDA作为生成式模型,需要对数据X和标签信息Y同时建模,更适合做无监督学习分析[^3]。 在自然语言处理领域,判断文档相关性时需要考虑文档的语义,而主题模型是进行语义挖掘的有效工具,LDA就是其中比较有效的一种模型。即使两个句子没有共同出现的单词,LDA也可能判断它们是相似的,因为它能挖掘文档的语义信息[^4]。 ### 代码示例 以下是使用Python的`gensim`库实现简单LDA模型的示例代码: ```python from gensim import corpora, models import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize nltk.download('stopwords') nltk.download('punkt') # 示例文档 documents = [ "Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", "System and human system engineering testing of EPS", "Relation of user perceived response time to error measurement", "The generation of random binary unordered trees", "The intersection graph of paths in trees", "Graph minors IV Widths of trees and well quasi ordering", "Graph minors A survey" ] # 分词并去除停用词 stop_words = set(stopwords.words('english')) texts = [] for doc in documents: tokens = word_tokenize(doc.lower()) filtered_tokens = [token for token in tokens if token.isalpha() and token not in stop_words] texts.append(filtered_tokens) # 创建词典 dictionary = corpora.Dictionary(texts) # 创建语料库 corpus = [dictionary.doc2bow(text) for text in texts] # 训练LDA模型 lda_model = models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=2, # 主题数 random_state=100, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True) # 打印每个主题的关键词 for idx, topic in lda_model.print_topics(-1): print('Topic: {} \nWords: {}'.format(idx, topic)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值