【NLP】latent Dirichlet allocation

latent Dirichlet allocation(LDA)是一种主题模型,用于发现文本数据中的隐藏主题。通过随机分配和迭代优化,LDA能够确定每个文档的主题分布和每个主题的词项分布,最终生成高质量的主题特征。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1.LDA主题模型原理

intro:

来看下面几句话:

I like to eat broccoli and bananas.
I ate a banana and spinach smoothie for breakfast.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

问:什么是latent Dirichlet allocation?
答:就是自动发现句子属于那种话题。
譬如,上面几句话分属于2种话题A&B,LDA就会这么显示:

第1、2句: 100% Topic A
第3、4句: 100% Topic B
第5句: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (也可以理解为A是关于食物的)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, … (也可以理解为B是关于可爱动物的)

步骤:

假设你有一组documents,有K种话题,希望用LDA学习每个document的topic和每种topic有哪些words:

  • Go through each document, and randomly assign each word in the document to one of the K topics.
  • Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).
  • So to improve on them, for each document d…
    • Go through each word w in d…
      • And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)
      • In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.
  • After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

2. LDA 参数学习

sklearn官网

3. 使用LDA生成主题特征

在这里插入图片描述

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值