LDA笔记

最新推荐文章于 2024-09-06 23:59:49 发布

原创最新推荐文章于 2024-09-06 23:59:49 发布 · 343 阅读

CC 4.0 BY-SA版权

这篇笔记详细介绍了LDA模型，包括文档-主题分布θ和主题-词分布ϕ的Dirichlet先验，以及主题分配Z和词W的多项分布。重点讨论了collapsed Gibbs采样方法优化主题分配，并通过实例展示了多项分布的性质。此外，解释了超参数α和β对主题和词分布的影响。

这里写图片描述

$\theta$ ：文档-主题分布， $\theta_m是Dirichlet(\alpha)$ ， $\alpha$ 是Dirichlet分布的超参数，K维向量（K是主题数）
$\phi$ ：主题-词分布， $\phi_z是Dirichlet(\beta)$ ， $\beta$ 是Dirichlet分布的超参数，V维向量（V是词典大小）
Z：词的主题， $z_{mn}是Multi(\theta_m)$
W：词， $w_{mn}是Multi(\phi_{z_{mn}})$

LDA假设一篇文档的生成过程如下：

随机选择一个主题分布
对文档中的每个词：
- 从主题分布中随机选择一个主题
- 从主题中随机选择一个单词

注意：步骤1我们需要的是分布上的分布。

LDA的目标是估计 $\theta$ 和 $\phi$ ，即估计哪些词对哪个主题重要，哪些主题对一个文档重要。
求解LDA模型中的主题分布和词分布有两种方法：gibbs采样，基于变分推断EM算法。
$\alpha$ : The higher the value the more likely each document is to contain a mixture of most of the topics instead of any single topic.
$\beta$ : higher value denotes that each topic is likely to contain a mixture of most of the words and not any word specifically.

collapsed gibbs sampling

省略了tokenize，去除停止词，stemming过程，以包含8个短文本的文档集为例。

raw_docs=["eat turkey on turkey day holiday",
    "i like to eat cake on holiday",
    "turkey trot race on thanksgiving holiday",
    "snail race the turtle",
    "time travel space race",
    "movie on thanksgiving",
    "movie at air and space museum is cool movie",
    "aspiring movie star"]
raw_docs_=[doc.split() for doc in raw_docs]

提取词典，为每个词赋一个id，将文档中的词替换为词id

from gensim import corpora
dictionary=corpora.Dictionary(raw_docs_)
#dictionary为每个出现在语料库中的单词分配了一个独一无二的整数编号。这个操作收集了单词计数及其他相关的统计信息。
#dictionary.token2id得到字典{词，词的id} 
docs=[[dictionary.token2id[word] for word in doc] for doc in raw_docs_]
"""
输出docs：
[2, 1, 3, 1, 0, 4]
[6, 8, 5, 2, 7, 3, 4]
[1, 9, 10, 3, 11, 4]
[12, 10, 13, 14]
[17, 16, 15, 10]
[18, 3, 11]
[18, 20, 19, 22, 15, 23, 24, 21, 18]
[25, 18, 26]
"""

初始化主题分配列表（topic assignment list）：给每篇文档的每个单词随机赋一个主题
词-主题矩阵：K $\times$ V维，被分配给每个主题的每个词的个数
文档-主题矩阵：M $\times$ K维，每个文档中，被分配给每个主题的词的个数

import numpy as np
M=len(docs) #文档的个数
K=2 #主题的个数
V=len(dictionary.token2id) #词典的大小

#@wt: the count of each word being assigned to each topic
wt=np.zeros((K,V))

#@dt: the number of words assigned to each topic for each document
dt=np.zeros((M,K))

#@ta: topic assignment list
ta=[]
for di in range(M):
    ta_di=[]
    for w in docs[di]:
        t=np.random.randint(0,K)
        ta_di.append(t)

        wt[t][w] = wt[t][w]+1
        dt[di][t] = dt[di][t]+1
    ta.append(ta_di)
"""
输出ta
[1, 0, 0, 0, 1, 0]
[1, 0, 0, 0, 0, 0, 0]
[0, 1, 1, 1, 0, 0]
[0, 1, 1, 0]
[1, 1, 0, 0]
[1, 0, 0]
[0, 0, 1, 1, 1, 1, 1, 1, 0]
[0, 0, 1]
输出dt
array([[ 4.,  2.],
       [ 6.,  1.],
       [ 3.,  3.],
       [ 2.,  2.],
       [ 2.,  2.],
       [ 2.,  1.],
       [ 3.,  6.],
       [ 2.,  1.]])
输出wt
array([[ 0.,  3.,  3.,  3.,  1.,  1.,  1.,  1.,  0.,  1.,  2.,  0.,  0.,
         1.,  1.,  0.,  0.,  1.,  3.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,
         1.],
       [ 1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  1.,  2.,  0.,  1.,  1.,
         0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  0.,  1.,
         0.]])
"""

这样的随机分配已经给出了文档的主题表示和主题的词分布，然而很明显不是好的分配。因此使用gibbs采样来优化分配。
对每一篇文档，遍历每一个词w，为w分配一个新的主题。依据下面的公式选择新的主题。
这里写图片描述
等式左边：
$P(z_i=j)$ ：token i 被分配给主题 j 的概率
$z_{-i}$ ：所有其他token的主题分配
$w_i$ ：token i 的词id
$d_i$ ：token i 所在文档

等式右边：
W：文档集的所有单词总数
T：主题数，与前面定义的K相同
$C^{WT}$ ：wt
$\sum_{w=1}^WC_{wj}^{WT}$ ：每个主题的token总数
$C^{DT}$ ：dt
$\sum_{t=1}^TC_{d_it}^{DT}$ ：文档di的token总数
$\alpha$ ：Parameter that sets the topic distribution for the documents
$\eta$ 即 $\beta$ ：Parameter that sets the topic distribution for the words

for m,doc in enumerate(docs):
    for n,t in enumerate(doc):
        z=ta[m][n]
        #z_-i指当采样token w时，在wt和dt计数矩阵中不包括token w
        dt[m][z]-=1
        wt[z][t]-=1
        #sampling
        left=(wt[:,t]+eta)/(np.sum(wt,axis=1)+V*eta)
        right=(dt[m]+alpha)
        p_z=left*right
        n_z=np.random.multinomial(1,p_z/p_z.sum()).argmax()
        #保存新得到的主题n_z
        dt[m][n_z]+=1
        wt[n_z][t]+=1
        ta[m][n]=n_z

迭代指定次数后
这里写图片描述
$\phi_{ij}$ 是主题 j 中词 i 的概率

这里写图片描述
$\theta_{dj}$ 是文档 d 中主题 j 的比例

多项分布

这里写图片描述
单次实验有d种可能情况，第i种情况发生的概率是 $\theta_i$ ，做n次实验，出现第i种情况的次数是 $x_i$ 。x的概率分布如上图所示。
当n=1，上式简化为：

numpy.random.multinomial(n=10, pvals=[0.2,0.4,0.4], size = 1)
生成一个三维向量，如[2,7,1]，向量的每个元素位于0-10之间，三个元素之和为10。设置size = 1000，就会得到1000个三维向量，这1000个向量的均值为[2.013,4.058,3.929]，可见其均值的分布趋近于概率[0.2,0.4,0.4]。