Latent Dirichlet Allocation Model学习笔记一

LDA模型用plate notation简洁表示文档中变量间的依赖。每个文档由随机混合的潜在主题构成,每个主题又由词分布定义。生成过程包括:1. 选择文档主题分布;2. 选择每个主题的词分布;3. 对文档中的每个词,依据主题分布选择词。LDA假设词的生成是独立的,且模型通常会采用平滑版本以提高效果。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


一 模型含义

With plate notation, the dependencies among the many variables can be captured concisely. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a document. Thus:

α is the parameter of the Dirichlet prior on the per-document topic distributions.
β is the parameter of the Dirichlet prior on the per-topic word distribution.
\theta_i is the topic distribution for document  i,
\phi_k is the word distribution for topic  k,
z_{ij} is the topic for the  jth word in document  i, and
w_{ij} is the specific word.
Plate notation for smoothed LDA

The w_{ij} are the only observable variables, and the other variables are latent variables. Mostly, the basic LDA model will be extended to a smoothed version to gain better results. The plate notation is shown on the right, where K denotes the number of topics considered in the model and:

\phi is a  K*V ( V is the dimension of the vocabulary)  Markov matrix each row of which denotes the word distribution of a topic.

The generative process behind is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document i in a corpus D :

1. Choose  \theta_i \, \sim \, \mathrm{Dir}(\alpha) , where  i \in \{ 1,\dots,M \}  and  \mathrm{Dir}(\alpha)  is the Dirichlet distribution for parameter \alpha

2. Choose  \phi_k \, \sim \, \mathrm{Dir}(\beta) , where  k \in \{ 1,\dots,K \}

3. For each of the words w_{ij} , where  j \in \{ 1,\dots,N_i \}

(a) Choose a topic  z_{i,j} \,\sim\, \mathrm{Multinomial}(\theta_i).
(b) Choose a word  w_{i,j} \,\sim\, \mathrm{Multinomial}( \phi_{z_{i,j}}) .

(Note that the Multinomial distribution here refers to the Multinomial with only one trial. It is formally equivalent to the categorical distribution.)

The lengths N_i are treated as independent of all the other data generating variables (q and z). The subscript is often dropped, as in the plate diagrams shown here.

二 数学定义

A formal description of smoothed LDA is as follows:

Definition of variables in the model
Variable Type Meaning
Kintegernumber of topics (e.g. 50)
Vintegernumber of words in the vocabulary (e.g. 50,000 or 1,000,000)
Mintegernumber of documents
N_{d=1 \dots M}integernumber of words in document d
Nintegertotal number of words in all documents; sum of all N_d values, i.e. N = \sum_{d=1}^{M} N_d
\alpha_{k=1 \dots K}positive realprior weight of topic k in a document; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per document
\boldsymbol\alphaK-dimension vector of positive realscollection of all \alpha_k values, viewed as a single vector
\beta_{w=1 \dots V}positive realprior weight of word w in a topic; usually the same for all words; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse word distributions, i.e. few words per topic
\boldsymbol\betaV-dimension vector of positive realscollection of all \beta_w values, viewed as a single vector
\phi_{k=1 \dots K,w=1 \dots V}probability (real number between 0 and 1)probability of word w occurring in topic k
\boldsymbol\phi_{k=1 \dots K}V-dimension vector of probabilities, which must sum to 1distribution of words in topic k
\theta_{d=1 \dots M,k=1 \dots K}probability (real number between 0 and 1)probability of topic k occurring in document d for a given word
\boldsymbol\theta_{d=1 \dots M}K-dimension vector of probabilities, which must sum to 1distribution of topics in document d
z_{d=1 \dots M,w=1 \dots N_d}integer between 1 and Kidentity of topic of word w in document d
\mathbf{Z}N-dimension vector of integers between 1 and Kidentity of topic of all words in all documents
w_{d=1 \dots M,w=1 \dots N_d}integer between 1 and Videntity of word w in document d
\mathbf{W}N-dimension vector of integers between 1 and Videntity of all words in all documents

We can then mathematically describe the random variables as follows:

\begin{array}{lcl}\boldsymbol\phi_{k=1 \dots K} &\sim& \operatorname{Dirichlet}_V(\boldsymbol\beta) \\\boldsymbol\theta_{d=1 \dots M} &\sim& \operatorname{Dirichlet}_K(\boldsymbol\alpha) \\z_{d=1 \dots M,w=1 \dots N_d} &\sim& \operatorname{Categorical}_K(\boldsymbol\theta_d) \\w_{d=1 \dots M,w=1 \dots N_d} &\sim& \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}}) \\\end{array}


笔记来源:

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值