Latent Dirichlet Allocation Model学习笔记一

最新推荐文章于 2024-09-09 13:54:49 发布

萝卜羊

最新推荐文章于 2024-09-09 13:54:49 发布

阅读量1.3k

点赞数

CC 4.0 BY-SA版权

分类专栏： Geoscience&Remote Scensing Machine Learning Computer Vision

本文链接：https://blog.youkuaiyun.com/polly_yang/article/details/8222763

Computer Vision 同时被 3 个专栏收录

22 篇文章

订阅专栏

Machine Learning

16 篇文章

订阅专栏

Geoscience&Remote Scensing

2 篇文章

订阅专栏

LDA模型用plate notation简洁表示文档中变量间的依赖。每个文档由随机混合的潜在主题构成，每个主题又由词分布定义。生成过程包括：1. 选择文档主题分布；2. 选择每个主题的词分布；3. 对文档中的每个词，依据主题分布选择词。LDA假设词的生成是独立的，且模型通常会采用平滑版本以提高效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一模型含义

With plate notation, the dependencies among the many variables can be captured concisely. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a document. Thus:

α is the parameter of the Dirichlet prior on the per-document topic distributions.

β is the parameter of the Dirichlet prior on the per-topic word distribution.

$\theta_i$ is the topic distribution for document i,

$\phi_k$ is the word distribution for topic k,

$z_{ij}$ is the topic for the jth word in document i, and

$w_{ij}$ is the specific word.

Plate notation for smoothed LDA

The $w_{ij}$ are the only observable variables, and the other variables are latent variables. Mostly, the basic LDA model will be extended to a smoothed version to gain better results. The plate notation is shown on the right, where K denotes the number of topics considered in the model and:

$\phi$ is a K*V ( V is the dimension of the vocabulary) Markov matrix each row of which denotes the word distribution of a topic.

The generative process behind is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. LDA assumes the following generative process for each document $i$ in a corpus D :

1. Choose $\theta_i \, \sim \, \mathrm{Dir}(\alpha)$ , where $i \in \{ 1,\dots,M \}$ and $\mathrm{Dir}(\alpha)$ is the Dirichlet distribution for parameter $\alpha$

2. Choose $\phi_k \, \sim \, \mathrm{Dir}(\beta)$ , where $k \in \{ 1,\dots,K \}$

3. For each of the words $w_{ij}$ , where $j \in \{ 1,\dots,N_i \}$

(a) Choose a topic $z_{i,j} \,\sim\, \mathrm{Multinomial}(\theta_i).$

(b) Choose a word $w_{i,j} \,\sim\, \mathrm{Multinomial}( \phi_{z_{i,j}})$ .

(Note that the Multinomial distribution here refers to the Multinomial with only one trial. It is formally equivalent to the categorical distribution.)

The lengths $N_i$ are treated as independent of all the other data generating variables ( $q$ and $z$ ). The subscript is often dropped, as in the plate diagrams shown here.

二数学定义

A formal description of smoothed LDA is as follows:

Definition of variables in the model
Variable	Type	Meaning
$K$	integer	number of topics (e.g. 50)
$V$	integer	number of words in the vocabulary (e.g. 50,000 or 1,000,000)
$M$	integer	number of documents
$N_{d=1 \dots M}$	integer	number of words in document d
$N$	integer	total number of words in all documents; sum of all $N_d$ values, i.e. $N = \sum_{d=1}^{M} N_d$
$\alpha_{k=1 \dots K}$	positive real	prior weight of topic k in a document; usually the same for all topics; normally a number less than 1, e.g. 0.1, to prefer sparse topic distributions, i.e. few topics per document
$\boldsymbol\alpha$	K-dimension vector of positive reals	collection of all $\alpha_k$ values, viewed as a single vector
$\beta_{w=1 \dots V}$	positive real	prior weight of word w in a topic; usually the same for all words; normally a number much less than 1, e.g. 0.001, to strongly prefer sparse word distributions, i.e. few words per topic
$\boldsymbol\beta$	V-dimension vector of positive reals	collection of all $\beta_w$ values, viewed as a single vector
$\phi_{k=1 \dots K,w=1 \dots V}$	probability (real number between 0 and 1)	probability of word w occurring in topic k
$\boldsymbol\phi_{k=1 \dots K}$	V-dimension vector of probabilities, which must sum to 1	distribution of words in topic k
$\theta_{d=1 \dots M,k=1 \dots K}$	probability (real number between 0 and 1)	probability of topic k occurring in document d for a given word
$\boldsymbol\theta_{d=1 \dots M}$	K-dimension vector of probabilities, which must sum to 1	distribution of topics in document d
$z_{d=1 \dots M,w=1 \dots N_d}$	integer between 1 and K	identity of topic of word w in document d
$\mathbf{Z}$	N-dimension vector of integers between 1 and K	identity of topic of all words in all documents
$w_{d=1 \dots M,w=1 \dots N_d}$	integer between 1 and V	identity of word w in document d
$\mathbf{W}$	N-dimension vector of integers between 1 and V	identity of all words in all documents

We can then mathematically describe the random variables as follows:

$\begin{array}{lcl}\boldsymbol\phi_{k=1 \dots K} &\sim& \operatorname{Dirichlet}_V(\boldsymbol\beta) \\\boldsymbol\theta_{d=1 \dots M} &\sim& \operatorname{Dirichlet}_K(\boldsymbol\alpha) \\z_{d=1 \dots M,w=1 \dots N_d} &\sim& \operatorname{Categorical}_K(\boldsymbol\theta_d) \\w_{d=1 \dots M,w=1 \dots N_d} &\sim& \operatorname{Categorical}_V(\boldsymbol\phi_{z_{dw}}) \\\end{array}$