Naive Bayes Theorm And Application - Application-优快云博客

本文深入探讨了朴素贝叶斯定理及其在文本分类、垃圾邮件过滤、新闻类别识别及情感分析中的应用。接着介绍了自举（Autoclass）算法，一种用于无监督学习的朴素贝叶斯方法，特别适用于数据聚类任务。文中详细解释了最大期望算法（EM）在训练自举模型过程中的作用，包括期望步骤（E-step）和最大化步骤（M-step）。通过实例展示了如何使用EM算法进行参数估计，并在自举算法中实现这一过程。最后，文章提供了关于EM算法在自举算法中的具体应用案例，以及如何利用EM算法进行参数迭代以达到收敛的详细步骤。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Naive Bayes Theorm And Application - Application

Naive Bayes Theorm And Application - Application

Naive Bayes Theore classifier can be applied for Text classification, such as classifying the email into spam and not spam, getting news categories, and finding emotion.

Autoclass:Naive Bayes for clustering

Autoclass is just like Naive Bayes, but it designed for unsupervised learning. Given unlabeled training data $D^1,...,D^N$ where $D^i=\langle {x_{1}^i,...x_{k}^i} \rangle$ where k is the attributes of the instance i, without class label like win or fail etc. Goal of this problem is to learn a Naive Bayes Model. We introduce 2 symbols: P(C) for the probability for class C and $P(X_i|C)$ for the probability of attributes $X_i$ given class C.

To solve this problem, we use maximum likelihood algorithm. Just like the theorem about Naive Bayes we disscuss in Naive Bayes Theorm And Application - Theorem
Parameters:
1. $\theta_C$ = P(C=T)
2. $1-\theta_C$ = P(C=F)
3. $P(X_i=T|C=T)$ = $\theta_i^T$
4. $P(X_i=F|C=T)$ = $1-\theta_i^T$
5. $P(X_i=T|C=F)$ = $\theta_i^F$
6. $P(X_i=F|C=F)$ = 1 - $\theta_i^F$
7. $\mathbf{\theta}=\langle{\theta_C,\theta_1^T,...,\theta_n^T,\theta_1^F,...\theta_n^F}\rangle$
And the approach is for this problem is to find $\mathbf{\theta}$ that maximizes the formular:

L (θ) = p (D | θ) = \prod i = 1 N p (x i | θ)

$L(\theta)=p(\mathbf{D}|\theta)=\prod_{i=1}^N p(x^i|\theta)$ But this is a difficult problem because we don’t have sufficient statistics because the class labels are missing.

EM(Expectation Maximization)

Expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posterior(MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. —— Wikipedia

The problem now is that data is not fully observed(not labels).
1.If we know the sufficient statistics of the data, we can choose parameter values so as to maximize the likelihood just like discussed in the theorem essay.
2. If we know the model parameters, we can compute a probability distribution over the missing attributes. From these, we get the expected sufficient statistics

Expected sufficient statistics

From observed data and model parameters we get the probability of every possible completion of the data(guess the label of the data). Then each completion defines sufficient statistics(find $\theta$ maximizing the likelihood). The expected sufficient statistics is the expectation, taken over all possible completions, of the sufficient statistics for each completion.

Now we can give a general form about EM algorithm:
Repeat:
$\theta_{old}=\theta$
E-step (Expectation): Compute the expectated sufficient statistics.
M-step (Maxmization): Choose $\theta$ so as to maximize the likelihood of the expected sufficient statistical.
Until $\theta$ is close to $\theta_{old}$

Example: E step

θ C = 0.7 θ T 1 = 0.9 θ F 1 = 0.3 θ T 2 = 0.6 θ F 2 = 0.2

$\begin{align} &\theta_C = 0.7 \notag\\ &\theta_1^T = 0.9 \notag\\ &\theta_1^F = 0.3 \notag\\ &\theta_2^T = 0.6 \notag\\ &\theta_2^F = 0.2 \notag\\ \end{align}$

D $\mathbf{D}$ =

[FTTT] $\begin{bmatrix}F&T \\T&T \end{bmatrix}$
After the first completions, in other other words, add label vector to the Data matrix, now we assume the augmented matrix Completion1:
1st:

[F T T T F F]

$\left[\begin{array}{cc|c} F&T&F\\ T&T&F\\ \end{array}\right]$
2nd:

[F T T T F T]

$\left[\begin{array}{cc|c} F&T&F\\ T&T&T\\ \end{array}\right]$
3rd:

[F T T T T F]

$\left[\begin{array}{cc|c} F&T&T\\ T&T&F\\ \end{array}\right]$
4th:

[F T T T T T]

$\left[\begin{array}{cc|c} F&T&T\\ T&T&T\\ \end{array}\right]$

Example: M-step

The probability of

P (C o m p l e t i o n 1 | θ) \propto P (X 1 = F, X 2 = T, C = F) * P (X 1 = T, X 2 = T, C = F) = P (C = F) P (X 1 = F | C = F) P (X 2 = T | C = F) P (C = F) P (X 1 = T | C = F) P (X 2 = T | C = F) = 0.3 * 0.7 * 0.2 * 0.3 * 0.3 * 0.2 = 0.000756

$\begin{align} P(Completion1|\theta) & \propto P(X_1=F, X_2=T, C=F) * P(X_1=T, X_2=T, C=F) \notag \\ & =P(C=F)P(X_1=F|C=F)P(X_2=T|C=F)P(C=F)P(X_1=T|C=F)P(X_2=T|C=F) \notag \\ & =0.3 * 0.7 * 0.2 * 0.3 * 0.3 * 0.2\notag \\ & =0.000756\notag \\ \end{align}$
With the same procedure, we can get:

P(Completion2)∝0.3∗0.7∗0.2∗0.7∗0.9∗0.6=0.015876 $P(Completion 2) \propto 0.3 * 0.7 * 0.2 * 0.7 * 0.9 * 0.6 = 0.015876$

P(Completion3)∝0.7∗0.1∗0.6∗0.3∗0.3∗0.2=0.000756 $P(Completion 3) \propto 0.7 * 0.1 * 0.6 * 0.3 * 0.3 * 0.2 = 0.000756$

P(Completion4)∝0.7∗0.1∗0.6∗0.7∗0.9∗0.6=0.015876 $P(Completion 4) \propto 0.7 * 0.1 * 0.6 * 0.7 * 0.9 * 0.6 = 0.015876$

iteration with EM step

Now a interation finished, we “guess” the completion (often refering labels), and maximize the parameter $\theta$ with the completion. Now we can start a new iteration with new $\theta$ . The explaination of the epecatation symbols i s given in the Naive Bayes Theorm And Application - Theorem
.

E [N T] = 0.0227 * 0 + 0.4773 * 1 + 0.0227 * 1 + 0.4773 * 2 = 1.4546 E [N F] = N - E [N T] E [N T 1, T] = 0.0227 * 0 + 0.4773 * 1 + 0.0227 * 0 + 0.4773 * 1 = 0.9546 E [N F 1, T] = 0.0227 * 1 + 0.4773 * 0 + 0.0227 * 1 + 0.4773 * 0 = 0.0454 E [N T 1, T] = 1.4546 E [N F 2, T] = 0.5454

$\begin{align} & E[N_T] = 0.0227*0 + 0.4773*1 + 0.0227*1 + 0.4773*2 = 1.4546 \notag \\ & E[N_F] = N - E[N_T] \notag \\ & E[N_{1,T}^T] = 0.0227*0 + 0.4773*1 + 0.0227*0 + 0.4773*1 = 0.9546 \notag \\ & E[N_{1,T}^F] = 0.0227*1 + 0.4773*0 + 0.0227*1 + 0.4773*0 = 0.0454 \notag\\ & E[N_{1,T}^T] = 1.4546 \notag \\ & E[N_{2,T}^F] = 0.5454 \notag \\ \end{align}$

Now we can make Maximum likelihood estimates(2nd M-step) again:

θ C = E [N T] / N = 1.4546 / 2 = 0.7273 θ T 1 = E [N T 1, T] / E [N T] = 0.9546 / 1.4546 = 0.6563 θ F 1 = E [N F 1, T] / E [N F] = 0.0454 / 0.5454 = 0.0832 θ T 2 = E [N T 2, T] / E [N T] = 1.4546 / 1.4546 = 1 θ F 2 = E [N T 2, T] / E [N F] = 0.5454 / 0.5454 = 1

$\begin{align} &\theta_C = E[N_T]/N = 1.4546 / 2 = 0.7273 \notag\\ &\theta_1^T = E[N_{1,T}^T] / E[N_T] = 0.9546 / 1.4546 = 0.6563 \notag\\ &\theta_1^F = E[N_{1,T}^F] / E[N_F] = 0.0454 / 0.5454 = 0.0832 \notag\\ &\theta_2^T = E[N_{2,T}^T] / E[N_T] = 1.4546 / 1.4546 = 1 \notag\\ &\theta_2^F = E[N_{2,T}^T] / E[N_F] = 0.5454 / 0.5454 = 1 \notag\\ \end{align}$

EM algorithm in Naive Bayes

In fact, the number of completions is exponential in number of instances.

Key observation

We don’t care about the exact completions, only expected sufficient statistics. While Each instances contributes separately to expected sufficient. Then we can:
1. enumerate completions of each instance separately.
2. get probability of each completion.
3. get expected contribution of that instance to sufficient statistics.

E-step for Naive Bayes:

Expectation according to the initial parameter $\mathbf{\theta}$ :
1. $E[N_T]$ is the expected number of instances in which the class is T.
2. Each instance has probability of the class being T.
3. Each instance contributes that probability to $E[N_T]$
4. In symbols:

E [N T] = \sum j = 1 N P (C j = T | x j 1, . . ., x j n) \propto \sum j = 1 N P (C j = T) \prod i = 1 n P (x j i | C j = T)

$\begin{align} E[N_T] & =\sum_{j=1}^N P(C^j=T|x_1^j,...,x_n^j) \notag\\ & \propto\sum_{j=1}^N P(C^j=T)\prod_{i=1}^n{P(x_i^j|C^j=T}) \end{align}$
5.

E[NTi,T] $E[N_{i,T}^T]$ is the expected number of times the class is T when

Xi $X_i$ is T. If an instance has

Xi≠T $X_i \neq T$ , it contributes 0 to

E[NTi,T] $E[N_{i,T}^T]$ .
6. If an instance has

Xi=T $X_i =T$ , it contributes the probability that the class is T to

E[NTi,T] $E[N_{i,T}^T]$ .
7. In symbols:

E [N T i, T] = \sum j : x j i = T N P (C j = T | x j i, . . ., x j n) \propto \sum j : x j i = T N P (C j = T) \prod i = 1 n P (x j i | C j = T)

$\begin{align} E[N_{i,T}^T] & = \sum_{j:x_i^j=T}^N P(C^j=T|x_i^j,...,x_n^j) \notag \\ &\propto\sum_{j:x_i^j=T}^N P(C^j=T)\prod_{i=1}^n{P(x_i^j|C^j=T}) \notag \\ \end{align}$ .

M-step for Naive Bayes:

Maximize the the likelihood according to the expectation:
1. $E[N_T]$ is the expected number of instances in which the class is T.
2. Each instance has probability of the class being T.
3. Each instance contributes that probability to $E[N_T]$
4. In symbols:

E [N T] = \sum j = 1 N P (C j = T | x j 1, . . ., x j n) \propto \sum j = 1 N P (C j = T) \prod i = 1 n P (x j i | C j = T)

$\begin{align} E[N_T] & =\sum_{j=1}^N P(C^j=T|x_1^j,...,x_n^j) \notag\\ & \propto\sum_{j=1}^N P(C^j=T)\prod_{i=1}^n{P(x_i^j|C^j=T}) \end{align}$

For notational convenience, we encode T as 1, F as 0, then for instance $X^j$ :

P (x j i | C j = T) = (θ T i) x j i (1 - θ T i) 1 - x j i P (x j i | C j = F) = (θ F i) x j i (1 - θ F i) 1 - x j i

$\begin{align} & P(x_i^j|C^j=T) = (\theta_i^T)^{x_i^j}(1-\theta_i^T)^{1-x_i^j} \notag\\ & P(x_i^j|C^j=F) = (\theta_i^F)^{x_i^j}(1-\theta_i^F)^{1-x_i^j} \notag\\ \end{align}$

Autoclass

Set $\theta_C, \theta_i^T$ and $\theta_i^F$ to arbitrary values for all attributes. Then Repeat the EM algorithm until convergence:
Expectation step
Maximization step
In the expectation step,

E [N T] = 0 E [N T i, T] = 0 E [N F i, T] = 0

$\begin{align} & E[N_T]=0 \notag \\ & E[N_{i,T}^T] = 0 \notag \\ & E[N_{i,T}^F] = 0\notag \\ \end{align}$
For each instance

Dj $D^j$ :

p T = θ C \prod i = 1 n (θ T i) x j i (1 - θ T i) 1 - x j i p F = (1 - θ C) \prod i = 1 n (θ F i) x j i (1 - θ F i) 1 - x j i q = p T p T + p F E [N T] + = q

$\begin{align} &p_T = \theta_C\prod_{i=1}^n(\theta_i^T){x_i^j} (1-\theta_i^T)^{1-x_i^j} \notag \\ &p_F=(1-\theta_C)\prod_{i=1}^n(\theta_i^F){x_i^j} (1-\theta_i^F)^{1-x_i^j} \notag \\ & q = \frac{p_T}{p_T+p_F} \notag \\ & E[N_T] += q \notag \\ \end{align}$
for each attribute i:
if x_i & == T:

E[NTi,T]+=q $~~~~E[N_{i,T}^T] += q$

E[NFi,T]+=(1−q) $~~~~E[N_{i,T}^F] += (1-q)$
In the maximization step:

θC=E[NT]N $\theta_C=\frac{E[N_T]}{N}$
For each attribute i:

θ T i = E [ N T i , T ] E [ N T ] θ F i = E [ N F i , T ] N - E [ N T ]

$\begin{align} & \theta_i^T = \frac{E[N_{i,T}^T]}{E[N_T]} \notag \\ & \theta_i^F = \frac{E[N_{i,T}^F]}{N - E[N_T]} \notag \\ \end{align}$

Example of Autoclass

E-step: Given the the initial “guess” about the parameter $\mathbf{\theta}$ and expectation:

θ C = 0.7 θ T 1 = 0.9, θ F 1 = 0.3 θ T 2 = 0.6, θ F 2 = 0.2

$\begin{align} & \theta_C = 0.7 \notag \\ & \theta_1^T = 0.9, \theta_1^F = 0.3 \notag \\ & \theta_2^T = 0.6, \theta_2^F = 0.2 \notag \\ \end{align}$
and the dataset,

D $\mathbf{D}$ :

i n s t a n c e s 12 attribute 1(X_1) F T attributes 2(X_2) T T

$\begin{array}{c|lcr} instances & \text{attribute 1(X_1)} & \text{attributes 2(X_2)} \\ \hline 1 & F & T \notag \\ 2 & T & T \notag \\ \end{array}$
and the initial expectation is below :

E [N T] = 0 E [N T 1, T] = 0 E [N F 1, T] = 0 E [N T 2, T] = 0 E [N F 2, T] = 0

$\begin{align} & E[N_T] = 0 \notag \\ & E[N_{1,T}^T] = 0 \notag \\ & E[N_{1,T}^F] = 0 \notag\\ & E[N_{2,T}^T] = 0 \notag \\ & E[N_{2,T}^F] = 0 \notag \\ \end{align}$

Detail of EM steps in Autoclass

The 1st E-step, for instance 1:

p T p F q = θ C * \prod i = 1 2 (θ T 1) x j i * (1 - θ T i) 1 - x j i = 0.7 * (0.9 0) (1 - 0.9) 1 - 0 * (0.6 1) * (1 - 0.6) 0 = 0.042 = (1 - θ C) * \prod i = 1 2 (θ F 1) x j i * (1 - θ F i) 1 - x j i = (1 - 0.7) * (0.3 0) (1 - 0.3) 1 - 0 * (0.2 1) * (1 - 0.2) 0 = 0.042 = 0.042 0.042 + 0.042 = 0.5

$\begin{align} p_T & = \theta_C * \prod_{i=1}^2 {(\theta_1^T)^{x_i^j} * (1-\theta_i^T)^{1-x_i^j} } \notag\\ & = 0.7 * (0.9^0)(1-0.9)^{1-0} * (0.6^1)*(1-0.6)^0 \notag\\ & = 0.042 \notag\\ p_F & = (1 - \theta_C) * \prod_{i=1}^2 {(\theta_1^F)^{x_i^j} * (1-\theta_i^F)^{1-x_i^j} } \notag\\ & = (1 - 0.7) * (0.3^0)(1-0.3)^{1-0} * (0.2^1)*(1-0.2)^0 \notag\\ & = 0.042 \notag\\ q & = \frac{0.042}{0.042+0.042} = 0.5 \notag\\ \end{align}$
after the loop of instance 1, the expectation is below:

E [N T] = 0.5 E [N T 1, T] = 0 E [N F 1, T] = 0 E [N T 2, T] = 0.5 E [N F 2, T] = 0.5

$\color{blue}{\begin{align} & E[N_T] = 0.5 \notag \\ & E[N_{1,T}^T] = 0 \notag \\ & E[N_{1,T}^F] = 0\notag\\ & E[N_{2,T}^T] = 0.5 \notag \\ & E[N_{2,T}^F] = 0.5 \notag \\ \end{align} }$

For instance 2:

p T p F q = θ C * \prod i = 1 2 (θ T 1) x j i * (1 - θ T i) 1 - x j i = 0.7 * (0.9 1 * 0.1 0) * (0.6 1 * 0.4 0) = 0.378 = (1 - θ C) * \prod i = 1 2 (θ F 1) x j i * (1 - θ F i) 1 - x j i = 0.3 * (0.3 1 * 0.7 0) * (0.2 1 * 0.8 0) = 0.018 = 0.378 0.378 + 0.018 = 0.95

$\begin{align} p_T & = \theta_C * \prod_{i=1}^2 {(\theta_1^T)^{x_i^j} * (1-\theta_i^T)^{1-x_i^j} } \notag\\ & = 0.7*(0.9^1 *0.1^0 )*(0.6^1 *0.4^0 ) \notag\\ & = 0.378 \notag\\ p_F & = (1 - \theta_C) * \prod_{i=1}^2 {(\theta_1^F)^{x_i^j} * (1-\theta_i^F)^{1-x_i^j} } \notag\\ & = 0.3*(0.3^1 *0.7^0)*(0.2^1 *0.8^0 ) \notag\\ & = 0.018 \notag\\ q & = \frac{0.378}{0.378+0.018} = 0.95 \notag\\ \end{align}$
after the loop of instance 1, the expectation is below:

E [N T] = 1.45 E [N T 1, T] = 0.95 E [N F 1, T] = 0.05 E [N T 2, T] = 1.45 E [N F 2, T] = 0.55

$\color{green}{\begin{align} & E[N_T] = 1.45 \notag \\ & E[N_{1,T}^T] = 0.95 \notag \\ & E[N_{1,T}^F] = 0.05 \notag\\ & E[N_{2,T}^T] = 1.45 \notag \\ & E[N_{2,T}^F] = 0.55 \notag \\ \end{align} }$

The first M-step, we maximize the parameter according the expectation:
$\color{red}{\theta_C=\frac{E[N_T]}{N}=\frac{1.45}{2}=0.72}$
For attribute 1:

θ T 1 = E [ N T 1 , T ] E [ N T ] = 0.95 1.45 = 0.65 θ F 1 = E [ N F 1 , T ] N - E [ N T ] = 0.05 1.45 = 0.09

$\color{red}{\begin{align} & \theta_1^T = \frac{E[N_{1,T}^T]}{E[N_T]}= \frac{0.95}{1.45}=0.65 \notag \\ & \theta_1^F = \frac{E[N_{1,T}^F]}{N - E[N_T]}=\frac{0.05}{1.45}=0.09 \notag \\ \end{align} }$

For attribute 2:

θ T 2 = E [ N T 2 , T ] E [ N T ] = 1.45 1.45 = 1.0 θ F 2 = E [ N F 2 , T ] N - E [ N T ] = 1.45 1.45 = 1.0

$\color{red}{\begin{align} & \theta_2^T = \frac{E[N_{2,T}^T]}{E[N_T]}= \frac{1.45}{1.45}=1.0 \notag \\ & \theta_2^F = \frac{E[N_{2,T}^F]}{N - E[N_T]}=\frac{1.45}{1.45}=1.0 \notag \\ \end{align} }$

Convergence

EM improves the likelihood on every iteration, and it’s guaanteed to converge to a maximum of the likelihood function. But the maximum may be a local maximum. There is a tip about Using EM algorithm, don’t start EM with symmetric parameter values, in particular, not starting with uniform.