Naive Bayes Theorm And Application - Application
Naive Bayes Theore classifier can be applied for Text classification, such as classifying the email into spam and not spam, getting news categories, and finding emotion.
Autoclass:Naive Bayes for clustering
Autoclass is just like Naive Bayes, but it designed for unsupervised learning. Given unlabeled training data D1,...,DN where Di=⟨xi1,...xik⟩ where k is the attributes of the instance i, without class label like win or fail etc. Goal of this problem is to learn a Naive Bayes Model. We introduce 2 symbols: P(C) for the probability for class C and P(Xi|C) for the probability of attributes Xi given class C.
To solve this problem, we use maximum likelihood algorithm. Just like the theorem about Naive Bayes we disscuss in Naive Bayes Theorm And Application - Theorem
Parameters:
1.
θC
= P(C=T)
2.
1−θC
= P(C=F)
3.
P(Xi=T|C=T)
=
θTi
4.
P(Xi=F|C=T)
=
1−θTi
5.
P(Xi=T|C=F)
=
θFi
6.
P(Xi=F|C=F)
= 1 -
θFi
7.
θ=⟨θC,θT1,...,θTn,θF1,...θFn⟩
And the approach is for this problem is to find
θ
that maximizes the formular:
EM(Expectation Maximization)
Expectation–maximization algorithm is an iterative method for finding maximum likelihood or maximum a posterior(MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. —— Wikipedia
The problem now is that data is not fully observed(not labels).
1.If we know the sufficient statistics of the data, we can choose parameter values so as to maximize the likelihood just like discussed in the theorem essay.
2. If we know the model parameters, we can compute a probability distribution over the missing attributes. From these, we get the expected sufficient statistics
Expected sufficient statistics
From observed data and model parameters we get the probability of every possible completion of the data(guess the label of the data). Then each completion defines sufficient statistics(find θ maximizing the likelihood). The expected sufficient statistics is the expectation, taken over all possible completions, of the sufficient statistics for each completion.
Now we can give a general form about EM algorithm:
Repeat:
θold=θ
E-step (Expectation): Compute the expectated sufficient statistics.
M-step (Maxmization): Choose
θ
so as to maximize the likelihood of the expected sufficient statistical.
Until
θ
is close to
θold
Example: E step
D = [FTTT]
After the first completions, in other other words, add label vector to the Data matrix, now we assume the augmented matrix Completion1:
1st:
2nd:
3rd:
4th:
Example: M-step
The probability of
With the same procedure, we can get:
P(Completion2)∝0.3∗0.7∗0.2∗0.7∗0.9∗0.6=0.015876
P(Completion3)∝0.7∗0.1∗0.6∗0.3∗0.3∗0.2=0.000756
P(Completion4)∝0.7∗0.1∗0.6∗0.7∗0.9∗0.6=0.015876
iteration with EM step
Now a interation finished, we “guess” the completion (often refering labels), and maximize the parameter
θ
with the completion. Now we can start a new iteration with new
θ
. The explaination of the epecatation symbols i s given in the Naive Bayes Theorm And Application - Theorem
.
Now we can make Maximum likelihood estimates(2nd M-step) again:
EM algorithm in Naive Bayes
In fact, the number of completions is exponential in number of instances.
Key observation
We don’t care about the exact completions, only expected sufficient statistics. While Each instances contributes separately to expected sufficient. Then we can:
1. enumerate completions of each instance separately.
2. get probability of each completion.
3. get expected contribution of that instance to sufficient statistics.
E-step for Naive Bayes:
Expectation according to the initial parameter
θ
:
1.
E[NT]
is the expected number of instances in which the class is T.
2. Each instance has probability of the class being T.
3. Each instance contributes that probability to
E[NT]
4. In symbols:
5. E[NTi,T] is the expected number of times the class is T when Xi is T. If an instance has Xi≠T , it contributes 0 to E[NTi,T] .
6. If an instance has Xi=T , it contributes the probability that the class is T to E[NTi,T] .
7. In symbols:
M-step for Naive Bayes:
Maximize the the likelihood according to the expectation:
1.
E[NT]
is the expected number of instances in which the class is T.
2. Each instance has probability of the class being T.
3. Each instance contributes that probability to
E[NT]
4. In symbols:
For notational convenience, we encode T as 1, F as 0, then for instance
Xj
:
Autoclass
Set
θC,θTi
and
θFi
to arbitrary values for all attributes. Then Repeat the EM algorithm until convergence:
Expectation step
Maximization step
In the expectation step,
For each instance Dj :
for each attribute i:
if x_i & == T:
E[NTi,T]+=q
E[NFi,T]+=(1−q)
In the maximization step:
θC=E[NT]N
For each attribute i:
Example of Autoclass
E-step: Given the the initial “guess” about the parameter
θ
and expectation:
and the dataset, D :
and the initial expectation is below :
Detail of EM steps in Autoclass
The 1st E-step, for instance 1:
after the loop of instance 1, the expectation is below:
For instance 2:
after the loop of instance 1, the expectation is below:
The first M-step, we maximize the parameter according the expectation:
θC=E[NT]N=1.452=0.72
For attribute 1:
For attribute 2:
Convergence
EM improves the likelihood on every iteration, and it’s guaanteed to converge to a maximum of the likelihood function. But the maximum may be a local maximum. There is a tip about Using EM algorithm, don’t start EM with symmetric parameter values, in particular, not starting with uniform.
Reference
Most of content in this essay comes from the CMU machine learning course notes, while I’m forgetting the source link. Sorry!