[Machine Learning] Generative learning algorithm-GDA and NB

本文对比分析了生成式与判别式学习算法的区别,并详细介绍了两种典型的生成式学习算法:高斯判别分析和朴素贝叶斯算法。通过数学公式推导了这两种算法的核心原理及其参数估计方法。

1. Generative v Discriminant

Generally speaking,

  • A generative learning algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumption, which category is most likely to generate this signal?
  • A discriminant analysis algorithm does not care about how the data was generated, it simple categorizes a given signal.

In statistical view, if you have input data x and you want to classify the data into category y.

  • A generative learning algorithm learns the joint probability distribution P(x,y) and then uses Bayes rules to transform P(x,y) into P(y|x) for classification.
  • A discriminant analysis algorithm learns the conditional probability distribution P(y|x) , which is the natural classification distribution for classify a given example x into a class y. This is why algorithms that models this directly are called discriminant algorithm.

2. Generative learning algorithm

The main goal of generative learning algorithm is to learn the conditional distribution P(x|y) . Then together with prior p(y) , using Bayes rule, we have

P(y|x)=P(x|y)P(y)P(x)

And make prediction by
y^=argmaxP(y|x)=argmaxP(x|y)P(y)P(x)=argmaxP(x|y)P(y)

3. Two generative learning algorithm

2.1 Gaussian discriminant analysis

2.1.1 Core assumption of DGA

The core assumption of Gaussian discriminant analysis is the conditional distribution p(x|y) is Gaussian, which is to say that x is continuous-value.

Here we take 2-class as example. Assume

x|y=1x|y=0N(μ1,Σ)N(μ0,Σ)

And set the prior distribution as

P(y)=ϕy(1ϕ)1y

Now we use maximum likelihood method to estimate parameters (ϕ,μ0,μ1,Σ) . The log-likelihood function is

L(ϕ,μ0,μ1,Σ)=logΠni=1P(x(i),y(i))logistic log-likelihoodL(θ)=logΠni=1P(y(i)|x(i))=logΠni=1P(x(i)|y(i))P(y(i))

Here are the maximum likelihood estimates of those parameters

ϕμ0μ1Σ=ni=11{y(i)=1}n=ni=11{y(i)=0}x(i)ni=11{y(i)=0}=ni=11{y(i)=1}x(i)ni=11{y(i)=1}=ni=1(x(i)μy(i))(x(i)μy(i))Tn

Then using Bayes rule, we get the classification rule

y^=argmax{P(x|y=1)p(y=1),P(x|y=0)P(y=0)}

2.1.2 GDA V logistic

We give an alternative classification rule of GDA,

f(x)=logP(y=1|x)P(y=0|x)=logP(x|y=1)P(y=1)P(x|y=0)P(y=0)=logexp{12(xμ1)TΣ1(xμ1)}ϕexp{12(xμ0)TΣ1(xμ0)}(1ϕ)=(xμ0+μ12)TΣ1(μ1μ0)+logϕlog(1ϕ)=θTx+b

where θ=Σ1(μ1μ0) and
b=12(μ0+μ1)TΣ1(μ1μ0)+logϕlog(1ϕ)

So the probability of P(y=1|x) can be written as

P(y=1|x)=exp(θTx+b)1+exp(θTx+b)

which is the exactly form of logistic function.

So far, we see that Gaussian discriminant analysis is a specific situation of logistic regression. More generally, if we assume the conditional probability of P(x|y) is an exponential distribution, then it is logistic regression.

Which is better of GDA v logistic?
GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic regression makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions. Specifically, when the data is indeed non-Gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA. For this reason, in practice logistic regression is used more often than GDA.

2.2 Naive Bayes

2.2.1 Core assumption of NB

The core assumption of NB is that xi are conditionally independent given y, which says

P(x1,,xp|y)=Πpi=1P(xi|y)

and features of xi are discrete numbers.

Then the classification rule is

y^=argmaxiP(y=i|x)=argmaxiP(x|y=i)P(y=i)jP(x|y=j)P(y=j)=argmaxiP(y=i)Πpk=1P(xk|y=i)jP(y=j)Πpk=1P(xk|y=j)=argmaxiP(y=i)Πpk=1P(xk|y=i)

If we assume θijk=P(xi=xij|y=k) where x~ij takes 2 values and k takes 2 values. So we need to estimate 2p+12=2(2p1) parameters. And we use maximum likelihood method to estimate the parameters. The log-likelihood function is

L(θ)=logΠni=1P(x(i),y(i))=logΠni=1P(y(i))P(x(i)|y(i))=logΠni=1P(y(i))Πpk=1P(x(i)k|y(i))

and the maximum likelihood estimates are
θijk=nl=11{x(l)i=xij,y(l)=k}nl=11{y(l)=k}

ϕi=nl=11{y(l)=i}n

Finally, we can classify x to

y^=argmax{ϕ0Πpi=1θij0,ϕ1Πpi=1θij1}

2.2.2 Laplace smoothing

There is a potential danger of the algorithm describe in the last section, since it sometimes results in some θij0=0,θij1=0 if the training data does not happen to contain any examples satisfying the condition in the numerator, which leads to a problem:

ϕ0Πpi=1θij0=0ϕ1Πpi=1θij1=0

and the classification rule will be y^=argmax{0,0} . Then we do not know how to make predictions. Now we use laplace smoothing to over come this problem, which estimates
θijkϕi=nl=11{x(l)i=xij,y(l)=k}+1nl=11{y(l)=k}+2=nl=11{y(l)=i}+1n+2

If in the multinomial classification, we use
θijkϕi=nl=11{x(l)i=xij,y(l)=k}+1nl=11{y(l)=k}+2=nl=11{y(l)=i}+1n+K

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值