[Machine Learning] Generative learning algorithm-GDA and NB

最新推荐文章于 2024-11-10 17:25:31 发布

原创最新推荐文章于 2024-11-10 17:25:31 发布 · 754 阅读

0 ·

CC 4.0 BY-SA版权

文章标签：

#机器学习

machine learning 专栏收录该内容

7 篇文章

订阅专栏

本文对比分析了生成式与判别式学习算法的区别，并详细介绍了两种典型的生成式学习算法：高斯判别分析和朴素贝叶斯算法。通过数学公式推导了这两种算法的核心原理及其参数估计方法。

Generative v Discriminant
Generative learning algorithm
Two generative learning algorithm
- 1 Gaussian discriminant analysis
  - 11 Core assumption of DGA
  - 12 GDA V logistic
- 2 Naive Bayes
  - 21 Core assumption of NB
  - 22 Laplace smoothing

1. Generative v Discriminant

Generally speaking,

A generative learning algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumption, which category is most likely to generate this signal?
A discriminant analysis algorithm does not care about how the data was generated, it simple categorizes a given signal.

In statistical view, if you have input data $x$ and you want to classify the data into category $y$ .

A generative learning algorithm learns the joint probability distribution $P(x,y)$ and then uses Bayes rules to transform $P(x,y)$ into $P(y|x)$ for classification.
A discriminant analysis algorithm learns the conditional probability distribution $P(y|x)$ , which is the natural classification distribution for classify a given example $x$ into a class $y$ . This is why algorithms that models this directly are called discriminant algorithm.

2. Generative learning algorithm

The main goal of generative learning algorithm is to learn the conditional distribution $P(x|y)$ . Then together with prior $p(y)$ , using Bayes rule, we have

P (y | x) = P ( x | y ) P ( y ) P ( x )

$P(y|x) = \frac{P(x|y)P(y)}{P(x)}$
And make prediction by

y^= arg max P (y | x) = arg max P ( x | y ) P ( y ) P ( x ) = arg max P (x | y) P (y)

$\hat y = \arg\max P(y|x) = \arg\max \frac{P(x|y)P(y)}{P(x)} = \arg\max P(x|y)P(y)$

3. Two generative learning algorithm

2.1 Gaussian discriminant analysis

2.1.1 Core assumption of DGA

The core assumption of Gaussian discriminant analysis is the conditional distribution $p(x|y)$ is Gaussian, which is to say that $x$ is continuous-value.

Here we take 2-class as example. Assume

x | y = 1 x | y = 0 \sim N (μ 1, Σ) \sim N (μ 0, Σ)

$\begin{align} x|y = 1 &\sim N(\mu_1, \Sigma)\\ x|y = 0 &\sim N(\mu_0, \Sigma) \end{align}$
And set the prior distribution as

P (y) = ϕ y (1 - ϕ) 1 - y

$P(y) = \phi^y (1-\phi)^{1-y}$

Now we use maximum likelihood method to estimate parameters $(\phi, \mu_0,\mu_1,\Sigma)$ . The log-likelihood function is

L (ϕ, μ 0, μ 1, Σ) = log Π n i = 1 P (x (i), y (i))                                            logistic log-likelihood L (θ) = log Π n i = 1 P (y (i) | x (i)) = log Π n i = 1 P (x (i) | y (i)) P (y (i))

$\underbrace{L(\phi, \mu_0,\mu_1,\Sigma) = \log \Pi_{i=1}^n P(x^{(i)}, y^{(i)})}_{\text{logistic log-likelihood} L(\theta) = \log \Pi_{i=1}^n P(y^{(i)}|x^{(i)})} = \log \Pi_{i=1}^n P(x^{(i)}|y^{(i)})P(y^{(i)})$

Here are the maximum likelihood estimates of those parameters

ϕ μ 0 μ 1 Σ = \sum n i = 1 1 { y ( i ) = 1 } n = \sum n i = 1 1 { y ( i ) = 0 } x ( i ) \sum n i = 1 1 { y ( i ) = 0 } = \sum n i = 1 1 { y ( i ) = 1 } x ( i ) \sum n i = 1 1 { y ( i ) = 1 } = \sum n i = 1 ( x ( i ) - μ y ( i ) ) ( x ( i ) - μ y ( i ) ) T n

$\begin{align} \phi&=\frac{\sum_{i=1}^n 1_{\{y^{(i)} = 1\}}}{n}\\ \mu_0 &= \frac{\sum_{i=1}^n 1_{\{y^{(i)} = 0\}}x^{(i)}}{\sum_{i=1}^n 1_{\{y^{(i)} = 0\}}}\\ \mu_1 &= \frac{\sum_{i=1}^n 1_{\{y^{(i)} = 1\}}x^{(i)}}{\sum_{i=1}^n 1_{\{y^{(i)} = 1\}}}\\ \Sigma &= \frac{\sum_{i=1}^n (x^{(i)} - \mu_{y^{(i)}}) (x^{(i)} - \mu_{y^{(i)}})^T}{n} \end{align}$

Then using Bayes rule, we get the classification rule

y^=argmax{P(x|y=1)p(y=1),P(x|y=0)P(y=0)}

$\begin{align} \hat y = \arg\max\{P(x|y=1)p(y=1), P(x|y=0)P(y=0)\} \end{align}$

2.1.2 GDA V logistic

We give an alternative classification rule of GDA,

f (x) = log P ( y = 1 | x ) P ( y = 0 | x ) = log P ( x | y = 1 ) P ( y = 1 ) P ( x | y = 0 ) P ( y = 0 ) = log exp { - 1 2 ( x - μ 1 ) T Σ - 1 ( x - μ 1 ) } ϕ exp { - 1 2 ( x - μ 0 ) T Σ - 1 ( x - μ 0 ) } ( 1 - ϕ ) = (x - μ 0 + μ 1 2) T Σ - 1 (μ 1 - μ 0) + log ϕ - log (1 - ϕ) = θ T x + b

$\begin{align} f(x) = \log\frac{P(y=1|x)}{P(y=0|x)} & =\log \frac{P(x|y=1)P(y=1)}{P(x|y=0)P(y=0)}\\ & =\log \frac{\exp\{-\frac{1}{2}(x-\mu_1)^T\Sigma^{-1}(x-\mu_1)\}\phi}{\exp\{-\frac{1}{2}(x-\mu_0)^T\Sigma^{-1}(x-\mu_0)\}(1-\phi)}\\ & =\left(x-\frac{\mu_0+\mu_1}{2}\right)^T\Sigma^{-1}\left(\mu_1 - \mu_0\right)+\log\phi - \log(1-\phi)\\ & = \theta^T x+b \end{align}$
where

θ=Σ−1(μ1−μ0) $\theta = \Sigma^{-1}\left(\mu_1 - \mu_0\right)$ and

b = - 1 2 (μ 0 + μ 1) T Σ - 1 (μ 1 - μ 0) + log ϕ - log (1 - ϕ)

$b =-\frac{1}{2}\left(\mu_0+\mu_1\right)^T\Sigma^{-1}\left(\mu_1 - \mu_0\right) +\log\phi - \log(1-\phi)$

So the probability of $P(y=1|x)$ can be written as

P (y = 1 | x) = exp ( θ T x + b ) 1 + exp ( θ T x + b )

$P(y=1|x) = \frac{\exp( \theta^T x+b)}{1 +\exp (\theta^T x+b)}$
which is the exactly form of logistic function.

So far, we see that Gaussian discriminant analysis is a specific situation of logistic regression. More generally, if we assume the conditional probability of $P(x|y)$ is an exponential distribution, then it is logistic regression.

Which is better of GDA v logistic?
GDA makes stronger modeling assumptions, and is more data efficient (i.e., requires less training data to learn “well”) when the modeling assumptions are correct or at least approximately correct. Logistic regression makes weaker assumptions, and is significantly more robust to deviations from modeling assumptions. Specifically, when the data is indeed non-Gaussian, then in the limit of large datasets, logistic regression will almost always do better than GDA. For this reason, in practice logistic regression is used more often than GDA.

2.2 Naive Bayes

2.2.1 Core assumption of NB

The core assumption of NB is that $x_i$ are conditionally independent given y, which says

P (x 1, \dots, x p | y) = Π p i = 1 P (x i | y)

$P(x_1,\cdots,x_p|y) = \Pi_{i=1}^p P(x_i|y)$
and features of

xi $x_i$ are discrete numbers.

Then the classification rule is

y^= arg max i P (y = i | x) = arg max i P ( x | y = i ) P ( y = i ) \sum j P ( x | y = j ) P ( y = j ) = arg max i P ( y = i ) Π p k = 1 P ( x k | y = i ) \sum j P ( y = j ) Π p k = 1 P ( x k | y = j ) = arg max i P (y = i) Π p k = 1 P (x k | y = i)

$\begin{align} \hat y &= \arg\max_i P(y=i|x) \\ &= \arg\max_i\frac{P(x|y=i)P(y=i)}{\sum_{j}P(x|y=j)P(y=j)} \\ & =\arg\max_i\frac{P(y=i)\Pi_{k=1}^pP(x_k|y=i)}{\sum_{j}P(y=j)\Pi_{k=1}^pP(x_k|y=j)}\\ &=\arg\max_iP(y=i)\Pi_{k=1}^pP(x_k|y=i)\\ \end{align}$

If we assume $\theta_{ijk} = P(x_i = x_{ij}|y=k)$ where $\tilde x_{ij}$ takes $2$ values and k takes 2 values. So we need to estimate $2^{p+1} - 2 = 2(2^p - 1)$ parameters. And we use maximum likelihood method to estimate the parameters. The log-likelihood function is

L (θ) = log Π n i = 1 P (x (i), y (i)) = log Π n i = 1 P (y (i)) P (x (i) | y (i)) = log Π n i = 1 P (y (i)) Π p k = 1 P (x (i) k | y (i))

$\begin{align} L(\theta) &=\log\Pi_{i=1}^n P(x^{(i)},y^{(i)})\\ &=\log\Pi_{i=1}^nP(y^{(i)})P(x^{(i)}|y^{(i)})\\ &=\log\Pi_{i=1}^nP(y^{(i)})\Pi_{k=1}^pP(x_k^{(i)}|y^{(i)})\\ \end{align}$
and the maximum likelihood estimates are

θ i j k = \sum n l = 1 1 { x ( l ) i = x i j , y ( l ) = k } \sum n l = 1 1 { y ( l ) = k }

$\theta_{ijk} = \frac{\sum_{l=1}^n1_{\{x_{i}^{(l)}=x_{ij}, y^{(l)}=k\}}}{\sum_{l=1}^n1_{\{y^{(l)}=k\}}}$

ϕ i = \sum n l = 1 1 { y ( l ) = i } n

$\phi_i = \frac{\sum_{l=1}^n1_{\{y^{(l)} = i\}}}{n}$

Finally, we can classify $x$ to

y^=argmax{ϕ0Πpi=1θij0,ϕ1Πpi=1θij1}

$\hat y = \arg\max{\{\phi_0 \Pi_{i=1}^p \theta_{ij0}, \phi_1 \Pi_{i=1}^p \theta_{ij1}\}}$

2.2.2 Laplace smoothing

There is a potential danger of the algorithm describe in the last section, since it sometimes results in some $\theta_{ij0}=0, \theta_{ij1}=0$ if the training data does not happen to contain any examples satisfying the condition in the numerator, which leads to a problem:

ϕ 0 Π p i = 1 θ i j 0 = 0 ϕ 1 Π p i = 1 θ i j 1 = 0

$\begin{align} &\phi_0 \Pi_{i=1}^p \theta_{ij0}=0\\ &\phi_1 \Pi_{i=1}^p \theta_{ij1}=0 \end{align}$
and the classification rule will be

y^=argmax{0,0} $\hat y = \arg\max\{0,0\}$ . Then we do not know how to make predictions. Now we use laplace smoothing to over come this problem, which estimates

θ i j k ϕ i = \sum n l = 1 1 { x ( l ) i = x i j , y ( l ) = k } + 1 \sum n l = 1 1 { y ( l ) = k } + 2 = \sum n l = 1 1 { y ( l ) = i } + 1 n + 2

$\begin{align} \theta_{ijk} &= \frac{\sum_{l=1}^n1_{\{x_{i}^{(l)}=x_{ij}, y^{(l)}=k\}}+1}{\sum_{l=1}^n1_{\{y^{(l)}=k\}}+2}\\ \phi_i &= \frac{\sum_{l=1}^n1_{\{y^{(l)} = i\}}+1}{n+2} \end{align}$
If in the multinomial classification, we use

θ i j k ϕ i = \sum n l = 1 1 { x ( l ) i = x i j , y ( l ) = k } + 1 \sum n l = 1 1 { y ( l ) = k } + 2 = \sum n l = 1 1 { y ( l ) = i } + 1 n + K

$\begin{align} \theta_{ijk} &= \frac{\sum_{l=1}^n1_{\{x_{i}^{(l)}=x_{ij}, y^{(l)}=k\}}+1}{\sum_{l=1}^n1_{\{y^{(l)}=k\}}+2}\\ \phi_i &= \frac{\sum_{l=1}^n1_{\{y^{(l)} = i\}}+1}{n+K} \end{align}$