Logistic Regression & Least Square Probability Classification

最新推荐文章于 2024-08-20 10:54:32 发布

止于至玄

最新推荐文章于 2024-08-20 10:54:32 发布

阅读量1.2k

点赞数

CC 4.0 BY-SA版权

分类专栏： Machine Learning 文章标签：机器学习

本文链接：https://blog.youkuaiyun.com/philthinker/article/details/69665120

Machine Learning 专栏收录该内容

23 篇文章

订阅专栏

本文深入探讨了两种概率分类器：逻辑回归与最小二乘概率分类器。首先介绍了逻辑回归的基本原理，包括似然函数的应用及梯度上升法求解最优参数的过程。接着详细解释了最小二乘概率分类器的工作机制，通过引入正则化项得到解析解，并讨论了如何避免负的概率估计。最后，通过高斯核模型的实例展示了这两种分类器的应用。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1. Logistic Regression

Likelihood function, as interpreted by wikipedia:

https://en.wikipedia.org/wiki/Likelihood_function

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.
Pattern recognition works on the way that learning the posterior probability $p(y|x)$ of pattern $x$ belonging to class $y$ . In view of a pattern $x$ , when the posterior probability of one of the class $y$ achieves the maximum, we can take $x$ for class $y$ , i.e.

y^= arg max y = 1, \dots, c p (u | x)

$\hat{y}=\arg\max_{y=1,\dots,c}p(u|x)$
The posterior probability can be seen as the credibility of model

x x $x$ belonging to class

y

$y$ .
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:

q (y | x, θ) = exp ( \sum b j = 1 θ ( y ) j ϕ j ( x ) ) \sum c y ' = 1 exp ( \sum b j = 1 θ ( y ' ) j ϕ j ( x ) )

$q(y|x,\theta)=\frac{\exp\left( \sum_{j=1}^{b}\theta_{j}^{(y)}\phi_{j}(x) \right)}{\sum_{y'=1}^{c}\exp\left( \sum_{j=1}^{b}\theta_{j}^{(y')}\phi_{j}(x) \right)}$
Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:

max θ \sum i = 1 m log q (y i | x i, θ)

$\max_{\theta}\sum_{i=1}^{m}\log q(y_{i}|x_{i},\theta)$
We can solve it by gradient descent method:

Initialize $\theta$ .
Pick up a training sample $(x_{i},y_{i})$ randomly.
Update $\theta=({\theta^{(1)}}^{T},\dots, {\theta^{(c)}}^{T})^{T}$ along the direction of gradient ascent: $θ (y) \leftarrow θ (y) + ϵ \nabla y J i (θ), y = 1, \dots, c$ $\theta^{(y)}\leftarrow \theta^{(y)}+\epsilon\nabla_{y}J_{i}(\theta),\quad y=1,\dots,c$ where $\nabla y J i (θ) = - exp ( θ ( y ) T ϕ ( x i ) ) ϕ ( x i ) \sum c y ' = 1 exp ( θ ( y ' ) T ϕ ( x i ) ) + {ϕ (x i) 0 (y = y i) (y \neq y i)$ $\nabla_{y}J_{i}(\theta)=-\frac{\exp\left( {\theta^{(y)}}^{T}\phi(x_{i}) \right)\phi(x_{i})}{\sum_{y'=1}^{c}\exp\left( {\theta^{(y')}}^{T}\phi(x_{i}) \right)}+\left\{\begin{aligned} &\phi(x_{i})\quad &(y=y_{i})\\ &0 &(y\neq y_{i}) \end{aligned}\right.$
Go back to step 2,3 until we get a $\theta$ of suitable precision.

Take the Gaussian Kernal Model as an example:

q (y | x, θ) \propto exp (\sum j = 1 n θ j K (x, x j))

$q(y|x,\theta) \propto \exp\left( \sum_{j=1}^{n}\theta_{j}K(x,x_{j}) \right)$
Aren’t you familiar with Gaussian Kernal Model? Refer to this article:

http://blog.youkuaiyun.com/philthinker/article/details/65628280

Here are the corresponding MATLAB codes:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; t0=randn(n,c);
for o=1:n*1000
    i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);
    ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));
    t(:,yi)=t(:,yi)+0.1*ki;
    if norm(t-t0)<0.000001
        break;
    end
    t0=t;
end

N=100; X=linspace(-5,5,N)';
K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

2. Least Square Probability Classification

In LS probability classifiers, linear parameterized model is used to express the posterior probability:

q (y | x, θ (y)) = \sum j = 1 b θ (y) j ϕ j (x) = θ (y) T ϕ (x), y = 1, \dots, c

$q(y|x,\theta^{(y)})=\sum_{j=1}^{b}\theta_{j}^{(y)}\phi_{j}(x)={\theta^{(y)}}^{T}\phi(x),\quad y=1,\dots,c$
These models depends on the parameters

θ(y)=（θ(y)1,…,θ(y)b）T θ ( y ) = （ θ 1 ( y ) , … , θ b ( y ) ） T $\theta^{(y)}=（\theta_{1}^{(y)},\dots, \theta_{b}^{(y)}）^{T}$ correlated to each classes

y y $y$ that is diverse from the one used by Logistic classifiers. Learning those models means to minimize the following quadratic error:

\begin{aligned} J_{y} (θ^{(y)}) = & \frac{1}{2} \int {(q (y | x, θ^{(y)}) - p (y | x))}^{2} p (x) d x \\ = & \frac{1}{2} \int q (y | x, θ^{(y)})^{2} p (x) d x - \int q (y | x, θ^{(y)}) p (y | x) p (x) d x \\ + \frac{1}{2} \int p (y | x)^{2} p (x) d x \end{aligned}

$\begin{split}J_{y}(\theta^{(y)})= & \frac{1}{2}\int\left( q(y|x,\theta^{(y)})-p(y|x) \right)^{2}p(x)\mathrm{d}x \\ =& \frac{1}{2}\int q(y|x,\theta^{(y)})^{2}p(x) \mathrm{d}x-\int q(y|x,\theta^{(y)})p(y|x)p(x)\mathrm{d}x \\ &+ \frac{1}{2}\int p(y|x)^{2}p(x) \mathrm{d}x\end{split}$ where

p(x) p ( x ) $p(x)$ represents the probability density of training set

{xi}ni=1 { x i } i = 1 n $\{x_{i}\}_{i=1}^{n}$ .
By the Bayesian formula,

p (y | x) p (x) = p (x, y) = p (x | y) p (y)

$p(y|x)p(x)=p(x,y)=p(x|y)p(y)$
Hence

Jy J y $J_{y}$ can be reformulated as

J y (θ (y)) = 1 2 \int q (y | x, θ (y)) 2 p (x) d x - \int q (y | x, θ (y)) p (x | y) p (y) d x + 1 2 \int p (y | x) 2 p (x) d x

$J_{y}(\theta^{(y)})=\frac{1}{2}\int q(y|x,\theta^{(y)})^{2}p(x) \mathrm{d}x-\int q(y|x,\theta^{(y)})p(x|y)p(y)\mathrm{d}x+ \frac{1}{2}\int p(y|x)^{2}p(x) \mathrm{d}x$
Note that the first term and second term in the equation above stand for the mathematical expectation of

p(x) p ( x ) $p(x)$ and

p(x|y) p ( x | y ) $p(x|y)$ respectively, which are often impossible to calculate directly. The last term is independent of

θ θ $\theta$ and thus can be omitted.
Due to the fact that

p(x|y) p ( x | y ) $p(x|y)$ is the probability density of sample

x x $x$ belonging to class

y

$y$ , we are able to estimate term 1 and 2 by the following averages:

1 n \sum i = 1 n q (y | x i, θ (y)) 2, 1 n y \sum i : y i = y q (y | x i, θ (y)) p (y)

$\frac{1}{n}\sum_{i=1}^{n}q(y|x_{i},\theta^{(y)})^{2},\quad \frac{1}{n_{y}}\sum_{i:y_{i}=y}^{}q(y|x_{i},\theta^{(y)})p(y)$
Next, we introduce the regularization term to get the following calculation rule:

J^y (θ (y)) = 1 2 n \sum i = 1 n q (y | x i, θ (y)) 2 - 1 n y \sum i : y i = y q (y | x i, θ (y)) + λ 2 n ∥ θ (y) ∥ 2

$\hat{J}_{y}(\theta^{(y)})=\frac{1}{2n}\sum_{i=1}^{n}q(y|x_{i},\theta^{(y)})^{2}-\frac{1}{n_{y}}\sum_{i:y_{i}=y}^{}q(y|x_{i},\theta^{(y)})+\frac{\lambda}{2n}\|\theta^{(y)}\|^{2}$
Let

π(y)=(π(y)1,…,π(y)n)T π ( y ) = ( π 1 ( y ) , … , π n ( y ) ) T $\pi^{(y)}=(\pi_{1}^{(y)},\dots,\pi_{n}^{(y)})^{T}$ and

π(y)i={1(yi=y)0(yi≠y) π i ( y ) = { 1 ( y i = y ) 0 ( y i ≠ y ) $\pi_{i}^{(y)}=\left\{\begin{aligned}&1\quad (y_{i}=y)\\ &0 \quad (y_{i}\neq y)\end{aligned}\right.$ , then

J^y (θ (y)) = 1 2 n θ (y) T Φ T Φ θ (y) - 1 n θ (y) T Φ T π (y) + λ 2 n ∥ θ (y) ∥ 2

$\hat{J}_{y}(\theta^{(y)})=\frac{1}{2n}{\theta^{(y)}}^{T}\Phi^{T}\Phi\theta^{(y)}-\frac{1}{n}{\theta^{(y)}}^{T}\Phi^{T}\pi^{(y)}+\frac{\lambda}{2n}\|\theta^{(y)}\|^{2}$ .
Therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:

θ^(y) = (Φ T Φ + λ I) - 1 Φ T π (y)

$\hat{\theta}^{(y)}=\left( \Phi^{T}\Phi+\lambda I \right)^{-1}\Phi^{T}\pi^{(y)}$ .
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:

p^(y | x) = max ( 0 , θ ^ ( y ) T ϕ ( x ) ) \sum c y ' = 1 max ( 0 , θ ^ ( y ' ) T ϕ ( x ) )

$\hat{p}(y|x)=\frac{\max(0,{\hat{\theta}^{(y)}}^{T}\phi(x))}{\sum_{y'=1}^{c}\max(0,{\hat{\theta}^{(y')}}^{T}\phi(x))}$

We also take Gaussian Kernal Models for example:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';
k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);
K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);
for yy=1:c
    yk=(y==yy); ky=k(:,yk);
    ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);
    Kt(:,yy)=max(0,K(:,yk)*ty);
end
ph=Kt./repmat(sum(Kt,2),1,c);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

3. Summary

Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.