Logistic Regression & Least Square Probability Classification

本文深入探讨了两种概率分类器:逻辑回归与最小二乘概率分类器。首先介绍了逻辑回归的基本原理,包括似然函数的应用及梯度上升法求解最优参数的过程。接着详细解释了最小二乘概率分类器的工作机制,通过引入正则化项得到解析解,并讨论了如何避免负的概率估计。最后,通过高斯核模型的实例展示了这两种分类器的应用。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

1. Logistic Regression

Likelihood function, as interpreted by wikipedia:

https://en.wikipedia.org/wiki/Likelihood_function

plays one of the key roles in statistic inference, especially methods of estimating a parameter from a set of statistics. In this article, we’ll make full use of it.
Pattern recognition works on the way that learning the posterior probability p(y|x) p ( y | x ) of pattern x x belonging to class y. In view of a pattern x x , when the posterior probability of one of the class y achieves the maximum, we can take x x for class y, i.e.

y^=argmaxy=1,,cp(u|x) y ^ = arg ⁡ max y = 1 , … , c p ( u | x )

The posterior probability can be seen as the credibility of model x x belonging to class y.
In Logistic regression algorithm, we make use of linear logarithmic function to analyze the posterior probability:
q(y|x,θ)=exp(bj=1θ(y)jϕj(x))cy=1exp(bj=1θ(y)jϕj(x)) q ( y | x , θ ) = exp ⁡ ( ∑ j = 1 b θ j ( y ) ϕ j ( x ) ) ∑ y ′ = 1 c exp ⁡ ( ∑ j = 1 b θ j ( y ′ ) ϕ j ( x ) )

Note that the denominator is a kind of regularization term. Then the Logistic regression is defined by the following optimal problem:
maxθi=1mlogq(yi|xi,θ) max θ ∑ i = 1 m log ⁡ q ( y i | x i , θ )

We can solve it by gradient descent method:

  1. Initialize θ θ .
  2. Pick up a training sample (xi,yi) ( x i , y i ) randomly.
  3. Update θ=(θ(1)T,,θ(c)T)T θ = ( θ ( 1 ) T , … , θ ( c ) T ) T along the direction of gradient ascent:
    θ(y)θ(y)+ϵyJi(θ),y=1,,c θ ( y ) ← θ ( y ) + ϵ ∇ y J i ( θ ) , y = 1 , … , c
    where
    yJi(θ)=exp(θ(y)Tϕ(xi))ϕ(xi)cy=1exp(θ(y)Tϕ(xi))+{ϕ(xi)0(y=yi)(yyi) ∇ y J i ( θ ) = − exp ⁡ ( θ ( y ) T ϕ ( x i ) ) ϕ ( x i ) ∑ y ′ = 1 c exp ⁡ ( θ ( y ′ ) T ϕ ( x i ) ) + { ϕ ( x i ) ( y = y i ) 0 ( y ≠ y i )
  4. Go back to step 2,3 until we get a θ θ of suitable precision.

Take the Gaussian Kernal Model as an example:

q(y|x,θ)exp(j=1nθjK(x,xj)) q ( y | x , θ ) ∝ exp ⁡ ( ∑ j = 1 n θ j K ( x , x j ) )

Aren’t you familiar with Gaussian Kernal Model? Refer to this article:

http://blog.youkuaiyun.com/philthinker/article/details/65628280

Here are the corresponding MATLAB codes:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; t0=randn(n,c);
for o=1:n*1000
    i=ceil(rand*n); yi=y(i); ki=exp(-(x-x(i)).^2/hh);
    ci=exp(ki'*t0); t=t0-0.1*(ki*ci)/(1+sum(ci));
    t(:,yi)=t(:,yi)+0.1*ki;
    if norm(t-t0)<0.000001
        break;
    end
    t0=t;
end

N=100; X=linspace(-5,5,N)';
K=exp(-(repmat(X.^2,1,n)+repmat(x.^2',N,1)-2*X*x')/hh);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

2. Least Square Probability Classification

In LS probability classifiers, linear parameterized model is used to express the posterior probability:

q(y|x,θ(y))=j=1bθ(y)jϕj(x)=θ(y)Tϕ(x),y=1,,c q ( y | x , θ ( y ) ) = ∑ j = 1 b θ j ( y ) ϕ j ( x ) = θ ( y ) T ϕ ( x ) , y = 1 , … , c

These models depends on the parameters θ(y)=θ(y)1,,θ(y)bT θ ( y ) = ( θ 1 ( y ) , … , θ b ( y ) ) T correlated to each classes y y that is diverse from the one used by Logistic classifiers. Learning those models means to minimize the following quadratic error:
Jy(θ(y))=12(q(y|x,θ(y))p(y|x))2p(x)dx=12q(y|x,θ(y))2p(x)dxq(y|x,θ(y))p(y|x)p(x)dx+12p(y|x)2p(x)dx
where p(x) p ( x ) represents the probability density of training set {xi}ni=1 { x i } i = 1 n .
By the Bayesian formula,
p(y|x)p(x)=p(x,y)=p(x|y)p(y) p ( y | x ) p ( x ) = p ( x , y ) = p ( x | y ) p ( y )

Hence Jy J y can be reformulated as
Jy(θ(y))=12q(y|x,θ(y))2p(x)dxq(y|x,θ(y))p(x|y)p(y)dx+12p(y|x)2p(x)dx J y ( θ ( y ) ) = 1 2 ∫ q ( y | x , θ ( y ) ) 2 p ( x ) d x − ∫ q ( y | x , θ ( y ) ) p ( x | y ) p ( y ) d x + 1 2 ∫ p ( y | x ) 2 p ( x ) d x

Note that the first term and second term in the equation above stand for the mathematical expectation of p(x) p ( x ) and p(x|y) p ( x | y ) respectively, which are often impossible to calculate directly. The last term is independent of θ θ and thus can be omitted.
Due to the fact that p(x|y) p ( x | y ) is the probability density of sample x x belonging to class y, we are able to estimate term 1 and 2 by the following averages:
1ni=1nq(y|xi,θ(y))2,1nyi:yi=yq(y|xi,θ(y))p(y) 1 n ∑ i = 1 n q ( y | x i , θ ( y ) ) 2 , 1 n y ∑ i : y i = y q ( y | x i , θ ( y ) ) p ( y )

Next, we introduce the regularization term to get the following calculation rule:
J^y(θ(y))=12ni=1nq(y|xi,θ(y))21nyi:yi=yq(y|xi,θ(y))+λ2nθ(y)2 J ^ y ( θ ( y ) ) = 1 2 n ∑ i = 1 n q ( y | x i , θ ( y ) ) 2 − 1 n y ∑ i : y i = y q ( y | x i , θ ( y ) ) + λ 2 n ‖ θ ( y ) ‖ 2

Let π(y)=(π(y)1,,π(y)n)T π ( y ) = ( π 1 ( y ) , … , π n ( y ) ) T and π(y)i={1(yi=y)0(yiy) π i ( y ) = { 1 ( y i = y ) 0 ( y i ≠ y ) , then
J^y(θ(y))=12nθ(y)TΦTΦθ(y)1nθ(y)TΦTπ(y)+λ2nθ(y)2 J ^ y ( θ ( y ) ) = 1 2 n θ ( y ) T Φ T Φ θ ( y ) − 1 n θ ( y ) T Φ T π ( y ) + λ 2 n ‖ θ ( y ) ‖ 2
.
Therefore, it is evident that the problem above can be formulated as a convex optimization problem, and we can get the analytic solution by setting the twice order derivative to zero:
θ^(y)=(ΦTΦ+λI)1ΦTπ(y) θ ^ ( y ) = ( Φ T Φ + λ I ) − 1 Φ T π ( y )
.
In order not to get a negative estimation of the posterior probability, we need to add a constrain on the negative outcome:
p^(y|x)=max(0,θ^(y)Tϕ(x))cy=1max(0,θ^(y)Tϕ(x)) p ^ ( y | x ) = max ( 0 , θ ^ ( y ) T ϕ ( x ) ) ∑ y ′ = 1 c max ( 0 , θ ^ ( y ′ ) T ϕ ( x ) )

We also take Gaussian Kernal Models for example:

n=90; c=3; y=ones(n/c,1)*(1:c); y=y(:);
x=randn(n/c,c)+repmat(linspace(-3,3,c),n/c,1);x=x(:);

hh=2*1^2; x2=x.^2; l=0.1; N=100; X=linspace(-5,5,N)';
k=exp(-(repmat(x2,1,n)+repmat(x2',n,1)-2*x*(x'))/hh);
K=exp(-(repmat(X.^2,1,n)+repmat(x2',N,1)-2*X*(x'))/hh);
for yy=1:c
    yk=(y==yy); ky=k(:,yk);
    ty=(ky'*ky +l*eye(sum(yk)))\(ky'*yk);
    Kt(:,yy)=max(0,K(:,yk)*ty);
end
ph=Kt./repmat(sum(Kt,2),1,c);

figure(1); clf; hold on; axis([-5,5,-0.3,1.8]);
C=exp(K*t); C=C./repmat(sum(C,2),1,c);
plot(X,C(:,1),'b-');
plot(X,C(:,2),'r--');
plot(X,C(:,3),'g:');
plot(x(y==1),-0.1*ones(n/c,1),'bo');
plot(x(y==2),-0.2*ones(n/c,1),'rx');
plot(x(y==3),-0.1*ones(n/c,1),'gv');
legend('q(y=1|x)','q(y=2|x)','q(y=3|x)');

这里写图片描述

3. Summary

Logistic regression is good at dealing with sample set with small size since it works in a simple way. However, when the number of samples is large to some degree, it is better to turn to the least square probability classifiers.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值