Maxmize Liklihood Linear Regression
Suppose we have data set
S={(x(i),y(i)),i=1,…,m}
where
x(i)∈ℝn
such that x has
n
features with
y(i)=θTx(i)+ϵ(i)
Where ϵ(i) is an error term that captures either un-model effects or random noise. Let’s assume that the ϵ(i) ’s are distribute i.i.d.(independently and identically distributed) according to Gaussian Distribution with mean zero and variance σ2 . Which can be written as ϵ(i)∼N(0,σ2) . And the pdf of ϵ(i) is given by
p(ϵ(i))=12π‾‾‾√σ(−(ϵ(i))22σ2)
Because of ϵ(i)=y(i)−θTx(i) , the pdf also can be given as
p(y(i)|x(i);θ)=12π‾‾‾√σ(−(y(i)−θTx(i))22σ2)
Notice that the notation ‘ p(y(i)|x(i);θ) ’ indicates that this is the distribution of y(i) given x(i) is parameterized by θ and θ is not a random variable, the formula is not a probability consition on θ . We can write the distribution as ‘ y(i)|x(i);θ∼N(θTx(i),σ2) ’. Given an input matrix X=(x(1),x(2),…,x(m))T and θ , what the distribution of y(i) ’s is given by p(y→|X;θ) . When we wish to explicity view this as a function of θ , we call it the likelihood function:
L(θ)=L(θ;X,y→)=p(y→|X;θ)
Note that by the independence assumption on the ϵ(i) ’s, this can be written by
L(θ)==∏i=1mp(y(i)|x(i);θ)∏i=1m12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)
Now, given this probabilistic model relating the y(i) ’s and the x(i) ’s. The principal of maximum likelihood says that we should should choose θ so as to make the data as high probability as possible. So We are facing an optimization problem.
maxθL(θ)
We define a new likelihood function called log likelihood:
ℓ(θ)=logL(θ)=log∏i=1m12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)=∑i=1mlog12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)=mlog12π‾‾‾√σ−12σ2∑i=1m(y(i)−θTx(i))2
When we scale the loss function the estimation of θ=argminθ∑mi=1logp(x(i);θ) will not change. We could use the expectation to be the standard.
θ=argminθ