Maxmize Liklihood Linear Regression
Suppose we have data set S={(x(i),y(i)),i=1,…,m} where x(i)∈ℝn such that x has n features with
y(i)=θTx(i)+ϵ(i)
Where ϵ(i) is an error term that captures either un-model effects or random noise. Let’s assume that the ϵ(i)’s are distribute i.i.d.(independently and identically distributed) according to Gaussian Distribution with mean zero and variance σ2. Which can be written as ϵ(i)∼N(0,σ2). And the pdf of ϵ(i) is given by
p(ϵ(i))=12π‾‾‾√σ(−(ϵ(i))22σ2)
Because of ϵ(i)=y(i)−θTx(i), the pdf also can be given as
p(y(i)|x(i);θ)=12π‾‾‾√σ(−(y(i)−θTx(i))22σ2)
Notice that the notation ‘p(y(i)|x(i);θ)’ indicates that this is the distribution of y(i) given x(i) is parameterized by θ and θ is not a random variable, the formula is not a probability consition on θ. We can write the distribution as ‘y(i)|x(i);θ∼N(θTx(i),σ2)’. Given an input matrix X=(x(1),x(2),…,x(m))T and θ, what the distribution of y(i)’s is given by p(y→|X;θ). When we wish to explicity view this as a function of θ, we call it the likelihood function:
L(θ)=L(θ;X,y→)=p(y→|X;θ)
Note that by the independence assumption on the ϵ(i)’s, this can be written by
L(θ)==∏i=1mp(y(i)|x(i);θ)∏i=1m12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)
Now, given this probabilistic model relating the y(i)’s and the x(i)’s. The principal of maximum likelihood says that we should should choose θ so as to make the data as high probability as possible. So We are facing an optimization problem.
maxθL(θ)
We define a new likelihood function called log likelihood:
ℓ(θ)=logL(θ)=log∏i=1m12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)=∑i=1mlog12π‾‾‾√σexp(−(y(i)−θTx(i))2)2σ2)=mlog12π‾‾‾√σ−12σ2∑i=1m(y(i)−θTx(i))2
When we scale the loss function the estimation of θ=argminθ∑mi=1logp(x(i);θ) will not change. We could use the expectation to be the standard.
θ=argminθ