People usually get confused about the meaning or purpose of a Likelihood function for a continuous probability distribution (say, Gaussian). What the hack is that? Indeed, any attempt to try to find some meaning in it will doom to failure. Why? because the Likelihood it self is meaningless!
By definition, the Likelihood is the probability to choose some model given the output data, so in discrete situation it's obvious to get it's context, but thing got quite confused in the continuous case. Given a data point x, in discrete case, we can immediately know it's probability from the distribution function. However, if x is in some continuous space, the function is actually a probability density, and by definition we know that for any given data point, the probability to generate it is 0! In other words, the Likelihood should be 0 in any cases...But how could one use a 0 function as Likelihood?
The trick is to divide the continuous space into infinitely small discrete space.So the probability is not 0, but an infinitely small amount that approximate to 0! Formally, given an enough small positive number e>0, then Pr(x) = ep(x), i.e. some very tiny amount multiple by p(x), the density at x. Although the Likelihood now is still make no sense to us, but it's not 0 anymore, it's an infinitely small amount. Recall theL'Hospital's Law, it actually allow us tocompare two infinitely small number. So, the point here is that even we don't know the exactly value of the Likelihood (i.e. an infinitely small amount), we can actually compare two Likelihoods and decide which one is better!
So, what happens when we compare two infinitely small Likelihood? We have a ration R=ep(x)/eq(x)=p(x)/q(x), this is actually whatL'Hospital's Law says. So, the point here is that to know which one's Likelihood is better, we only need to know the density functions, to compare the density is equivalent to compare the probability! In other words, since the infinitely small number e will be eliminated during the division, and we still got the right answer, that means, technically, we can ignore this term.Only the value of density function matters!
So, instead of define Likelihood as a probability, we redefine it in terms of probability density function. And notice that this definition itself is meaning less, we can't explain what is the value of a density based Likelihood function mean. But, mathematically, it's correctness for the purpose of comparison among different Likelihood functions derives directly from theL'Hospital's Law. So, now we can understand a Likelihood in general case:
Given a serial of i.d.d data points x1,x2...xN. And a family of model, distinguished by parameter t. What is the Likelihood of some model t can generate the data?
Traditionally, we use Pr(x1,x2...xN;t) = Pr(x1;t) Pr(x2;t)...Pr(xN;t) = p(x1;t)p(x2;t)...p(xN;t) e^N = 0 (e->0)
But now, since we know that only the value of density matters, we can actually define that L(x1,x2...xN|t) = p(x1;t)p(x2;t)...p(xN;t).
And we also know that the one with maximum L(x1,x2...xN|t), will also have the maximum infinitely small probability, compared with other infinitely small probabilities.