7. Parameter Estimation
- Model and parameters
- Properties of good estimators
- Unbiasedness, consistency
- UMVUE, efficiency
- MLE
- Bayesian Estimation
- why?
- Prior and Posterior
- Conjugate distribution
- Limitations
Reason: statistic estimation is not general estimation problem.
- Formulation:
X1,X2,...,Xn i.i.d∼f(x;θ) θ∈unknownEstimator:ϕ^=ϕ(X),ϕ:Rn→E X_1, X_2,..., X_n \ i.i.d \sim f(x ; \theta) \ \ \ \theta \in unknown\\ Estimator: \hat \phi = \phi(X) , \phi: \mathbb{R}^{n} \rightarrow E X1,X2,...,Xn i.i.d∼f(x;θ) θ∈unknownEstimator:ϕ^=ϕ(X),ϕ:Rn→E
Properties of Good Estimators:
Correctness:
- Unbiasedness: 样本量抽样分布的数学期望等于被估计总体的参数
E[ϕ(X)]=θ for X∼f(x;θ) E[\phi(X)]=\theta \text { for } X \sim f(x ; \theta) E[ϕ(X)]=θ for X∼f(x;θ)

- Consistency: 随样本量增大,估计量收敛于总体的被估计参数
ϕ(X)→θ in probability for X∼f(x;θ) \phi(X) \rightarrow \theta \text { in probability for } X \sim f(x ; \theta) ϕ(X)→θ in probability for X∼f(x;θ) - Example:
s2=1n−1∑i=1n(Xi−X‾)2无偏的σ^2=1n∑i=1n(Xi−X‾)2一致的 \begin{aligned} s^{2} &=\frac{1}{n-1} \sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2} 无偏的\\ \hat{\sigma}^{2} &=\frac{1}{n} \sum_{i=1}^{n}\left(X_{i}-\overline{X}\right)^{2} 一致的\end{aligned} s2σ^2=n−11i=1∑n(Xi−X)2无偏的=n1i=1∑n(Xi−X)2一致的 - Accurate:

- Efficient:
- UMVUE is very restrictive. Efficient is weaker condition.

Maximum Likelihood Estimation
Why?
MLE is a framework to design consistent and efficient estimator under very general conditions.
Formulation
- The likelihood function:
L(X;θ)=∏i=1nf(Xi;θ)X i.i.d∼f(x;θ) θ∈unknown L(X ; \theta)=\prod_{i=1}^{n} f\left(X_{i} ; \theta\right) \\ X ~ \ i.i.d \sim f(x ; \theta) \ \ \ \theta \in unknown\\ L(X;θ)=i=1∏nf(Xi;θ)X i.i.d∼f(x;θ) θ∈unknown - MLE: For given data samples X=x
θ^=argmaxθ∈EL(x;θ)=L(x;θ^) \hat\theta=argmax _{\theta \in E} L(x ; \theta)=L(x ; \hat{\theta}) θ^=argmaxθ∈EL(x;θ)=L(x;θ^)
Limitations:
- To solve MLE, even numerically, could be very challenging.
- MLE does not guarantee good performance in finite sample.


Bayesian Estimation
With Bayesian estimation, we can easily update our estimator in a fashion that samples are collected sequentially.
Formulation:
- θ ~ E
- f0(θ)f_{0}(\theta)f0(θ) as the prior of θ\thetaθ
- f1(θ)f_{1}(\theta)f1(θ) called posterior, which gives the distribution of θ\thetaθ on condition data
f1(θ)=f(θ∣X)=L(x;θ)f0(θ)∫EL(x;u)f0(u)du f_{1}(\theta)= f(\theta|X)=\frac{L(x ; \theta) f_{0}(\theta)}{\int_{E} L(x ; u) f_{0}(u) d u} f1(θ)=f(θ∣X)=∫EL(x;u)f0(u)duL(x;θ)f0(θ)


Sequential Bayesian Estimation
Intuitively, if more data Xn+1,…,Xn+m is available, we can take the previous posterior f1 as the new prior and update the belief again using the new data only:
f2(θ)=L(x;θ)f1(θ)∫EL(x;u)f1(u)du f_{2}(\theta)=\frac{L(x ; \theta) f_{1}(\theta)}{\int_{E} L(x ; u) f_{1}(u) d u} f2(θ)=∫EL(x;u)f1(u)duL(x;θ)f1(θ)
Limitations:
- Its dependence on the prior, which can be any distribution on E. A very strong prior could lead to a non-consistent estimation.
- In the information-based trade example, what will happen if we pick p0 = 1?
On the other hand, a weak prior could lead to slow convergence.
- In the information-based trade example, what will happen if we pick p0 = 1?
- The computation of the posterior could be very costly when the parameter space E is large.
8. Confidence Interval
- Three constructions of CI for i.i.d samples:
- normal
- t
- bootstrap
- When and how?
Central Limit Theory
- Theorem: {Xi}\{X_i\}{Xi} is a sequence of i.i.d. samples of X with E[X]=μE[X] = μE[X]=μ and
Var(X)=σ2Var(X) = σ^2Var(X)=σ2. Then,
nσ(X‾n−μ)⇒N(0,1) \frac{\sqrt{n}}{\sigma}\left(\overline{X}_{n}-\mu\right) \Rightarrow N(0,1) σn(Xn−μ)⇒N(0,1) - Therefore, when n is “large”, for any α > 0
P(∣nσ(X‾n−μ)∣>a)≈P(∣Z∣>a) P\left(\left|\frac{\sqrt{n}}{\sigma}\left(\overline{X}_{n}-\mu\right)\right|>a\right) \approx P(|Z|>a) P(∣∣∣∣σn(Xn−μ)∣∣∣∣>a)≈P(∣Z∣>a)
where Z is a standard normal r.v.
Confidence Interval(z-distribution)
- For any confidence level aaa, we simply choose ϕ\phiϕ such that
P(∣Z∣>ϕ)=1−aP(|Z|>\phi)=1-aP(∣Z∣>ϕ)=1−a, then the a confidence interval is
[X‾n−ϕσn,X‾n+ϕσn] \left[\overline{X}_{n}-\phi \frac{\sigma}{\sqrt{n}}, \overline{X}_{n}+\phi \frac{\sigma}{\sqrt{n}}\right] [Xn−ϕnσ,Xn+ϕnσ] - 95% CI means that: 如果做了100次抽样,大概有95次找到的区间包含真值,有5次找到的区间不包含真值。
样本均值的标准误差s.e.为σ.x‾=σ/n 样本均值的标准误差s.e.为\sigma_{ . \overline{x}}=\sigma / \sqrt{n} 样本均值的标准误差s.e.为σ.x=σ/n

The Effect of Sample Size
- The magnitude of estimation error, measured by the half length of CI, is
ϕσn \phi \frac{\sigma}{\sqrt{n}} ϕnσ - In order to have the estimation error ≈ ε, we need the sample size
n≈ϕ2σ2ε2 n \approx \frac{\phi^{2} \sigma^{2}}{\varepsilon^{2}} n≈ε2ϕ2σ2
Intuitively, to improve the estimation accuracy by 10 times, we need enlarge the sample size by 100 times.
CI for Small Samples
- Theorem: (CI of t-distribution)
If X1,X2,...,XnX1, X2,...,XnX1,X2,...,Xn are i.i.d. samples of a normal distribution N(μ,σ2)N(μ,σ^2)N(μ,σ2), then
ns(X‾n−μ)∼t(n−1)\frac{\sqrt{n}}{s}\left(\overline{X}_{n}-\mu\right) \sim t(n-1)sn(Xn−μ)∼t(n−1), a t-distribution with degree of freedom n − 1. - Remark:
- t-distribution is more disperse than normal.
- When n → ∞, t(n − 1) ⇒ N (0, 1).

Bootstrap


9. Significance Test
- Formulation of general hypothesis test
- Parameter space
- Hypothesis / Alternative
- Hypothesis testing
- Significance test
- 5 steps
- What is the intuition
- How to choose the hypothesis and alternative
- How to interpret the p-value
- Type I and II errors
Steps of a Significance Test
- Assumptions: underlying probability model for population
- Hypothesis: Formulate the statement or prediction in your research problem into a statement about the population parameter.
- Test Statistic: the test statistic measures how “far” the point estimate of parameter is from its null hypothesis value(s), conditional on that null hypothesis is true.
- P-Value: the tail probability beyond the observed value of test statistic, if we presume null hypothesis is true. 事件发生的不可能程度
- Conclusion: Report and interpret the p-value in the context of the study. Make a decision about H0 based on p-value.
Type I & Type II errors & Interpreting P-Value



Inference on Single Variables
Population proportion
- z-test
- Difference from CI
- Small sample: binomial test


Population mean
- t-test
- Relation with CI
- Small sample: bootstrap


Inference on Two Variables
- Independent samples
- Population proportion: z-test
- Population mean: t-test
- Small sample: permutation test
- Paired data: t_test for single variable


-
standard error of z:
z=(p1−p2)−(π1−π2)p1(1−p1)n1+p2(1−p2)n2 z=\frac{\left(p_{1}-p_{2}\right)-\left(\pi_{1}-\pi_{2}\right)}{\sqrt{\frac{p_{1}\left(1-p_{1}\right)}{n_{1}}+\frac{p_{2}\left(1-p_{2}\right)}{n_{2}}}} z=n1p1(1−p1)+n2p2(1−p2)(p1−p2)−(π1−π2) -
standard error of u:
一般不做要求,直接给出 -
Conclude CI:
Given our estimation on the standard error for the estimated mean or proportion difference, we can construct the confidence interval for mean or proportion difference:
[(x‾−y‾)−ϕαse,(x‾−y‾)+ϕαse] \left[(\overline{x}-\overline{y})-\phi_{\alpha} s e,(\overline{x}-\overline{y})+\phi_{\alpha} s e\right] [(x−y)−ϕαse,(x−y)+ϕαse]
The coefficient φα is determined by α and model assumptions (normal
distribution for proportions, t distribution for means).
Permutation Test
检验是两个总体是否是同一个服从同样的分布


Paired data


10. Multiple Regression
- Assumptions
- Interpretation of estimation results
- Inference methods:
- t-test for single coefficient
- F-test for nested models
- Residual analysis
Assumptions(linear regression model)
yi=β0+∑k=1pβkgk(xik)+εi y_{i}=\beta_{0}+\sum_{k=1}^{p} \beta_{k} g_{k}\left(x_{i k}\right)+\varepsilon_{i} yi=β0+k=1∑pβkgk(xik)+εi
where the functions gkg_kgk are known. Besides, we assume the following conditions on εiε_iεi:
- Independence: εiε_iεi are independent.
- Zero mean: E[ε∣x]=0E[ε|x] = 0E[ε∣x]=0 for all possible value of x=(x1,...,xm)x = (x1, ..., xm)x=(x1,...,xm).
- Equal variance: Var(ε∣x)=σ2Var(ε|x) = σ2Var(ε∣x)=σ2.
- Normality: εiε_iεi are normal conditional on x.
T-test & F-test


Residual analysis

- DW-test 检验是否独立,原假设是残差独立不相关
- JB-test 检验是否正太分布,原假设是残差是正太分布
Assumptions(logistic regression)


n=nrow(data)
tpr=fpr=rep(0,n)
#compute TPR and FPR for different threshold
for (i in 1:n)
{
threshold=data$prob[i]
tp=sum(data$prob>threshold&data$obs==1)
fp=sum(data$prob>threshold&data$obs==0)
tn=sum(data$prob<=threshold&data$obs==1)
fn=sum(data$prob<=threshold&data$obs==0)
tpr[i]=tp/(tp+tn) #true positive rate
fpr[i]=fp/(fp+fn) #false positive rate
}
# plot ROC
plot(fpr,tpr,type='l',ylim = c(0,1),xlim = c(0,1),main = 'ROC')
abline(a=0,b=1)