NOTES
1 The relationship between variables
function relation: One can be described by y = f ( x ) y=f(x) y=f(x).
correlation: There is no completely confirmed relaionship between two variables.
In the most cases, two variables can not describes by a function, but we can analysis their correlativity.
-
scatter plot: Visual inspection.
-
correlation coefficient: according to the data, calculating the degree of correlativity between variables.
r = n ∑ x y − ∑ x ∑ y n ∑ x 2 − ( ∑ x ) 2 ⋅ n ∑ y 2 − ( ∑ y ) 2 r= \frac{n\sum{xy}-\sum{x}\sum{y}}{\sqrt{n\sum{x^2}-(\sum{x})^2}\cdot\sqrt{n\sum{y^2}-(\sum{y})^2}} r=n∑x2−(∑x)2⋅n∑y2−(∑y)2n∑xy−∑x∑y
Here, r r r is defined as linear correlation coefficient or Pearson’s correlation coefficient
∣ r ∣ ≥ 0.8 \vert r \vert \geq 0.8 ∣r∣≥0.8: high correlation
0.5 ≤ ∣ r ∣ < 0.8 0.5\leq \vert r \vert \lt 0.8 0.5≤∣r∣<0.8: moderate correlation
0.3 ≤ ∣ r ∣ < 0.5 0.3\leq \vert r \vert \lt 0.5 0.3≤∣r∣<0.5: low correlation
∣ r ∣ < 0.3 \vert r \vert \lt 0.3 ∣r∣<0.3: no correlation
- When we use Pearson’s correlation coefficient, the data should be normality. If not, we can use Spearman correlation coefficient instaed. More details.
- significance testing of r r r
population correlationship coefficient: ρ \rho ρ
sample correlationship coefficient: r r r
hypothesis:
H 0 : ρ = 0 H_0:\rho=0 H0:ρ=0; H 1 : ρ ≠ 0 H_1:\rho\ne0 H1:ρ=0
statistics:
t = ∣ r ∣ n − 2 1 − r 2 ∼ t ( n − 2 ) t= \vert r \vert \sqrt{\frac{n-2}{1-r^2}}\sim t(n-2) t=∣r∣1−r2n−2∼t(n−2)
decision
If ∣ t ∣ > t α / 2 \vert t \vert>t_{\alpha/2} ∣t∣>tα/2, reject H 0 H_0 H0, there is a significantly linear ralationship between population variables.
2 Unary linear regression
2.1 Regression model
Population regression equation:
y i = β 0 + β 1 x i + ε i y_i=\beta_0+\beta_1x_i+\varepsilon_i yi=β0+β1xi+εi
Estimated regression equation:
y ^ i = β ^ 0 + β ^ 1 x i \hat y_i=\hat\beta_0+\hat\beta_1x_i y^i=β^0+β^1xi
Assumption:
A1: dependent variable and independent variable is linear.
A2: The variable x is not random and must take at least two different values.
A3: E ( ε ) = 0 E(\varepsilon)=0 E(ε)=0
A4: The variation is constant around the regression line, independent of x. v a r ( y ) = σ 2 var(y) =\sigma^2 var(y)=σ