定义1: 两个随机样本变量 xxx 和 yyy 之间的协方差(covariance\textbf{covariance}covariance)是两个变量之间线性关联的度量, 由公式定义
(1)cov(x,y)=1n−1∑i=1n(xi−xˉ)(yi−yˉ)
cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})\tag{1}
cov(x,y)=n−11i=1∑n(xi−xˉ)(yi−yˉ)(1)
注意: 协方差类似于方差, 不同之处在于为两个变量 (上面的 xxx 和 yyy) 定义协方差, 而方差只为一个变量定义。事实上, cov(x,x)=var(x)cov (x, x) = var (x)cov(x,x)=var(x)。
协方差可以被认为是 xxx 和 yyy 的数据元素对之间的匹配和不匹配之和:当对中的两个元素在它们的平均值的同一侧时, 就会发生匹配;当对中的一个元素高于其平均值, 而另一个元素低于其平均值时, 就会发生不匹配。
当匹配大于不匹配时, 协方差为正, 当不匹配大于匹配时, 协方差为负。协方差的绝对值大小表示 xxx 和 yyy 之间线性关系的强度:线性关系越强, 协方差值就越大。协方差的大小也受数据元素尺度的影响, 为了消除尺度因子, 将相关系数作为线性关系的无尺度度量。
定义2: 两个样本变量 xxx 和 yyy 之间的相关系数是两个变量之间线性关联的无标度度量, 并通过公式给出
(2)rxy=cov(x,y)sxsy
r_{xy} = \frac{cov(x,y)}{s_xs_y}\tag{2}
rxy=sxsycov(x,y)(2)
We also use the term coefficient of determination\textbf{coefficient of determination}coefficient of determination for r2r^2r2.
注意: Just as we saw for the variance in Measures of Variability\textbf{Measures of Variability}Measures of Variability, the covariancecovariancecovariance can be calculated as
(3)1n−1∑i=1n(xi−xˉ)(yi−yˉ)=1n−1(∑i=1nxiyi−xˉ∑i=1nyi−yˉ∑i=1nxi+nxˉyˉ)=1n−1(∑i=1nxiyi−nxˉyˉ)
\begin{aligned}
\frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})=&
\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-\bar{x}\sum_{i=1}^{n}y_i-\bar{y}\sum_{i=1}^{n}x_i+n\bar{x}\bar{y})\\
=&\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-n\bar{x}\bar{y})\\
&\tag{3}
\end{aligned}
n−11i=1∑n(xi−xˉ)(yi−yˉ)==n−11(i=1∑nxiyi−xˉi=1∑nyi−yˉi=1∑nxi+nxˉyˉ)n−11(i=1∑nxiyi−nxˉyˉ)(3)
因此, 我们还可以将相关系数计算为
(4)∑i=1nxiyi−nxˉyˉ∑i=1nxi2−nxˉ2∑i=1nyi2−nyˉ2
\frac{\sum_{i = 1}^{n}x_iy_i-n\bar{x}\bar{y}}{\sqrt{\sum_{i=1}^{n}x_i^2-n\bar{x}^2}\sqrt{\sum_{i=1}^{n}y_i^2-n\bar{y}^2}}\tag{4}
∑i=1nxi2−nxˉ2∑i=1nyi2−nyˉ2∑i=1nxiyi−nxˉyˉ(4)
性质1: −1≤r≤1-1\leq r\leq 1−1≤r≤1.
注意: 如果 rrr 接近 1, 则 xxx 和 yyy 呈正相关。正线性相关意味着 xxx 的高值与 yyy 的较高值相关, xxx 的较低值与 yyy 的低值相关联。
如果 rrr 接近 -1, 则 xxx 和 yyy 呈负相关。负线性相关意味着 xxx 的较高值与 yyy 的较低值相关联, 而 xxx 的较低值与 yyy 的较高值相关联。
当接近 0 时, xxx 和 yyy 之间几乎没有线性关系。
注意: We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their probability density function (pdf).
定义3: The covariance\textbf{covariance}covariance between two random variables xxx and yyy for a population with discrete or continuous pdf is defined by
(5)cov(x,y)=E[(x−μx)(y−μy)]
cov(x,y) = E[(x-\mu_{x})(y-\mu_{y})]\tag{5}
cov(x,y)=E[(x−μx)(y−μy)](5)
Where E[]E[]E[] is the expectation function.
定义4: The (Pearson’s product moment)\textbf{(Pearson’s product moment)}(Pearson’s product moment) correlation coefficient for two variables xxx and yyy for a population with discrete or continuous pdf is
(6)ρ=cov(x,y)σxσy
\rho = \frac{cov(x,y)}{\sigma_x\sigma_y}\tag{6}
ρ=σxσycov(x,y)(6)
性质2: −1≤ρ≤1-1\leq\rho\leq1−1≤ρ≤1.
性质3: cov(x,y)=E[xy]−μxμycov(x,y) = E[xy]-\mu_x\mu_ycov(x,y)=E[xy]−μxμy
性质4: cov(x,y)=0cov(x,y) = 0cov(x,y)=0, xxx and yyy are independent.
性质5:
(7)var(x+y)=var(x)+var(y)+2cov(x,y)var(x−y)=var(x)+var(y)−2cov(x,y)
\begin{aligned}
var(x+y) = var(x)+var(y)+2cov(x,y)\\
var(x-y) = var(x)+var(y)-2cov(x,y) \tag{7}
\end{aligned}
var(x+y)=var(x)+var(y)+2cov(x,y)var(x−y)=var(x)+var(y)−2cov(x,y)(7)
注意: It turns out that rrr is not an unbiased estimate of ρ\rhoρ. A relatively unbiased estimate of ρ2\rho^2ρ2 is given by the adjusted coefficient of determination\textbf{adjusted coefficient of determination}adjusted coefficient of determination radj2r_{adj}^2radj2:
(8)radj2=1−(1−r2)(n−1)n−2
r_{adj}^2 = 1-\frac{(1-r^2)(n-1)}{n-2}\tag{8}
radj2=1−n−2(1−r2)(n−1)(8)
while radj2r_{adj}^2radj2 is a better estimate of of the population coefficient of determination, especially for small values of nnn, for large values of nnn it is easy to see that radj2≈r2r_{adj}^2\approx r^2radj2≈r2. Note too that radj2≤r2r_{adj}^2\leq r^2radj2≤r2, and while radj2r_{adj}^2radj2 can be negative, this is relatively rare.