相关的基本概念

定义1: 两个随机样本变量 xxxyyy 之间的协方差(covariance\textbf{covariance}covariance)是两个变量之间线性关联的度量, 由公式定义
(1)cov(x,y)=1n−1∑i=1n(xi−xˉ)(yi−yˉ) cov(x,y) = \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})\tag{1} cov(x,y)=n11i=1n(xixˉ)(yiyˉ)(1)
注意: 协方差类似于方差, 不同之处在于为两个变量 (上面的 xxxyyy) 定义协方差, 而方差只为一个变量定义。事实上, cov(x,x)=var(x)cov (x, x) = var (x)cov(x,x)=var(x)

协方差可以被认为是 xxxyyy 的数据元素对之间的匹配和不匹配之和:当对中的两个元素在它们的平均值的同一侧时, 就会发生匹配;当对中的一个元素高于其平均值, 而另一个元素低于其平均值时, 就会发生不匹配。

当匹配大于不匹配时, 协方差为正, 当不匹配大于匹配时, 协方差为负。协方差的绝对值大小表示 xxxyyy 之间线性关系的强度:线性关系越强, 协方差值就越大。协方差的大小也受数据元素尺度的影响, 为了消除尺度因子, 将相关系数作为线性关系的无尺度度量。

定义2: 两个样本变量 xxxyyy 之间的相关系数是两个变量之间线性关联的无标度度量, 并通过公式给出
(2)rxy=cov(x,y)sxsy r_{xy} = \frac{cov(x,y)}{s_xs_y}\tag{2} rxy=sxsycov(x,y)(2)
We also use the term coefficient of determination\textbf{coefficient of determination}coefficient of determination for r2r^2r2.

注意: Just as we saw for the variance in Measures of Variability\textbf{Measures of Variability}Measures of Variability, the covariancecovariancecovariance can be calculated as
(3)1n−1∑i=1n(xi−xˉ)(yi−yˉ)=1n−1(∑i=1nxiyi−xˉ∑i=1nyi−yˉ∑i=1nxi+nxˉyˉ)=1n−1(∑i=1nxiyi−nxˉyˉ) \begin{aligned} \frac{1}{n-1}\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})=& \frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-\bar{x}\sum_{i=1}^{n}y_i-\bar{y}\sum_{i=1}^{n}x_i+n\bar{x}\bar{y})\\ =&\frac{1}{n-1}(\sum_{i=1}^{n}x_iy_i-n\bar{x}\bar{y})\\ &\tag{3} \end{aligned} n11i=1n(xixˉ)(yiyˉ)==n11(i=1nxiyixˉi=1nyiyˉi=1nxi+nxˉyˉ)n11(i=1nxiyinxˉyˉ)(3)
因此, 我们还可以将相关系数计算为
(4)∑i=1nxiyi−nxˉyˉ∑i=1nxi2−nxˉ2∑i=1nyi2−nyˉ2 \frac{\sum_{i = 1}^{n}x_iy_i-n\bar{x}\bar{y}}{\sqrt{\sum_{i=1}^{n}x_i^2-n\bar{x}^2}\sqrt{\sum_{i=1}^{n}y_i^2-n\bar{y}^2}}\tag{4} i=1nxi2nxˉ2i=1nyi2nyˉ2i=1nxiyinxˉyˉ(4)
性质1: −1≤r≤1-1\leq r\leq 11r1.

注意: 如果 rrr 接近 1, 则 xxxyyy 呈正相关。正线性相关意味着 xxx 的高值与 yyy 的较高值相关, xxx 的较低值与 yyy 的低值相关联。

如果 rrr 接近 -1, 则 xxxyyy 呈负相关。负线性相关意味着 xxx 的较高值与 yyy 的较低值相关联, 而 xxx 的较低值与 yyy 的较高值相关联。

当接近 0 时, xxxyyy 之间几乎没有线性关系。

注意: We have defined covariance and the correlation coefficient for data samples. We can also define covariance and correlation coefficient for populations, based on their probability density function (pdf).

定义3: The covariance\textbf{covariance}covariance between two random variables xxx and yyy for a population with discrete or continuous pdf is defined by
(5)cov(x,y)=E[(x−μx)(y−μy)] cov(x,y) = E[(x-\mu_{x})(y-\mu_{y})]\tag{5} cov(x,y)=E[(xμx)(yμy)](5)
Where E[]E[]E[] is the expectation function.

定义4: The (Pearson’s product moment)\textbf{(Pearson’s product moment)}(Pearson’s product moment) correlation coefficient for two variables xxx and yyy for a population with discrete or continuous pdf is
(6)ρ=cov(x,y)σxσy \rho = \frac{cov(x,y)}{\sigma_x\sigma_y}\tag{6} ρ=σxσycov(x,y)(6)
性质2: −1≤ρ≤1-1\leq\rho\leq11ρ1.

性质3: cov(x,y)=E[xy]−μxμycov(x,y) = E[xy]-\mu_x\mu_ycov(x,y)=E[xy]μxμy

性质4: cov(x,y)=0cov(x,y) = 0cov(x,y)=0, xxx and yyy are independent.

性质5:
(7)var(x+y)=var(x)+var(y)+2cov(x,y)var(x−y)=var(x)+var(y)−2cov(x,y) \begin{aligned} var(x+y) = var(x)+var(y)+2cov(x,y)\\ var(x-y) = var(x)+var(y)-2cov(x,y) \tag{7} \end{aligned} var(x+y)=var(x)+var(y)+2cov(x,y)var(xy)=var(x)+var(y)2cov(x,y)(7)
注意: It turns out that rrr is not an unbiased estimate of ρ\rhoρ. A relatively unbiased estimate of ρ2\rho^2ρ2 is given by the adjusted coefficient of determination\textbf{adjusted coefficient of determination}adjusted coefficient of determination radj2r_{adj}^2radj2:
(8)radj2=1−(1−r2)(n−1)n−2 r_{adj}^2 = 1-\frac{(1-r^2)(n-1)}{n-2}\tag{8} radj2=1n2(1r2)(n1)(8)
while radj2r_{adj}^2radj2 is a better estimate of of the population coefficient of determination, especially for small values of nnn, for large values of nnn it is easy to see that radj2≈r2r_{adj}^2\approx r^2radj2r2. Note too that radj2≤r2r_{adj}^2\leq r^2radj2r2, and while radj2r_{adj}^2radj2 can be negative, this is relatively rare.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值