Covariance and Correlation

本文深入探讨了协方差和相关性这两个统计学概念的区别与联系。协方差指示变量间线性关系的方向,而相关性则衡量这种关系的强度和方向,且其值标准化。文章详细介绍了它们的数学定义、计算公式,并通过数据矩阵形式展示了协方差矩阵和相关矩阵的构造过程。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Covariance and Correlation

Demystifying the terms

Covariance indicates the direction of the linear relationship between variables.

Correlation on the other hand measures both the strength and direction of the linear relationship between two variables.

Correlation is a function of the covariance. What sets them apart is the fact that correlation values are standardized whereas, covariance values are not.

Defining the terms mathematically

Covariance

cov(x,y)=E[(x−μx)(y−μy)]=E[xy]−E[x]E[y] \begin{aligned} cov(x,y) &= E[(x - \mu_x) (y - \mu_y)]\\ &= E[xy] - E[x] E[y] \end{aligned} cov(x,y)=E[(xμx)(yμy)]=E[xy]E[x]E[y]

If we have only a single variable xxx, then

cov(x,x)=E[(x−μx)(x−μx)]=E[(x−μx)2]=var(x)=σ2(x)=σx2Let var(x):=s2sampled varaince \begin{aligned} cov(x, x) &= E[(x - \mu_x) (x - \mu_x)]\\ &= E[(x - \mu_x)^2] \\ &= var(x) = \sigma^2(x) = \sigma^2_x \\ \text{Let }var(x) & := s^2 \hspace{1cm} \text{sampled varaince} \end{aligned} cov(x,x)Let var(x)=E[(xμx)(xμx)]=E[(xμx)2]=var(x)=σ2(x)=σx2:=s2sampled varaince

Expand it, we can get

s2=cov(x,x)=∑i=1N(xi−xˉ)2n−1cov(x,y)=∑i=1N(xi−xˉ)(yi−yˉ)n−1 \begin{aligned} s^2 = cov(x, x) &= \frac{\sum_{i=1}^N (x_i - \bar{x})^2}{n-1} \\ cov(x,y) &= \frac{\sum_{i=1}^{N}(x_i - \bar{x}) (y_i - \bar{y})}{n-1} \end{aligned} s2=cov(x,x)cov(x,y)=n1i=1N(xixˉ)2=n1i=1N(xixˉ)(yiyˉ)

The numerator of the first equation is called sum of squared deviation, and the second is called sum of cross product.

Correlation

corr(x,y)=cov(x,y)sxsy=E[(x−μx)(y−μy)]sxsy=E[(x−μx)(y−μy)]σxσy \begin{aligned} corr(x,y) = \frac{cov(x,y)}{s_x s_y} &= \frac{E[(x - \mu_x) (y - \mu_y)]}{s_x s_y} \\ &= \frac{E[(x - \mu_x) (y - \mu_y)]}{\sigma_x \sigma_y} \end{aligned} corr(x,y)=sxsycov(x,y)=sxsyE[(xμx)(yμy)]=σxσyE[(xμx)(yμy)]

So the values of correlation coefficient rnge from [-1, 1]. The positive sign signifies the direction of the correlation i.e. if one of the variables increases, the other variable is also supposed to increase.

Data-matrix representation of covariance and correlation

X=[x11...x1n.........xm1...xmn]=[x1...xn] X = \begin{bmatrix} x_{11} & ... & x_{1n} \\ ... & ... & ... \\ x_{m1} & ... & x_{mn} \\ \end{bmatrix} = \begin{bmatrix} \mathbf{x}_1 & ... & \mathbf{x}_n \end{bmatrix} X=x11...xm1.........x1n...xmn=[x1...xn]

order of X=m×nX = m\times nX=m×n

We call a row is item / subject and a column variable

Now we can calculate the sample mean of jjjth variable

xˉj=1m∑i=1mxij \bar{x}_j = \frac{1}{m}\sum_{i=1}^m x_{ij} xˉj=m1i=1mxij

similarly, the row-mean is

xˉi=1n∑j=1nxij \bar{x}_i = \frac{1}{n}\sum_{j=1}^nx_{ij} xˉi=n1j=1nxij

We then can define the covariance matrix:

S=1m[x1−xˉ1...xn−xˉn][x1−xˉ1...xn−xˉn]=[s12...s1n2.........sn12...sn2]where sj2=1m∑i=1m(xij−xˉj)2variance of jth variablesjk=1m∑i=1m(xij−xˉj)(xik−xˉk)covariance between jth and kth variablexˉj=1m∑i=1mxijmean of jth variable \begin{aligned} S = \frac{1}{m}\begin{bmatrix} \mathbf{x}_1 - \bar{\mathbf{x}}_1 \\ ... \\ \mathbf{x}_n - \bar{\mathbf{x}}_n \\ \end{bmatrix} \begin{bmatrix} \mathbf{x}_1 - \bar{\mathbf{x}}_1 & ... & \mathbf{x}_n - \bar{\mathbf{x}}_n \end{bmatrix} &= \begin{bmatrix} s_{1}^2 & ... & s_{1n}^2 \\ ... & ... & ... \\ s_{n1}^2 & ... & s_{n}^2 \\ \end{bmatrix}\\ \text{where } s_j^2 &= \frac{1}{m}\sum_{i=1}^{m}(x_{ij} - \bar{x}_j)^2 \hspace{1cm} \text{variance of jth variable} \\ s_{jk} &= \frac{1}{m} \sum_{i=1}^{m}(x_{ij} - \bar{x}_j) (x_{ik} - \bar{x}_k) \hspace{1cm} \text{covariance between jth and kth variable}\\ \bar{\mathbf{x}}_j &= \frac{1}{m}\sum_{i=1}^{m}x_{ij} \hspace{1cm} \text{mean of jth variable} \end{aligned} S=m1x1xˉ1...xnxˉn[x1xˉ1...xnxˉn]where sj2sjkxˉj=s12...sn12.........s1n2...sn2=m1i=1m(xijxˉj)2variance of jth variable=m1i=1m(xijxˉj)(xikxˉk)covariance between jth and kth variable=m1i=1mxijmean of jth variable

We can see that the covariance matrix is a n×nn\times nn×n symmetric matrix

Then we can define the Correlation matrix

R=1m[(x1−xˉ1)/s1...(xn−xˉn)/sn][(x1−xˉ1)/s1...(xn−xˉn)/sn]=[1r12...r1n............rn1......1] \begin{aligned} R &= \frac{1}{m} \begin{bmatrix} (\mathbf{x}_1 - \bar{\mathbf{x}}_1) / s_1 \\ ... \\ (\mathbf{x}_n - \bar{\mathbf{x}}_n) / s_n \\ \end{bmatrix} \begin{bmatrix} (\mathbf{x}_1 - \bar{\mathbf{x}}_1) / s_1 & ... & (\mathbf{x}_n - \bar{\mathbf{x}}_n) / s_n \\ \end{bmatrix}\\ &= \begin{bmatrix} 1 & r_{12} & ... & r_{1n} \\ ...& ... & ... & ... \\ r_{n1} & ... & ... & 1 \end{bmatrix} \end{aligned} R=m1(x1xˉ1)/s1...(xnxˉn)/sn[(x1xˉ1)/s1...(xnxˉn)/sn]=1...rn1r12...............r1n...1

Covariance versus Correlation

  • Covariance has unit from the product of the units of the two variables
    Correlation is dimensionless

  • Covariance can take value from (−∞,+∞)(-\infty, +\infty)(,+)
    Correlation lies between [−1,1][-1, 1][1,1]

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值