Spearman's rank correlation coefficient 和 Pearson correlation coefficient详细


In statisticsSpearman's rank correlation coefficient or Spearman's rho, named after Charles Spearman and often denoted by the Greek letter \rho(rho) or as r_s, is a nonparametric measure of statistical dependence between two variables. It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Spearman's coefficient, like any correlation calculation, is appropriate for both continuous and discrete variables, including ordinal variables.[1][2] Spearman's \rho and Kendall's \tau can be formulated as special cases of a more general correlation coefficient.

Definition and calculation[edit]

The Spearman correlation coefficient is defined as the Pearson correlation coefficient between the ranked variables.[3]

For a sample of size n, the n raw scores X_i, Y_i are converted to ranks \operatorname{rg} X_i, \operatorname{rg} Y_i, and r_s is computed from:

r_s = \rho_{\operatorname{rg}_X,\operatorname{rg}_Y} = \frac {\operatorname{cov}(\operatorname{rg}_X,\operatorname{rg}_Y)} { \sigma_{\operatorname{rg}_X} \sigma_{\operatorname{rg}_Y} }
where

Only if all n ranks are distinct integers, it can be computed using the popular formula

 r_s = {1- \frac {6 \sum d_i^2}{n(n^2 - 1)}}.
where
  • d_i = rg(X_i) - rg(Y_i), is the difference between the two ranks of each observation.
  • n is the number of observations

Identical values are usually each assigned fractional ranks equal to the average of their positions in the ascending order of the values, which is equivalent to averaging over all possible permutations.

If ties are present in the data set, this equation yields incorrect results: Only if in both variables all ranks are distinct, then \sigma_{\operatorname{rg}_X} \sigma_{\operatorname{rg}_Y} = \operatorname{Var}{\operatorname{rg}_X} = \operatorname{Var}{\operatorname{rg}_Y} = n(n^2 - 1)/6 (cf. tetrahedral number T_{n-1}). The first equation—normalizing by the standard deviation—may even be used even when ranks are normalized to [0;1] ("relative ranks") because it is insensitive both to translation and linear scaling.

This method should also not be used in cases where the data set is truncated; that is, when the Spearman correlation coefficient is desired for the top X records (whether by pre-change rank or post-change rank, or both), the user should use the Pearson correlation coefficient formula given above.[citation needed]

The standard error of the coefficient (σ) was determined by Pearson in 1907 and Gosset in 1920. It is

 \sigma_{r_s} = \frac{ 0.6325 }{ \sqrt{n-1} }

Example[edit]

In this example, the raw data in the table below is used to calculate the correlation between the IQ of a person with the number of hours spent in front of TV per week.

IQX_iHours of TV per week, Y_i
1067
860
10027
10150
9928
10329
9720
11312
1126
11017

Firstly, evaluate d^2_i. To do so use the following steps, reflected in the table below.

  1. Sort the data by the first column (X_i). Create a new column x_i and assign it the ranked values 1,2,3,...n.
  2. Next, sort the data by the second column (Y_i). Create a fourth column y_i and similarly assign it the ranked values 1,2,3,...n.
  3. Create a fifth column d_i to hold the differences between the two rank columns (x_i and y_i).
  4. Create one final column d^2_i to hold the value of column d_i squared.
IQX_iHours of TV per week, Y_irank x_irank y_id_id^2_i
8601100
972026−416
992838−525
1002747−39
10150510−525
1032969−39
106773416
110178539
112692749
11312104636

With d^2_i found, add them to find \sum d_i^2 = 194. The value of n is 10. These values can now be substituted back into the equation : \rho = 1- {\frac {6 \sum d_i^2}{n(n^2 - 1)}}. to give

 \rho = 1- {\frac {6\times194}{10(10^2 - 1)}}

which evaluates to ρ = -29/165 = −0.175757575... with a P-value = 0.627188 (using the t distribution)

Chart of the data presented. It can be seen that there might be a negative correlation, but that the relationship does not appear definitive.

This low value shows that the correlation between IQ and hours spent watching TV is very low, although the negative value suggests that the longer the time spent watching television the lower the IQ. In the case of ties in the original values, this formula should not be used; instead, the Pearson correlation coefficient should be calculated on the ranks (where ties are given ranks, as described above).


皮尔森相关系数

皮尔森相关系数(Pearson correlation coefficient)也叫皮尔森积差相关系数(Pearson product-moment correlation coefficient),是用来反应两个变量相似程度的统计量。或者说可以用来计算两个向量的相似度(在基于向量空间模型的文本分类、用户喜好推荐系统中都有应用)。

皮尔森相关系数计算公式如下:

ρX,Y=cov(X,Y)σXσY=E((XμX)(YμY))σXσY=E(XY)E(X)E(Y)E(X2)E2(X)E(Y2)E2(Y) ρX,Y=cov(X,Y)σXσY=E((X−μX)(Y−μY))σXσY=E(XY)−E(X)E(Y)E(X2)−E2(X)E(Y2)−E2(Y)

分子是协方差,分子是两个变量标准差的乘积。显然要求X和Y的标准差都不能为0。

当两个变量的线性关系增强时,相关系数趋于1或-1。正相关时趋于1,负相关时趋于-1。当两个变量独立时相关系统为0,但反之不成立。比如对于 y=x2 y=x2,X服从[-1,1]上的均匀分布,此时E(XY)为0,E(X)也为0,所以 ρX,Y=0 ρX,Y=0,但x和y明显不独立。所以“不相关”和“独立”是两回事。当Y 和X服从联合正态分布时,其相互独立和不相关是等价的。

对于居中的数据来说(何谓居中?也就是每个数据减去样本均值,居中后它们的平均值就为0),E(X)=E(Y)=0,此时有:

ρX,Y=E(XY)E(X2)E(Y2)=1NNi=1XiYi1NNi=1X2i1NNi=1Y2i=Ni=1XiYiNi=1X2iNi=1Y2i=Ni=1XiYi||X||||Y|| ρX,Y=E(XY)E(X2)E(Y2)=1N∑i=1NXiYi1N∑i=1NXi21N∑i=1NYi2=∑i=1NXiYi∑i=1NXi2∑i=1NYi2=∑i=1NXiYi||X||||Y||

即相关系数可以看作是两个随机变量中得到的样本集向量之间夹角的cosine函数。

进一步当X和Y向量归一化后,||X||=||Y||=1,相关系数即为两个向量的乘积 ρX,Y=XY ρX,Y=X∙Y

Spearman秩相关系数

首先说明秩相关系数还有其他类型,比如kendal秩相关系数。

使用Pearson线性相关系数有2个局限:

  1. 必须假设数据是成对地从正态分布中取得的。
  2. 数据至少在逻辑范围内是等距的。

对于更一般的情况有其他的一些解决方案,Spearman秩相关系数就是其中一种。Spearman秩相关系数是一种无参数(与分布无关)检验方法,用于度量变量之间联系的强弱。在没有重复数据的情况下,如果一个变量是另外一个变量的严格单调函数,则Spearman秩相关系数就是+1或-1,称变量完全Spearman秩相关。注意这和Pearson完全相关的区别,只有当两变量存在线性关系时,Pearson相关系数才为+1或-1。

对原始数据xi,yi按从大到小排序,记x'i,y'i为原始xi,yi在排序后列表中的位置,x'i,y'i称为xi,yi的秩次,秩次差di=x'i-y'i。Spearman秩相关系数为:

ρs=16d2in(n21) ρs=1−6∑di2n(n2−1)

位置 原始X 排序后 秩次 原始Y 排序后 秩次 秩次差
1 12 546 5 1 78 6 1
2 546 45 1 78 46 1 0
3 13 32 4 2 45 5 1
4 45 13 2 46 6 2 0
5 32 12 3 6 2 4 1
6 2 2 6 45 1 3 -3

对于上表数据,算出Spearman秩相关系数为:1-6*(1+1+1+9)/(6*35)=0.6571

查阅秩相关系数检验的临界值表

n 显著水平
0.01 0.05
5 0.9 1
6 0.829 0.943
7 0.714 0.893

n=6时,0.6571<0.829,所以在0.01的显著水平下认为X和Y是不相关的。

如何原始数据中有重复值,则在求秩次时要以它们的平均值为准,比如:

原始X 秩次 调整后的秩次
0.8 5 5
1.2 4 (4+3)/2=3.5
1.2 3 (4+3)/2=3.5
2.3 2 2
18 1 1

Spearman秩相关系数应该是从秩和检验延伸过来的,因为它们很像。

相关性和相似度的区别

X=(1,2,3)跟Y=(4,5,6)的皮尔森相关系数等于1,说明X和Y是严格线性相关的(事实上Y=X+3)。

但是X和Y的相似度却不是1,如果用余弦距离来度量,X和Y之间的距离明显大于0。


评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

xiangyong58

喝杯茶还能肝到天亮,共同进步

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值