统计相关系数比较：Pearson、Spearman与Kendall秩相关性分析-优快云博客

本文介绍了统计中的三大相关系数——Pearson、Spearman和Kendall秩相关系数，探讨它们在衡量变量间相关性时的适用场景和差异。Pearson系数要求数据正态分布，而Spearman和Kendall是非参数方法，适用于等级数据或连续变量。Kendall's Tau通常比Spearman's rho的值小，且对误差不敏感，适合小样本量。此外，文章还展示了Python中如何计算这些相关系数，并指出Kendall's Tau的不同计算方法可能影响结果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

overview

三大统计相关系数：包括Pearson、Spearman秩相关系数、kendall等级相关系数

相关系数：考察两个事物（在数据里我们称之为变量）之间的相关程度。

如果有两个变量：X、Y，最终计算出的相关系数的含义可以有如下理解：

(1)、当相关系数为0时，X和Y两变量无关系。

(2)、当X的值增大（减小），Y值增大（减小），两个变量为正相关，相关系数在0.00与1.00之间。

(3)、当X的值增大（减小），Y值减小（增大），两个变量为负相关，相关系数在-1.00与0.00之间。

Name	前提	公式
Pearson Correlation Coefficient	正太分布，x,y独立，连续变量
Spearman Rank	等级对或者连续变量转化为rank
Kendall Rank	等级对或者连续变量转化为rank

从上面可以看出Pearson Correlation Coefficient特别严格，考虑到了分布，自然就会考虑到和分布有关的参数：均值和方差，因为为参数估计。后面两个都是再放松分布这个条件，不用均值和方差去算，因此为非参数方法。用途也比较广泛。

计算公式

Other

pearson 要求严苛，那后两者的区别是什么？

Kendall’s Tau: usually smaller values than Spearman’s rho correlation. Calculations based on concordant and discordant pairs. Insensitive to error. P values are more accurate with smaller sample sizes.

Spearman’s rho: usually have larger values than Kendall’s Tau. Calculations based on deviations. Much more sensitive to error and discrepancies in data.

Spearman’s rho 比 Kendall’s Tau更sensitive to error. 而且P values are more accurate with smaller sample sizes.

To my knowledge 所以Kendall’s Tau使用范围更广

link

python 实现

from scipy import stats
import random
random.seed(0)
x = random.choices(range(100),k=10)
y = random.choices(range(100),k=10)
result  = stats.pearsonr( x ,y )
print( "pearsonr",result )
result  =stats.spearmanr( x ,y )
print( "spearmanr",result )
result  =stats.kendalltau( x ,y )
print( "kendalltau",result )

output

pearsonr (0.20365851700261028, 0.5725197247700218)
spearmanr SpearmanrResult(correlation=0.23928181646454103, pvalue=0.5055248490013543)
kendalltau KendalltauResult(correlation=0.23002185311411807, pvalue=0.36396207510247747)

Note

In fact， kendalltau有三种计算方式， scipy里面用的是,scipy 里面有的是计算公式2：
$\tau_2 = \frac{|Concordant-pairs|- |Discordant-pairs|}{(N_3 - N_1)(N_3 - N_2) }$
where $N_3= \frac{1}{2}N(N-1),N_2= \sum_{i=1}^{s}\frac{1}{2}U_i(U_i-1),N_1= \sum_{i=1}^{s}\frac{1}{2}V_i(V_i-1)$
将X中的相同元素分别组合成小集合，s表示集合X中拥有的小集合数（例如X包含元素：1 2 3 4 3 3 2，那么这里得到的s则为2，因为只有2、3有相同元素），Ui表示第i个小集合所包含的元素数。N2在集合Y的基础上计算而得。.

而平常paper[1,2,3,4]等等中用的是

$\tau_1 = \frac{|Concordant-pairs|- |Discordant-pairs|}{|Concordant-pairs|+ |Discordant-pairs|}$
concordant-pairs 就是 $C_n^2$ 随机挑选两个位置，X中和Y中这两个位置的rank一致，Discordant-pairs就是不一致。如果相同的话，不同论文的处理方式不同。

def kendalltau_way1(x,y):
    count = 0
    Lens=len(x)
    for i in range(Lens-1):
        for j in range(i+1,Lens):
            count = count + np.sign(x[i] - x[j]) * np.sign(y [i] - y [j])
    #     print('-----')
    kendallCorrelation = count/((Lens*(Lens-1))/2)
    return kendallCorrelation

一般来说， $\tau_2$ 都高于 $\tau_1$ 因为分母中减去了值变小了。因为在计算 $\tau_1$ 的时候，如果在X中和Y中这两个位置的rank一致，|Concordant-pairs|和|Discordant-pairs|个得到一半收益，那么 $N_3= \frac{1}{2}N(N-1)$