应用统计学与R语言实现笔记（番外篇三）——缺失值的相关系数分析

探讨R语言中处理含缺失值数据的相关系数分析，对比不同use参数设置下的相关系数计算差异，包括everything、pairwise、complete.obs及pairwise.complete.obs，通过案例解释为何在包含缺失值的数据集上，计算相关系数时会产生不同的结果。

昨天刚好有位同学来咨询R语言里计算相关系数的一些问题，所以来谈谈关于缺失值的相关系数分析问题，主要是在R语言中如何处理含缺失值数据的相关系数分析。

文章目录

1 问题描述
2 R语言相关分析中的缺失值处理原理
3 “Pairwise-complete correlation considered dangerous”案例

1 问题描述

相关分析可以说是数据分析以及探索性分析的基础。一般拿到手的数据，起手先来一波相关分析。同学遇上的问题如下：类似如下的数据。这里的数据是我利用随机分布随机造出来的，跟我同学的数据的一些基础分布特征是相似的。其实关键就是第四列数据有缺失数据。

然后在计算具体的相关系数时发现了一些问题。

可以清楚地看到在只计算b和c的相关系数的情况下，相关系数与p值分别为0.24和0.13，但当b，c和d都参与运算的情况下，相关系数和p值就变成了0.19和0.24。造成差别的原因是什么呢？

2 R语言相关分析中的缺失值处理原理

经过检查，关键在于use的参数的选择。use可以设置的参数主要包括pairwise，complete，complete.obs，pairwise.complete.obs，everything等。这里分别来看具体的含义。事实上这些都是针对相关系数公式里的协方差计算的设置。

pairwise：使用成对样本计算。
complete/complete.obs：必须选择完整的样本计算，目前没发现这两个有什么区别。
pairwise.complete.obs：通过在成对的基础上省略具有缺失值的行而形成的向量为每对列计算相关性。
everything：不对缺失值做任何处理，因此缺失值结果会直接传递给相关系数矩阵与p值计算。也就是说含有缺失值NA的变量无法计算出相关系数与p值。

由于前面提到这是针对协方差的计算，所以可以再查看R里面计算协方差的函数——cov的帮助文档协助理解。这是原文。

If use is “everything”, NAs will propagate conceptually, i.e., a resulting value will be NA whenever one of its contributing observations is NA. If use is “all.obs”, then the presence of missing observations will produce an error. If use is “complete.obs” then missing values are handled by casewise deletion (and if there are no complete cases, that gives an error). “na.or.complete” is the same unless there are no complete cases, that gives NA. Finally, if use has the value “pairwise.complete.obs” then the correlation or covariance between each pair of variables is computed using all complete pairs of observations on those variables. This can result in covariance or correlation matrices which are not positive semi-definite, as well as NA entries if there are no complete pairs for that pair of variables. For cov and var, “pairwise.complete.obs” only works with the “pearson” method. Note that (the equivalent of) var(d