Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni

通过Anscombe四重奏数据集说明了数据可视化的重要性,并探讨了如何正确使用数据可视化来辅助统计分析,避免因选择性测试而产生的严重类型I错误。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

(This article was first published on  Getting Genetics Done, and kindly contributed to R-bloggers)     

<div class="markdown-here-wrapper" data-md-url="https://www.blogger.com/blogger.g?blogID=6232819486261696035#editor/ markdown-here-wrapper-content-modified=" true"="">
Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven ( x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.
Let’s load and view the data. There’s a built-in dataset, but I munged the data into a tidy format and included it in an R package that I wrote primarily for myself.
# If you don't have Tmisc installed, first install devtools, then install
# from github: install.packages('devtools')
# devtools::install_github('stephenturner/Tmisc')
library(Tmisc)
data(quartet)
str(quartet)
## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...
set x y
I 10 8.04
I 8 6.95
I 13 7.58
II 10 9.14
II 8 8.14
II 13 8.74
III 10 7.46
III 8 6.77
III 13 12.74
IV 8 6.58
IV 8 5.76
IV 8 7.71
Now, let’s compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.
library(dplyr)
quartet %>%
  group_by(set) %>%
  summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))
## Source: local data frame [4 x 6]
##
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817
Looks like each dataset has the same mean, median, standard deviation, and correlation coefficient between x and y.
Now, let’s plot y versus x for each set with a linear regression trendline displayed on each plot:
library(ggplot2)
p = ggplot(quartet, aes(x, y)) + geom_point()
p = p + geom_smooth(method = lm, se = FALSE)
p = p + facet_wrap(~set)
p

This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.
With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.
Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.
library(dplyr)
set.seed(42)
ndset = 64
n = 100
d = data_frame(
  set = factor(rep(1:ndset, each = n)),
  x = rnorm(n * ndset),
  y = rep(rnorm(n), ndset))
d
## Source: local data frame [6,400 x 3]
##
##    set       x       y
## 1    1  1.3710  1.2546
## 2    1 -0.5647  0.0936
## 3    1  0.3631 -0.0678
## 4    1  0.6329  0.2846
## 5    1  0.4043  1.0350
## 6    1 -0.1061 -2.1364
## 7    1  1.5115 -1.5967
## 8    1 -0.0947  0.7663
## 9    1  2.0184  1.8043
## 10   1 -0.0627 -0.1122
## .. ...     ...     ...
ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)

Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.
library(dplyr)
results = d %>%
  group_by(set) %>%
  do(mod = lm(y ~ x, data = .)) %>%
  summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
  mutate(bon = p.adjust(p, method = "bonferroni")) %>%
  mutate(fdr = p.adjust(p, method = "bonferroni"))
results
## Source: local data frame [64 x 4]
##
##    set      p   bon   fdr
## 1    1 0.2738 1.000 1.000
## 2    2 0.2125 1.000 1.000
## 3    3 0.7650 1.000 1.000
## 4    4 0.2094 1.000 1.000
## 5    5 0.8073 1.000 1.000
## 6    6 0.0132 0.844 0.844
## 7    7 0.4277 1.000 1.000
## 8    8 0.7323 1.000 1.000
## 9    9 0.9323 1.000 1.000
## 10  10 0.1600 1.000 1.000
## .. ...    ...   ...   ...
library(qqman)
qq(results$p)

BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?
results %>% filter(set %in% c(6, 28, 41, 49))
## Source: local data frame [4 x 4]
##
##   set      p   bon   fdr
## 1   6 0.0132 0.844 0.844
## 2  28 0.0338 1.000 1.000
## 3  41 0.0624 1.000 1.000
## 4  49 0.0898 1.000 1.000
When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I’ll say it again, the null hypothesis is true in this experiment, because I’ve simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p<0 .05="" 64="" all="" and="" are="" bonferroni="" correcting="" fdr-corrected="" for="" level.="" ll="" not="" notice="" p-values="" p="" significant.="" still="" tests="" that="" the="" you="">
The moral of the story here is to always look at your data, but don’t “cheat” by basing which statistical tests you perform based solely on that visualization exercise.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值