Using and Abusing Data Visualization: Anscombe’s Quartet and Cheating Bonferroni

最新推荐文章于 2025-02-06 15:27:21 发布

RoQuant

最新推荐文章于 2025-02-06 15:27:21 发布

阅读量682

点赞数

分类专栏： R 数据可视化

R 同时被 2 个专栏收录

421 篇文章

订阅专栏

数据可视化

47 篇文章

订阅专栏

通过Anscombe四重奏数据集说明了数据可视化的重要性，并探讨了如何正确使用数据可视化来辅助统计分析，避免因选择性测试而产生的严重类型I错误。

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

Anscombe’s quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed. Each dataset consists of eleven ( x, y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data before analyzing it and the effect of outliers on statistical properties.

Let’s load and view the data. There’s a built-in dataset, but I munged the data into a tidy format and included it in an R package that I wrote primarily for myself.

# If you don't have Tmisc installed, first install devtools, then install
# from github: install.packages('devtools')
# devtools::install_github('stephenturner/Tmisc')
library(Tmisc)
data(quartet)
str(quartet)

## 'data.frame':    44 obs. of  3 variables:
##  $ set: Factor w/ 4 levels "I","II","III",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ x  : int  10 8 13 9 11 14 6 4 12 7 ...
##  $ y  : num  8.04 6.95 7.58 8.81 8.33 ...

set	x	y
I	10	8.04
I	8	6.95
I	13	7.58
…	…	…
II	10	9.14
II	8	8.14
II	13	8.74
…	…	…
III	10	7.46
III	8	6.77
III	13	12.74
…	…	…
IV	8	6.58
IV	8	5.76
IV	8	7.71
…	…	…

Now, let’s compute the mean and standard deviation of both x and y, and the correlation coefficient between x and y for each dataset.

library(dplyr)
quartet %>%
  group_by(set) %>%
  summarize(mean(x), sd(x), mean(y), sd(y), cor(x,y))

## Source: local data frame [4 x 6]
##
##   set mean(x) sd(x) mean(y) sd(y) cor(x, y)
## 1   I       9  3.32     7.5  2.03     0.816
## 2  II       9  3.32     7.5  2.03     0.816
## 3 III       9  3.32     7.5  2.03     0.816
## 4  IV       9  3.32     7.5  2.03     0.817

Looks like each dataset has the same mean, median, standard deviation, and correlation coefficient between x and y.

Now, let’s plot y versus x for each set with a linear regression trendline displayed on each plot:

library(ggplot2)
p = ggplot(quartet, aes(x, y)) + geom_point()
p = p + geom_smooth(method = lm, se = FALSE)
p = p + facet_wrap(~set)
p

This classic example really illustrates the importance of looking at your data, not just the summary statistics and model parameters you compute from it.

With that said, you can’t use data visualization to “cheat” your way into statistical significance. I recently had a collaborator who wanted some help automating a data visualization task so that she could decide which correlations to test. This is a terrible idea, and it’s going to get you in serious type I error trouble. To see what I mean, consider an experiment where you have a single outcome and lots of potential predictors to test individually. For example, some outcome and a bunch of SNPs or gene expression measurements. You can’t just visually inspect all those relationships then cherry-pick the ones you want to evaluate with a statistical hypothesis test, thinking that you’ve outsmarted your way around a painful multiple-testing correction.

Here’s a simple simulation showing why that doesn’t fly. In this example, I’m simulating 100 samples with a single outcome variable y and 64 different predictor variables, x. I might be interested in which x variable is associated with my y (e.g., which of my many gene expression measurement is associated with measured liver toxicity). But in this case, both x and y are random numbers. That is, I know for a fact the null hypothesis is true, because that’s what I’ve simulated. Now we can make a scatterplot for each predictor variable against our outcome, and look at that plot.

library(dplyr)
set.seed(42)
ndset = 64
n = 100
d = data_frame(
  set = factor(rep(1:ndset, each = n)),
  x = rnorm(n * ndset),
  y = rep(rnorm(n), ndset))
d

## Source: local data frame [6,400 x 3]
##
##    set       x       y
## 1    1  1.3710  1.2546
## 2    1 -0.5647  0.0936
## 3    1  0.3631 -0.0678
## 4    1  0.6329  0.2846
## 5    1  0.4043  1.0350
## 6    1 -0.1061 -2.1364
## 7    1  1.5115 -1.5967
## 8    1 -0.0947  0.7663
## 9    1  2.0184  1.8043
## 10   1 -0.0627 -0.1122
## .. ...     ...     ...

ggplot(d, aes(x, y)) + geom_point() + geom_smooth(method = lm) + facet_wrap(~set)

Now, if I were to go through this data and compute the p-value for the linear regression of each x on y, I’d get a uniform distribution of p-values, my type I error is where it should be, and my FDR and Bonferroni-corrected p-values would almost all be 1. This is what we expect — remember, the null hypothesis is true.

library(dplyr)
results = d %>%
  group_by(set) %>%
  do(mod = lm(y ~ x, data = .)) %>%
  summarize(set = set, p = anova(mod)$"Pr(>F)"[1]) %>%
  mutate(bon = p.adjust(p, method = "bonferroni")) %>%
  mutate(fdr = p.adjust(p, method = "bonferroni"))
results

## Source: local data frame [64 x 4]
##
##    set      p   bon   fdr
## 1    1 0.2738 1.000 1.000
## 2    2 0.2125 1.000 1.000
## 3    3 0.7650 1.000 1.000
## 4    4 0.2094 1.000 1.000
## 5    5 0.8073 1.000 1.000
## 6    6 0.0132 0.844 0.844
## 7    7 0.4277 1.000 1.000
## 8    8 0.7323 1.000 1.000
## 9    9 0.9323 1.000 1.000
## 10  10 0.1600 1.000 1.000
## .. ...    ...   ...   ...

library(qqman)
qq(results$p)

BUT, if I were to look at those plots above and cherry-pick out which hypotheses to test based on how strong the correlation looks, my type I error will skyrocket. Looking at the plot above, it looks like the x variables 6, 28, 41, and 49 have a particularly strong correlation with my outcome, y. What happens if I try to do the statistical test on only those variables?

results %>% filter(set %in% c(6, 28, 41, 49))

## Source: local data frame [4 x 4]
##
##   set      p   bon   fdr
## 1   6 0.0132 0.844 0.844
## 2  28 0.0338 1.000 1.000
## 3  41 0.0624 1.000 1.000
## 4  49 0.0898 1.000 1.000

When I do that, my p-values for those four tests are all below 0.1, with two below 0.05 (and I’ll say it again, the null hypothesis is true in this experiment, because I’ve simulated random data). In other words, my type I error is now completely out of control, with more than 50% false positives at a p<0 .05="" 64="" all="" and="" are="" bonferroni="" correcting="" fdr-corrected="" for="" level.="" ll="" not="" notice="" p-values="" p="" significant.="" still="" tests="" that="" the="" you="">

The moral of the story here is to always look at your data, but don’t “cheat” by basing which statistical tests you perform based solely on that visualization exercise.