文章目录
Below I did a complete example of a parametric test to illustrate some important concepts.
Question
Suppose we obtained a random sample (which is representative of the US population), and we grouped the individuals into different groups according to which age decade they are in (the AgeDecade
column, whose values are strings). The data is about the individuals’ heartbeat per minute (pulse rate), which corresponds to the Pulse
column in the data below. In this problem, we want to test if the collected data provide evidence that the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39. You may assume pulse rates follow a normal distribution (for the entire population as well as for each age category). We run 10000 repetitions.
Solution
1. State the hypotheses
- H 0 : μ 30 − 39 − μ 20 − 29 = 0 H_0:\mu_{30-39}-\mu_{20-29}=0 H0:μ30−39−μ20−29=0
-
H
a
:
μ
30
−
39
−
μ
20
−
29
≠
0
H_a: \mu_{30-39}-\mu_{20-29}\ne 0
Ha:μ30−39−μ20−29=0
where μ \mu μ stands for the mean pulse rate for the corresponding US population, and the test statistic of interest is thus the difference x ˉ 30 − 39 − x ˉ 20 − 29 \bar x_{30-39}-\bar x_{20-29} xˉ30−39−xˉ20−29 - Equivalently, the hypotheses may be stated as:
- Null hypothesis: the average pulse rate of people aged 20-29 is not different from the average pulse rate of people aged 30-39
- Alternative hypothesis: the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39.
2. Importing and examining the data
- You can refer to the code and outputs below
# load the data (.csv file located in the same file/working directory)
nhanes <- read.csv("nhanes.csv")
# Examine the data
head(nhanes)
summary(nhanes)
![]() | Fig.1 head(nhanes) |
---|
![]() | Fig.2 summary(nhanes) |
---|
3. Exploratory analysis of the data
- From the boxplot below, we can infer that the test statistic is negative i.e. there is at least a difference in the means of the observed sample
- -0.7768498 is this difference
# side-by-side boxplot our sample
boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE)
![]() | Fig.2 boxplot() |
---|
# Calculating the test statistic of the sample
elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)
younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)
obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff # -0.7768498 is the result
4. Paramatrizing
- Parametrizing the population under the null hypothesis of no difference using the obtained data
- Because we assumed both groups of sample are from the same population, we can pool them together to estimate that hypothetical population. Moreover, since the sample is a random sample, we can assume the sample is representative of the “null population”, so the sample mean is going to estimate the population mean, and the sample standard deviation is going to estimate the population standard deviation.
mean <- mean(nhanes$Pulse)
mean # 75.15403
sd <- sd(nhanes$Pulse)
sd # 11.78723
# N(mean = 75.15403, sd = 11.78723) is the null population
5. Executing the bootstrap test
-
The general implementation is straightforward. However, I must mention one big thing to notice in this step (I’ve made this mistake in the past):
Because we have a two-tailed test, the order of your subtraction does not really matter (i.e. you can calculate either x ˉ 20 − 29 − x ˉ 30 − 39 \bar x_{20-29}-\bar x_{30-39} xˉ20−29−xˉ30−39 or vice versa. Here, we assume we stritcly follow the hypotheses to use x ˉ 30 − 39 − x ˉ 20 − 29 \bar x_{30-39}-\bar x_{20-29} xˉ30−39−xˉ20−29, and we denote our observation to be o b s _ d i f f = − 0.7768498 obs\_diff = -0.7768498 obs_diff=−0.7768498. But one thing should always be implemented: you always want to make sure that when you are calculating the empirical p p p-value, you are looking for the probability that, under the null hypothesis, a random sample produces a difference in sample means that is at least as extreme or more extreme than your observed sample’s difference, which meansP ( ( x ˉ 30 − 39 − x ˉ 20 − 29 = o b s _ d i f f ) ⏟ a s e x t r e m e a s o u r o b s e r v a t i o n ∨ ( x ˉ 30 − 39 − x ˉ 20 − 29 < o b s _ d i f f ) ∨ ( x ˉ 30 − 39 − x ˉ 20 − 29 > o b s _ d i f f ⏟ m o r e e x t r e m e t h a n o u r o b s e r v a t i o n ) ∣ H 0 ) P(\underbrace {(\bar x_{30-39}-\bar x_{20-29}=obs\_diff)}_{as ~extreme ~as ~our ~observation} \lor \underbrace{(\bar x_{30-39}-\bar x_{20-29}<obs\_diff) \lor (\bar x_{30-39}-\bar x_{20-29}>obs\_diff}_{more~extreme~than~our ~observation})|H_0) P(as extreme as our observation (xˉ30−39−xˉ20−29=obs_diff)∨more extreme than our observation (xˉ30−39−xˉ20−29<obs_diff)∨(xˉ30−39−xˉ20−29>obs_diff)∣H0). Thus, we have two
abs()
function used in the lineempirical_p <- mean(abs(diffs) >= abs(obs_diff))
below.
Moreover, we have toround()
the sampled values because the pulse rate should be integers
diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
group1 <- round(rnorm(length(elder), mean = mean,sd = sd))
group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
diff <- mean(group1) - mean(group2)
diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff))
# notice the two abs() above
# Moreover, recall that mean() applied
# to a logical vector computes the proportion
empirical_p # this is 0.241
6. Draw conclusion (statistical inference)
- Given the assumption that random sampling is implemented, we do a parametric bootstrap test. The empirical p-value, obtained by drawing random samples from a normal distribution
N
(
75.15403
,
11.78723
)
N(75.15403, 11.78723)
N(75.15403,11.78723) 10000 times, is
0.241
, which is large. This is weak evidence against the null hypothesis of no difference, so we fail to reject the null hypothesis and may suggest that there is no difference between the average pulse rate of people aged 20-29 and the average pulse rate of people aged 30-39.
Full Code
nhanes <- read.csv("nhanes.csv")
head(nhanes)
summary(nhanes)
boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE)
elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)
younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)
obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff
diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
group1 <- round(rnorm(length(elder), mean = mean,sd = sd))
group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
diff <- mean(group1) - mean(group2)
diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff))
empirical_p