Go Over a Complete Example of Parametric Bootstrap Test with R

本文链接：https://blog.youkuaiyun.com/Bill_Wang_01/article/details/114762255

本文提供了一个使用R进行参数化引导检验的完整案例，以测试20-29岁和30-39岁人群平均脉搏率是否不同。通过陈述假设、导入数据、探索性分析、参数化、执行引导测试和得出统计推断，得出结论为样本差异的观察值对应的p值较大，未能拒绝原假设，即两个年龄段的平均脉搏率无显著差异。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

Below I did a complete example of a parametric test to illustrate some important concepts.

Question

Suppose we obtained a random sample (which is representative of the US population), and we grouped the individuals into different groups according to which age decade they are in (the AgeDecade column, whose values are strings). The data is about the individuals’ heartbeat per minute (pulse rate), which corresponds to the Pulse column in the data below. In this problem, we want to test if the collected data provide evidence that the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39. You may assume pulse rates follow a normal distribution (for the entire population as well as for each age category). We run 10000 repetitions.

Solution

1. State the hypotheses

$H_0:\mu_{30-39}-\mu_{20-29}=0$
$H_a: \mu_{30-39}-\mu_{20-29}\ne 0$
where $\mu$ stands for the mean pulse rate for the corresponding US population, and the test statistic of interest is thus the difference $\bar x_{30-39}-\bar x_{20-29}$
Equivalently, the hypotheses may be stated as:
- Null hypothesis: the average pulse rate of people aged 20-29 is not different from the average pulse rate of people aged 30-39
- Alternative hypothesis: the average pulse rate of people aged 20-29 is different from the average pulse rate of people aged 30-39.

2. Importing and examining the data

- You can refer to the code and outputs below

# load the data (.csv file located in the same file/working directory)
nhanes <- read.csv("nhanes.csv")

# Examine the data
head(nhanes)
summary(nhanes)

	Fig.1 `head(nhanes)`

	Fig.2 `summary(nhanes)`

3. Exploratory analysis of the data

From the boxplot below, we can infer that the test statistic is negative i.e. there is at least a difference in the means of the observed sample
-0.7768498 is this difference

# side-by-side boxplot our sample
boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE)

	Fig.2 `boxplot()`

# Calculating the test statistic of the sample
elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)

younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)

obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff #  -0.7768498 is the result

4. Paramatrizing

Parametrizing the population under the null hypothesis of no difference using the obtained data
Because we assumed both groups of sample are from the same population, we can pool them together to estimate that hypothetical population. Moreover, since the sample is a random sample, we can assume the sample is representative of the “null population”, so the sample mean is going to estimate the population mean, and the sample standard deviation is going to estimate the population standard deviation.

mean <- mean(nhanes$Pulse)
mean # 75.15403

sd <- sd(nhanes$Pulse)
sd # 11.78723
# N(mean = 75.15403, sd = 11.78723) is the null population

5. Executing the bootstrap test

The general implementation is straightforward. However, I must mention one big thing to notice in this step (I’ve made this mistake in the past):
Because we have a two-tailed test, the order of your subtraction does not really matter (i.e. you can calculate either $\bar x_{20-29}-\bar x_{30-39}$ or vice versa. Here, we assume we stritcly follow the hypotheses to use $\bar x_{30-39}-\bar x_{20-29}$ , and we denote our observation to be $obs\_diff = -0.7768498$ . But one thing should always be implemented: you always want to make sure that when you are calculating the empirical $p$ -value, you are looking for the probability that, under the null hypothesis, a random sample produces a difference in sample means that is at least as extreme or more extreme than your observed sample’s difference, which means

$P(\underbrace {(\bar x_{30-39}-\bar x_{20-29}=obs\_diff)}_{as ~extreme ~as ~our ~observation} \lor \underbrace{(\bar x_{30-39}-\bar x_{20-29}<obs\_diff) \lor (\bar x_{30-39}-\bar x_{20-29}>obs\_diff}_{more~extreme~than~our ~observation})|H_0)$ . Thus, we have two abs() function used in the line empirical_p <- mean(abs(diffs) >= abs(obs_diff)) below.
Moreover, we have to round() the sampled values because the pulse rate should be integers

diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
  group1 <- round(rnorm(length(elder),  mean = mean,sd = sd))
  group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
  diff <- mean(group1) - mean(group2)
  diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff)) 
# notice the two abs() above
# Moreover, recall that mean() applied 
# to a logical vector computes the proportion
empirical_p # this is  0.241

6. Draw conclusion (statistical inference)

Given the assumption that random sampling is implemented, we do a parametric bootstrap test. The empirical p-value, obtained by drawing random samples from a normal distribution $N (75.15403, 11.78723)$ 10000 times, is 0.241, which is large. This is weak evidence against the null hypothesis of no difference, so we fail to reject the null hypothesis and may suggest that there is no difference between the average pulse rate of people aged 20-29 and the average pulse rate of people aged 30-39.

Full Code

nhanes <- read.csv("nhanes.csv")


head(nhanes)
summary(nhanes)


boxplot(Pulse ~ AgeDecade, data = nhanes, horizontal = TRUE) 


elder <- nhanes$Pulse[nhanes$AgeDecade=="30-39"]
elder_mean_pulse <- mean(elder)
younger <- nhanes$Pulse[nhanes$AgeDecade=="20-29"]
younger_mean_pulse <- mean(younger)
obs_diff <- elder_mean_pulse - younger_mean_pulse
obs_diff


diffs <- rep(NA,10000)
for(i in seq_along(diffs)){
  group1 <- round(rnorm(length(elder),  mean = mean,sd = sd))
  group2 <- round(rnorm(length(younger),mean = mean,sd = sd))
  diff <- mean(group1) - mean(group2)
  diffs[i] <- diff
}
empirical_p <- mean(abs(diffs) >= abs(obs_diff))
empirical_p