What is a p-value

本文深入探讨了P值的概念及其在假设检验中的应用,通过具体案例解释如何设定零假设和备择假设,以及如何计算P值来做出决策。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

While I was refreshing my stats knowledge recently, I had some challenges following along with the explanation that was given in class, so I decided to write one out for myself for future reference.

For context, we are using the bootstrapping methods (that I’ve referenced previously) for simulating null and sampling distributions (rather than standard statistical formulae) and so the methodology is a bit different than what would be done in traditional statistics courses.

Definition

First of all, the definition of the p-value is: the probability of obtaining the observed statistic or a “more extreme” value (by extreme we just mean more in favour of the alternate hypothesis mean) if the null hypothesis is true. (If you need to know more about what null and alternate hypotheses are, Wikipedia has a pretty good definition).

Note: There’s a whole other discussion to be had about what p-values mean if you assume that the alternate hypothesis is true, but that is for another day!

What does this actually mean?

The definition is all well and good, but what does it mean when actually using it? There’s two things we want to consider, how it informs decision making, and how to calculate it.

Decision Making

The p-value helps us make a decision. Because of the way we construct our assumptions, when calculated, the p-value tells us the probability of committing a Type I error if the null hypothesis is true. (A Type I error is when you incorrectly reject the null hypothesis - usually we would consider making Type I errors to be ‘bad,’ so we want to make as few of them as possible, and so are happy to err on the side of caution and make this chance quite low)

A low p-value is often considered to be less than 0.05 in business and research, and 0.01 in medicine, but it could be any value appropriate to the situation. That is, if you get a p-value that is 0.05, this means that there is a 5% chance that a statistic that you observed came from a population where the null hypothesis is true. With this reasoning, at low p-values we typically reject the null hypothesis. That is, we act on the assumption that the observed statistic came from a population where the alternate hypothesis is true.

In standard methodology we don’t just calculate the p-value and decide if it’s low enough, we pick a threshhold beforehand (a priori - sister wave!), which is called the α level/value.

So if we calculate the p-value and it is below the α we make a decision to act as if the alternate hypothesis is true.

Calculating

We know that we need to calculate a p-value. How do we do that? There are two things to consider, what do our null and alternate hypotheses say? (Which tells us generally what we’ll need to calculate) And what values do we use to make the calculation? (How to actually do it).

At this point it is really better to start visualizing with a specific example, so here goes.

The Example

Let’s say that you were looking at whether eldest childern have different IQs than the rest of the siblings.

While directionality is indicated, the general null hypothesis for this study is that there is no difference between the siblings. With this information we can simulate what the comparisons of siblings in the population would look like if the null hypothesis is true.

Simulating the Population

To note, when conducting hypothesis testing, we always simulate the null population and then compare to the observed statistic.

To do this, we reference the null hypothesis. As we discussed above, the general null hypothesis is that there is no difference. That is, if we took the average IQ for all eldest children and compared that to the average IQ for all of their siblings, the difference would be zero. For the sake of argument, let’s also simulate the population using a standard deviation of 0.75. (More technically, this would be the standard deviation of the sampling distribution of the differences that we created from the sample we took for our experiment) This allows us to set up the following representation of the null population.

The aqua space shows the spread of all of the differences in IQ between siblings that would be seen if the null hypothesis is true (the null population), and the blue dashed line shows the null mean (0). We can use this distribution to help us calculate our p-values.

Hypotheses Version 1: Eldest Children have Higher IQs

In the first version we are going to create hypotheses in line with the directionality indicated by the study.

Hypotheses

To test if eldest siblings have higher IQs, we would set up our hypotheses like this:

H0: μeldest - μnon-eldest ≤ 0**H1:** μeldest - μnon-eldest > 0

That is:

Null Hypothesis: The difference between the average IQ of eldest children and the average IQ of non-eldest children is equal to or less than zero.Alternate Hypothesis: The difference between the average IQ of eldest children and the average IQ of non-eldest children is greater than zero.

Calculation

Let’s say that in our sample we observed that the average difference in IQ for eldest children compared to their siblings was 1.4 points. We can call this our observed mean. In terms of our null population distribution, it would fall here:

The red line shows the observed difference.

We can now pick up from where we left off from when we started talking about how to calcalte the p-value. Now that we have the distribution (aqua shaded area), the null mean (blue dashed line) and the observed mean (red line), we have everything we need to calculate the p-value. To do this are going to calculate the area for part of the aqua shaded distribution, but which part?

Here is where we consult our alternate hypothesis and look at the direction of the arrow. If the sign is greater than (with the tail pointing to the right), then we need to calculate the shaded area to the right of our observed value (the red line), or the proportion of values from our null distribution that are greater than the observed mean.

The code (using numpy, and where ‘dist’ is an array representation the distribution) for this would be as follows:

p_val = (dist > 1.4).mean()

This compares each value in the distribution to 1.4 and creates an array of these comparisons (For each comparison, if the null value is greater than the observed statistic, the result will be True, if not, it will be False). When calculating the mean, comparisons that were True are evaluated as 1 and comparisons that were False are evaluted as 0.

The value for ‘p_val’ wil be:

0.0294

Therefore, 2.94% of values from our null distribution fall to the right, or are above, our observed mean. Looking at how much of the aqua shading is to the right of the red line compared to the rest of the shading, this seems to be about right.

Conclusion

Because we are doing some pretty casual research, let’s assume that our α value is 0.05.Our calculated p-value is below this and so we would reject the null hypothesis and say, ” Yes, on average, eldest siblings do have a greater IQ than their younger siblings!

Hypotheses Version 2: Eldest Children have Lower IQs

Now let’s switch it up and do the opposite version of the above hypotheses.

Hypotheses

So this time we are looking at if eldest siblings have lower IQs, we would set up our hypotheses like this:

H0: μeldest - μnon-eldest ≥ 0**H1:** μeldest - μnon-eldest < 0

That is:

Null Hypothesis: The difference between the average IQ of eldest children and the average IQ of non-eldest children is equal to or greater than zero.Alternate Hypothesis: The difference between the average IQ of eldest children and the average IQ of non-eldest children is less than zero.

Calculation

For the sake of simplicity, let’s keep all of the previous values the same - the average difference for the null hypothesis is 0, the standard deviation of the sampling distribution is 0.75, and the difference in IQ that is observed between eldest children and their siblings for our sample is 1.4. The chart will therefore look exactly the same:

But, because of our change in the hypotheses, what we calculate for the p-value will be different.

This time, because the alternate hypothesis talks about “less than 0” (with the tail of the sign pointing to the left), the area that we are going to calculate is everything shaded aqua that is to the left of the red line, or the proportion of values from our null distribution that are less than the observed mean.

To do this, we switch the direction of the sign in our code:

p_val = (dist < 1.4).mean()

This time, the value calculated for ‘p_val’ wil be:

0.9706

Again, when we look at how much aqua is to the left of the red line, this seems to make sense.

As a side note, because all we did was switch the signs, the total of this and the previously calculated p-value is one.

Conclusion

This p-value is WELL above our previously established α level and so in this case we would fail to reject the null hypothesis.

Our conclusion would be, ” On average, the IQ of eldest children is greater than or equal to that of their younger siblings.

Hypotheses Version 3: There is a difference in the IQs of eldest children and their younger siblings

Note that what is particular about this hypothesis is that a direction isn’t specified, we are only looking for a difference. The difference could be higher or lower, and so we will need to consider that in our calculations.

Hypotheses

H0: μeldest - μnon-eldest = 0**H1:** μeldest - μnon-eldest ≠ 0

That is:

Null Hypothesis: There is no difference between the average IQ of eldest children and the average IQ of non-eldest children.Alternate Hypothesis: There is a difference between the average IQ of eldest children and the average IQ of non-eldest children.

Calculation

Again, we’re keeping everything the same. As a reminder, the chart will look like this:

However, because of our hypothesis, here’s where things get a bit different. We need to test any difference that we’ve observed between the null mean (0) and the observed mean (1.4) in both directions.

In other words, if we added an extra line to our chart to represent the difference between the null mean and the observed mean on the other side, we need to calculate the area that is outside both of the red lines.

Here’s how we do that.

First, we look at the direction of the observed mean in relation to the null mean. In our case, the observed mean is greater than the null mean. So first of all, we are going to calculate all of the values of the null distribution that are higher than the observed mean, or the right/upper tail (we’ve done this before in our first example).

p_upper = (dist > 1.4).mean()

The value of ‘p_upper’ will be:

0.0294

Then we need to look at the other side/tail. For us that will be the left/lower tail. The calculation goes like this:

diff_means = 0 - 1.4``lower_compare = 0 + diff_means``p_lower = (dist < lower_compare).mean()

The first line calculates the difference between the null mean and the observed mean. The second line adds this to the null mean to get the difference on the other side. And then the comparison is made for all values in the distribution that are lower than the line on the left side.

The value for ‘p_lower’ will be:

0.0321

To find the total p-value, we add ‘p_upper’ and ‘p_lower’ together.

p_val = p_upper + p_lower

The p-value will be:

0.0615

Conclusion

Again, our α level is 0.05. Our p-value (0.0615) is greater than this and so we would fail to reject the null hypothesis.

We would say, ” On average, there is no difference between the IQ of eldest children and their younger siblings.

One of the interesting things here is the impact of not selecting a direction in our hypothesis. If we select a direction, we have more ‘space’ within which to find a p-value that is lower than our αα level when compared to when we are just looking for a difference. (This is why when I did my stats education they strongly suggested that we used hypotheses that looked for a difference rather than a direction. You had to be VERY sure that you expected a certain direction if you wanted to pick that to prevent a greater chance of Type I errors.)

Final Consideration - What does it MEAN?

Here’s where things can get interesting. When we are doing null hypothesis statistical testing (NHST) people can get very focused on p-values. A result being significant is everything, and if the result is not significant, the work is often put to the side.

But p-values are not the only things that we should consider in reality. Let’s think about the current situation, let’s say that the study used the methodology of our first example and so the greater average IQ for eldest siblings was found. What would you do as an employer? Would you include questions of sibling order in your hiring process to give you a better chance of hiring a more intelligent workforce?

If you look at the size of the difference (called effect size), we’re talking about less than 2 points of difference in IQ (In a range for most people that is 70 to 130). Can you imagine the lawsuits that could be directed towards your company if you declined to hire someone because they weren’t an older sibling? In fact, this article suggests that in 4 out of 10 cases, the later-born is still smarter than their older sibling. That’s an awful lot of potential lawsuits for what would seem to be a very tiny potential increase in productivity. It would seem that the downfalls associated with making decisions based on this information would be MUCH greater than not using this information.

This is why our instructor encourages us to consider not just statistical significance but also practical significance of our hypothesis testing.

<think>好的,我现在需要解决用户的问题:如何计算多类别预测变量的整体p值,使用F检验来评估统计显著性,而不是组间比较的p值。首先,我需要明确用户的需求。用户可能在进行回归分析或方差分析(ANOVA)时,遇到了一个多类别(比如三个或更多组)的预测变量,他们希望知道这个变量作为一个整体是否对结果变量有显著影响,而不是比较各分组之间的差异。例如,在医学研究中,可能有三种不同的治疗方法,想确定治疗方法整体是否对疗效有影响,而不是单独比较每种方法之间的差异。 首先,我应该回顾一下F检验的基本原理。F检验通常用于比较两个模型的方差,判断是否应该拒绝原假设。在这种情况下,原假设是多类别预测变量的所有组别对结果变量的影响没有显著差异。备择假设是至少有一个组别的影响与其他不同。 接下来,我需要考虑用户提到的多类别预测变量。通常,当处理多类别变量时,在回归模型中会将其转换为哑变量(dummy variables)。例如,一个有三个类别(A、B、C)的变量,可以创建两个哑变量(比如以A为参考水平,B和C作为哑变量)。这时候,回归模型中的系数会显示B和C相对于A的影响。但用户的问题不是关于这些哑变量的系数是否显著,而是整个多类别变量是否作为一个整体显著。这时候,就需要使用F检验来比较包含这个变量和不包含这个变量的两个模型,看是否有显著差异。 可能的步骤包括: 1. 建立完整模型(包含多类别预测变量)和简化模型(不包含该变量)。 2. 计算两个模型的残差平方和(RSS)。 3. 使用F统计量的公式来计算F值。 4. 根据自由度和F值查找p值,判断是否拒绝原假设。 我需要确保公式的正确性。F统计量的计算公式是: $$ F = \frac{(RSS_{简化} - RSS_{完整}) / (df_{简化} - df_{完整})}{RSS_{完整} / df_{完整}} $$ 其中,分子是模型间差异的平方和除以自由度差,分母是完整模型的均方误差。自由度差指的是两个模型参数数量的差异,比如多类别变量有k个类别,通常会引入k-1个哑变量,因此自由度差是k-1。而分母的自由度是完整模型的残差自由度,即样本量减去预测变量的数量减1。 接下来,我需要考虑如何在实际操作中实现这一点。例如,在R中,可以使用anova()函数来比较两个模型,得到F检验的结果。在Python中,可以使用statsmodels库中的anova_lm方法。用户可能希望知道具体如何在Python中执行这个步骤。 例如,假设用户有一个多类别变量“treatment”有三个水平(A、B、C),结果变量是“outcome”。在Python中,他们可能使用statsmodels的OLS(普通最小二乘法)来拟合模型。完整模型包含treatment变量,而简化模型不包含。然后,使用anova_lm来比较这两个模型,得到F统计量和p值。 需要验证是否正确处理了自由度,特别是当变量是分类变量时,确保模型正确地识别了哑变量的数量。比如,三个类别的变量会生成两个哑变量,因此完整模型相比简化模型多了两个参数,自由度差是2。F检验的自由度应该是(2, n - p - 1),其中p是完整模型中其他变量的数量(如果有的话)。 另外,用户可能需要了解这背后的统计学原理,比如原假设是所有哑变量的系数都为0,即多类别变量对结果变量没有影响。F检验用于测试这个联合假设,而不是单个系数的假设。 可能的相关问题包括如何解释F检验的结果,或者如何处理多类别变量与其他变量的交互作用。此外,用户可能还会问及何时使用F检验而不是其他检验,如似然比检验等。 最后,需要确保引用的参考资料正确。用户提供的引用中有关于F检验用于精密度检测,以及假设检验的基本概念,可能需要结合这些内容来解释F检验在此处的应用。例如,引用[3]提到F检验用于检测精密度,但在此上下文中,F检验用于模型比较,属于不同的应用场景,需要澄清这一点。不过,用户的问题更侧重于应用而非理论,因此重点应放在步骤和实现上。 总结步骤: 1. 构建完整模型和简化模型。 2. 计算两个模型的RSS。 3. 应用F统计量公式。 4. 确定自由度和对应的p值。 5. 解释结果:若p值小于显著性水平(如0.05),则拒绝原假设,认为多类别变量整体显著。 在Python中的具体实现可能需要使用statsmodels库,并确保正确设置分类变量。例如,在使用patsy公式时,将变量用C()包裹,或者使用pandas的get_dummies函数生成哑变量,然后在模型中包含这些哑变量。 可能的问题包括用户是否正确地处理了分类变量,例如是否将其作为字符串或分类类型处理,否则程序可能会将其视为连续变量,导致错误的结果。因此,在示例代码中需要明确如何处理多类别变量。 最后,检查是否有遗漏的步骤或可能的误解,确保回答准确无误,并且符合用户需求,即整体p值的计算而非组间比较。</think>### 多类别预测变量的整体p值计算(F检验) 要评估多类别预测变量的整体统计显著性(而非组间比较),需通过**F检验**比较包含该变量与不包含该变量的模型差异。以下是具体步骤: --- #### **1. 模型构建** - **完整模型(Full Model)**:包含所有预测变量,包括多类别变量(需转换为哑变量)。 - **简化模型(Reduced Model)**:排除目标多类别变量,保留其他变量。 例如,若多类别变量为 `treatment`(含3个类别:A、B、C),则完整模型可表示为: $$ y = \beta_0 + \beta_1 D_B + \beta_2 D_C + \epsilon $$ 其中 $D_B$ 和 $D_C$ 是哑变量(参考水平为A)。 --- #### **2. 计算F统计量** F统计量公式为: $$ F = \frac{(RSS_{简化} - RSS_{完整}) / (k)}{RSS_{完整} / (n - p - 1)} $$ - $RSS$:残差平方和(Residual Sum of Squares) - $k$:多类别变量的自由度(类别数 -1,如3类时$k=2$) - $n$:样本量,$p$:完整模型中预测变量总数 - 分母自由度:$df_{完整} = n - p - 1$ --- #### **3. 确定p值** 根据计算出的F值及自由度 $(k, n-p-1)$,查找F分布表或通过统计软件计算p值。若p值小于显著性水平(如0.05),则拒绝原假设(即变量整体显著)。 --- #### **Python实现示例** 使用 `statsmodels` 库进行模型比较: ```python import statsmodels.api as sm from statsmodels.formula.api import ols from statsmodels.stats.anova import anova_lm # 示例数据(假设df包含变量'treatment'和'outcome') model_full = ols('outcome ~ C(treatment)', data=df).fit() model_reduced = ols('outcome ~ 1', data=df).fit() # 仅截距模型 # 比较模型 anova_results = anova_lm(model_reduced, model_full) print(anova_results) ``` 输出结果中,`F`值和`Pr(>F)`即为整体显著性检验的结果[^4]。 --- #### **关键点** - **哑变量转换**:多类别变量需转换为$k-1$个哑变量(如3类生成2个)。 - **联合检验**:F检验评估所有哑变量的系数**同时为0**的原假设[^1]。 - **应用场景**:适用于线性回归、ANOVA等模型,判断变量是否需保留。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值