分位数与QQ图

最新推荐文章于 2024-11-09 17:50:29 发布

原创最新推荐文章于 2024-11-09 17:50:29 发布 · 2.6k 阅读

4 ·

CC 4.0 BY-SA版权

文章标签：

#分位数 #QQ图

Stat 专栏收录该内容

6 篇文章

订阅专栏

本文详细介绍了QQ图的基本原理及其在R语言中的实现，包括SampleQuantiles样本分位数的计算方法，以及如何使用quantile函数进行分位数的求解。同时，文章深入探讨了经验累积分布函数(ecdf)的概念，并解释了ppoints函数在生成概率点方面的作用。此外，还对比分析了qqnorm、qqplot和qqline函数在绘制QQ图和PP图时的不同应用。

QQ图基本知识

Sample Quantiles 样本分位数

quantile(x, ...)
给定一个系列 $x$ ，可以求出给定累积概率 $p$ 对应的分位数。
计算分位数有9种方法 $^1$ ：

假设方法 $i$ （ $\le i \le 9$ )，对应概率p的计算公式是：
$\gamma)\ x_j + \gamma\ x_{j+1},$

$\frac{j-m}{n} ≤ p < \frac{(j+1)-m}{n} \\ j=floor(np+m) \\ g = (np + m) - j \\ \gamma = f(j, g)$
$x_j$ 是第 $j$ 个顺序统计量；
$n$ 是 $x$ 的长度（样本量）；
$m$ 是个常数，不同的方法 $i$ 取不同的值；
$j$ 的值由公式： $j = f l o o r (n p + m)$ 确定；
还有个 $j$ 的gap值 $g$ ： $g = (n p + m) - j$
$γ\gamma$ 值由 $j$ 和 $g$ 值共同确定（如下表）：

type	$m$ value	$γ\gamma$ value	desc
1	$0$	$\gamma = \begin{cases} & 0; \text{ if g=0;} \\ &1; \text { if others } \end{cases}$	Inverse of empirical distribution function
2	$0$	$\gamma = \begin{cases} & 0.5; \text{ if g=0;} \\ &1; \text {\ \ if others } \end{cases}$	Similar to type 1 but with averaging at discontinuities. (SAS default)
3	$- 1 / 2$	$\gamma = \begin{cases} & 0; \text{ if g=0 and j is even} \\ &1; \text {\ \ if others } \end{cases}$	Nearest even order statistic (SAS default till ca. 2010)
4	$0$	$=\frac{ k }n$	linear interpolation of the empirical cdf.
5	$1 / 2$	$\frac{k - 0.5} n$	That is a piecewise linear function where the knots are the values midway through the steps of the empirical cdf. This is popular amongst hydrologists.
6	$p$	$\frac{k }{n + 1}$	Thus $p [k] = E [F (x [k])]$ . This is used by Minitab and by SPSS.
7	$1 - p$	$=\frac {k - 1} {n - 1}$	In this case, $p [k] = m o d e [F (x [k])]$ . This is used by S.
8	$p+13\frac{p+1}3$	$\frac{k - 1/3} {n + 1/3}$	Then p[k] =~ median[F(x[k])]. The resulting quantile estimates are approximately median-unbiased regardless of the distribution of $x$ .
9	$p / 4 + 3 / 8$	$\frac{k - 3/8} {n + 1/4}$	The resulting quantile estimates are approximately unbiased for the expected order statistics if $x$ is normally distributed.

Empirical Cumulative Distribution Function 经验累积分布函数

ecdf(x, ...)

$F_n(t) = \frac{\#\{x_i \leq t\}}{n} = \frac{\sum_{i=1}^{n} Indicator(x_i \leq t)}{n}.$
其中，Indicator是指示函数，
Indicator(TRUE) = 1; Indicator(FALSE) = 0；

我们可以看见，对于顺序统计量 $x_i$ ，每往后增加一个元素，累积概率增加 $1 / n$ 。

ppoints

再[0, 1]上，“均匀”地产生 $n$ 个概率点

> ppoints
function (n, a = if (n <= 10) 3/8 else 1/2) 
{
    if (length(n) > 1L) 
        n <- length(n)
    if (n > 0) 
        (1L:n - a)/(n + 1 - 2 * a)
    else numeric()
}

从ppoints的在线帮助可以知道 $^2$ ：

ppoints产生的概率点在[0, 1]上是对称的， $p_i + p_{n-i + 1} = 1;\ i=1..n$ ；
默认情况下， $\le 10$ 时， $a = 3 / 8$ , 此时， $pi=i−3/8n+1/4p_i = \frac{i-3/8}{n+1/4}$ ; $n > 10$ 时， $a = 1 / 2$ , 此时， $pi=i−0.5np_i = \frac{i-0.5}{n}$ ; 此时，ppoints一般用于产生标准正态分布对应的累积概率。

ppoints用在qqnorm中产生标准正态分布对应的分位点。x <- qnorm(ppoints(n))[order(order(y))]

不同a的值，对应quantile()函数的type。

QQ图

qqnorm

function (y, ylim, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles", 
    ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE, ...) 
{
    if (has.na <- any(ina <- is.na(y))) {
        yN <- y
        y <- y[!ina]
    }
    if (0 == (n <- length(y))) 
        stop("y is empty or has only NAs")
    if (plot.it && missing(ylim)) 
        ylim <- range(y)
    x <- qnorm(ppoints(n))[order(order(y))]
    if (has.na) {
        y <- x
        x <- yN
        x[!ina] <- y
        y <- yN
    }
    if (plot.it) 
        if (datax) 
            plot(y, x, main = main, xlab = ylab, ylab = xlab, 
                xlim = ylim, ...)
        else plot(x, y, main = main, xlab = xlab, ylab = ylab, 
            ylim = ylim, ...)
    invisible(if (datax) list(x = y, y = x) else list(x = x, y = y))
}

有函数的实现可以知道，y是待比较的样本分位点，x是通过ppoints产生的标准正态分布的理论分位点，调用plot画出散点图。

qqplot

> qqplot
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)), 
    ylab = deparse(substitute(y)), ...) 
{
    sx <- sort(x)
    sy <- sort(y)
    lenx <- length(sx)
    leny <- length(sy)
    if (leny < lenx) 
        sx <- approx(1L:lenx, sx, n = leny)$y
    if (leny > lenx) 
        sy <- approx(1L:leny, sy, n = lenx)$y
    if (plot.it) 
        plot(sx, sy, xlab = xlab, ylab = ylab, ...)
    invisible(list(x = sx, y = sy))
}

从函数的实现可以发现，qqplot可以画出QQ图，也能画出PP图，取决于传入的x和y是分位点还是累积概率点
qqplot不局限于是不是满足标准正态分布，可以画出样本分位点和任意分布的理论分位点的QQ图。也可以比较两个系列y，x是否满足同一分布（任意分布）。

qqline

> qqline
function (y, datax = FALSE, distribution = qnorm, probs = c(0.25, 
    0.75), qtype = 7, ...) 
{
    stopifnot(length(probs) == 2, is.function(distribution))
    y <- quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE)
    x <- distribution(probs)
    if (datax) {
        slope <- diff(x)/diff(y)
        int <- x[1L] - slope * y[1L]
    }
    else {
        slope <- diff(y)/diff(x)
        int <- y[1L] - slope * x[1L]
    }
    abline(int, slope, ...)
}

有函数的实现可以知道，qqline默认从y选取Q1和Q3两个分位点，x根据理论分布产生对应的四分位点，过 $x_1, y_1), (x_2, y_2)$ 画一条直线。