QQ图基本知识
Sample Quantiles 样本分位数
quantile(x, ...)
给定一个系列xxx,可以求出给定累积概率ppp对应的分位数。
计算分位数有9种方法1^11:
假设方法iii(1≤i≤91 \le i \le 91≤i≤9),对应概率p的计算公式是:
Q(p)=(1−γ) xj+γ xj+1,
Q(p) = (1 - \gamma)\ x_j + \gamma\ x_{j+1},
Q(p)=(1−γ) xj+γ xj+1,
j−mn≤p<(j+1)−mnj=floor(np+m)g=(np+m)−jγ=f(j,g)
\frac{j-m}{n} ≤ p < \frac{(j+1)-m}{n} \\
j=floor(np+m) \\
g = (np + m) - j \\
\gamma = f(j, g)
nj−m≤p<n(j+1)−mj=floor(np+m)g=(np+m)−jγ=f(j,g)
xjx_jxj是第jjj个顺序统计量;
nnn是xxx的长度(样本量);
mmm是个常数,不同的方法iii取不同的值;
jjj的值由公式:j=floor(np+m)j=floor(np+m)j=floor(np+m)确定;
还有个jjj的gap值ggg:g=(np+m)−jg = (np + m) - jg=(np+m)−j
γ\gammaγ值由jjj和ggg值共同确定(如下表):
type | mmm value | γ\gammaγ value | desc |
---|---|---|---|
1 | 000 | γ={0; if g=0;1; if others \gamma = \begin{cases} & 0; \text{ if g=0;} \\ &1; \text { if others } \end{cases}γ={0; if g=0;1; if others | Inverse of empirical distribution function |
2 | 000 | γ={0.5; if g=0;1; if others \gamma = \begin{cases} & 0.5; \text{ if g=0;} \\ &1; \text {\ \ if others } \end{cases}γ={0.5; if g=0;1; if others | Similar to type 1 but with averaging at discontinuities. (SAS default) |
3 | −1/2-1/2−1/2 | γ={0; if g=0 and j is even1; if others \gamma = \begin{cases} & 0; \text{ if g=0 and j is even} \\ &1; \text {\ \ if others } \end{cases}γ={0; if g=0 and j is even1; if others | Nearest even order statistic (SAS default till ca. 2010) |
4 | 000 | p[k]=knp[k] =\frac{ k }np[k]=nk | linear interpolation of the empirical cdf. |
5 | 1/21/21/2 | p[k]=k−0.5np[k] = \frac{k - 0.5} np[k]=nk−0.5 | That is a piecewise linear function where the knots are the values midway through the steps of the empirical cdf. This is popular amongst hydrologists. |
6 | ppp | p[k]=kn+1p[k] = \frac{k }{n + 1}p[k]=n+1k | Thus p[k]=E[F(x[k])]p[k] = E[F(x[k])]p[k]=E[F(x[k])]. This is used by Minitab and by SPSS. |
7 | 1−p1-p1−p | p[k]=k−1n−1p[k] =\frac {k - 1} {n - 1}p[k]=n−1k−1 | In this case, p[k]=mode[F(x[k])]p[k] = mode[F(x[k])]p[k]=mode[F(x[k])]. This is used by S. |
8 | p+13\frac{p+1}33p+1 | p[k]=k−1/3n+1/3p[k] = \frac{k - 1/3} {n + 1/3}p[k]=n+1/3k−1/3 | Then p[k] =~ median[F(x[k])]. The resulting quantile estimates are approximately median-unbiased regardless of the distribution of xxx. |
9 | p/4+3/8p/4 + 3/8p/4+3/8 | p[k]=k−3/8n+1/4p[k] = \frac{k - 3/8} {n + 1/4}p[k]=n+1/4k−3/8 | The resulting quantile estimates are approximately unbiased for the expected order statistics if xxx is normally distributed. |
Empirical Cumulative Distribution Function 经验累积分布函数
ecdf(x, ...)
Fn(t)=#{xi≤t}n=∑i=1nIndicator(xi≤t)n.
F_n(t) = \frac{\#\{x_i \leq t\}}{n} = \frac{\sum_{i=1}^{n} Indicator(x_i \leq t)}{n}.
Fn(t)=n#{xi≤t}=n∑i=1nIndicator(xi≤t).
其中,Indicator是指示函数,
Indicator(TRUE) = 1; Indicator(FALSE) = 0
;
我们可以看见,对于顺序统计量xix_ixi,每往后增加一个元素,累积概率增加1/n1/n1/n。
ppoints
再[0, 1]上,“均匀”地产生nnn个概率点
> ppoints
function (n, a = if (n <= 10) 3/8 else 1/2)
{
if (length(n) > 1L)
n <- length(n)
if (n > 0)
(1L:n - a)/(n + 1 - 2 * a)
else numeric()
}
从ppoints的在线帮助可以知道2^22:
- ppoints产生的概率点在[0, 1]上是对称的,pi+pn−i+1=1; i=1..np_i + p_{n-i + 1} = 1;\ i=1..npi+pn−i+1=1; i=1..n;
- 默认情况下,n≤10n \le 10n≤10时,a=3/8a= 3/8a=3/8, 此时,pi=i−3/8n+1/4p_i = \frac{i-3/8}{n+1/4}pi=n+1/4i−3/8; n>10n > 10n>10时,a=1/2a= 1/2a=1/2, 此时,pi=i−0.5np_i = \frac{i-0.5}{n}pi=ni−0.5; 此时,ppoints一般用于产生标准正态分布对应的累积概率。
ppoints用在qqnorm中产生标准正态分布对应的分位点。
x <- qnorm(ppoints(n))[order(order(y))]
- 不同a的值,对应quantile()函数的type。
QQ图
qqnorm
function (y, ylim, main = "Normal Q-Q Plot", xlab = "Theoretical Quantiles",
ylab = "Sample Quantiles", plot.it = TRUE, datax = FALSE, ...)
{
if (has.na <- any(ina <- is.na(y))) {
yN <- y
y <- y[!ina]
}
if (0 == (n <- length(y)))
stop("y is empty or has only NAs")
if (plot.it && missing(ylim))
ylim <- range(y)
x <- qnorm(ppoints(n))[order(order(y))]
if (has.na) {
y <- x
x <- yN
x[!ina] <- y
y <- yN
}
if (plot.it)
if (datax)
plot(y, x, main = main, xlab = ylab, ylab = xlab,
xlim = ylim, ...)
else plot(x, y, main = main, xlab = xlab, ylab = ylab,
ylim = ylim, ...)
invisible(if (datax) list(x = y, y = x) else list(x = x, y = y))
}
有函数的实现可以知道,y是待比较的样本分位点,x是通过ppoints产生的标准正态分布的理论分位点,调用plot画出散点图。
qqplot
> qqplot
function (x, y, plot.it = TRUE, xlab = deparse(substitute(x)),
ylab = deparse(substitute(y)), ...)
{
sx <- sort(x)
sy <- sort(y)
lenx <- length(sx)
leny <- length(sy)
if (leny < lenx)
sx <- approx(1L:lenx, sx, n = leny)$y
if (leny > lenx)
sy <- approx(1L:leny, sy, n = lenx)$y
if (plot.it)
plot(sx, sy, xlab = xlab, ylab = ylab, ...)
invisible(list(x = sx, y = sy))
}
从函数的实现可以发现,qqplot可以画出QQ图,也能画出PP图,取决于传入的x和y是分位点还是累积概率点
qqplot不局限于是不是满足标准正态分布,可以画出样本分位点和任意分布的理论分位点的QQ图。也可以比较两个系列y,x是否满足同一分布(任意分布)。
qqline
> qqline
function (y, datax = FALSE, distribution = qnorm, probs = c(0.25,
0.75), qtype = 7, ...)
{
stopifnot(length(probs) == 2, is.function(distribution))
y <- quantile(y, probs, names = FALSE, type = qtype, na.rm = TRUE)
x <- distribution(probs)
if (datax) {
slope <- diff(x)/diff(y)
int <- x[1L] - slope * y[1L]
}
else {
slope <- diff(y)/diff(x)
int <- y[1L] - slope * x[1L]
}
abline(int, slope, ...)
}
有函数的实现可以知道,qqline默认从y选取Q1和Q3两个分位点,x根据理论分布产生对应的四分位点,过(x1,y1),(x2,y2)(x_1, y_1), (x_2, y_2)(x1,y1),(x2,y2)画一条直线。
Ref:
- R语言:help(quantile)
- R语言:help(ppoints)