用R做GLM的Summary相关指标解释——以Poission regression为例

Residual

In ordinary least-squares, the residual associated with the i i i-th observation. The term residual comes from the residual sum of squares (RSS), which is defined as

r i = y i − f ( x i ) R S S = ∑ i = 1 n r i 2 r_i = y_i - f(x_i)\\ RSS=\sum_{i=1}^n r_i^2 ri=yif(xi)RSS=i=1nri2

where the residual r i r_i ri is defined as the difference between predicted values f ( x i ) f(x_i) f(xi) from the observed value y i y_i yi.

For GLMs, there are several ways for specifying residuals. We will cover four types of residuals: response residuals, working residuals, Pearson residuals, and, deviance residuals.

Response residuals

For type = "response", the conventional residual on the response level is computed, that is,
r i = y i − f ^ ( x i ) r_i=y_i−\hat f(x_i) ri=yif^(xi)

# type = "response"
res.response1 <- residuals(model.pois, type = "response")
res.response2 <- expected - estimates
all.equal(res.response1, res.response2)
## [1] TRUE

Working residuals

For type = "working", the residuals are normalized by the estimates f ^ ( x i ) \hat f(x_i) f^(xi):

r i = y i − f ^ ( x i ) f ^ ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\hat f(x_i)} ri=f^(xi)yif^(xi)

# type = "working"
res.working1 <- residuals(model.pois, type="working")
res.working2 <- (expected - estimates) / estimates
all.equal(res.working1, res.working2)
## [1] TRUE

Pearson residuals

For type = "pearson", the Pearson residuals are computed. They are obtained by normalizing the residuals by the square root of the estimate:

r i = y i − f ^ ( x i ) f ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\sqrt{f(x_i)}} ri=f(xi) yif^(xi)

# type = "pearson"
res.pearson1 <- residuals(model.pois, type="pearson")
res.pearson2 <- (expected - estimates) / sqrt(estimates)
all.equal(res.pearson1, res.pearson2)
## [1] TRUE

Deviance residuals

Deviance residuals are defined by the deviance.

Deviance

The deviance of a model is given by

D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0)))

where

  • y y y is the outcome.
  • μ ^ \hat μ μ^ is the estimate of the model.
  • θ ^ s \hat θ_s θ^s and θ ^ 0 \hat θ_0 θ^0 are the parameters of the fitted saturated and proposed models, respectively. A saturated model has as many parameters as it has training points, that is, p = n p=n p=n. Thus, it has a perfect fit. The proposed model can be the any other model.
  • p(y|θ) is the likelihood of data given the model.

The deviance indicates the extent to which the likelihood of the saturated model exceeds the likelihood of the proposed model. If the proposed model has a good fit, the deviance will be small. If the proposed model has a bad fit, the deviance will be high. For example, for the Poisson model, the deviance is
D = 2 ⋅ ∑ i = 1 n [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] D=2 \cdot \sum_{i=1}^n \left[ y_i \cdot \log (\frac{y_i}{\hat μ_i})−(y_i−\hat μ_i) \right] D=2i=1n[yilog(μ^iyi)(yiμ^i)]

具体推导过程参考: Poission regression deviance 的推导

  1. 首先假设 x x x只有一个位置参数, Y Y Y 服从Poission distribution,并且 Y Y Y 的均值是一个关于 x x x 的函数,即
    E ( Y ∣ x ) = λ ( x ) \mathbb E(Y|x) = \lambda (x) E(Yx)=λ(x)

  2. 对于Poission regression用 log ⁡ \log log 作为link function,即 log ⁡ ( λ ( x ) ) = β 0 + β 1 x \log(\lambda(x))=\beta_0 + \beta_1 x log(λ(x))=β0+β1x

  3. 那么Likelihood function就是 L ( β 0 , β 1 ; y i ) = ∏ i = 1 n e − λ ( x i ) [ λ ( x i ) y i ] y i ! = ∏ i = 1 n e e − β 0 + β 1 x i [ e β 0 + β 1 x i ] y i ] y i ! L(\beta_0, \beta_1; y_i) = \prod \limits_{i=1}^n \frac{e^{-\lambda(x_i)}[\lambda(x_i)^{y_i}]}{y_i !} = \prod \limits_{i=1}^n \frac{e^{e^{-\beta_0 + \beta_1 x_i}}[e^{\beta_0 + \beta_1 x_i}]^{y_i}]}{y_i !} L(β0,β1;yi)=i=1nyi!eλ(xi)[λ(xi)yi]=i=1nyi!eeβ0+β1xi[eβ0+β1xi]yi]

Log likelihood function就是:
l ( β 0 , β 1 ; y i ) = ∑ i = 1 n y i ( β 0 + β 1 x i ) − ∑ i = 1 n e β 0 + β 1 x i − ∑ i = 1 n log ⁡ ( y i ! ) ( 1 ) l(\beta_0, \beta_1; y_i) = \sum \limits_{i=1}^n y_i(\beta_0 + \beta_1 x_i) - \sum \limits_{i=1}^n e^{\beta_0 + \beta_1 x_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (1) l(β0,β1;yi)=i=1nyi(β0+β1xi)i=1neβ0+β1xii=1nlog(yi!)(1)

写成用 μ ^ i \hat \mu_i μ^i表示均值预测值的形式,即
l ( μ ^ ) = ∑ i = 1 n y i log ⁡ ( μ ^ i ) − ∑ i = 1 n μ ^ i − ∑ i = 1 n log ⁡ ( y i ! ) ( 1 ′ ) l(\hat \mu) = \sum \limits_{i=1}^n y_i \log(\hat \mu_i) - \sum \limits_{i=1}^n {\hat \mu_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (1') l(μ^)=i=1nyilog(μ^i)i=1nμ^ii=1nlog(yi!)(1)

  1. μ 1 , μ 2 , ⋯   , μ n \mu_1,\mu_2,\cdots,\mu_n μ1,μ2,,μn代替"saturated model"的参数(saturated model:每个数据点都有一个自己的参数构成的模型)。

那么,saturated model 的 log likelihood function 即为:
l ( μ ) = ∑ i = 1 n y i log ⁡ ( μ i ) − ∑ i = 1 n μ i − ∑ i = 1 n log ⁡ ( y i ! ) ( 2 ) l(\mu) = \sum \limits_{i=1}^n y_i \log(\mu_i) - \sum \limits_{i=1}^n {\mu_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (2) l(μ)=i=1nyilog(μi)i=1nμii=1nlog(yi!)(2)

欲求(2)式中 μ i \mu_i μi的最优估计,即对(2)求偏导数,并令其等于0,即:
∂ ∂ μ i l ( μ ) = y i μ i − 1 = 0 \frac{\partial}{\partial \mu_i}l(\mu) = \frac{y_i}{\mu_i} - 1=0 μil(μ)=μiyi1=0
解得,’‘saturated model’’ 的参数估计值为:
μ ^ i = y i \hat \mu_i = y_i μ^i=yi
将上面的结果带入到(2)式中,即得“saturated model”的likelihood function:
l ( μ ^ ) = ∑ i = 1 n y i log ⁡ ( y i ) − ∑ i = 1 n y i − ∑ i = 1 n log ⁡ ( y i ! ) ( 3 ) l(\hat \mu) = \sum \limits_{i=1}^n y_i \log(y_i) - \sum \limits_{i=1}^n {y_i} - \sum \limits_{i=1}^n \log (y_i !) \quad \quad (3) l(μ^)=i=1nyilog(yi)i=1nyii=1nlog(yi!)(3)

  1. 由Deviance的定义: D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0))),即用 2 ⋅ [ ( 3 ) − ( 1 ′ ) ] , 得 2 \cdot [(3)-(1')],得 2[(3)(1)]
    D ( y , μ ^ ) = 2 ⋅ [ ∑ i = 1 n y i log ⁡ y i μ ^ i − ∑ i = 1 n ( y i − μ ^ i ) ] D(y,\hat \mu)=2 \cdot \left[\sum \limits_{i=1}^n y_i \log \frac{y_i}{\hat \mu_i} - \sum \limits_{i=1}^n (y_i - \hat \mu_i) \right] D(y,μ^)=2[i=1nyilogμ^iyii=1n(yiμ^i)]

注意:这仅仅对于分类模型,有时也用 (D)表示Deviance, (D∗)表示Scaled Deviance,即D除以上一个尺度参数。


D ∗ = D ⋅ ϕ D^{*}=D \cdot \phi D=Dϕ
例如,Poission分布的尺度参数就为1,Nomal distribution的尺度参数为 σ 2 \sigma^2 σ2.

所以,Nomal distribution的Deviance推出来是
D ( y , μ ) = ( y − μ ) 2 D(y,\mu)=(y−\mu)^2 D(y,μ)=(yμ)2
Scaled Deviance是
D ∗ ( y , μ ) = ( y − μ ) 2 σ 2 D^{*}(y,\mu)=\frac{(y−\mu)^2}{\sigma^2} D(y,μ)=σ2(yμ)2
详情https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

Deviance residuals

In R, the deviance residuals represent the contributions of individual samples to the deviance D. More specifically, they are defined as the signed square roots of the unit deviances.

How does such a deviance look like in practice? For example, for the Poisson distribution, the deviance residuals are defined as:(这里我个人有个小疑问暂时没解决,根号下的数不一定是正的呀)

r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)]

# type = "deviance"
res.dev1 <- residuals(model.pois, type = "deviance")
res.dev2 <- residuals(model.pois)
poisson.dev <- function (y, mu) 
    # unit deviance
    2 * (y * log(ifelse(y == 0, 1, y/mu)) - (y - mu))
res.dev3 <- sqrt(poisson.dev(expected, estimates)) * 
        ifelse(expected > estimates, 1, -1)
all.equal(res.dev1, res.dev2, res.dev3)
## [1] TRUE

Deviance residuals举例

We can obtain the deviance residuals of our model using the residuals function:

summary(residuals(model.pois))

如果有以下输出

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.0451 -2.4138 -0.8169 -0.1840  1.4265  8.7946

Since the median deviance residual is close to zero, this means that our model is not biased in one direction (i.e. the out come is neither over- nor underestimated).

Tips:

Note that, for ordinary least-squares models, the deviance residual is identical to the conventional residual.

推导见:
https://bookdown.org/egarpor/PM-UC3M/glm-deviance.html#eq:dev
或者
https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

Residual 小结

名称Response residuals(conventional residuals)Working residualsPearson residualsDeviance residuals
公式 r i = y i − f ^ ( x i ) r_i=y_i−\hat f(x_i) ri=yif^(xi) r i = y i − f ^ ( x i ) f ^ ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\hat f(x_i)} ri=f^(xi)yif^(xi) r i = y i − f ^ ( x i ) f ( x i ) r_i=\frac{y_i−\hat f(x_i)}{\sqrt{f(x_i)}} ri=f(xi) yif^(xi) r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)]
意义常规Normalize by the estimatesNormalize the residuals by the square root of the estimaterepresent the contributions of individual samples to the deviance D D D

Null Deviance and Residual Deviance

图解Null model, proposed model and saturated model——以Nomal regression为例

这部分参考了视频:https://www.youtube.com/watch?v=9T0wlKdew6I&t=0s

我们假设只有均值这一种未知参数(即方差已知)。

如图,左上角的就是Null model,即这些数据只用一个参数去刻画;右上角的就是Saturated Model,即每个数据都有一个参数刻画它,模型很饱和;下方的是Proposed Model,即我们使用的模型,需要去评估的模型,它可以是任何介于Null Model和Saturated Model之间的模型。

Since we have already introduced the deviance, understanding the null and residual deviance is not a challenge anymore. Let us repeat the definition of the deviance once again:

D ( y , μ ) = 2 ( log ⁡ ( p ( y ∣ θ s ) ) − log ⁡ ( p ( y ∣ θ 0 ) ) ) D(y,\mu)=2(\log(p(y∣\theta^s))−\log(p(y∣\theta^0))) D(y,μ)=2(log(p(yθs))log(p(yθ0)))

The null and residual deviance differ in θ 0 θ_0 θ0:

  • Null deviance: θ 0 θ_0 θ0 refers to the null model (i.e. an intercept-only model)
  • Residual deviance: θ 0 θ_0 θ0 refers to the trained model

How can we interpret these two quantities?

  • Null deviance: A low null deviance implies that the data can be modeled well merely using the intercept. If the null deviance is low, you should consider using few features for modeling the data.
  • Residual deviance: A low residual deviance implies that the model you have trained is appropriate. Congratulations!

图解:
在这里插入图片描述

Confusing one: Deviance Residuals VS Residual Deviance

名称Deviance ResidualsResidual Deviance
公式 r i = s i g n ( y − μ ^ i ) ⋅ 2 ⋅ [ y i ⋅ log ⁡ ( y i μ ^ i ) − ( y i − μ ^ i ) ] r_i=sign (y−\hat μ_i) \cdot \sqrt{2 \cdot \left[ y_i⋅\log \left( \frac{y_i} {\hat μ_i} \right)−(y_i−\hat μ_i) \right]} ri=sign(yμ^i)2[yilog(μ^iyi)(yiμ^i)] R e s i d u a l D e v i a n c e = ∑ [ ( D e v i a n c e    R e s i d u a l ) i ] 2 Residual Deviance =\sum[(Deviance \, \, Residual)_i]^2 ResidualDeviance=[(DevianceResidual)i]2 D ( y , μ ^ ) = 2 ( log ⁡ ( p ( y ∣ θ ^ s ) ) − log ⁡ ( p ( y ∣ θ ^ 0 ) ) ) D(y,\hat μ)=2(\log (p(y∣\hat θ_s))−\log (p(y∣\hat θ_0))) D(y,μ^)=2(log(p(yθ^s))log(p(yθ^0)))
区别Deviance Residuals是Residual的一种,重点关注每个样本点,区别于上文提到的另外三种ResidualResidual Deviance是Deviance的一种,重点关注模型整体,是‘saturated model’与‘proposed model’的log likelihood差的二倍,区别于Null Deviance( ‘saturated model’ 与 ‘null model’ 的log likelihood差的二倍)

Other outputs of the summary function

Here, I deal with the other outputs of the GLM summary fuction: the dispersion parameter, the AIC, and the statement about Fisher scoring iterations.

Dispersion parameter

Dispersion (variability/scatter/spread) simply indicates whether a distribution is wide or narrow. The GLM function can use a dispersion parameter to model the variability.

However, for likelihood-based model, the dispersion parameter is always fixed to 1. It is adjusted only for methods that are based on quasi-likelihood estimation such as when family = "quasipoisson" or family = "quasibinomial". These methods are particularly suited for dealing with overdispersion.

AIC

The Akaike information criterion (AIC) is an information-theoretic measure that describes the quality of a model. It is defined as

A I C = 2 p − 2 ln ⁡ ( L ^ ) AIC=2p−2 \ln (\hat L) AIC=2p2ln(L^)

where p p p is the number of model parameters and L ^ \hat L L^ is the maximum of the likelihood function. A model with a low AIC is characterized by low complexity (minimizes p p p) and a good fit (maximizes L ^ \hat L L^).

参考文献

  1. 主要参考:Interpreting Generalized Linear Models

  2. 次要参考:Notes for Predictive Modeling 5.5

  3. youtube视频:Deviance Residuals

  4. BOOK:Beyond Multiple Linear Regression——Applied Generalized Linear Models and Multilevel Models in R 第四章

  5. Poission regression deviance 的推导

  6. wikipedia:Deviance

  7. https://stats.stackexchange.com/questions/5135/interpretation-of-rs-lm-output

  8. https://stats.stackexchange.com/questions/315788/deviance-in-simple-linear-regression

  9. https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r

  10. Data science in insurance: an R intro

评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值