R for Data Science总结之——modelr（1）

最新推荐文章于 2021-11-12 16:56:35 发布

我要养只哈士奇

最新推荐文章于 2021-11-12 16:56:35 发布

阅读量2.4k

点赞数

CC 4.0 BY-SA版权

分类专栏： R Data Science R语言数据挖掘tidyverse框架文章标签： R Data Science

本文链接：https://blog.youkuaiyun.com/weixin_38423453/article/details/82924683

本文介绍了R语言中modelr库在数据科学中的应用，通过实例展示了如何构建、解构和评估线性模型。内容涵盖最简单的线性模型、模型解剖、公式和模型家族、公式变换，以及处理缺失值和不同模型类型，如GLM、GAM和决策树等。

R for Data Science总结之——modelr（1）

数学模型是用来提供一个数据集的低维总结性描述，通常而言，R语言内置的线性模型lm()函数已经可以用来描绘绝大多数数学模型，这一章简要介绍数学模型机理和其作用。

library(tidyverse)

library(modelr)
options(na.action = na.warn)

最简单的例子

ggplot(sim1, aes(x, y)) + 
  geom_point()

在这里插入图片描述
这个数据集看起来可以用最简单的线性模型来描述：

y = a_0 + a_1 * x

这里我们用geom_abline()以及生成的均匀分布数据集来描述这个过程：

models <- tibble(
  a1 = runif(250, -20, 40),
  a2 = runif(250, -5, 5)
)

ggplot(sim1, aes(x, y)) + 
  geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
  geom_point()

在这里插入图片描述
通过随机生成斜率和截距我们得到了这么多模型，但要通过一些方法来评定他们的好坏，例如最小二乘法等。首先给出一个模型下在每个x值对应的y值：

model1 <- function(a, data) {
  a[1] + data$x * a[2]
}
model1(c(7, 1.5), sim1)
#>  [1]  8.5  8.5  8.5 10.0 10.0 10.0 11.5 11.5 11.5 13.0 13.0 13.0 14.5 14.5
#> [15] 14.5 16.0 16.0 16.0 17.5 17.5 17.5 19.0 19.0 19.0 20.5 20.5 20.5 22.0
#> [29] 22.0 22.0

然后给出真实点与模型上点的残差，并使用最小二乘法：

measure_distance <- function(mod, data) {
  diff <- data$y - model1(mod, data)
  sqrt(mean(diff ^ 2))
}
measure_distance(c(7, 1.5), sim1)
#> [1] 2.67

这里我们要使用purrr的map系列函数计算所有模型的差别：

sim1_dist <- function(a1, a2) {
  measure_distance(c(a1, a2), sim1)
}

models <- models %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))
models
#> # A tibble: 250 x 3
#>       a1      a2  dist
#>    <dbl>   <dbl> <dbl>
#> 1 -15.2   0.0889  30.8
#> 2  30.1  -0.827   13.2
#> 3  16.0   2.27    13.2
#> 4 -10.6   1.38    18.7
#> 5 -19.6  -1.04    41.8
#> 6   7.98  4.59    19.3
#> # ... with 244 more rows

从中选择差别最小的十个把它们画出来：

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(models, rank(dist) <= 10)
  )

在这里插入图片描述
我们也可以把所有模型斜率和截距值画成散点图选出最好的几个：

ggplot(models, aes(a1, a2)) +
  geom_point(data = filter(models, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))

在这里插入图片描述
以上模型都是用随机数产生，但也可以用expand.grid()生成规整的网格数据：

grid <- expand.grid(
  a1 = seq(-5, 20, length = 25),
  a2 = seq(1, 3, length = 25)
  ) %>% 
  mutate(dist = purrr::map2_dbl(a1, a2, sim1_dist))

grid %>% 
  ggplot(aes(a1, a2)) +
  geom_point(data = filter(grid, rank(dist) <= 10), size = 4, colour = "red") +
  geom_point(aes(colour = -dist))

在这里插入图片描述
从中选出最好的十个模型与数据集做对比：

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(
    aes(intercept = a1, slope = a2, colour = -dist), 
    data = filter(grid, rank(dist) <= 10)
  )

在这里插入图片描述
而R里给出了一个类似梯度下降法的优化方法optim()，可以从给出的最初值开始，找到最优解：

best <- optim(c(0, 0), measure_distance, data = sim1)
best$par
#> [1] 4.22 2.05

ggplot(sim1, aes(x, y)) + 
  geom_point(size = 2, colour = "grey30") + 
  geom_abline(intercept = best$par[1], slope = best$par[2])

这里我们再回到线性模型，他们的基本格式或称family为：

y = a_1 + a_2 * x_1 + a_3 * x_2 + … + a_n * x_(n - 1)

在这里我们直接用lm()函数对数据集进行建模：

sim1_mod <- lm(y ~ x, data = sim1)
coef(sim1_mod)
#> (Intercept)           x 
#>        4.22        2.05

其结果和optim()一模一样

模型解剖

这一节我们深入剖析模型预测值以及残差，首先使用modelr::data_grid()生成平整的数据集：

grid <- sim1 %>% 
  data_grid(x) 
grid
#> # A tibble: 10 x 1
#>       x
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     4
#> 5     5
#> 6     6
#> # ... with 4 more rows

再使用modelr::add_predictions()添加预测值：

grid <- grid %>% 
  add_predictions(sim1_mod) 
grid
#> # A tibble: 10 x 2
#>       x  pred
#>   <int> <dbl>
#> 1     1  6.27
#> 2     2  8.32
#> 3     3 10.4 
#> 4     4 12.4 
#> 5     5 14.5 
#> 6     6 16.5 
#> # ... with 4 more rows

可视化结果：

ggplot(sim1, aes(x)) +
  geom_point(aes(y = y)) +
  geom_line(aes(y = pred), data = grid, colour = "red", size = 1)

在这里插入图片描述
再用add_residuals()添加残差：

sim1 <- sim1 %>% 
  add_residuals(sim1_mod)
sim1
#> # A tibble: 30 x 3
#>       x     y  resid
#>   <int> <dbl>  <dbl>
#> 1     1  4.20 -2.07 
#> 2     1  7.51  1.24 
#> 3     1  2.13 -4.15 
#> 4     2  8.99  0.665
#> 5     2 10.2   1.92 
#> 6     2 11.3   2.97 
#> # ... with 24 more rows

理解残差的方法有很多，最简单的是画频数多边形图：

ggplot(sim1, aes(resid)) + 
  geom_freqpoly(binwidth = 0.5)

在这里插入图片描述
又或者直接用散点图进行描述：

ggplot(sim1, aes(x, resid)) + 
  geom_ref_line(h = 0) +
  geom_point()

在这里插入图片描述

公式和模型家族

最常见的公式可对应：

y ~ x
y = a_1 + a_2 * x

而model_matrix()可以展现一个模型的定义过程：


Type to search
R for Data Science
Welcome
1 Introduction
I Explore
2 Introduction
3 Data visualisation
4 Workflow: basics
5 Data transformation
6 Workflow: scripts
7 Exploratory Data Analysis
8 Workflow: projects
II Wrangle
9 Introduction
10 Tibbles
11 Data import
12 Tidy data
13 Relational data
14 Strings
15 Factors
16 Dates and times
III Program
17 Introduction
18 Pipes
19 Functions
20 Vectors
21 Iteration
IV Model
22 Introduction
23 Model basics
23.1 Introduction
23.2 A s