多元线性回归建模
文章大纲 (Article Outline)
- Dataset description 数据集描述
- Exploratory data analysis 探索性数据分析
- A simple linear regression model fitting 简单的线性回归模型拟合
- Model interpretation 模型解释
- MLR regression model fitting and interpretation MLR回归模型的拟合和解释
- Hypothesis testing 假设检验
- Stepwise regression 逐步回归
目标 (Aim)
The aim of this article to illustrate how to fit a multiple linear regression model in the R statistical programming language and interpret the coefficients. Here, we are going to use the Salary dataset for demonstration.
本文旨在说明如何在R统计编程语言中拟合多元线性回归模型并解释系数。 在这里,我们将使用Salary数据集进行演示。
数据集说明 (Dataset Description)
The 2008–09 nine-month academic salary for Assistant Professors, Associate Professors and Professors in a college in the U.S. The data were collected as part of the on-going effort of the college’s administration to monitor salary differences between male and female faculty members [1].
美国某大学的助理教授,副教授和教授在2008-09年度的9个月学术薪水数据收集是该大学行政部门不断努力的一部分,目的是监测男女教职员工之间的薪资差异[ 1]。
The data frame includes 397 observations and 6 variables.
数据框包含397个观测值和6个变量。
rank (I1): a factor with levels AssocProf, AsstProf, Prof
等级(I1) :等级为AssocProf,AsstProf,Professor的因子
discipline (I2): a factor with levels A (“theoretical” departments) or B (“applied” departments).
学科(I2) :A级(“理论”部门)或B级(“应用”部门)的因素。
yrs.since.phd (I3): years since PhD.
yrs.since.phd(I3 ):自博士学位以来的年份。
yrs.service (I4): years of service.
yrs.service(I4) :服务年限。
sex (I5): a factor with levels Female and Male
性别(I5) :女性和男性水平的因素
salary (D): nine-month salary, in dollars.
工资(D) :9个月的工资,以美元为单位。
Where I: Independent variable; D: Dependent/Outcome variable
我在哪里:自变量; D :因果/结果变量
加载库 (Load Libraries)
The first step is to start installing and loading R libraries
第一步是开始安装和加载R库
# use install.packages( ) function for installationlibrary(tidyverse) # data loading, manipulation and plotting
library(carData) # Salary dataset
library(broom) # tidy model output
Print dataset details
打印数据集详细信息
Let’s print different inbuilt datasets offered by the carData package.
让我们打印carData包提供的不同内置数据集。
data(package = "carData")

Glimpse of data
数据一览
Let's take a look at the inbuilt Salary dataset structure. For that one can utilize the str( ) function on glimpse( ) function from the dplyr library (which is included in the tidyverse library).
让我们看一下内置的Salary数据集结构。 为此,可以使用dplyr库(包含在tidyverse库中)的glimpse ()函数上的str()函数。
glimpse(Salaries)

The dataset includes 397 observations and 6 variables. Rank, discipline and sex are of categorical type while yrs.since.phd, yrs.service and salary are of integer type.
数据集包括397个观测值和6个变量。 等级,纪律和性别是分类类型,而yrs.since.phd,yrs.service和薪水是整数类型。
探索性分析 (Exploratory Analysis)
Before starting to model let’s perform some exploratory data analysis. Let’s see how the salary varies across different ranks. For that, we can plot a bee swarm + box plot combination. From the plot, we can observe that as the rank increases the salary also increases. The mean salary (blue dot) for Male is comparatively higher as compared to female.
在开始建模之前,让我们进行一些探索性数据分析。 让我们看看不同级别的薪水如何变化。 为此,我们可以绘制蜂群+箱形图的组合。 从图中可以看出,随着职级的增加,薪水也随之增加。 男性的平均工资(蓝点)高于女性。
ggplot(Salaries, aes(x = sex, y = salary, color = sex)) +
geom_boxplot()+
geom_point(size = 2, position = position_jitter(width = 0.2)) +
stat_summary(fun.y = mean, geom = "point", shape = 20, size = 6, color = "blue")+
theme_classic() +
facet_grid(.~rank)

Check gender/sex labels
检查性别/性别标签
Before we process for the detailed analysis let’s first fit a simple linear regression model where we predict the salary based on gender category. To check the current levels of sex we can use the levels( ) function and supply the column name. By default, the R treats the first level as a reference level (here female).
在进行详细分析之前,首先让我们拟合一个简单的线性回归模型,在该模型中,我们根据性别类别预测工资。 要检查当前的性别水平,我们可以使用level()函数并提供列名。 默认情况下,R将第一级视为参考级(此处为女性)。
levels(Salaries$sex)

Fitting a simple linear model
拟合简单的线性模型
Let’s fit a simple linear regression model with lm( ) function by supplying the formula and dataset.
通过提供公式和数据集,让我们使用lm()函数拟合一个简单的线性回归模型。
Formula = salary (~) is predicted by sex
公式=工资(〜)由性别预测
Then print the model summary using the summary( ) function.
然后使用summary()函数打印模型摘要。
lm1 <- lm(salary~sex, data = Salaries)
summary(lm1)

Interpretation of Coefficients
系数解释
You can observe that the average salary for a female is about 101002 dollars. The male person earns on an average of 14088 dollars more compared to a female person.
您可以观察到,女性的平均工资约为101002美元。 男性平均收入比女性多14088美元。
Releveling sex variable
重新调整性别变量
We can alter the levels to set male as the reference level. To do so just use the relevel( ) function and supply the column and reference level.
我们可以更改级别以将男性设置为参考级别。 为此,只需使用relevel()函数并提供列和参考级别。
Salaries$sex <- relevel(Salaries$sex, ref = "Male")levels(Salaries$sex)

Fitting model after releveling
重新调平后拟合模型
Now if we again fit the model we can now observe a negative sign for female coefficients. The interpretation will be the female person earns on an average of 14088 dollars less (115090$ — 14088$) compared to a male person.
现在,如果再次拟合模型,我们现在可以观察到女性系数为负号。 解释是,与男性相比,女性的平均收入要少14088美元(115090美元— 14088美元)。
lm2 <- lm(salary~sex, data = Salaries)
summary(lm2)

创建一个完整的模型 (Create a complete model)
Let’s fit a multiple linear regression model by supplying all independent variables. The ~ symbol indicates predicted by and dot (.) at the end indicates all independent variables except the dependent variable (salary).
通过提供所有自变量来拟合多元线性回归模型。 〜符号表示由预测,结尾处的点(。)表示除因变量(薪水)外的所有自变量。
lm_total <- lm(salary~., data = Salaries)
summary(lm_total)

系数解释 (Interpretation of Coefficients)
Here we can observe that a person gets an average salary of 65955.2 dollars. The associate professor is set to the reference level. You can interpret that as ranking increases i.e., from assistant to associate to the professor, the average salary also increases. let’s interpret a continuous variable to say “years of service”. As years of service increases by 1 year, the average salary drops by 489.5 dollars holding all other variables constant.
在这里,我们可以看到一个人的平均工资为65955.2美元。 副教授设置为参考水平。 您可以解释为,随着排名的增加,即从助理到副教授再到教授,平均薪水也有所提高。 让我们将连续变量解释为“服务年限”。 随着服务年限增加1年,在保持所有其他变量不变的情况下,平均工资下降了489.5美元。
Similarly, here the diciplineA is the reference category. The discipline B (applied departments) is significantly associated with an average increase of 14417.6 dollars in salary compared to discipline A (theoretical departments) holding other variables at constant.
同样,这里的纪律A是参考类别。 与保持其他变量不变的学科A(理论部门)相比,学科B(应用部门)的薪水显着增加1441.76美元。
逐步回归 (Stepwise regression)
To escape the problem of multicollinearity (correlation among independent variables) and to filter out essential variables/features from a large set of variables, a stepwise regression usually performed. The R language offers forward, backwards and both type of stepwise regression. One can fit a backward stepwise regression using the step( ) function by supplying the initial model object and direction argument. The process starts with initially fitting all the variables and after that, with each iteration, it starts eliminating variables one by one if the variable does not improve the model fit. The AIC metric is used for checking model fit improvement.
为了避免多重共线性问题(独立变量之间的相关性)并从大量变量中滤除基本变量/特征,通常执行逐步回归。 R语言提供前进,后退和两种类型的逐步回归。 通过提供初始模型对象和方向参数,可以使用step()函数拟合向后逐步回归。 该过程从最初拟合所有变量开始,然后在每次迭代中,如果变量不能改善模型拟合,则开始逐个消除变量。 AIC指标用于检查模型拟合的改进。
step(lm_total, direction = "backward")

Here, you can see that it eliminated the sex variable from the model but it hardly caused any improvement in the AIC value.
在这里,您可以看到它从模型中消除了性别变量,但几乎没有改善AIC值。
拟合改进的模型 (Fitting the improved model)
Now, let's refit the model with the best model variables suggested by the stepwise process.
现在,让我们使用逐步过程建议的最佳模型变量来重新拟合模型。
lm_step_backward <- lm(formula = salary ~ rank + discipline + yrs.since.phd + yrs.service,
data = Salaries)
summary(lm_step_backward)

测试假设 (Testing a Hypothesis)
In the next part, we will ask one question and will try to find out the answers by building a hypothesis.
在下一部分中,我们将提出一个问题,并通过建立假设来尝试找出答案。
Let’s see the trend of nine months salary over the service period. The best way to do that is by plotting a scatter plot + fitting a trend line.
让我们来看看服务期内九个月薪水的趋势。 最好的方法是绘制散点图+拟合趋势线 。
ggplot(Salaries_mod, aes(x = yrs.service, y = salary)) +
geom_point()+
stat_smooth()+
theme_classic()

Here, we can observe that up to 20 years of service the salary variable has an increasing trend. After that, the salary shows a decreasing trend.
在这里,我们可以观察到服务长达20年的工资变量呈上升趋势。 之后,薪水呈现下降趋势。
将其转换为分类 (Converting it to categorical)
So we can test one hypothesis that how much on average salary increases or decreases for those having service years of 20–40 years and 40–60 years when compared with 0–20 years (reference). For that purpose, we need to create three categories 0–20, 20–40, 40–60 years where we bin the continuous variable i.e., yrs.service values.
因此,我们可以检验一种假设,即服务年限为20–40年和40–60年的人与0–20年相比,平均工资会增加或减少多少(参考)。 为此,我们需要创建三个类别0–20、20–40、40–60年,在此我们对连续变量即yrs.service值进行分类。
Range estimation
范围估算
First, we need to identify the service years variable’s data range. To identify the range we can use the range( ) function. You can observe the range is between 0–60 years.
首先,我们需要确定服务年数变量的数据范围。 为了确定范围,我们可以使用range()函数。 您可以观察到范围在0到60年之间。
range(Salaries$yrs.service)

将yrs.service变量分为几类 (Binning yrs.service variable into categories)
Lets put the yrs.service variable into three category bins i.e., 0–20, 20–40, 40–60. To do that we can use the mutate( ) function for creating a new column service_time_cat using the case_when( ) function.
让我们将yrs.service变量放入三个类别分类中,即0–20、20–40、40–60 。 为此,我们可以使用case_when()函数使用mutate()函数创建新列service_time_cat 。
Salaries_mod <- Salaries %>%
mutate(service_time_cat = case_when(
between(yrs.service, 0, 20)~"upto20",
between(yrs.service, 20, 40)~"20_40",
between(yrs.service, 40, 60)~"40_60"))
Converting it to a factor variable
将其转换为因子变量
The next step is to convert the new variable into a categorical one. Now, if we check the levels we can observe that the levels are not in proper order.
下一步是将新变量转换为分类变量。 现在,如果我们检查级别,我们可以观察到级别不正确。
Salaries_mod$service_time_cat <- as.factor(Salaries_mod$service_time_cat)levels(Salaries_mod$service_time_cat)

Changing the order of the levels
更改关卡的顺序
You can set the level order during categorical conversion or just by using the relevel( ) function. Now the level is in proper order and the reference category is set to “upto20”.
您可以在分类转换期间设置级别顺序,也可以仅使用relevel()函数设置级别顺序。 现在,级别处于适当的顺序,并且参考类别设置为“ upto20 ”。
Salaries_mod$service_time_cat <- relevel(Salaries_mod$service_time_cat, ref = "upto20")levels(Salaries_mod$service_time_cat)

Generating a count table
生成计数表
Let’s see how many unique values each category holds. We can obtain the level count using the table( ) function.
让我们看看每个类别拥有多少个唯一值。 我们可以使用table()函数获得级别计数。
table(Salaries_mod$service_time_cat)

Model Fitting
模型拟合
Let’s refit the model with the newly created categorical variable (service_time_cat).
让我们用新创建的分类变量( service_time_cat )调整模型。
lm4 <- lm(formula = salary ~ rank + discipline + yrs.since.phd + sex + service_time_cat, data = Salaries_mod)summary(lm4)

Model interpretation
模型解释
The model coefficient table showed that as the service time increases the salary decreases (negative coefficients) when compared to the 0–20 years of service. Compared to 0–20 years of service years category, a person is in 20–40 years of service gets on average 8905.1$ less salary, similarly, a person is in 40–60 years service earns 16710.4$ less salary.
模型系数表显示,与0-20年的服务时间相比,随着服务时间的增加,薪水减少(负系数)。 与0–20年服务年限类别相比,在20–40年服务年限的人平均薪水少8905.1 $,类似地,在40–60年服务年限的人薪水少16710.4 $。
Performing a stepwise regression
执行逐步回归
Next, we can again try to improve the model by performing a backward stepwise regression.
接下来,我们可以再次尝试通过执行向后逐步回归来改进模型。
step(lm4, direction = "backward")

Fitting final model
拟合最终模型
Let's fit the final model with stepwise regression suggestion.
让我们用逐步回归建议拟合最终模型。
lm5 <- lm(formula = salary ~ rank + discipline + yrs.since.phd + service_time_cat,
data = Salaries_mod)summary(lm5)

Making result tidy
使结果整洁
Sometimes we need the model results in a tidy format so that we can perform certain manipulation on the estimate table. You can convert a model result table into a tidy format using the tidy( ) function from the broom package.
有时我们需要整洁的格式的模型结果,以便我们可以对估计表执行某些操作。 您可以使用broom软件包中的tidy()函数将模型结果表转换为整齐的格式。
tidy(lm5)

ANOVA analysis
方差分析
In scenarios where your categorical variables have more than two levels, the interpretation gets complicated. So one can better understand the relationship between independent and dependent variables by performing an anova analysis by supplying the trained model object into the anova( ) function.
在类别变量具有两个以上级别的情况下,解释会变得很复杂。 因此,通过将受过训练的模型对象提供给anova()函数来执行方差分析,可以更好地理解自变量和因变量之间的关系。
The anova analysis result revealed that rank, discipline and service_time_cat variables are significantly associated with the variation in salary (p-values<0.10)
方差分析结果表明, rank , 学科和service_time_cat变量与薪金差异显着相关(p值<0.10)
anova(lm5)

The Multiple Linear regression is still a vastly popular ML algorithm (for regression task) in the STEM research domain. It is still very easy to train and interpret, compared to many sophisticated and complex black-box models.
在STEM研究领域,多元线性回归仍然是一种非常流行的ML算法(用于回归任务)。 与许多复杂复杂的黑匣子模型相比,它仍然非常易于训练和解释。
I hope you learned something new. See you next time!
我希望你学到了一些新东西。 下次见!
If you learned something new and liked this article, follow me on onezero.blog (my personal blogging website), Twitter, LinkedIn, YouTube and Github.
如果您学到了新知识并喜欢本文,请在 onezero.blog ( 我的个人博客网站 ) , Twitter , LinkedIn , YouTube 上 关注 我 和 Github 。
[1] Fox J. and Weisberg, S. (2019) An R Companion to Applied Regression, Third Edition, Sage.
[1] Fox J. and Weisberg,S.(2019)《应用回归的R伴侣》,第三版,Sage。
Originally published at https://onezero.blog on August 10, 2020.
最初于 2020年8月10日 在 https://onezero.blog 上 发布 。
More Interesting Readings — I hope you’ve found this article useful! Below are some interesting readings hope you like them too —
更多有趣的读物 - 希望您对本文有所帮助! 以下是一些有趣的读物,希望您也喜欢 —
多元线性回归建模