数据分析实例2 (英文报告)-确定性价比最高的汽油种类(不同行驶条件)-R语言 Linear Regression-Kaggle 数据

本文通过R语言进行多元线性回归分析,研究了不同行驶条件下E10和SP98两种汽油的消费因素。研究发现,E10在低速下更经济,而SP98在高速和雨天更经济。通过数据转换和模型诊断,得出最终模型,揭示了速度、温度和降雨对两种汽油消耗的影响。

Find Economical Gas by Using Multiple Linear Regression

AIWEN XING, YU ZHANG
 

1 Abstract

Authors used multiple linear regression on the gas consumption data and to find factors that affect the consumption of two kinds of gas, E10, and SP98, significantly. By doing data transformation on the predictor variables and response, authors got a more reasonable model and did the model diagnostic. The final predictor for E10 should be speed and temperature, and that for SP98 should be speed, temperature and it is rainy or not. The result is that E10 is economical at low speed while SP98 is economical in high speed and rain affects consumption of both types.

2 Introduction

It is common that adding Ethanol into gasoline for environmental protection. Super Plus, 95 with 10%, added Ethanol was called as SP95/E10 in Eurozone, which is abbreviated as E10 in following contents. These two kinds of gas, E10, and Super Plus 98 (SP98) are all available in gas stations. E10 is cheaper than SP98 in one unit. However, it is said that E10 reduces fuel economy of vehicles. So, E10 may cause more fuel costs than SP98 does under some conditions.
 
Based on this statement, this project tried to discover which factors affect the gas consumption and how they influence the consumption. The project also tried to discuss which gas is more economical.

The authors are interested in one data set, car fuel consumption, that was published at Kaggle.com. The data set documented gas consumption of two types gas (E10, and SP98) on one car that driving by one driver under different conditions.

3 Analysis Process

3.1 Preliminary Analysis

The original dataset contains 258 observations, one response, and six variables. The variables and its meaning are listed in Table 1.

Table 1 Variables and description in gas consumption dataset
在这里插入图片描述
The authors split raw data into two subsets named E10 and SP98. Analyzing gas consumption by gas types is convenient. E10 has 120 observations. SP98 has 138 observations.

The scatter matrix, Figure 1 and Figure 2, of response (consume), reveals that some variables may have correlations and data transformation might be applied for getting better distribution plots.
在这里插入图片描述
Figure 1 Scatterplot matrix for gas E10 consumption
在这里插入图片描述
Figure 2 Scatterplot matrix for gas SP98 consumption
 
3.1.1 Replacement of Missing Values
There are nine missing values of inside temperature in raw data SP98. The missing data are replaced as the median value (21.5) of interior temperature.

3.1.2 Analysis of Correlations of Predictors
Table 2 shows the correlation coefficient of predictors. The correlation between speed and distance is 0.53, so the predictor SPEED is highly correlated to predictor DISTANCE. Speed represents the displayed average speed of each one-way route, and the average speed was measured by the distance in a given period. Although some values of distance in the data set are the same, the average speed is not exactly same. For example, when driving in downtown or near residential communities, there are more stop signs and traffic lights, so it causes the increasing spent time. Hence, the values of average speed are various in the same distance.

The correlation between inside temperature (Temp_inside) and outside temperature (Temp_outside) is -0.18. However, it does not mean that Temp_inside are correlated to Temp_outside. However, the correlation does not imply causation. The outside temperature of a vehicle does not affect the final inside temperature. There are only four levels of Temp_inside: 21, 21.5, 22, and 22.5. The standard deviation of Temp_inside (0.471605) is small. It causes the calculated correlation gets a large value that does not indicate correlations between Temp_inside and Temp_outside. So, they can be considered as independent variables. The rest values of correlation are too small to be considered.
Table 2 Correlation efficient of variables in dataset

Distance Speed Temp_indise Temp_outside
Distance 1 0.53 0.10 -0.005
Speed 0.53 1 0.11 0.03
Temp_inside 0.10 0.11 1 -0.18
Temp_outside -0.005 0.03 -0.18 1

The authors applied the multiple linear regression to find an appropriate model. Details of the regression are described in later paragraphs.

3.1.3 Data Transformation
Distance and Speed are continuous variables. Their Distributions are neither symmetric nor normality in Figure 3.The authors centralized data and transferred data by logarithm. The distributions after transformation are shown in Figure 5.

在这里插入图片描述
(3a)
在这里插入图片描述
(3b)
在这里插入图片描述
(3c)
在这里插入图片描述
(3d)
Figure 3 Density plot of distance and speed in E10 and SP98

3.1.3 (a) Data Centralization
To comply data centralization, the raw data abstract an appropriate constant that closes to modes of variables. The abstraction values and modes values are documented in Table 3.
Table 3 Mode and abstraction value in E10 and SP98

Data Set $ Variable Mode Abstraction Value
E10$Distance 12.3 12.0
### 开源商务数据分析 #### 数据分析基本流程 数据分析的基本流程通常包括四个主要阶段:数据收集、数据预处理、数据分析以及结果呈现。这些步骤构成了一个完整的数据分析链条,确保从原始数据到可操作见解的有效转化[^1]。 #### 开源商务数据资源 对于开源商务数据的获取,可以考虑以下几种常见来源: - **Kaggle**:提供大量公开可用的数据集,涵盖了电子商务、零售等多个领域。 - **UCI Machine Learning Repository**:专注于机器学习研究的数据集合,其中也包含了部分商务相关的数据- **Google Dataset Search**:谷歌推出的统一平台,帮助用户查找来自互联网上的公共数据集。 #### 示例代码与分析过程 下面是一个基于Python实现的小型电商销售预测案例,展示了如何利用Pandas进行数据加载和初步探索,接着使用Scikit-Learn完成简单的回归模型训练来预测销售额: ```python import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, r2_score # 加载数据 data = pd.read_csv('sales_data.csv') # 查看前几行记录了解结构 print(data.head()) # 基本描述性统计信息查看 print(data.describe()) # 特征工程 - 这里假设我们只关注几个简单特征用于演示目的 X = data[['advertising_spend', 'social_media_clicks']] y = data['revenue'] # 划分测试集与训练集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 创建线性回归实例并拟合数据 model = LinearRegression() model.fit(X_train, y_train) # 预测新样本的结果 predictions = model.predict(X_test) # 输出评估指标 mse = mean_squared_error(y_test, predictions) r2 = r2_score(y_test, predictions) print(f'Mean Squared Error: {mse}') print(f'R-squared Value: {r2}') ``` 此脚本首先导入必要的库文件,随后读取存储于CSV格式中的历史销售数据,并执行基础的数据探查工作;之后选取特定变量作为输入特性建立线性关系模型最后计算误差平方均值(MSE)及决定系数(R²),以此衡量所建模的好坏程度。 #### 商业智能工具对比 当涉及到更复杂的业务需求时,可以选择不同的BI解决方案。例如Tableau擅长制作精美的仪表板支持深入洞察而Power BI则因其易用性和微软产品兼容性强受到欢迎[^3]。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符  | 博主筛选后可见
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值