住房月租金预测

最新推荐文章于 2023-03-15 16:16:35 发布

BernadetteDi

最新推荐文章于 2023-03-15 16:16:35 发布

阅读量1.2k

点赞数 2

分类专栏： machine learning python 文章标签： python 机器学习

本文链接：https://blog.youkuaiyun.com/weixin_45004761/article/details/114866392

版权

该项目旨在通过历史数据建立回归模型，预测住房月租金，解决租房市场信息不对称问题。数据包含18个特征，存在缺失值。进行了EDA、异常值清洗、缺失值填充、特征工程和模型调优，使用了XGBoost等方法，评估指标为均方根误差。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

住房月租金预测

项目介绍

项目目的

当今社会，房屋租金由装修情况、位置地段、户型格局、交通便利程度、市场供需量等多方面因素综合决定，对于租房这个相对传统的行业来说，信息严重不对称一直存在。一方面，房东不了解租房的市场真实价格，只能忍痛空置高租金的房屋；另一方面，租客也找不到满足自己需求高性价比房屋，这造成了租房资源的极大浪费。
本项目基于租房市场的痛点，提供脱敏处理后的真实租房市场数据。选手需要利用有月租金标签的历史数据建立模型，实现基于房屋基本信息的住房月租金预测，为该城市租房市场提供客观衡量标准。

数据介绍

在这里插入图片描述

最后一列’月租金’是预测值，这是一个回归问题，需要建立回归模型

评估指标

在这里插入图片描述

通过计算预测值和真实房租月租金的均方根误差来衡量回归模型的优劣。均方根误差越小，说明回归模型越好。

加载数据集

导入工具包，数据读取

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv('data/train.csv')

基本信息

columns

col_names = data.columns.tolist() #所有的列展示出来

print("Column names:")
print(col_names)

Column names:
['时间', '小区名', '小区房屋出租数量', '楼层', '总楼层', '房屋面积', '房屋朝向', '居住状态', '卧室数量', '厅的数量', '卫的数量', '出租方式', '区', '位置', '地铁线路', '地铁站点', '距离', '装修情况', '月租金']

head()

data.head()

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	房屋朝向	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
0	1	3072	0.128906	2	0.236364	0.008628	东南	NaN	1	1	1	NaN	11.0	118.0	2.0	40.0	0.764167	NaN	5.602716
1	1	3152	0.132812	1	0.381818	0.017046	东	NaN	1	0	0	NaN	10.0	100.0	4.0	58.0	0.709167	NaN	16.977929
2	1	5575	0.042969	0	0.290909	0.010593	东南	NaN	2	1	2	NaN	12.0	130.0	5.0	37.0	0.572500	NaN	8.998302
3	1	3103	0.085938	2	0.581818	0.019199	南	NaN	3	2	2	NaN	7.0	90.0	2.0	63.0	0.658333	NaN	5.602716
4	1	5182	0.214844	0	0.545455	0.010427	东北	NaN	2	1	1	NaN	3.0	31.0	NaN	NaN	NaN	NaN	7.300509

info()

data.info()  #是否有缺失值，数据类型

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196539 entries, 0 to 196538
Data columns (total 19 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   时间        196539 non-null  int64  
 1   小区名       196539 non-null  int64  
 2   小区房屋出租数量  195538 non-null  float64
 3   楼层        196539 non-null  int64  
 4   总楼层       196539 non-null  float64
 5   房屋面积      196539 non-null  float64
 6   房屋朝向      196539 non-null  object 
 7   居住状态      20138 non-null   float64
 8   卧室数量      196539 non-null  int64  
 9   厅的数量      196539 non-null  int64  
 10  卫的数量      196539 non-null  int64  
 11  出租方式      24230 non-null   float64
 12  区         196508 non-null  float64
 13  位置        196508 non-null  float64
 14  地铁线路      91778 non-null   float64
 15  地铁站点      91778 non-null   float64
 16  距离        91778 non-null   float64
 17  装修情况      18492 non-null   float64
 18  月租金       196539 non-null  float64
dtypes: float64(12), int64(6), object(1)
memory usage: 28.5+ MB

shape

data.shape

(196539, 19)

describe()

data.describe()
#describe() 可以返回具体的结果， 对于每一列。

#数量 平均值 标准差 25% 分位 50% 分位数 75% 分位数 最大值 很多时候你可以得到NA的数量和比例。

	时间	小区名	小区房屋出租数量	楼层	总楼层	房屋面积	居住状态	卧室数量	厅的数量	卫的数量	出租方式	区	位置	地铁线路	地铁站点	距离	装修情况	月租金
count	196539.000000	196539.000000	195538.000000	196539.000000	196539.000000	196539.000000	20138.000000	196539.000000	196539.000000	196539.000000	24230.000000	196508.000000	196508.000000	91778.000000	91778.000000	91778.000000	18492.000000	196539.000000
mean	2.115229	3224.116562	0.124151	0.955449	0.408711	0.013139	2.725196	2.236635	1.299625	1.223818	0.900289	7.905139	67.945982	3.284850	57.493735	0.551202	3.589228	7.949313
std	0.786980	2023.073726	0.133299	0.851511	0.183100	0.008104	0.667763	0.896961	0.613169	0.487234	0.299621	4.025696	43.522394	1.477147	35.191414	0.247268	1.996912	6.310609
min	1.000000	0.000000	0.007812	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	1.000000	0.001667	1.000000	0.000000
25%	1.000000	1388.000000	0.039062	0.000000	0.290909	0.009268	3.000000	2.000000	1.000000	1.000000	1.000000	4.000000	33.000000	2.000000	23.000000	0.356667	2.000000	4.923599
50%	2.000000	3086.000000	0.082031	1.000000	0.418182	0.012910	3.000000	2.000000	1.000000	1.000000	1.000000	9.000000	61.000000	4.000000	59.000000	0.554167	2.000000	6.621392
75%	3.000000	5199.000000	0.160156	2.000000	0.563636	0.014896	3.000000	3.000000	2.000000	1.000000	1.000000	11.000000	103.000000	5.000000	87.000000	0.745833	6.000000	8.998302
max	3.000000	6627.000000	1.000000	2.000000	1.000000	1.000000	3.000000	11.000000	8.000000	8.000000	1.000000	14.000000	152.000000	5.000000	119.000000	1.000000	6.000000	100.000000

1- data.describe().loc['count',:]/data.shape[0]  #数据缺失的比例
# 可以观察出什么信息？

时间          0.000000
小区名         0.000000
小区房屋出租数量    0.005093
楼层          0.000000
总楼层         0.000000
房屋面积        0.000000
居住状态        0.897537
卧室数量        0.000000
厅的数量        0.000000
卫的数量        0.000000
出租方式        0.876717
区           0.000158
位置          0.000158
地铁线路        0.533029
地铁站点        0.533029
距离          0.533029
装修情况        0.905912
月租金         0.000000
Name: count, dtype: float64

一共196539条数据，每条数据有18个特征及一个预测值(月租金)，有float,int,object型，部分字段存在缺失值

探索性数据分析(EDA)

如何进行EDA探索

1）如何探索？

绘制特征自身的分布图；
绘制特征与预测值之间的关系图；

2）探索到什么？

异常值、缺失值
特征的分布趋势是否合理：比如，对于回归问题，预测值一般需要符合正态分布；
特征与预测值之间的关系

探索分析

数据缺失情况

missing_val = 1- data.describe().loc['count',:]/data.shape[0]

missing_val

时间          0.000000
小区名         0.000000
小区房屋出租数量    0.005093
楼层          0.000000
总楼层         0.000000
房屋面积        0.000000
居住状态        0.897537
卧室数量        0.000000
厅的数量        0.000000
卫的数量        0.000000
出租方式        0.876717
区           0.000158
位置          0.000158
地铁线路        0.533029
地铁站点        0.533029
距离          0.533029
装修情况        0.905912
月租金         0.000000
Name: count, dtype: float64

plt.clf()
plt.figure(figsize=(8,4))
plt.tick_params(labelsize=12)
missing_val.plot(kind='bar', color='green')
plt.show()

<Figure size 432x288 with 0 Axes>

在这里插入图片描述

1）有缺失值的字段包括： #missing_val[missing_val != 0.000000].index
- [‘小区房屋出租数量’, ‘居住状态’, ‘出租方式’,
  ‘区’, ‘位置’, ‘地铁线路’, ‘地铁站点’, ‘距离’, ‘装修情况’]
2）从趋势占比来看：
[‘小区房屋出租数量’， ‘区’, ‘位置’]缺失较少，
[‘地铁线路’, ‘地铁站点’, ‘距离’]缺失量中等，占比一半左右；
[‘居住状态’、‘出租方式’、‘装修情况’]大量缺失，缺失占比超过了80%；

时间属性

label = '月租金'
column='时间'
print(len(data[column].unique()))
print(data[column].value_counts())

import seaborn as sns
fig = plt.figure(figsize=(10,4))
plt.subplot2grid((1,2), (0,0))  #图像几行几列，从第0行第0列
sns.barplot(x=data[column].value_counts().index, y=data[column].value_counts().values)
plt.title(column)
plt.ylabel('数量')

plt.subplot2grid((1,2),(0,1))
sns.boxplot(x=column,y='月租金', data=data)
plt.show()

##看出什么?
#2月和3月的数据较1月多一些
#楼市总体上在三个月内没有明显波动

3
3    73490
2    72206
1    50843
Name: 时间, dtype: int64

在这里插入图片描述
箱线图：

小区名

column = '小区名'
print(len(data[column].unique()))
# data[column].value_counts() 
print('最小值和最大值：',data[column].min(), data[column].max())

fig = plt.figure(figsize=(10,4))

plt.subplot2grid((1,2), (0,0)) 
sns.distplot(data[column].dropna())
plt.xlabel(column)
plt.ylabel('数量')

plt.subplot2grid((1,2), (0,1)) 
sns.scatterplot(data[column].dropna(),data[label])
plt.show()

5547
最小值和最大值： 0 6627

在这里插入图片描述

# 把上述两种数据的绘制封装成函数，
def lisan_plot(column):    #离散数据
    fig = plt.figure(figsize=(10,4))
    plt.subplot2grid((1,2), (0,0))  #图像几行几列，从第0行第0列
    sns.barplot(x=data[column].value_counts().index, y=data[column].value_counts().values)
    plt.title(column)
    plt.ylabel('数量')

    plt.subplot2grid((1,2),(0,1))
    sns.boxplot(x=column,y='月租金', data=data)
    plt.show()

def lianxu_plot(column): #连续数据

    fig = plt.figure(figsize=(10,4))

    plt.subplot2grid((1,2), (0,0)) 
    sns.distplot(data[column].dropna())
    plt.xlabel(column)
    plt.ylabel('数量')

    plt.subplot2grid((1,2), (0,1)) 
    sns.scatterplot(data[column].dropna(),data[label])
    plt.show()

小区房屋出租数量

column = '小区房屋出租数量'
print(len(data[column].unique

最低0.47元/天解锁文章