下面的分析的源数据是从链家网上抓取的成都二手房数据,截至时间2019年1月16日。目的也简单,一个想买房的人关注一下所在城市的房价情况。
需要注意的问题:
- 只含普通住房, 不含公寓、别墅
- 链家网上只有成都一、二圈层的二手房信息(缺乏三圈层的数据),共52548条
- 不同区域下面有重复的商圈,抓取数据时注意去重和商圈的正确归属
- 商圈直接划分错误,比如
犀浦
归到了新都
- 一些较远的区县归属到了相邻的区域内,比如
新津
在双流
内,因为较远区县的房子很少,单独拿出来意义不大,暂且默认这种方式
import pymongo
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams['font.sans-serif'] = ['Microsoft YaHei']
plt.rcParams['axes.unicode_minus'] = False
pd.set_option('display.max_columns', 50)
sns.set_context('talk')
warnings.filterwarnings('ignore')
client = pymongo.MongoClient('localhost')
db = client.spider
data = pd.DataFrame(list(db.lianjia.find()))
del data['_id']
data = data[['title', 'link', 'building', 'layout', 'size', 'orientation', 'decoration', 'elevator', 'zone', 'location',
'floor', 'num_of_floor', 'year', 'type', 'follow', 'watch', 'how_long_since_release', 'tags', 'unit', 'total']]
1、数据清洗
先看看数据的基本信息,部分变量有缺失值,稍后进行处理。
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52548 entries, 0 to 52547
Data columns (total 20 columns):
title 52548 non-null object
link 52548 non-null object
building 52548 non-null object
layout 52548 non-null object
size 52548 non-null float64
orientation 52548 non-null object
decoration 52548 non-null object
elevator 44464 non-null object
zone 52548 non-null object
location 52548 non-null object
floor 52322 non-null object
num_of_floor 52548 non-null int64
year 43395 non-null object
type 49879 non-null object
follow 52548 non-null int64
watch 52548 non-null int64
how_long_since_release 52548 non-null object
tags 52548 non-null object
unit 52548 non-null int64
total 52548 non-null float64
dtypes: float64(2), int64(4), object(14)
memory usage: 8.0+ MB
1.1 区域
从网上抓取数据时,区域的归属明显有问题,比如犀浦
归到了新都
,下面进行调整。
#names = ['锦江', '青羊', '武侯', '高新', '成华', '金牛', '天府新区', '高新西', '双流', '温江', '郫都', '龙泉驿', '新都', '天府新区南区', '青白江', '都江堰']
#for name in names:
# print("***", name, '***')
# print(data[data['zone'] == name]['location'].value_counts())
# print('='*50)
data.loc[data['zone'] == '天府新区南区', ['zone']] = '天府新区'
data.loc[data['location'] == '犀浦', ['zone']] = '郫都'
data['bedroom_num'] = data['layout'].apply(lambda x: int(x.split('室')[0]))
data['parlour_num'] = data['layout'].apply(lambda x: int(x.split('室')[1].replace('厅', '')))
1.2 房屋朝向
directions = []
for i in data['orientation']:
directions.extend(i)
directions = pd.unique(directions)
zero_matrix = np.zeros((len(data), len(directions)))
dummy = pd.DataFrame(zero_matrix, columns=sorted(directions))
for i, j in enumerate(data['orientation']):
indices = dummy.columns.get_indexer(j)
dummy.iloc[i, indices] = 1
data = pd.concat([data, dummy.add_prefix('direct_')], axis=1)
plt.bar(dummy.apply(np.sum, axis=0).index, dummy.apply(np.sum, axis=0).values)
for i, j in zip(range(8), dummy.apply(np