二手房数据可分析-python

1、jupyter lab或jupyter notebook环境下

项目名称:二手房屋数据可视化

一、导包

In [51]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
##忽略警告
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.family'] = ['SimHei']   #设置字体为黑体
plt.rcParams['axes.unicode_minus'] = False #解决保存图像时负号“-”显示为方块的问题

In [52]:

%config InlineBackend.figure_format = 'svg'

二、读取数据

In [53]:

house=pd.read_csv('E:/新建文件夹/实训/项目/二手房数据.csv',encoding='gbk')
house

Out[53]:

市区小区户型朝向楼层装修情况电梯面积(㎡)价格(万元)年份
0朝阳育慧里一区1室0厅西7精装有电梯52.0343.02001
1朝阳大西洋新城A区2室2厅南北10精装有电梯86.0835.01999
2朝阳团结湖路2室1厅东西6精装无电梯65.0430.01980
3朝阳尚家楼48号院2室1厅南北12精装有电梯75.0610.01998
4朝阳望京西园一区3室2厅南北6精装无电梯115.0710.01997
.................................
23672西城真武庙六里2室1厅南北18精装有电梯78.0888.01988
23673西城右安门内大街1室1厅西北7其他无电梯45.0405.01991
23674西城玉桃园二区2室1厅南北6简装无电梯60.0650.01997
23675西城红莲南里2室1厅南北7精装无电梯61.0470.01992
23676西城白广路6号院3室0厅6简装NaN84.0635.01955

23677 rows × 10 columns

一、数据清洗和预处理

In [54]:

###查看是否有空缺值
house.info()##可以看出只有电梯那一列有空缺值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23677 entries, 0 to 23676
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   市区      23677 non-null  object 
 1   小区      23677 non-null  object 
 2   户型      23677 non-null  object 
 3   朝向      23677 non-null  object 
 4   楼层      23677 non-null  int64  
 5   装修情况    23677 non-null  object 
 6   电梯      15420 non-null  object 
 7   面积(㎡)   23677 non-null  float64
 8   价格(万元)  23677 non-null  float64
 9   年份      23677 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 1.8+ MB

In [55]:

house['电梯'].fillna('未知', inplace=True)###填充数据

In [56]:

###去除电梯那一列,在原来数据的基础上进行删除
# house.drop(labels='电梯',axis=1,inplace=True)
# house##查看数据

In [56]:

###再次进行查看是否有空缺值
house.info()###确认数据没有空缺值
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23677 entries, 0 to 23676
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   市区      23677 non-null  object 
 1   小区      23677 non-null  object 
 2   户型      23677 non-null  object 
 3   朝向      23677 non-null  object 
 4   楼层      23677 non-null  int64  
 5   装修情况    23677 non-null  object 
 6   电梯      23677 non-null  object 
 7   面积(㎡)   23677 non-null  float64
 8   价格(万元)  23677 non-null  float64
 9   年份      23677 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 1.8+ MB

In [57]:

house.isnull().any()##可以看到每一列都没有空缺值

Out[57]:

市区        False
小区        False
户型        False
朝向        False
楼层        False
装修情况      False
电梯        False
面积(㎡)     False
价格(万元)    False
年份        False
dtype: bool

四、利用pyecharts画图

In [58]:

from pyecharts.globals import CurrentConfig, NotebookType

CurrentConfig.NOTEBOOK_TYPE = NotebookType.JUPYTER_LAB

In [59]:

import pandas as pd
from pyecharts.charts import Map
from pyecharts.charts import Bar
from pyecharts.charts import Line
from pyecharts.charts import Grid
from pyecharts.charts import Pie
from pyecharts.charts import Scatter
from pyecharts import options as opts
(1)统计各城区二手房数量

In [60]:

g = house.groupby('市区')
df_region = g.count()['小区']
region = df_region.index.tolist()
count = df_region.values.tolist()
df_region

Out[60]:

市区
东城     1533
丰台     2952
大兴     2115
密云       12
平谷       41
延庆      469
怀柔       15
房山     1442
昌平     2811
朝阳     2973
海淀     2983
石景山     882
西城     2130
通州     1602
门头沟     496
顺义     1221
Name: 小区, dtype: int64

In [104]:

g = house.groupby('市区')
df_region = g.count()['小区']
region = df_region.index.tolist()
count = df_region.values.tolist()
new = [x + '区' for x in region]
m = (
        Map()
        .add('', [list(z) for z in zip(new, count)], '北京')
        .set_global_opts(
            title_opts=opts.TitleOpts(title='北京市二手房各区分布',is_show=True),
            visualmap_opts=opts.VisualMapOpts(max_=3000,is_show=True),
        )
    )
m.render('北京市二手房各区分布.html')

Out[104]:

'E:\\新建文件夹\\实训\\北京市二手房各区分布.html'
(2)各城区二手房数量-平均价格柱状图

In [62]:

house_price=house.groupby('市区')['价格(万元)'].mean()
house_price

Out[62]:

市区
东城     851.425245
丰台     525.103591
大兴     460.469693
密云     425.333333
平谷     308.658537
延庆     549.876333
怀柔     785.200000
房山     360.611859
昌平     469.230345
朝阳     757.320148
海淀     827.740194
石景山    468.926757
西城     828.909202
通州     455.107553
门头沟    388.054032
顺义     558.339885
Name: 价格(万元), dtype: float64

In [63]:

price=[round(x,2) for x in house_price.values.tolist()]
price

Out[63]:

[851.43,
 525.1,
 460.47,
 425.33,
 308.66,
 549.88,
 785.2,
 360.61,
 469.23,
 757.32,
 827.74,
 468.93,
 828.91,
 455.11,
 388.05,
 558.34]

In [64]:

# 各城区二手房数量-平均价格柱状图
bar = (
    Bar()
    .add_xaxis(region)
    .add_yaxis('数量', count,
              label_opts=opts.LabelOpts(is_show=True))
    .extend_axis(
        yaxis=opts.AxisOpts(
            name="价格(万元)",
            type_="value",
            min_=200,
            max_=900,
            interval=100,
            axislabel_opts=opts.LabelOpts(formatter="{value}"),
        )
    )
    .set_global_opts(
        tooltip_opts=opts.TooltipOpts(
            is_show=True, trigger="axis", axis_pointer_type="cross"
        ),
        xaxis_opts=opts.AxisOpts(
            type_="category",
            axispointer_opts=opts.AxisPointerOpts(is_show=True, type_="shadow"),
        ),
        yaxis_opts=opts.AxisOpts(name='数量',
            axistick_opts=opts.AxisTickOpts(is_show=True),
            splitline_opts=opts.SplitLineOpts(is_show=False),)
    )
)
line2 = (
    Line()
    .add_xaxis(xaxis_data=region)
    .add_yaxis( 
        series_name="价格",
        yaxis_index=1,
        y_axis=price,
        label_opts=opts.LabelOpts(is_show=True),
        z=10)
)
bar.overlap(line2)
bar.load_javascript()

Out[64]:

In [77]:

bar.render_notebook()

Out[77]:

  • 均价分析
(3)二手房价格最高的Top

In [78]:

####二手房价格最高的Top20
top_price=house.sort_values(by='价格(万元)',ascending=False)[:15]
top_price

Out[78]:

市区小区户型朝向楼层装修情况电梯面积(㎡)价格(万元)年份
20390西城朱雀门4室2厅东南5其他有电梯376.06000.02008
22228东城贡院六号5室2厅南北23精装有电梯459.05500.02002
22907东城NAGA上院6室2厅东南12精装有电梯608.05000.02008
3219顺义丽宫5室2厅南北3精装未知685.05000.02007
22982东城当代MOMA5室2厅东南7精装未知384.04988.02006
20202西城耕天下5室3厅南北7其他有电梯330.04650.02003
6191昌平碧水庄园5室3厅南北2精装未知571.04600.02005
2391顺义丽嘉花园4室2厅东南2其他未知548.04500.02007
17285朝阳首府官邸叠拼别墅南北5精装未知523.44500.02007
15327海淀紫御府4室2厅南北12精装有电梯374.04368.02008
23240东城长安太和4室1厅24精装有电梯314.04350.02012
21012西城西派国际公寓5室2厅东南17精装未知355.04270.02009
20240西城金融世家4室2厅西北15精装有电梯300.04250.02008
14531海淀西山壹号院4室3厅东南6毛坯有电梯561.04150.02011
21760西城丽豪园4室2厅西南6精装有电梯289.04055.01999

In [79]:

area=top_price['小区'].values.tolist()
count=top_price['价格(万元)'].values.tolist()
bar_1=(
    Bar()
    .add_xaxis(area)
    .add_yaxis('数量',count,category_gap='50%')
    .set_global_opts(
        yaxis_opts=opts.AxisOpts(name='价格(万元)'),
        xaxis_opts=opts.AxisOpts(name='数量'),
    )
)
bar_1.load_javascript()

Out[79]:

In [80]:

bar_1.render_notebook()

Out[80]:

In [82]:

####二手房价格最高的Top7
top_price=house.sort_values(by='价格(万元)',ascending=False)[:7]
top_price

Out[82]:

市区小区户型朝向楼层装修情况电梯面积(㎡)价格(万元)年份
20390西城朱雀门4室2厅东南5其他有电梯376.06000.02008
22228东城贡院六号5室2厅南北23精装有电梯459.05500.02002
22907东城NAGA上院6室2厅东南12精装有电梯608.05000.02008
3219顺义丽宫5室2厅南北3精装未知685.05000.02007
22982东城当代MOMA5室2厅东南7精装未知384.04988.02006
20202西城耕天下5室3厅南北7其他有电梯330.04650.02003
6191昌平碧水庄园5室3厅南北2精装未知571.04600.02005

In [83]:

area=top_price['小区'].values.tolist()
count=top_price['价格(万元)'].values.tolist()
bar_2=(
    Bar()
    .add_xaxis(area)
    .add_yaxis('数量',count,category_gap='50%')
    .set_global_opts(
        yaxis_opts=opts.AxisOpts(name='价格(万元)'),
        xaxis_opts=opts.AxisOpts(name='数量'),
    )
)
bar_2.load_javascript()

Out[83]:

In [84]:

bar_2.render_notebook()

Out[84]:

(4)装修情况/有无电梯玫瑰图

In [85]:

house_fitment=house.groupby('装修情况')['小区'].count()
house_fitment

Out[85]:

装修情况
其他     3239
毛坯      583
简装     8499
精装    11356
Name: 小区, dtype: int64

In [86]:

house_direction=house.groupby('电梯')['小区'].count()
house_direction

Out[86]:

电梯
无电梯    6078
有电梯    9342
未知     8257
Name: 小区, dtype: int64

In [87]:

house_fitment

Out[87]:

装修情况
其他     3239
毛坯      583
简装     8499
精装    11356
Name: 小区, dtype: int64

In [88]:

fitment=house_fitment.index.tolist()
count1=house_fitment.values.tolist()
directions=house_direction.index.tolist()
count2=house_direction.values.tolist()
bar = (
    Bar()
    .add_xaxis(fitment)
    .add_yaxis('', count1, category_gap = '50%')
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position='right'))
    .set_global_opts(
        yaxis_opts=opts.AxisOpts(name='装修情况'),
        xaxis_opts=opts.AxisOpts(name='数量'),
        title_opts=opts.TitleOpts(title='装修情况/有无电梯玫瑰图(组合图)',pos_left='33%',pos_top="5%"),
        legend_opts=opts.LegendOpts(type_="scroll", pos_left="90%",pos_top="58%",orient="vertical")
    )
)

c2 = (
    Pie(init_opts=opts.InitOpts(
            width='800px', height='600px',
            )
       )
        .add(
        '',
        [list(z) for z in zip(directions, count2)],
        radius=['10%', '30%'],
        center=['75%', '65%'],
        rosetype="radius",
        label_opts=opts.LabelOpts(is_show=True),
        )
        .set_global_opts(title_opts=opts.TitleOpts(title='有/无电梯',pos_left='33%',pos_top="5%"),
                        legend_opts=opts.LegendOpts(type_="scroll", pos_left="80%",pos_top="25%",orient="vertical")
                        )
        .set_series_opts(label_opts=opts.LabelOpts(formatter='{b}:{c} \n ({d}%)'),position="outside")
    )
bar.overlap(c2)
bar.load_javascript()

Out[88]:

In [89]:

bar.render_notebook()

Out[89]:

In [90]:

df=house.groupby('装修情况')[['价格(万元)']].sum()
data1=df['价格(万元)'].values.tolist()

In [91]:

import pyecharts.options as opts
from pyecharts.charts import Grid, Boxplot, Scatter
x_data = fitment
y_data1= data1
s= Scatter()
# 添加横轴的数据
s.add_xaxis(xaxis_data=x_data)
# 添加纵轴的数据
s.add_yaxis(
    series_name='',
    y_axis=y_data1,
    label_opts=opts.LabelOpts(is_show=False),
)
s.set_global_opts(
        title_opts=opts.TitleOpts(title='装修与价格的散点图'),
        yaxis_opts=opts.AxisOpts(name='价格'),
        xaxis_opts=opts.AxisOpts(name='装修情况'),
    )
s.load_javascript()

Out[91]:

In [92]:

s.render_notebook()

Out[92]:

In [93]:

import pyecharts.options as opts
from pyecharts.charts import Grid, Boxplot, Scatter
df1=house.groupby('装修情况')[['面积(㎡)']].sum()
data2=df1['面积(㎡)'].values.tolist()
x_data = fitment
y_data2= data2
line=Line()
line.add_xaxis(xaxis_data=x_data)
line.add_yaxis(
    series_name='',
    y_axis=y_data2,
    label_opts=opts.LabelOpts(is_show=False),
)
line.set_global_opts(
        title_opts=opts.TitleOpts(title='装修与价格的折线图'),
        yaxis_opts=opts.AxisOpts(name='面积'),
        xaxis_opts=opts.AxisOpts(name='装修情况'),
    )
line.load_javascript()

Out[93]:

In [94]:

line.render_notebook()

Out[94]:

(5)二手房总价与面积散点图

In [95]:

s = (
    Scatter()
    .add_xaxis(house['面积(㎡)'].values.tolist())
    .add_yaxis('',house['价格(万元)'].values.tolist())
    .set_global_opts(xaxis_opts=opts.AxisOpts(name='面积(㎡)',type_='value'),
                    yaxis_opts=opts.AxisOpts(name='价格(万元)'),)
)
s.load_javascript()

Out[95]:

In [96]:

s.render_notebook()

Out[96]:

(6)二手房楼层分布柱状图

In [97]:

g1 =house.groupby('楼层')
house_floor = g1.count()['小区']
house_floor

Out[97]:

楼层
1        6
2       94
3      201
4      465
5     1070
6     7658
7      821
8      321
9      670
10     406
11     790
12     702
13     405
14     745
15     787
16    1033
17     373
18    1553
19     347
20     638
21     644
22     577
23     253
24     858
25     357
26     456
27     402
28     505
29     167
30     126
31      64
32      99
33      33
34      21
35      17
36       8
40       3
42       1
57       1
Name: 小区, dtype: int64

In [98]:

floor =house_floor.index.tolist()
count = house_floor.values.tolist()
bar = (
    Bar()
    .add_xaxis(floor)
    .add_yaxis('数量', count)
    .set_global_opts(
        title_opts=opts.TitleOpts(title='二手房楼层分布柱状缩放图'),
        yaxis_opts=opts.AxisOpts(name='数量'),
        xaxis_opts=opts.AxisOpts(name='楼层'),
        datazoom_opts=opts.DataZoomOpts(type_='slider')
    )
)
bar.load_javascript()

Out[98]:

In [99]:

bar.render_notebook()

Out[99]:

(7)房屋面积分布柱状图

In [100]:

area_level = [0, 50, 100, 150, 200, 250, 300, 350, 400, 1500]    
label_level = ['小于50', '50-100', '100-150', '150-200', '200-250', '250-300', '300-350', '350-400', '大于400']    
jzmj_cut = pd.cut(house['面积(㎡)'], area_level, labels=label_level)        
df_area = jzmj_cut.value_counts()
df_area

Out[100]:

50-100     13653
100-150     5809
150-200     1677
小于50        1562
200-250      545
250-300      226
300-350       94
大于400         56
350-400       55
Name: 面积(㎡), dtype: int64

In [101]:

area = df_area.index.tolist()
count = df_area.values.tolist()

bar = (
    Bar()
    .add_xaxis(area)
    .add_yaxis('数量', count)
    .reversal_axis()
    .set_series_opts(label_opts=opts.LabelOpts(position="right"))
    .set_global_opts(
        title_opts=opts.TitleOpts(title='房屋面积分布纵向柱状图'),
        yaxis_opts=opts.AxisOpts(name='面积(㎡)'),
        xaxis_opts=opts.AxisOpts(name='数量'),
    )
)
bar.load_javascript()

Out[101]:

In [102]:

bar.render_notebook()

Out[102]:

模型评价
根据面积和价格预测所要预测数据属于哪种装修情况

In [88]:

X=house[['面积(㎡)', '价格(万元)']]
from sklearn.neighbors import KNeighborsClassifier
y=house[['装修情况']]
knn=KNeighborsClassifier(n_neighbors=5)
knn.fit(X,y)##训练数据
x_test=pd.DataFrame({'面积(㎡)':[55,67,89,34],'价格(万元)':[788.8,789,2343,88]})###预测数据
x_test

Out[88]:

面积(㎡)价格(万元)
055788.8
167789.0
2892343.0
33488.0

In [89]:

knn.predict(x_test)###预测结果

Out[89]:

array(['简装', '简装', '精装', '其他'], dtype=object)

In [90]:

house.columns

Out[90]:

Index(['市区', '小区', '户型', '朝向', '楼层', '装修情况', '电梯', '面积(㎡)', '价格(万元)', '年份'], dtype='object')
数据的分类转换

In [106]:

house.rename(columns={'市区':'region','小区':'disrict','户型':'room','朝向':'direction','楼层':'floor',
                      '装修情况':'renovation','电梯':'elevator','面积(㎡)':'area','价格(万元)':'price','年份':'year'},inplace=True)
house_data=house[['region', 'disrict', 'room', 'direction', 'renovation','elevator', 'area', 'year', 'floor']]
#对region进行转换
squ=house_data['region'].unique()
m={}
for i,work in enumerate(region):
    m[work]=i
    
#进行转换
house_data['region']=house['region'].map(m)
for col in house_data.columns[1:6]:
    print(col)
disrict
room
direction
renovation
elevator

In [107]:

####批量转换
for col in house_data.columns[1:6]:
    u=house_data[col].unique()
    def convert(x):
        return np.argwhere(u==x)[0,0]
    house_data[col]=house_data[col].map(convert)

In [108]:

house_data.head()

Out[108]:

regiondisrictroomdirectionrenovationelevatorareayearfloor
090000052.020017
191110086.0199910
292220165.019806
393210075.0199812
4943101115.019976

In [109]:

house1=house[['region','elevator','area','price']]
####批量转换
for col in house1.columns[0:2]:
    u=house1[col].unique()
    def convert(x):
        return np.argwhere(u==x)[0,0]
    house1[col]=house1[col].map(convert)

In [110]:

house1

Out[110]:

regionelevatorareaprice
00052.0343.0
10086.0835.0
20165.0430.0
30075.0610.0
401115.0710.0
...............
2367214078.0888.0
2367314145.0405.0
2367414160.0650.0
2367514161.0470.0
2367614284.0635.0

23677 rows × 4 columns

knn模型调参
  • 总体效果来看,n_neighbors=5时要好一点

In [133]:

from sklearn.model_selection import train_test_split
X=house1####数据
y=house1['price']###y===>目标
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
knn=KNeighborsClassifier(n_neighbors=3,weights="distance")
knn.fit(x_train,y_train.astype("int"))
y_=knn.predict(x_test)
result=y_==y_test
result.mean()

Out[133]:

0.7430320945945946

In [148]:

X=house1####数据
y=house1['price']###y===>目标
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
knn=KNeighborsClassifier(n_neighbors=5,weights="distance")
knn.fit(x_train,y_train.astype("int"))
y_=knn.predict(x_test)
result=y_==y_test
result.mean()

Out[148]:

0.7440878378378378

In [187]:

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
knn=KNeighborsClassifier(n_neighbors=7,weights="distance")
knn.fit(x_train,y_train.astype("int"))
y_=knn.predict(x_test)
result=y_==y_test
result.mean()

Out[187]:

0.7356418918918919

In [230]:

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
knn=KNeighborsClassifier(n_neighbors=10,weights="distance")
knn.fit(x_train,y_train.astype("int"))
y_=knn.predict(x_test)
result=y_==y_test
result.mean()

Out[230]:

0.730152027027027

In [245]:

x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
knn=KNeighborsClassifier(n_neighbors=15,weights="distance")
knn.fit(x_train,y_train.astype("int"))
y_=knn.predict(x_test)
result=y_==y_test
result.mean()

Out[245]:

0.7164273648648649
决策树模型

In [246]:

from sklearn.model_selection import train_test_split
X=house_data####数据
y=house['price']###y===>目标
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)#修正测试集和训练集的索引
for i in [x_train,x_test,y_train,y_test]:
    i.index = range(i.shape[0])
from sklearn.model_selection import cross_val_score
""""初始模型"""
from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
clf = DecisionTreeRegressor(splitter='best',max_depth=12)
clf=clf.fit(x_train, y_train)
score_tr = clf.score(x_train,y_train)
score_te = clf.score(x_test,y_test)#测试集分数
score_tc= cross_val_score(clf,X,y,cv=10).mean()#使用交叉验证
print(score_tr,score_te,score_tc)
0.9531080066732065 0.7745982538131777 0.6047073569939521

In [247]:

tr = []
te = []
tc = []
N = 10
for i in range(N):
    clf = DecisionTreeRegressor( random_state=25
                                ,max_depth=i+1 #拟合不同最大深度的决策
                                ,criterion="friedman_mse"#尝试调参
                               )
    clf = clf.fit(x_train,y_train)
    score_tr = clf.score(x_train, y_train)#训练集分数
    score_te = clf.score(x_test,y_test)#测试集分数
    score_tc = cross_val_score(clf,X, y, cv=10).mean()#模型交叉验证分数
    tr.append(score_tr)
    te.append(score_te)
    tc.append(score_tc)
print(max(tc))###发现结果还比原来的好
0.6383923356782749

In [248]:

plt.plot ( range(1,N+1) ,tr,color="red" ,label="train")
plt.plot (range(1,N+1) ,te,color="blue" ,label="test")
plt.plot(range(1,N+1) ,tc,color="green",label="cross")
plt.xticks ( range (1,N+1))#横坐标标尺,只显示1-10。
plt.legend()
plt.xlabel("max_depth")
plt.ylabel("score")
plt.show()
线性回归

In [249]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X=house_data####数据
y=house['price']###y===>目标
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)#修正测试集和训练集的索引
for i in [x_train,x_test,y_train,y_test]:
    i.index = range(i.shape[0])
# lr=LinearRegression(fit_intercept=False)
# lr.fit(x_train,y_train)
from sklearn.model_selection import cross_val_score
""""初始模型"""
from sklearn.model_selection import KFold
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
lr = LinearRegression(fit_intercept=False)
lr=lr.fit(x_train, y_train)
score_tr = lr.score(x_train,y_train)
score_te = lr.score(x_test,y_test)#测试集分数
score_tc= cross_val_score(lr,X,y,cv=10).mean()#使用交叉验证
print(score_tr,score_te,score_tr)
0.6905104199409964 0.6829968035712277 0.6905104199409964

In [250]:

lr.predict(x_test)

Out[250]:

array([ 514.01980998,   16.55808561, 1177.51447509, ..., 1071.80962829,
        795.808886  ,  467.86695568])

In [251]:

house_data.columns

Out[251]:

Index(['region', 'disrict', 'room', 'direction', 'renovation', 'elevator',
       'area', 'year', 'floor'],
      dtype='object')

In [252]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
plt.style.use('ggplot')
from sklearn import tree
import sys
import os
import time
##忽略警告
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.family'] = ['SimHei']   #设置字体为黑体
plt.rcParams['axes.unicode_minus'] = False #解决保存图像时负号“-”显示为方块的问题
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

In [112]:

data=pd.read_csv('E:/新建文件夹/实训/项目/二手房数据.csv',encoding='gbk')
data.head()

Out[112]:

市区小区户型朝向楼层装修情况电梯面积(㎡)价格(万元)年份
0朝阳育慧里一区1室0厅西7精装有电梯52.0343.02001
1朝阳大西洋新城A区2室2厅南北10精装有电梯86.0835.01999
2朝阳团结湖路2室1厅东西6精装无电梯65.0430.01980
3朝阳尚家楼48号院2室1厅南北12精装有电梯75.0610.01998
4朝阳望京西园一区3室2厅南北6精装无电梯115.0710.01997

In [114]:

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23677 entries, 0 to 23676
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   市区      23677 non-null  object 
 1   小区      23677 non-null  object 
 2   户型      23677 non-null  object 
 3   朝向      23677 non-null  object 
 4   楼层      23677 non-null  int64  
 5   装修情况    23677 non-null  object 
 6   电梯      15420 non-null  object 
 7   面积(㎡)   23677 non-null  float64
 8   价格(万元)  23677 non-null  float64
 9   年份      23677 non-null  int64  
dtypes: float64(2), int64(2), object(6)
memory usage: 1.8+ MB

In [113]:

data.describe

Out[113]:

<bound method NDFrame.describe of        市区       小区    户型  朝向  楼层 装修情况   电梯  面积(㎡)  价格(万元)    年份
0      朝阳    育慧里一区  1室0厅   西   7   精装  有电梯   52.0   343.0  2001
1      朝阳  大西洋新城A区  2室2厅  南北  10   精装  有电梯   86.0   835.0  1999
2      朝阳     团结湖路  2室1厅  东西   6   精装  无电梯   65.0   430.0  1980
3      朝阳  尚家楼48号院  2室1厅  南北  12   精装  有电梯   75.0   610.0  1998
4      朝阳   望京西园一区  3室2厅  南北   6   精装  无电梯  115.0   710.0  1997
...    ..      ...   ...  ..  ..  ...  ...    ...     ...   ...
23672  西城    真武庙六里  2室1厅  南北  18   精装  有电梯   78.0   888.0  1988
23673  西城   右安门内大街  1室1厅  西北   7   其他  无电梯   45.0   405.0  1991
23674  西城    玉桃园二区  2室1厅  南北   6   简装  无电梯   60.0   650.0  1997
23675  西城     红莲南里  2室1厅  南北   7   精装  无电梯   61.0   470.0  1992
23676  西城   白广路6号院  3室0厅   南   6   简装  NaN   84.0   635.0  1955

[23677 rows x 10 columns]>

In [115]:

data.rename(columns={'市区':'region','小区':'disrict','户型':'room','朝向':'direction','楼层':'floor',
                      '装修情况':'renovation','电梯':'elevator','面积(㎡)':'area','价格(万元)':'price','年份':'year'},inplace=True)
data.head()

Out[115]:

regiondisrictroomdirectionfloorrenovationelevatorareapriceyear
0朝阳育慧里一区1室0厅西7精装有电梯52.0343.02001
1朝阳大西洋新城A区2室2厅南北10精装有电梯86.0835.01999
2朝阳团结湖路2室1厅东西6精装无电梯65.0430.01980
3朝阳尚家楼48号院2室1厅南北12精装有电梯75.0610.01998
4朝阳望京西园一区3室2厅南北6精装无电梯115.0710.01997

In [116]:

data1=data[['year','region','disrict','room','direction','renovation','elevator','floor','area','price']]
data1.loc[(data['floor']>=6)&(data["elevator"].isnull()),"elevator"]="有电梯"
data1.loc[(data['floor']<=6)&(data["elevator"].isnull()),"elevator"]="无电梯"
data0=data[(data["elevator"]=="有电梯")|(data["elevator"]=="无电梯")]

In [117]:

data1["room"]=data1["room"].apply(lambda x:x.replace("房间","室"))
data1=data1[~data1["room"].str.contains("卫")]#筛选不包含“卫"的数据
data1["room_num"]=data1["room"].apply(lambda x:x[0])
data1[ "hall_num" ]=data1[ "room" ].apply(lambda x:x[2])
特征工程

In [118]:

"""追加新特征,选用需要分析的属性列,调整属性列顺序"""
data1["perprice"]=round(data1["price"]/data1["area"],2)
data1=data1[["year" , "region", "disrict", "direction", "room", "room_num","hall_num", "floor", "elevator", "renovation", "perprice","area","price"]]

In [119]:

data1.head()###数据预览

Out[119]:

yearregiondisrictdirectionroomroom_numhall_numfloorelevatorrenovationperpriceareaprice
02001朝阳育慧里一区西1室0厅107有电梯精装6.6052.0343.0
11999朝阳大西洋新城A区南北2室2厅2210有电梯精装9.7186.0835.0
21980朝阳团结湖路东西2室1厅216无电梯精装6.6265.0430.0
31998朝阳尚家楼48号院南北2室1厅2112有电梯精装8.1375.0610.0
41997朝阳望京西园一区南北3室2厅326无电梯精装6.17115.0710.0

In [127]:

figl=plt.figure(figsize=(15,15))#设置图窗口
import seaborn as sns
sns.barplot( x='region',y='perprice',palette="Blues_d",data=data1)#统计北京各大区二手房每平米单价
plt.tick_params (axis='x' ,labelsize=20)
plt.tick_params ( axis='y' ,labelsize=20)
plt.xlabel('区域' ,fontsize=30)
plt.ylabel('每平米单价(均价)',fontsize=30)

Out[127]:

Text(0, 0.5, '每平米单价(均价)')

In [128]:

figl=plt.figure(figsize=(15,15))#设置图窗口
sns.boxplot(x='region',y='price',data=data1)
plt.tick_params(axis='x' ,labelsize=20)
plt.tick_params(axis='y' ,labelsize=20)
plt.xlabel('区域',fontsize=30)
plt.ylabel( '二手房总价',fontsize=30)
plt.show()

In [131]:

"""面积分析"""
sns.distplot(data['area'],bins=20,color="skyblue")#面积分布情况(直方图)
plt.tick_params(axis='x' ,labelsize=15)
plt.tick_params(axis='y' ,labelsize=15)
plt.xlabel('area',fontsize=20)
plt.ylabel( '',fontsize=20)
plt.show()

In [134]:

plt.figure(figsize=(40,40))#设置图窗口
fig3,[ax1,ax2]=plt.subplots(2,1)
df_layout=data.groupby("room")["area"].count().sort_values(ascending=False).to_frame ().reset_index()
sns.barplot(y="room",x="area",data=df_layout.head(20) , ax=ax1,orient="h")
ax1.set_xlabel("数量" ,fontsize=12)
ax1.set_ylabel("户型",fontsize=12)
sns.barplot( x='room',y='perprice',data=data1,ax=ax2)#统计各户型二手房每平米单价
ax2.tick_params (axis='x' ,labelsize=6)
ax2.tick_params (axis='y' ,labelsize=6)
ax2.set_xlabel('户型',fontsize=10)
ax2.set_ylabel( '每平米单价(均价)',fontsize=10)
plt.show()
<Figure size 4000x4000 with 0 Axes>

In [135]:

"""年份分析"""
fig4=plt.figure()
ax1=plt.subplot2grid((2,1),(0,0))#设置第一张子图,位置0,0
ax2=plt.subplot2grid((2,1),(1,0))#设置第二张子图,位置0,1
sns.regplot( x="year" , y="price" ,data=data1,ax=ax1)
sns.barplot(x="year",y="price",data=data1,ax=ax2)
ax2.tick_params(axis='x' ,labelsize=4)
plt.show()

二、源码文件

链接:https://pan.baidu.com/s/15N8ESHjAU58ZyL39VOgbpQ 
提取码:yyss

html格式源码文件:

链接:https://pan.baidu.com/s/1LM2ZLQElIN7m2wc5UqMiOA 
提取码:yyss

所采用的数据:

链接:https://pan.baidu.com/s/1QSiS0Is57nZmRVUhiDwKtg 
提取码:yyss

三、作者有话

建议提取html源码格式文件,自己试着在jupyter lab中运行,此上面运行出的图片就不展示了,本人超级懒,懒得截图。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值