housing price

该博客探讨了房价预测问题的数据预处理过程,包括数据读取、特征工程、缺失值处理、数据可视化和数据划分。使用StratifiedShuffleSplit进行分层抽样,并通过Scikit-Learn库进行了数值特征的标准化、分类特征的独热编码。接着,建立了线性回归和决策树模型,计算了预测误差,如RMSE和MAE。最后,进行了交叉验证以评估决策树模型的性能,并使用GridSearchCV进行参数调优。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

###################Get data#######################

house_data = pd.read_csv("./housing.csv")

# Divide by 1.5 to limit the number of income categories
house_data["income_cat"] = np.ceil(house_data["median_income"] / 1.5)
# Label those above 5 as 5
house_data["income_cat"].where(house_data["income_cat"] < 5, 5.0, inplace=True)

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(house_data, house_data["income_cat"]):
    strat_train_set = house_data.loc[train_index]
    strat_test_set = house_data.loc[test_index]

# from sklearn.model_selection import train_test_split
# train_set, test_set = train_test_split(house_data, test_size=0.2, random_state=16)

strat_train_set = strat_train_set.drop("income_cat",axis =1)
strat_test_set = strat_test_set.drop("income_cat",axis =1)
house_data = strat_train_set.copy()


###############Visualize########################

house_data.hist(bins=50, figsize=(15, 10))
house_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)
house_data.plot(kind="scatter", x="longitude", y="latitude", alpha=0.3,
                s=house_data["population"] / 100, label="population",
                c=house_data["median_house_value"], cmap=plt.get_cmap("jet"), colorbar=True,
                )
plt.legend()
#plt.show()

corr_matrix = house_data.corr()
a = corr_matrix["median_house_value"].sort_values(ascending=False)

house_data["rooms_per_households"] = house_data["total_rooms"] / house_data["households"]
house_data["bedrooms_per_room"] = house_data["total_bedrooms"] / house_data["total_rooms"]
house_data["population_per_households"] = house_data["population"] / house_data["households"]

#print(house_data.info())

corr_matrix = house_data.corr()
b = corr_matrix["median_house_value"].sort_values(ascending=False)
#print(b)


###################Prepare data####################

house_data = strat_train_set.drop("median_house_value", axis
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值