机器学习——入门

Load the data and choose features

import pandas as pd
file_path = ''
name_data = pd.read_csv(file_path)
name_data.describe() ## describe function can tell you the information of each column in data, such as count(how manty rows have non-missing values), mean, std, min, 25%, 50%, 75%, max

name_data.columns ## list the column name
name_data = name_data.dropna(axis=0) ## dropna drops missing values (think of na as "not available")

## Selecting The Prediction Target
y = name_data.Price ## use the dot notation to select the column we want to predict, which is called the prediction target.

## Choosing "Features"
name_features = ['Rooms', 'Bathroom', 'Landsize', 'Lattitude', 'Longtitude']
X = name_data[name_features]

# check the data
X.describe()
X.head()


Building your model

Define model type
*Fit, capture patterns from data
Predict
Evaluate model's accuracy

the code for building a model

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run(what's the meaning?)
name_model = DecisionTreeRegressor(random_state = 1)

# Fit model
name_model.fit(X,y)

# Predict
print("Making predictions for the following 5 houses:")
print(X_test.head())
print("The predictions are")
print(name_model.predict(X_test.head()))

Model Validation

use Mean Absolute Error(MAE) and validation data to evaluate the model,which means |predicted_value - standard_value| / value_number

The scikit-learn library has a function train_test_split to split the data into training data and validation data.

from sklearn.model_selection import train_test_split
# split the data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state = 0)
# define model
name_model = DecisionTreeRegressor(random_state = 1)
# fit model
name_model.fit(train_X, train_y)

# get predicted prices on validation data
val_predictions = name_model.predict(val_X)
print(mean_absolute_error(val_y, val_predictions))

Underfitting and Overfitting

overfitting: a model matches the training data almost perfectly, but does poorly in validation and other new data, capturing spurious patterns that won’t recur in the future, leading to less accurate predictions.
underfitting: a model fails to capture important distinctions and patterns in the data, so it performs poorly even in training data.

model = DecisionTreeRegressor(max_leaf_nodes=node_num, random_state=0)

# function which can calculate the MAE
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return(mae)
    
# a loop to find the best tree size
for max_leaf_nodes in candidate_max_leaf_nodes:
    print("max_leaf_nodes: ", max_leaf_nodes, "\t \t MAE: ", get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y))

After finding the best tree size, you can use all of the data to train the model, don’t need to hold the validation data.

Random Forests

The random forests use many trees, and it makes a prediction by averaging the predictions of each component tree.One of the best features of Random Forest models is that they generally work reasonably even without adjusting parameters.

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

forest_model = RandomForestRegressor(random_state=1)
forest_model.fit(train_X, train_y)
melb_preds = forest_model.predict(val_X)
print(mean_absolute_error(val_y, melb_preds))

How to submit the model

test:

# test
test = pd.read_csv('../input/test.csv')
test_X = test[predictor_cols]
predicted_prices = my_model.predict(test_X)

get submission files

Submissions should be in CSV files: an ID column and a prediction column.

my_submission = pd.DataFrame({'Id': test.Id, 'SalePrice': predicted_prices})
# any filename is okay.
my_submission.to_csv('submission.csv', index=False)

make submission

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值