Basic Data Exploration(2)

1—Selecting Data for Modeling

Your dataset has so many variables to wrap your head around🤷🏼, or even to output it nicely🥀. How can you pare down that overwhelming amount of data to something you can understand🧐?

Let's start by selecting some variables using our intuition🤓.

To choose variables/columns, we’ll need to look at a list of all the columns🤯 in the dataset. This is done with the columns property of the DataFrame👇🏻:

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
Out[1]: 
Index(['Suburb', 'Address', 'Rooms', 'Type', 'Price', 'Method', 'SellerG',
       'Date', 'Distance', 'Postcode', 'Bedroom2', 'Bathroom', 'Car',
       'Landsize', 'BuildingArea', 'YearBuilt', 'CouncilArea', 'Lattitude',
       'Longtitude', 'Regionname', 'Propertycount'],
      dtype='object')

2—Selecting The Prediction Target

We can pull out a variable with dot-notation. Store this single column in a Series, which is broadly like a DataFrame with only a single column of data.

Now, instead of choosing variables intuitively, but using what you're going to predict💁🏼‍♀️. The variable we pull out is called the prediction target. By convention, the prediction target is called y.

#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price

(Interesting part😄)

3—Choosing "Features"

In the output above, columns other than the ‘Price' are called “features”. By convention, the features are called X.

(In this case, those columns are also inputted into our model and used to determine the home price.)(Sometimes, we will use all columns except the target as features. Other times you'll be better off with fewer features.)

#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]

- Is this the end of it? -No😂

Let's quickly review the data we'll be using to predict house prices using the describe method and the head method......

the describe method👀: 

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price
#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
#Predict house prices using the describe method
X.describe()
Out[2]: 
              Rooms      Bathroom       Landsize     Lattitude    Longtitude
count  13580.000000  13580.000000   13580.000000  13580.000000  13580.000000
mean       2.937997      1.534242     558.416127    -37.809203    144.995216
std        0.955748      0.691712    3990.669241      0.079260      0.103916
min        1.000000      0.000000       0.000000    -38.182550    144.431810
25%        2.000000      1.000000     177.000000    -37.856822    144.929600
50%        3.000000      1.000000     440.000000    -37.802355    145.000100
75%        3.000000      2.000000     651.000000    -37.756400    145.058305
max       10.000000      8.000000  433014.000000    -37.408530    145.526350

the head method👀: 

#To choose variables/columns,
#we’ll need to look at a list of all the columns in the dataset.
import pandas as pd
melbourne_file_path = '/Users/mac/Desktop/melb_data.csv'
melbourne_data = pd.read_csv(melbourne_file_path) 
melbourne_data.columns
#Selecting the prediction target: y = 'Price'
y = melbourne_data.Price
#For now, we'll build a model with only a few features
melbourne_features = ['Rooms', 'Bathroom', 'Landsize', 
                      'Lattitude', 'Longtitude']
X = melbourne_data[melbourne_features]
#Predict house prices using the head method
X.head()
Out[3]: 
   Rooms  Bathroom  Landsize  Lattitude  Longtitude
0      2       1.0     202.0   -37.7996    144.9984
1      2       1.0     156.0   -37.8079    144.9934
2      3       2.0     134.0   -37.8093    144.9944
3      3       2.0      94.0   -37.7969    144.9969
4      4       1.0     120.0   -37.8072    144.9941

Using these commands to visually check data is an important part of the data effort. I think we'll find some surprises💡 in the dataset that are worth checking out.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值