Machine Learning - Data Pre-processing

本文介绍了数据预处理的基本步骤,包括处理缺失值、编码分类数据、划分训练集与测试集及特征缩放等。提供了Python和R语言的具体实现示例。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Data Preprocessing

It’s a studying note to self.

Data preprocessing: we need to do some data preprocessing to make the raw dataset to be the dataset that can be used in the next steps.

I just list a few steps of preprocessing.

  • Taking care of missing data
  • Encoding categorical data
  • Encoding the independent variable
  • Encoding the dependent variable
  • Splitting the dataset into the Training set and Test set
  • Feature scaling

Here, I use Spyder and Rstudio.

Python

# Data Preprocessing

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
# iloc: index; loc: the name of row or column
# X is the values that choose all columns except the last column.
# y is the values of the last column in dataset (labelled value)
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

Show the results:
dataset
The value of X
The value of y

# Taking care of missing data
# Imputer is a class, 
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

Show the results:
The missing values are filled by mean

# Coz the first col and the last col are String
# We need to transform them into int
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

There is one tip. We can use dumy varibales here. That means we have three characters named “Frech”, “Spain” and “German”. We can use 100 to be “Frech”.
Using three cols to represent one city.

Show the results of the code:
Encoding the value of X
Encoding the value of y

# We only hava one dataset, so wo need to split it into two parts
# If you use the different library,
# you need to change sklearn.model_selection to sklearn.cross_validation

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
# if there are two values: 100000 and 43, 43 will be discarded.
# let the variables to be same scale
# Also, the program can run faster after implementing this
# same basis: first run X_train then X_test
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

# Classification: DO NOT applt feature scaling
# Regression: NEED 
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

There is one question: do we need to feature scaling the domain variables. You can find a lot of answers using Google. Different situations you can choose different strategies.
And for y, in classification problem we DO NOT apply it. In regression problem, we need to apply.

R:

# Data Preprocessing

# Importing the dataset
dataset = read.csv('Data.csv')

# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)
# Encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c('France', 'Spain', 'Germany'),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c('No', 'Yes'),
                           labels = c(0, 1))
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)

PS: I don’t know how to insert R code with correct way using Markdown. If someone knows, please command below. I really appreciate it.

If you have any questions, please comment below and I will get back to you as soon as possible. And you also can find the answer in Udemy class “Machine Learning A-Z™: Hands-On Python & R In Data Science”.

All the codes come from the class of udemy called “Machine Learning A-Z™: Hands-On Python & R In Data Science”.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值