Day 1: Data Cleaning Challenge: Handling missing values | Kaggle
Day 2: Data Cleaning Challenge: Scale and Normalize Data | Kaggle
Day 3: Data Cleaning Challenge: Parsing Dates | Kaggle
Day 4: Data Cleaning Challenge: Character Encodings | Kaggle
Day 5: Data Cleaning Challenge: Inconsistent Data Entry | Kaggle
1. 缺失值处理
观察数据
# modules we'll use
import pandas as pd
import numpy as np
# read in all our data
nfl_data = pd.read_csv("../input/nflplaybyplay2009to2016/NFL Play by Play 2009-2017 (v4).csv")
sf_permits = pd.read_csv("../input/building-permit-applications-data/Building_Permits.csv")
# set seed for reproducibility
np.random.seed(0)
# look at a few rows of the nfl_data file. I can see a handful of missing data already!
nfl_data.sample(5)
检查有多少缺失值
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()
# look at the # of missing points in the first ten columns
mi

本文是Kaggle数据清理5天挑战的一部分,重点讨论如何处理缺失值。从观察数据、检查缺失值数量、分析缺失原因,到决定是否丢弃或填充缺失值。介绍了多种填充方法,如直接用0填充、使用下一行值填充,以及更高级的解决方案,如丢弃列和使用平均值等进行插补。通过比较不同方法的模型得分来评估效果。
最低0.47元/天 解锁文章
1301

被折叠的 条评论
为什么被折叠?



