数据探索是拿到数据要做的第一步,目的是对要分析的数据有个大概的了解。弄清数据集大小,特征和样本数量,数据类型,数据的概率分布等。下面结合奔驰车数据做个梳理,也是个人学习的记录。
import numpy as np
import pandas as pd
train_df = pd.read_csv('train_b.csv')
test_df = pd.read_csv('test_b.csv')
print train_df.shape, test_df.shape
(4209, 378) (4209, 377)
train_df.head()
ID | y | X0 | X1 | X2 | X3 | X4 | X5 | X6 | X8 | … | X375 | X376 | X377 | X378 | X379 | X380 | X382 | X383 | X384 | X385 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 130.81 | k | v | at | a | d | u | j | o | … | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 6 | 88.53 | k | t | av | e | d | y | l | o | … | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 7 | 76.26 | az | w | n | c | d | x | j | x | … | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3 | 9 | 80.62 | az | t | n |