之前基本完成了pandas基础知识的学习,现在结合实际工程实例来进一步理解巩固pandas的用法。
参考博客资料:http://www.cnblogs.com/north-north/p/4353365.html
一.读取数据
df = pd.read_csv('train.csv',header=0)使用pandas的read_csv函数可以轻松读取csv文件的内容。在读取了数据文件之后,可以使用一些方法来大概的看看数据的一些基础情况:
如通过df.info() df.describe()等函数可以获知缺失数据的一些基本情形
二.缺失值处理
对于数据缺失值的处理方式有多种,博客中用到了三种处理方式:
使用均值或者众数代替缺失值:
1 #replace missing values with mode 2 df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().values注意dropna()函数,可以去除掉NA的数值,mode()方法是求众数
直接人为赋值:
1 #replace missing value with U0 2 df.Cabin[df.Cabin.isnull()]='U0'
利用模型来预测缺失属性的值:
1 #choose training data to predict age 2 age_df = df[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']] 3 age_df_notnull = age_df.loc[(df.Age.notnull())] 4 age_df_isnull = age_df.loc[(df.Age.isnull())] 5 X = age_df_notnull.values[:,1:] 6 Y = age_df_notnull.values[:,0] 7 #use RandomForestRegressor to train data 8 rfr = RandomForestRegressor(n_estimators=1000,n_jobs=-1) 9 rfr.fit(X,Y) 10 predictAges = rfr.predict(age_df_isnull.values[:,1:]) 11 df.loc[(df.Age.isnull()),'Age'] = predictAges
注意Loc方法,可以选出需要的行。df的values属性则是可以获取数据的numpy的array表示形式,这样就可以与scikit-learn机器学习库联系起来。
三.数据转换
定性转换Embarked属性:
Embarked属性的取值有‘S’,‘C’,'Q' 3种,需要转换成数组属性方可。博客中采用的转换策略是为每个字符生成一个3维的标识数组。
import pandas as pd #creat dummy varibles from raw data 2 dummies_df = pd.get_dummies(df.Embarked) 3 #remana the columns to Embarked_S... 4 dummies_df = dummies_df.rename(columns=lambda x:'Embarked_'+str(x)) 5 df = pd.concat([df,dummies_df],axis=1)采用的函数是get_dummies(),这个函数可以生成标识矩阵。rename方法则是可以修改各个列的标签名。
concat可以将两个dataframe给连接起来
定量转换age属性:
1 if keep_scaled: 2 scaler = preprocessing.StandardScaler() 3 df['Age_Scaled'] = scaler.fit_transform(df['Age'])采用scikit-llearn库里面的函数,进行归一化
定量转换fare属性:
def processFare(): 2 global df 3 df['Fare'][df.Fare.isnull()] = df.Fare.dropna().mean() 4 #zero values divide -- laplace 5 df['Fare'][np.where(df['Fare']==0)[0]] = df['Fare'][df.Fare.\ 6 nonzero()[0] ].min() / 10 7 df['Fare_bin'] = pd.qcut(df.Fare, 4)pd.qcnt可以把连续的数值转换到离散的几个区间值,这一列的每一个元素都是一个元组(1,2)这样
再使用factorize函数可以将元组转换成一个数字,最后再进行归一化。
1 df['Fare_bin_id'] = pd.factorize(df.Fare_bin)[0]+1 2 scaler = preprocessing.StandardScaler() 3 df['Fare_bin_id_scaled'] = scaler.fit_transform(df.Fare_bin_id)
四.特征工程
针对具体的数据挖掘业务,需要人为地生成特征。特征工程是数据挖掘项目最关键的步骤。
博客中简单给出了两种生成特征的思路:数据本身的业务逻辑,组合不同的特征。比如对淘宝上的一个商品来说,购买数/点击率可以反应商品的转化率,也是商品的一个非常重要的特征。
五.我的程序
import numpy as np
import pandas as pd
import random as rd
import re
from sklearn import tree
from sklearn import preprocessing
from sklearn import cross_validation
def load_off_data():
df = pd.read_csv('train.csv') #using the first row as head
fix_miss_value()
convert_data()
def fix_miss_value():
global df
df.Cabin[df.Cabin.isnull()]='U0'
df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().values
df.Age[df.Age.isnull()]=df.Age.dropna().mean()
def getCabinNumber(cabin):
match = re.compile("([0-9]+)").search(cabin)
if match:
return match.group()
else:
return 0
def convert_data():
global df
embark_df = pd.get_dummies(df.Embarked)
embark_df = embark_df.rename(columns=lambda x: 'Embarked'+str(x))
df = pd.concat([df,embark_df],axis=1)
df['cabin_letter']= df['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group() )
df['cabin_letter']= pd.factorize(df.cabin_letter)[0]
df['CabinNumber'] = df['Cabin'].map( lambda x: getCabinNumber(x) ).\
astype(int) +1
df['fare_bin']=pd.qcut(df.Fare,4)
df['fare_id']=pd.factorize(df.fare_bin)[0]+1
df.Sex[df['Sex']=='male']=1
df.Sex[df['Sex']=='female']=0
def combine_fea():
global df
numerics = df.loc[:, ['Age', 'fare_id', 'Pclass', 'Sex', 'cabin_letter']]
print "\nFeatures used for automated feature generation:\n", numerics.head(10)
new_fields_count = 0
for i in range(0, numerics.columns.size-1):
for j in range(0, numerics.columns.size-1):
if i <= j:
name = str(numerics.columns.values[i]) + "*" + str(numerics.columns.values[j])
df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1)
new_fields_count += 1
if i < j:
name = str(numerics.columns.values[i]) + "+" + str(numerics.columns.values[j])
df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1)
new_fields_count += 1
print "\n", new_fields_count, "new features generated"
def off_feature_extraction():
global df
combine_fea()
new_df=df.drop(['fare_bin','Fare','PassengerId','Name','Ticket','Fare','Embarked','fare_bin','Cabin'],axis=1)
new_df.columns = range(new_df.columns.size)
label=new_df[0].values
feature=new_df.ix[:,1:].values
return label,feature
def on_feature_extraction():
global df
combine_fea()
df=df.drop(['fare_bin','Fare','PassengerId','Name','Ticket','Fare','Embarked','fare_bin','Cabin'],axis=1)
new_df=df[df['Survived'].isnull()]
new_df=new_df.drop(['Survived'],axis=1)
new_df.columns = range(new_df.columns.size)
feature=new_df.ix[:,0:].values
return feature
def off_model_building(feature,label):
train_feature,test_feature,train_label,test_label= cross_validation.train_test_split(feature, label, test_size=0.3, random_state=0)
clf = RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
clf = clf.fit(train_feature, train_label)
predict_label=clf.predict(test_feature)
res=predict_label ^ test_label
accuracy = 1-float(sum(res))/len(test_label)
print "accuracy rate is: %f" % accuracy
def off_test():
df = pd.read_csv('train.csv') #using the first row as head
fix_miss_value()
convert_data()
label,feature=off_feature_extraction()
off_model_building(feature,label)
def on_test():
#load training data and build the model
df = pd.read_csv('train.csv') #using the first row as head
fix_miss_value()
convert_data()
label,feature=off_feature_extraction()
clf = RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
clf = clf.fit(feature, label)
#load test data and extract the feature
df = pd.read_csv('train.csv') #using the first row as head
test_df = pd.read_csv('test.csv') #using the first row as head
df=pd.concat([df,test_df])
fix_miss_value()
convert_data()
feature=on_feature_extraction()
online_res(feature,clf)
def online_res(feature,clf):
predict_label=clf.predict(feature)
head=['PassengerId','Survived']
idx=np.arrange(len(predict_label))+892
res_data = list(zip(idx,predict_label))
df = pd.DataFrame(data = res_data, columns=head)
df.to_csv('res.csv',index=False,header=False)
按照自己的理解,完成了基于pandas和scikit-learn库的titanic比赛。上面是程序,以后其他的数据挖掘赛事均可以参照上面的程序框架来完成。