【数据挖掘实战】之kaggle练习赛titanic

最新推荐文章于 2025-06-09 11:44:00 发布

messiran10

最新推荐文章于 2025-06-09 11:44:00 发布

阅读量2.2k

点赞数

CC 4.0 BY-SA版权

分类专栏： python数据挖掘

本文链接：https://blog.youkuaiyun.com/messiran10/article/details/50704882

python数据挖掘专栏收录该内容

13 篇文章

订阅专栏

本文介绍如何使用Pandas处理数据集，包括数据读取、缺失值处理、数据转换，并通过特征工程生成新的特征，最终应用Scikit-Learn进行模型训练。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

之前基本完成了pandas基础知识的学习，现在结合实际工程实例来进一步理解巩固pandas的用法。

参考博客资料：http://www.cnblogs.com/north-north/p/4353365.html

一.读取数据

df = pd.read_csv('train.csv',header=0)

使用pandas的read_csv函数可以轻松读取csv文件的内容。在读取了数据文件之后，可以使用一些方法来大概的看看数据的一些基础情况：

如通过df.info() df.describe()等函数可以获知缺失数据的一些基本情形

二.缺失值处理

对于数据缺失值的处理方式有多种，博客中用到了三种处理方式：

使用均值或者众数代替缺失值：

1 #replace missing values with mode
2 df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().values

注意dropna()函数，可以去除掉NA的数值，mode()方法是求众数

直接人为赋值：

1 #replace missing value with U0
2 df.Cabin[df.Cabin.isnull()]='U0'

利用模型来预测缺失属性的值：

 1     #choose training data to predict age
 2     age_df = df[['Age','Survived','Fare', 'Parch', 'SibSp', 'Pclass']]
 3     age_df_notnull = age_df.loc[(df.Age.notnull())]
 4     age_df_isnull = age_df.loc[(df.Age.isnull())]
 5     X = age_df_notnull.values[:,1:]
 6     Y = age_df_notnull.values[:,0]
 7     #use RandomForestRegressor to train data
 8     rfr = RandomForestRegressor(n_estimators=1000,n_jobs=-1)
 9     rfr.fit(X,Y)
10     predictAges = rfr.predict(age_df_isnull.values[:,1:])
11     df.loc[(df.Age.isnull()),'Age'] = predictAges

注意Loc方法，可以选出需要的行。df的values属性则是可以获取数据的numpy的array表示形式，这样就可以与scikit-learn机器学习库联系起来。

三.数据转换

定性转换Embarked属性：

Embarked属性的取值有‘S’，‘C’，'Q' 3种，需要转换成数组属性方可。博客中采用的转换策略是为每个字符生成一个3维的标识数组。

 import pandas as pd
　　　　#creat dummy varibles from raw data
2     dummies_df = pd.get_dummies(df.Embarked)
3     #remana the columns to Embarked_S...
4     dummies_df = dummies_df.rename(columns=lambda x:'Embarked_'+str(x))
5     df = pd.concat([df,dummies_df],axis=1)

采用的函数是get_dummies()，这个函数可以生成标识矩阵。rename方法则是可以修改各个列的标签名。

concat可以将两个dataframe给连接起来

定量转换age属性:

1     if keep_scaled:
2         scaler = preprocessing.StandardScaler()
3         df['Age_Scaled'] = scaler.fit_transform(df['Age'])

采用scikit-llearn库里面的函数，进行归一化

定量转换fare属性：

 def processFare():
2     global df
3     df['Fare'][df.Fare.isnull()] = df.Fare.dropna().mean()
4     #zero values divide -- laplace
5     df['Fare'][np.where(df['Fare']==0)[0]] = df['Fare'][df.Fare.\
6                         nonzero()[0] ].min() / 10
7     df['Fare_bin'] = pd.qcut(df.Fare, 4)

pd.qcnt可以把连续的数值转换到离散的几个区间值,这一列的每一个元素都是一个元组（1,2）这样

再使用factorize函数可以将元组转换成一个数字，最后再进行归一化。

1     df['Fare_bin_id'] = pd.factorize(df.Fare_bin)[0]+1
2     scaler = preprocessing.StandardScaler()
3     df['Fare_bin_id_scaled'] = scaler.fit_transform(df.Fare_bin_id)

四.特征工程

针对具体的数据挖掘业务，需要人为地生成特征。特征工程是数据挖掘项目最关键的步骤。

博客中简单给出了两种生成特征的思路：数据本身的业务逻辑，组合不同的特征。比如对淘宝上的一个商品来说，购买数/点击率可以反应商品的转化率，也是商品的一个非常重要的特征。

五.我的程序

import numpy as np
import pandas as pd
import random as rd
import re
from sklearn import tree
from sklearn import preprocessing
from sklearn import cross_validation  

def load_off_data():
    df = pd.read_csv('train.csv')  #using the first row as head
    fix_miss_value()
    convert_data()
    
   
def fix_miss_value():
    global df
    df.Cabin[df.Cabin.isnull()]='U0'
    df.Embarked[df.Embarked.isnull()] = df.Embarked.dropna().mode().values 
    df.Age[df.Age.isnull()]=df.Age.dropna().mean()
  
   
def getCabinNumber(cabin):
     match = re.compile("([0-9]+)").search(cabin)
     if match:
         return match.group()
     else:
         return 0

def convert_data():
    global df
    embark_df = pd.get_dummies(df.Embarked)
    embark_df = embark_df.rename(columns=lambda x: 'Embarked'+str(x))
    df = pd.concat([df,embark_df],axis=1)
    
    df['cabin_letter']= df['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group() )
    df['cabin_letter']= pd.factorize(df.cabin_letter)[0]
    df['CabinNumber'] = df['Cabin'].map( lambda x: getCabinNumber(x) ).\
                     astype(int) +1

    df['fare_bin']=pd.qcut(df.Fare,4)
    df['fare_id']=pd.factorize(df.fare_bin)[0]+1
                    
    df.Sex[df['Sex']=='male']=1
    df.Sex[df['Sex']=='female']=0
 
def combine_fea():
     global df
     numerics = df.loc[:, ['Age', 'fare_id', 'Pclass', 'Sex', 'cabin_letter']]
     print "\nFeatures used for automated feature generation:\n", numerics.head(10)
     new_fields_count = 0
     for i in range(0, numerics.columns.size-1):
          for j in range(0, numerics.columns.size-1):
              if i <= j:
                 name = str(numerics.columns.values[i]) + "*" + str(numerics.columns.values[j])
                 df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1)
                 new_fields_count += 1
              if i < j:
                 name = str(numerics.columns.values[i]) + "+" + str(numerics.columns.values[j])
                 df = pd.concat([df, pd.Series(numerics.iloc[:,i] * numerics.iloc[:,j], name=name)], axis=1)
                 new_fields_count += 1
           
     print "\n", new_fields_count, "new features generated"
                            


      

def off_feature_extraction():
     global df
     combine_fea()
     new_df=df.drop(['fare_bin','Fare','PassengerId','Name','Ticket','Fare','Embarked','fare_bin','Cabin'],axis=1)
     new_df.columns = range(new_df.columns.size)
     label=new_df[0].values
     feature=new_df.ix[:,1:].values
     return label,feature

def on_feature_extraction():
     global df
     combine_fea()
     df=df.drop(['fare_bin','Fare','PassengerId','Name','Ticket','Fare','Embarked','fare_bin','Cabin'],axis=1)
     new_df=df[df['Survived'].isnull()]
     new_df=new_df.drop(['Survived'],axis=1)
     new_df.columns = range(new_df.columns.size)
    
     feature=new_df.ix[:,0:].values
     return feature

def off_model_building(feature,label):
    train_feature,test_feature,train_label,test_label= cross_validation.train_test_split(feature, label, test_size=0.3, random_state=0)
    clf = RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
    clf = clf.fit(train_feature, train_label)
    predict_label=clf.predict(test_feature)
    res=predict_label ^ test_label
    accuracy = 1-float(sum(res))/len(test_label)
    print "accuracy rate is: %f" % accuracy
    

def off_test():
    df = pd.read_csv('train.csv')  #using the first row as head
    fix_miss_value()
    convert_data()
    label,feature=off_feature_extraction()
    off_model_building(feature,label)

def on_test():
    #load training data and build the model
    df = pd.read_csv('train.csv')  #using the first row as head
    fix_miss_value()
    convert_data()
    label,feature=off_feature_extraction()
    clf = RandomForestClassifier(random_state=1,n_estimators=150,min_samples_split=4,min_samples_leaf=2)
    clf = clf.fit(feature, label)

    #load test data and extract the feature
    df = pd.read_csv('train.csv')  #using the first row as head
    test_df = pd.read_csv('test.csv')  #using the first row as head
    df=pd.concat([df,test_df])
    fix_miss_value()
    convert_data()
    feature=on_feature_extraction()
    online_res(feature,clf)


def online_res(feature,clf):
     predict_label=clf.predict(feature)
     head=['PassengerId','Survived']
     idx=np.arrange(len(predict_label))+892
     res_data = list(zip(idx,predict_label))
     df = pd.DataFrame(data = res_data, columns=head)
     df.to_csv('res.csv',index=False,header=False)

按照自己的理解，完成了基于pandas和scikit-learn库的titanic比赛。上面是程序，以后其他的数据挖掘赛事均可以参照上面的程序框架来完成。