Kaggle手写数字识别

0.理论

算法:支持向量机
降维:主成分分析
数据来自Kaggle官网:https://www.kaggle.com/competitions/digit-recognizer/data

1.Python实现

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/digit-recognizer/sample_submission.csv
/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/test.csv
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler

train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
print(train.shape)
print(test.shape)
(42000, 785)
(28000, 784)
train.head()
labelpixel0pixel1pixel2pixel3pixel4pixel5pixel6pixel7pixel8...pixel774pixel775pixel776pixel777pixel778pixel779pixel780pixel781pixel782pixel783
01000000000...0000000000
10000000000...0000000000
21000000000...0000000000
34000000000...0000000000
40000000000...0000000000

5 rows × 785 columns

# 将训练集中的特征和标签列分开
X = train.iloc[:,1:]
y = train.iloc[:,0]

# 画图,查看训练集的数字
plt.figure(figsize = (10,5))

for num in range(0,10):
    plt.subplot(2,5,num+1)
    #将长度为784的向量数据转化为28*28的矩阵
    grid_data = X.iloc[num].values.reshape((28,28))
    #显示图片,颜色为黑白
    plt.imshow(grid_data, cmap = 'Greys')

请添加图片描述

# 特征预处理,将特征的值域规范化
X = MinMaxScaler().fit_transform(X)
test = MinMaxScaler().fit_transform(test)

# 分开训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 14)

def get_best_accuracy():
    """寻找使精确度最高的n_components并画图
    """    
    n_components = []
    accuracies = []
    for f in np.linspace(0.7, 0.9, num=20, endpoint=False):
        t0 = time()
        pca = PCA(n_components = f).fit(X_train)    
        X_train_pca = pca.transform(X_train)
        X_test_pca = pca.transform(X_test)
        # 使用支持向量机分类器
        clf = svm.SVC()
        clf.fit(X_train_pca, y_train)
        # 计算准确度
        accuracy = clf.score(X_test_pca, y_test)
        # 将结果加入列表
        n_components.append(f)
        accuracies.append(accuracy)
        t1 = time()
        print('n_components:{:.2f} , accuracy:{:.4f} , time elaps:{:.2f}s'.format(f, accuracy, t1-t0))
    ans = n_components[accuracies.index(max(accuracies))]
    print('使精确度最高的n_components: {}'.format(ans))  
    # 画出主成分和准确度的关系图
    plt.plot(n_components, accuracies, '-o')
    plt.xlabel('n_components')
    plt.ylabel('accuracy')
    plt.show()
    
get_best_accuracy()
n_components:0.70 , accuracy:0.9729 , time elaps:22.28s
n_components:0.71 , accuracy:0.9740 , time elaps:22.11s
n_components:0.72 , accuracy:0.9748 , time elaps:23.85s
n_components:0.73 , accuracy:0.9745 , time elaps:23.42s
n_components:0.74 , accuracy:0.9762 , time elaps:22.19s
n_components:0.75 , accuracy:0.9762 , time elaps:21.25s
n_components:0.76 , accuracy:0.9767 , time elaps:23.36s
n_components:0.77 , accuracy:0.9764 , time elaps:23.79s
n_components:0.78 , accuracy:0.9781 , time elaps:25.35s
n_components:0.79 , accuracy:0.9776 , time elaps:25.49s
n_components:0.80 , accuracy:0.9779 , time elaps:25.02s
n_components:0.81 , accuracy:0.9776 , time elaps:26.29s
n_components:0.82 , accuracy:0.9779 , time elaps:24.83s
n_components:0.83 , accuracy:0.9779 , time elaps:25.49s
n_components:0.84 , accuracy:0.9788 , time elaps:27.15s
n_components:0.85 , accuracy:0.9786 , time elaps:29.09s
n_components:0.86 , accuracy:0.9788 , time elaps:33.13s
n_components:0.87 , accuracy:0.9783 , time elaps:30.18s
n_components:0.88 , accuracy:0.9776 , time elaps:33.76s
n_components:0.89 , accuracy:0.9781 , time elaps:38.95s
使精确度最高的n_components: 0.84

请添加图片描述

# 使用最优的n_components
pca = PCA(n_components=0.84).fit(X)
# 打印主成分个数
print(pca.n_components_)
# 对训练集和测试集进行主成分转换
X = pca.transform(X)
test = pca.transform(test)
# 使用支持向量机预测,使用网格搜索进行调参
clf_svc = GridSearchCV(estimator=svm.SVC(), param_grid={ 'C': [1, 2, 4, 5], 'kernel': [ 'linear', 'rbf', 'sigmoid' ] }, cv=5, verbose=2 ) 
# 训练算法
clf_svc.fit(X, y)
# 显示使模型准确率最高的参数
print(clf_svc.best_params_)
# 预测
preds = clf_svc.predict(test)
image_id = pd.Series(range(1,len(preds)+1))
result = pd.DataFrame({'ImageID': image_id,'Label':preds})
# 保存为CSV文件
result.to_csv('result_svc.csv',index = False)
print('Done')
55
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END .................................C=1, kernel=linear; total time=  21.9s
[CV] END .................................C=1, kernel=linear; total time=  21.1s
[CV] END .................................C=1, kernel=linear; total time=  20.7s
[CV] END .................................C=1, kernel=linear; total time=  21.9s
[CV] END .................................C=1, kernel=linear; total time=  21.6s
[CV] END ....................................C=1, kernel=rbf; total time=  20.8s
[CV] END ....................................C=1, kernel=rbf; total time=  20.6s
[CV] END ....................................C=1, kernel=rbf; total time=  20.9s
[CV] END ....................................C=1, kernel=rbf; total time=  20.3s
[CV] END ....................................C=1, kernel=rbf; total time=  20.5s
[CV] END ................................C=1, kernel=sigmoid; total time=  26.5s
[CV] END ................................C=1, kernel=sigmoid; total time=  26.8s
[CV] END ................................C=1, kernel=sigmoid; total time=  26.7s
[CV] END ................................C=1, kernel=sigmoid; total time=  27.3s
[CV] END ................................C=1, kernel=sigmoid; total time=  27.6s
[CV] END .................................C=2, kernel=linear; total time=  29.9s
[CV] END .................................C=2, kernel=linear; total time=  30.3s
[CV] END .................................C=2, kernel=linear; total time=  29.6s
[CV] END .................................C=2, kernel=linear; total time=  31.4s
[CV] END .................................C=2, kernel=linear; total time=  30.5s
[CV] END ....................................C=2, kernel=rbf; total time=  20.3s
[CV] END ....................................C=2, kernel=rbf; total time=  20.1s
[CV] END ....................................C=2, kernel=rbf; total time=  20.9s
[CV] END ....................................C=2, kernel=rbf; total time=  19.7s
[CV] END ....................................C=2, kernel=rbf; total time=  19.7s
[CV] END ................................C=2, kernel=sigmoid; total time=  23.5s
[CV] END ................................C=2, kernel=sigmoid; total time=  24.3s
[CV] END ................................C=2, kernel=sigmoid; total time=  24.1s
[CV] END ................................C=2, kernel=sigmoid; total time=  26.0s
[CV] END ................................C=2, kernel=sigmoid; total time=  25.7s
[CV] END .................................C=4, kernel=linear; total time=  45.7s
[CV] END .................................C=4, kernel=linear; total time=  46.2s
[CV] END .................................C=4, kernel=linear; total time=  44.5s
[CV] END .................................C=4, kernel=linear; total time=  48.6s
[CV] END .................................C=4, kernel=linear; total time=  44.6s
[CV] END ....................................C=4, kernel=rbf; total time=  20.7s
[CV] END ....................................C=4, kernel=rbf; total time=  20.9s
[CV] END ....................................C=4, kernel=rbf; total time=  20.4s
[CV] END ....................................C=4, kernel=rbf; total time=  19.5s
[CV] END ....................................C=4, kernel=rbf; total time=  20.4s
[CV] END ................................C=4, kernel=sigmoid; total time=  24.7s
[CV] END ................................C=4, kernel=sigmoid; total time=  22.9s
[CV] END ................................C=4, kernel=sigmoid; total time=  23.6s
[CV] END ................................C=4, kernel=sigmoid; total time=  24.1s
[CV] END ................................C=4, kernel=sigmoid; total time=  24.9s
[CV] END .................................C=5, kernel=linear; total time=  51.3s
[CV] END .................................C=5, kernel=linear; total time=  52.4s
[CV] END .................................C=5, kernel=linear; total time=  50.5s
[CV] END .................................C=5, kernel=linear; total time=  52.2s
[CV] END .................................C=5, kernel=linear; total time=  50.2s
[CV] END ....................................C=5, kernel=rbf; total time=  19.4s
[CV] END ....................................C=5, kernel=rbf; total time=  19.9s
[CV] END ....................................C=5, kernel=rbf; total time=  19.2s
[CV] END ....................................C=5, kernel=rbf; total time=  19.0s
[CV] END ....................................C=5, kernel=rbf; total time=  19.1s
[CV] END ................................C=5, kernel=sigmoid; total time=  21.7s
[CV] END ................................C=5, kernel=sigmoid; total time=  21.7s
[CV] END ................................C=5, kernel=sigmoid; total time=  21.9s
[CV] END ................................C=5, kernel=sigmoid; total time=  21.8s
[CV] END ................................C=5, kernel=sigmoid; total time=  21.4s
{'C': 4, 'kernel': 'rbf'}
Done
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值