0.理论
算法:支持向量机
降维:主成分分析
数据来自Kaggle官网:https://www.kaggle.com/competitions/digit-recognizer/data
1.Python实现
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/digit-recognizer/sample_submission.csv
/kaggle/input/digit-recognizer/train.csv
/kaggle/input/digit-recognizer/test.csv
from time import time
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
train = pd.read_csv('/kaggle/input/digit-recognizer/train.csv')
test = pd.read_csv('/kaggle/input/digit-recognizer/test.csv')
print(train.shape)
print(test.shape)
(42000, 785)
(28000, 784)
train.head()
| label | pixel0 | pixel1 | pixel2 | pixel3 | pixel4 | pixel5 | pixel6 | pixel7 | pixel8 | ... | pixel774 | pixel775 | pixel776 | pixel777 | pixel778 | pixel779 | pixel780 | pixel781 | pixel782 | pixel783 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 785 columns
# 将训练集中的特征和标签列分开
X = train.iloc[:,1:]
y = train.iloc[:,0]
# 画图,查看训练集的数字
plt.figure(figsize = (10,5))
for num in range(0,10):
plt.subplot(2,5,num+1)
#将长度为784的向量数据转化为28*28的矩阵
grid_data = X.iloc[num].values.reshape((28,28))
#显示图片,颜色为黑白
plt.imshow(grid_data, cmap = 'Greys')

# 特征预处理,将特征的值域规范化
X = MinMaxScaler().fit_transform(X)
test = MinMaxScaler().fit_transform(test)
# 分开训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 14)
def get_best_accuracy():
"""寻找使精确度最高的n_components并画图
"""
n_components = []
accuracies = []
for f in np.linspace(0.7, 0.9, num=20, endpoint=False):
t0 = time()
pca = PCA(n_components = f).fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)
# 使用支持向量机分类器
clf = svm.SVC()
clf.fit(X_train_pca, y_train)
# 计算准确度
accuracy = clf.score(X_test_pca, y_test)
# 将结果加入列表
n_components.append(f)
accuracies.append(accuracy)
t1 = time()
print('n_components:{:.2f} , accuracy:{:.4f} , time elaps:{:.2f}s'.format(f, accuracy, t1-t0))
ans = n_components[accuracies.index(max(accuracies))]
print('使精确度最高的n_components: {}'.format(ans))
# 画出主成分和准确度的关系图
plt.plot(n_components, accuracies, '-o')
plt.xlabel('n_components')
plt.ylabel('accuracy')
plt.show()
get_best_accuracy()
n_components:0.70 , accuracy:0.9729 , time elaps:22.28s
n_components:0.71 , accuracy:0.9740 , time elaps:22.11s
n_components:0.72 , accuracy:0.9748 , time elaps:23.85s
n_components:0.73 , accuracy:0.9745 , time elaps:23.42s
n_components:0.74 , accuracy:0.9762 , time elaps:22.19s
n_components:0.75 , accuracy:0.9762 , time elaps:21.25s
n_components:0.76 , accuracy:0.9767 , time elaps:23.36s
n_components:0.77 , accuracy:0.9764 , time elaps:23.79s
n_components:0.78 , accuracy:0.9781 , time elaps:25.35s
n_components:0.79 , accuracy:0.9776 , time elaps:25.49s
n_components:0.80 , accuracy:0.9779 , time elaps:25.02s
n_components:0.81 , accuracy:0.9776 , time elaps:26.29s
n_components:0.82 , accuracy:0.9779 , time elaps:24.83s
n_components:0.83 , accuracy:0.9779 , time elaps:25.49s
n_components:0.84 , accuracy:0.9788 , time elaps:27.15s
n_components:0.85 , accuracy:0.9786 , time elaps:29.09s
n_components:0.86 , accuracy:0.9788 , time elaps:33.13s
n_components:0.87 , accuracy:0.9783 , time elaps:30.18s
n_components:0.88 , accuracy:0.9776 , time elaps:33.76s
n_components:0.89 , accuracy:0.9781 , time elaps:38.95s
使精确度最高的n_components: 0.84

# 使用最优的n_components
pca = PCA(n_components=0.84).fit(X)
# 打印主成分个数
print(pca.n_components_)
# 对训练集和测试集进行主成分转换
X = pca.transform(X)
test = pca.transform(test)
# 使用支持向量机预测,使用网格搜索进行调参
clf_svc = GridSearchCV(estimator=svm.SVC(), param_grid={ 'C': [1, 2, 4, 5], 'kernel': [ 'linear', 'rbf', 'sigmoid' ] }, cv=5, verbose=2 )
# 训练算法
clf_svc.fit(X, y)
# 显示使模型准确率最高的参数
print(clf_svc.best_params_)
# 预测
preds = clf_svc.predict(test)
image_id = pd.Series(range(1,len(preds)+1))
result = pd.DataFrame({'ImageID': image_id,'Label':preds})
# 保存为CSV文件
result.to_csv('result_svc.csv',index = False)
print('Done')
55
Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END .................................C=1, kernel=linear; total time= 21.9s
[CV] END .................................C=1, kernel=linear; total time= 21.1s
[CV] END .................................C=1, kernel=linear; total time= 20.7s
[CV] END .................................C=1, kernel=linear; total time= 21.9s
[CV] END .................................C=1, kernel=linear; total time= 21.6s
[CV] END ....................................C=1, kernel=rbf; total time= 20.8s
[CV] END ....................................C=1, kernel=rbf; total time= 20.6s
[CV] END ....................................C=1, kernel=rbf; total time= 20.9s
[CV] END ....................................C=1, kernel=rbf; total time= 20.3s
[CV] END ....................................C=1, kernel=rbf; total time= 20.5s
[CV] END ................................C=1, kernel=sigmoid; total time= 26.5s
[CV] END ................................C=1, kernel=sigmoid; total time= 26.8s
[CV] END ................................C=1, kernel=sigmoid; total time= 26.7s
[CV] END ................................C=1, kernel=sigmoid; total time= 27.3s
[CV] END ................................C=1, kernel=sigmoid; total time= 27.6s
[CV] END .................................C=2, kernel=linear; total time= 29.9s
[CV] END .................................C=2, kernel=linear; total time= 30.3s
[CV] END .................................C=2, kernel=linear; total time= 29.6s
[CV] END .................................C=2, kernel=linear; total time= 31.4s
[CV] END .................................C=2, kernel=linear; total time= 30.5s
[CV] END ....................................C=2, kernel=rbf; total time= 20.3s
[CV] END ....................................C=2, kernel=rbf; total time= 20.1s
[CV] END ....................................C=2, kernel=rbf; total time= 20.9s
[CV] END ....................................C=2, kernel=rbf; total time= 19.7s
[CV] END ....................................C=2, kernel=rbf; total time= 19.7s
[CV] END ................................C=2, kernel=sigmoid; total time= 23.5s
[CV] END ................................C=2, kernel=sigmoid; total time= 24.3s
[CV] END ................................C=2, kernel=sigmoid; total time= 24.1s
[CV] END ................................C=2, kernel=sigmoid; total time= 26.0s
[CV] END ................................C=2, kernel=sigmoid; total time= 25.7s
[CV] END .................................C=4, kernel=linear; total time= 45.7s
[CV] END .................................C=4, kernel=linear; total time= 46.2s
[CV] END .................................C=4, kernel=linear; total time= 44.5s
[CV] END .................................C=4, kernel=linear; total time= 48.6s
[CV] END .................................C=4, kernel=linear; total time= 44.6s
[CV] END ....................................C=4, kernel=rbf; total time= 20.7s
[CV] END ....................................C=4, kernel=rbf; total time= 20.9s
[CV] END ....................................C=4, kernel=rbf; total time= 20.4s
[CV] END ....................................C=4, kernel=rbf; total time= 19.5s
[CV] END ....................................C=4, kernel=rbf; total time= 20.4s
[CV] END ................................C=4, kernel=sigmoid; total time= 24.7s
[CV] END ................................C=4, kernel=sigmoid; total time= 22.9s
[CV] END ................................C=4, kernel=sigmoid; total time= 23.6s
[CV] END ................................C=4, kernel=sigmoid; total time= 24.1s
[CV] END ................................C=4, kernel=sigmoid; total time= 24.9s
[CV] END .................................C=5, kernel=linear; total time= 51.3s
[CV] END .................................C=5, kernel=linear; total time= 52.4s
[CV] END .................................C=5, kernel=linear; total time= 50.5s
[CV] END .................................C=5, kernel=linear; total time= 52.2s
[CV] END .................................C=5, kernel=linear; total time= 50.2s
[CV] END ....................................C=5, kernel=rbf; total time= 19.4s
[CV] END ....................................C=5, kernel=rbf; total time= 19.9s
[CV] END ....................................C=5, kernel=rbf; total time= 19.2s
[CV] END ....................................C=5, kernel=rbf; total time= 19.0s
[CV] END ....................................C=5, kernel=rbf; total time= 19.1s
[CV] END ................................C=5, kernel=sigmoid; total time= 21.7s
[CV] END ................................C=5, kernel=sigmoid; total time= 21.7s
[CV] END ................................C=5, kernel=sigmoid; total time= 21.9s
[CV] END ................................C=5, kernel=sigmoid; total time= 21.8s
[CV] END ................................C=5, kernel=sigmoid; total time= 21.4s
{'C': 4, 'kernel': 'rbf'}
Done
1385

被折叠的 条评论
为什么被折叠?



