Kaggle实战-最简单的DIGIT RECOGNIZER

最新推荐文章于 2022-01-14 19:02:21 发布

原创

最新推荐文章于 2022-01-14 19:02:21 发布 · 4k 阅读

10 ·

CC 4.0 BY-SA版权

文章标签：

#数据 #Kaggle

这篇博客介绍了Kaggle上的Digit Recognizer问题，包括数据集描述、数据预处理、特征提取以及模型选择。在特征提取阶段，作者探讨了PCA和LDA两种线性降维方法，PCA通过保留大部分信息的主成分来降低维度，而LDA则利用类别信息最大化类别间分离。最后，作者选择了PCA降维结合SVM模型进行模型选择。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Digit Recognizer from kaggle

link: https://www.kaggle.com/c/digit-recognizer

Digit Recognizer是kaggle上很基本的一道题目。

数据集描述：

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called “label”, is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

首先查看下数据集

#coding = utf8
%matplotlib inline
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)

def opencsv():  # open with pandas
    data = pd.read_csv('data/train.csv')
    data1 = pd.read_csv('data/test.csv')
    train_data = data.values[0:, 1:]  # 读入全部训练数据
    train_label = data.values[0:, 0]
    test_data = data1.values[0:, 0:]  # 测试全部测试个数据
    print 'Data Load Done!'
    return train_data, train_label, test_data
train_data, train_label, test_data = opencsv() 
# Train_data 中存储了训练集的784个特征，Test_data存储了测试集的784个特征，train_lable则存储了训练集的标签
# 可以看出这道题是典型的监督学习问题

Data Load Done!

import matplotlib.pyplot as plt
from numpy import *
print shape(train_data),shape(test_data) #训练集有42000个。测试集有28000个
def showPic(data):
    plt.figure(figsize=(7,7))
    # 查看前70幅图
    for digit_num in range(0,