1. TensorFlow 与 MNIST

TensorFlow 是谷歌于 2015 年 11 月 9 日正式开源的计算框架，可以很好地支持机器学习的各种算法，其灵活的架构可以在多种平台上展开计算，例如CPU或GPU台式机、服务器，移动设备等等。

MNIST 是机器学习领域的一个经典入门 Demo，数据集是由 6 万张训练图片和 1 万张测试图片构成，期望效果是让机器识别一系列大小为 28x28 像素的手写数字灰度图像，并判断这些图像代表 0-9 中的哪一个数字。

2. 创建 GPU 云主机

本文使用滴滴云 GPU P4 服务器，主要创建过程配置如下。

选择 GPU 服务器和默认已安装显卡驱动的 Centos7.3 镜像：

本次测试选择 8 核 CPU，16G 内存和 80G SDD 云盘作为系统盘：

详细步骤可点击以下链接参考滴滴云官网教程： https://help.didiyun.com/hc/kb/article/1146353/

远程 SSH 连接云主机后，输入 sudo su 切换至 root 用户，输入 nvidia-smi 查看输出确认显卡驱动是否已安装：


[dc2-user@10-254-93-25 ~]$ sudo su
[root@10-254-93-25 dc2-user]# nvidia-smi
Wed Jan 16 17:15:47 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            On   | 00000000:00:06.0 Off |                    0 |
| N/A   35C    P8     6W /  75W |      0MiB /  7606MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3. 安装 TensorFlow

滴滴云 GPU 虚拟主机预装了 Python2.7 和 PIP，可以直接用 PIP 安装 TensorFlow：


pip install -i http://pypi.mirrors.ustc.edu.cn/simple/ --trusted-host pypi.mirrors.ustc.edu.cn tensorflow

如果看到以下输出说明 TensorFlow 安装成功：


Successfully installed absl-py-0.6.1 astor-0.7.1 backports.weakref-1.0.post1 enum34-1.1.6 funcsigs-1.0.2 futures-3.2.0 gast-0.2.0 grpcio-1.16.1 h5py-2.8.0 keras-applications-1.0.6 keras-preprocessing-1.0.5 markdown-3.0.1 mock-2.0.0 numpy-1.15.4 pbr-5.1.1 protobuf-3.6.1 six-1.11.0 tensorboard-1.12.0 tensorflow-1.12.0 termcolor-1.1.0 werkzeug-0.14.1 wheel-0.32.2

测试 TensorFlow：


pythonPython 2.7.5 (default, Nov  6 2016, 00:28:07) host 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> hello = tf.constant("Hello, TensorFlow!")
>>> sess = tf.Session()
2019-01-17 11:09:10.301186: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
>>> print sess.run(hello)
Hello, TensorFlow!

4. 下载 MNIST 数据集

MNIST 数据集可在 http://yann.lecun.com/exdb/mnist/ 获取，它包含了四个部分：

train-images-idx3-ubyte.gz: 训练图片
train-labels-idx1-ubyte.gz: 训练标签
t10k-images-idx3-ubyte.gz: 测试图片
t10k-labels-idx1-ubyte.gz: 测试标签

将以上文件保存在 /tmp 目录下。

这些文件并不是标准的图片格式，因此无法直接展示，需要单独的程序来读取，下面我们以 train-images-idx3-ubyte.gz 为例，用一个 Python 程序将其中的内容转换为普通的 BMP 格式图片。

首先将 train-images-idx3-ubyte.gz 解压到另外的文件夹，因为后面 TensorFlow 用来训练这些数据是是不需要解压的：


[root@10-254-141-173 dc2-user]# cp train-images-idx3-ubyte.gz /home/dc2-user/
[root@10-254-141-173 dc2-user]# cd /home/dc2-user/
[root@10-254-141-173 dc2-user]# gunzip train-images-idx3-ubyte.gz
[root@10-254-141-173 dc2-user]# ls
[root@10-254-141-173 dc2-user]# train-images-idx3-ubyte

在 /home/dc2-user 下创建一个新的文件夹 training，将转换格式后的图片放入此文件夹中，以下为转换文件格式的 Python 代码：


import struct
import numpy as np
import PIL.Image
filename='/home/dc2-user/train-images-idx3-ubyte'
binfile=open(filename,'rb')
buf=binfile.read()
index=0
magic,numImages,numRows,numColumns=struct.unpack_from('>IIII',buf,index)
index+=struct.calcsize('>IIII')
for image in range(0,numImages):
    im=struct.unpack_from('>784B',buf,index)
    index+=struct.calcsize('>784B')
    im=np.array(im,dtype='uint8')
    im=im.reshape(28,28)
    im=PIL.Image.fromarray(im)
    im.save('/home/dc2-user/training/train_%s.bmp'%image,'bmp')

转换完成后，我们可以随便打开两张图片看一下效果：

数据集中的每张图片是 28x28 像素，即 784 个像素点。

4. 训练与识别

接下来在 Python 中运行图片训练与识别的代码。

加载数据集并使用 one-hot 编码：


from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp",one_hot = True)

导入 TensorFlow 库，并创建一个 InteractiveSession，这种交互式会话会指定当前会话为默认会话，之后的运算也会在当前会话运行。

之后创建一个 placeholder 即占位符，可以理解为数据的入口，其中第一个参数是数据类型，为 float32，第二个参数为数据的尺寸，none 表示输入图像数量不唯一，784 表示 MNIST 图像的 784 个像素点，y_ 为图像的标签。此处的数据即为 TensorFlow 中的 Tensor ( 张量 )，可简单理解为多维数组。


import tensorflow as tf
sess = tf.InteractiveSession()
x = tf.placeholder(tf.float32, [None, 784])
y_= tf.placeholder(tf.float32,[None,10])

接下来定义 Softmax 回归模型中的 weights（权重）和 biases（偏置值）创建 Variable 对象。

本文侧重于 MNIST 数据集在滴滴云 GPU 服务器的训练演示，对 Softmax 回归模型不做展开讨论，可以简单理解为 w 是一个 784x10 的矩阵，因为我们有 784 个输入特征和 10 个输出类别（0~9），b 是一个大小为 10 的向量，代表 10 各类别：


W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))

接下来就是实现 Softmax 回归算法，公式 y = softmax(Wx + b)：


y = tf.nn.softmax(tf.matmul(x, W) + b)

为了训练模型，需要定义一个 loss 函数，loss 表示模型的预测在一个单一例子上的准确程度，本例中使用 Cross-entropy（交叉熵）函数。交叉熵刻画的是实际输出（概率）与期望输出（概率）的距离，也就是交叉熵的值越小，两个概率分布就越接近：


cross_entropy = tf.reduce_mean(-tf.reduce_sum(y_ * tf.log(y), reduction_indices = [1]))

使用 TensorFlow 的内置最快速梯度下降法，0.5 的步长，来下降交叉熵：


train_step =  tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

初始化全局参数并执行其 run 方法：


tf.global_variables_initializer().run()

迭代执行训练操作 train_step，每次随机抽取 100 样本构成一个 batch，一共进行 10000 组训练：


for i in range(10000):
    batch_xs, batch_ys = mnist.train.next_batch(100)
    train_step.run({x:batch_xs, y_:batch_ys})

训练完成后，对模型的准确率进行验证。

以下代码中 tf.argmax(y,1) 是指经过学习后判断图中数字最可能的值，tf.argmax(y_,1) 是图片中数字的真实值，tf.equal 检查判断值和真实值是否相等：


correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
print(accuracy.eval({x:mnist.test.images, y_: mnist.test.labels}))
0.9253

我们可以看到准确率为 92.53%。

基于滴滴云 GPU 实现简单 MINIST 手写识别

1. TensorFlow 与 MNIST

2. 创建 GPU 云主机

3. 安装 TensorFlow

4. 下载 MNIST 数据集

4. 训练与识别