因为工作的需要,近期正式开始学习深度学习,采用的深度学习框架是caffe。为了更快的了解caffe训练模型的整体流程,首先从caffe自带的MNIST例程开始。
准备数据
因为所用的linux操作系统不能联网,所以不能编写脚本来下载数据集,只能先下载下来,然后导入自己的根目录下。
下载网址:
the minist dataset
下载下来的文件如下:
train开头的文件夹代表训练集,t10k开头的文件夹代表验证集,文件命名中有images的代表图片,有labels的代表标签。
文件下载好之后,需要将其解压,linux指令如下:
$ gzip -d train-images-idx3-ubyte.gz
依次将四个文件解压,得到的文件如下:
之后,需要将其转换为caffe适用的数据格式lmdb,编写shell脚本如下:
#!/usr/bin/env sh
# this script converts the mnist data into lmdb or leveldb format,
# depending on the value assigned to $BACKEND.
EXAMPLE=examples/mnist
DATA=examples/mnist
BUILD=build/examples/mnist
BACKEND="lmdb"
echo "creating ${BACKEND}..."
rm -rf $EXAMPLE/mnist_train_${BACKEND}
rm -rf $EXAMPLE/mnist_test_${BACKEND}
$BUILD/convert_mnist_data.bin $DATA/train-images-idx3-ubyte \
$DATA /train-labels-idx1-ubyte $EXAMPLE/mnist_train_${BACKEND} --backend=${BACKEND}
$BUILD/convert_mnist_data.bin $DATA/t10k-images-idx3-ubyte \
$DATA/t10k-labels-idx1-ubyte $EXAMPLE/mnist_test_${BACKEND} --backend=${BACKEND}
echo "done"
EXAMPLE是lmdb文件存放路径,DATA是原始数据存放路径,BUILD是convert_mnist_data.bin文件的存放路径,将其换成自己的正确的路径,convert_mnist_data.bin是caffe自带的一个数据类型转换的文件。
至此,数据准备完毕。
定义网络结构
caffe中已经自带了该任务的网络结构,即:lenet_train_test.prototxt文件,看一下其中的内容:
name: "LeNet"
layer{
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_train_lmdb"
batch_size: 64
backend: LMDB
}
}
layer {
name: "mnist"
type: "Data"
top: "data"
top: "label"
include {
phase: TEST
}
transform_param {
scale: 0.00390625
}
data_param {
source: "examples/mnist/mnist_test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "data"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 20
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 2
stride: 2
}
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 50
kernel_size: 5
stride: 1
weight_filler {
type: "xavier"
}
bias_filler {
type: "xavier"
}
}
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: MAX
kernel_size: 2
stride:2
}
}
layer {
name: "ip1"
type: "InnerProduct"
bottom: "pool2"
top: "ip1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 500
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "ip1"
top: "ip1"
}
layer {
name: "ip2"
type: "InnerProduct"
bottom: "ip1"
top: "ip2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
inner_product_param {
num_output: 10
weight_filler {
type: "xavier"
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "ip2"
bottom: "label"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "ip2"
bottom: "label"
top: "loss"
}
配置solver参数
caffe已经自带啦该任务的sovler文件,即:lenet_solver.prototxt, 看一下它的内容:
# the train/test net protocol buffer definition
net: "examples/mnist/lenet_train_test.prototxt"
# test_iter specifies how many forward passes the test should carry out.
# in the case of MNIST, we have test batch size 100 and 100 test iterations,
# convering the full 10000 testing images.
test_iter:100
# carry out testing every 500 training iterations.
test_interval: 500
test_type: TEST
# the base learning rate, momentum and the weight decay of the network.
base_lr: 0.01
momentum: 0.9
weight_decay: 0.0005
# the learning rate policy
lr_policy: "inv"
gamma: 0.0001
power: 0.75
# display every 100 iterations
display: 100
# the maximum number of iterations
max_iter: 10000
# snapshot intermediate results
snapshot: 5000
snapshot_prefix: "examples/mnist/lenet"
# solver mode: CPU or GPU
solver_mode: GPU
训练
因为我是在服务器集群上跑的代码,使用了GPU,在caffe根目录下运行脚本文件,指令如下:
$ srun -p K15G12 -J MNIST -c 4 --gres=gpu:1 sh examples/mnist/train_lenet.sh
train_lenet.sh脚本内容如下:
#!/usr/bin/env sh
./build/tools/caffe train --solver=examples/mnist/lenet_solver.prototxt
最后,在./examples/mnist文件夹下,出现了如下几个文件:
caffemodel就是训练好了的用于手写字符识别的模型。
注意
因为服务器上caffe已经配置好了,所以我是直接从别人那里拷过来一个caffe包,运行的时候一直出错。之后发现这是因为./build/tools/caffe 这个软链接到别人的caffe,我没有权限,我需要重新编译链接一下。
所以我在caffe根目录下首先 make clean,删掉原先存在的build文件夹, 然后 make, 重新编译链接,之后就能正常运行啦。
后记
刚刚开始接触linux操作系统和caffe框架,很多地方都不是很清楚,还有很多知识需要学习。在这个过程中遇到问题真的很想爆炸,但是就是这些问题的存在自己才能一步步成长。告诫自己:不要逃避问题和挑战,成长的路上任重而道远!