根据caffe官方教程我们知道,在数据的前向传播和梯度的反向传播过程中,caffe以Blob的形式来存储、传递和操作数据,Blob是标准的array,也是框架的统一内存接口。
为了更明了,还是先把关于Blob的caffe官方教程贴过来吧。
Blob storage and communication
A Blob is a wrapper over the actual data being processed and passed along by Caffe, and also under the hood provides synchronization capability between the CPU and the GPU. Mathematically, a blob is an N-dimensional array stored in a C-contiguous fashion.
Caffe stores and communicates data using blobs. Blobs provide a unified memory interface holding data; e.g., batches of images, model parameters, and derivatives for optimization.
Blobs conceal the computational and mental overhead of mixed CPU/GPU operation by synchronizing from the CPU host to the GPU device as needed. Memory on the host and device is allocated on demand (lazily) for efficient memory usage.
The conventional blob dimensions for batches of image data are number N x channel K x height H x width W. Blob memory is row-major in layout, so the last / rightmost dimension changes fastest. For example, in a 4D blob, the value at index (n, k, h, w) is physically located at index ((n * K + k) * H + h) * W + w.
Number / N is the batch size of the data. Batch processing achieves better throughput for communication and device processing. For an ImageNet training batch of 256 images N = 256.
Channel / K is the feature dimension e.g. for RGB images K = 3.
Note that although many blobs in Caffe examples are 4D with axes for image applications, it is totally valid to use blobs for non-image applications. For example, if you simply need fully-connected networks like the conventional multi-layer perceptron, use 2D blobs (shape (N, D)) and call the InnerProductLayer (which we will cover soon).
Parameter blob dimensions vary according to the type and configuration of the layer. For a convolution layer with 96 filters of 11 x 11 spatial dimension and 3 inputs the blob is 96 x 3 x 11 x 11. For an inner product / fully-connected layer with 1000 output channels and 1024 input channels the parameter blob is 1000 x 1024.
For custom data it may be necessary to hack your own input preparation tool or data layer. However once your data is in your job is done. The modularity of layers accomplishes the rest of the work for you.
Implementation Details
As we are often interested in the values as well as the gradients of the blob, a Blob stores two chunks of memories, data and diff. The former is the normal data that we pass along, and the latter is the gradient computed by the network.
Further, as the actual values could be stored either on the CPU and on the GPU, there are two different ways to access them: the const way, which does not change the values, and the mutable way, which changes the values:
const Dtype* cpu_data() const;
Dtype* mutable_cpu_data();```
(similarly for gpu and diff).
The reason for such design is that, a Blob uses a SyncedMem class to synchronize values between the CPU and GPU in order to hide the synchronization details and to minimize data transfer. A rule of thumb is, always use the const call if you do not want to change the values, and never store the pointers in your own object. Every time you work on a blob, call the functions to get the pointers, as the SyncedMem will need this to figure out when to copy data.
In practice when GPUs are present, one loads data from the disk to a blob in CPU code, calls a device kernel to do GPU computation, and ferries the blob off to the next layer, ignoring low-level details while maintaining a high level of performance. As long as all layers have GPU implementations, all the intermediate data and gradients will remain in the GPU.
If you want to check out when a Blob will copy data, here is an illustrative example:
// Assuming that data are on the CPU initially, and we have a blob.
const Dtype* foo;
Dtype* bar;
foo = blob.gpu_data(); // data copied cpu->gpu.
foo = blob.cpu_data(); // no data copied since both have up-to-date contents.
bar = blob.mutable_gpu_data(); // no data copied.
// ... some operations ...
bar = blob.mutable_gpu_data(); // no data copied when we are still on GPU.
foo = blob.cpu_data(); // data copied gpu->cpu, since the gpu side has modified the data
foo = blob.gpu_data(); // no data copied since both have up-to-date contents
bar = blob.mutable_cpu_data(); // still no data copied.
bar = blob.mutable_gpu_data(); // data copied cpu->gpu.
bar = blob.mutable_cpu_data(); // data copied gpu->cpu.
通过上面的教程我们对Blob会有一个整体的了解,接下来通过阅读源码来进一步学习caffe中的Blob。学习的流程,包括后面的学习,先以注释的方式阅读源码,然后总结。
1. 源码
(本人对C++并不熟悉,所以希望通过这样的过程来学习,如果有错误希望大家批评指正)
blob.hpp
#ifndef CAFFE_BLOB_HPP_
#define CAFFE_BLOB_HPP_
#include <algorithm>
#include <string>
#include <vector>
#include "caffe/common.hpp"
#include "caffe/proto/caffe.pb.h" // 由caffe.proto生成
#include "caffe/syncedmem.hpp"
上面时blob.hpp包含的头文件,先简单了解一下这些头文件(除了前三个标准库)都做了些什么。
#include "caffe/common.hpp"
这个头文件里面首先是一堆头文件,接下来是一堆宏定义,这里先不细看。再接下来就时两个命名空间:
namespace cv { class Mat; }
这个暂时也不知道有什么用。
namespace caffe
这个命名空间里面有一个类caffe,这个应该是commo.hpp的主要部分了。
接下来看caffe类中都有些什么呢?
public:
~Caffe(); // 析构函数
// Thread local context for Caffe. Moved to common.cpp instead of
// including boost/thread.hpp to avoid a boost/NVCC issues (#1009, #1010)
// on OSX. Also fails on Linux with CUDA 7.0.18.
static Caffe& Get(); // 在新的线程创建caffe类对象,并返回它的引用?(需要大家指正)
enum Brew { CPU, GPU }; // 枚举
// This random number generator facade hides boost and CUDA rng
// implementation from one another (for cross-platform compatibility).
class RNG; // 一个产生随机数的类
// Getters for boost rng, curand, and cublas handles
// 不太明白这个函数作用。
inline static RNG& rng_stream();
// 接下来这几个比较好理解。
// Returns the mode: running on CPU or GPU.
inline static Brew mode() { return Get().mode_; }
// The setters for the variables
// Sets the mode. It is recommended that you don't change the mode halfway
// into the program since that may cause allocation of pinned memory being
// freed in a non-pinned way, which may cause problems - I haven't verified
// it personally but better to note it here in the header file.
inline static void set_mode(Brew mode) { Get().mode_ = mode; }
// Sets the random seed of both boost and curand
static void set_random_seed(const unsigned int seed);
// Sets the device. Since we have cublas and curand stuff, set device also
// requires us to reset those values.
static void SetDevice(const int device_id);
// Prints the current GPU status.
static void DeviceQuery();
// Check if specified device is available
static bool CheckDevice(const int device_id);
// Search from start_id to the highest possible device ordinal,
// return the ordinal of the first available device.
static int FindDevice(const int start_id = 0);
// Parallel training info
inline static int solver_count() { return Get().solver_count_; }
inline static void set_solver_count(int val) { Get().solver_count_ = val; }
inline static bool root_solver() { return Get().root_solver_; }
inline static void set_root_solver(bool val) { Get().root_solver_ = val; }
protected:
#ifndef CPU_ONLY
cublasHandle_t cublas_handle_;
curandGenerator_t curand_generator_;
#endif
shared_ptr<RNG> random_generator_;
Brew mode_;
int solver_count_;
bool root_solver_;
private:
// The private constructor to avoid duplicate instantiation.
Caffe();
DISABLE_COPY_AND_ASSIGN(Caffe);
总之,在common.hpp中,主要时singleton化caffe类,封装了boost和CUDA随机数生成的函数,提供了统一的接口。
对于#include "caffe/syncedmem.hpp"
,主要包含:
// 在cpu或gpu上分配内存
inline void CaffeMallocHost(void** ptr, size_t size, bool* use_cuda);
// 在cpu或gpu上释放内存
inline void CaffeFreeHost(void* ptr, bool use_cuda);
// 在cpu和gpu之间同步内存的类
class SyncedMemory;
caffe.pb.h是有caffe.proto生成的,可以参考我之前写的学习笔记。根据caffe.proto中关于Blob的定义:
message BlobShape {
repeated int64 dim = 1 [packed = true];
}
message BlobProto {
optional BlobShape shape = 7;
repeated float data = 5 [packed = true];
repeated float diff = 6 [packed = true];
repeated double double_data = 8 [packed = true];
repeated double double_diff = 9 [packed = true];
// 4D dimensions -- deprecated. Use "shape" instea