集群上编译caffe时遇到的一些问题

在编译3D-caffe时遇到CUDNN版本不兼容的问题,通过对比BVLC-caffe源码并更新相关文件解决了编译错误。在集群环境中,针对caffe镜像的编译需要正确挂载库文件,并确保MATLAB接口matcaffe的配置,包括修改.bashrc文件和makefile.config。在编译过程中遇到的hdf5库找不到的问题,通过添加库路径解决。此外,还提到了图片旋转问题的潜在解决方案。


1. 3d-caffe在编译时不能通过,检查报错信息。和能成功编译的BVLC-caffe做对比,发现是3D-caffe/include/caffe/util/cudnn.hpp文件中的setConvolutionDesc{}函数不一样,可能是cudnn版本问题导致编译不成功。

BVLC-caffe:https://github.com/BVLC/caffe/blob/master/include/caffe/util/cudnn.hpp


于是copy BVLC-caffe中的cudnn.hpp文件,对setConvolutionDesc{}函数进行了修改,之后如下所示:


#if CUDNN_VERSION_MIN(6, 0, 0)
  CUDNN_CHECK(cudnnSetConvolution2dDescriptor(*conv,
      pad_h, pad_w, stride_h, stride_w, 1, 1, CUDNN_CROSS_CORRELATION,
      dataType<Dtype>::type));
#else
    CUDNN_CHECK(cudnnSetConvolution2dDescriptor(*conv,
      pad_h, pad_w, stride_h, stride_w, 1, 1, CUDNN_CROSS_CORRELATION));
#endif

同时对cudnnGetErrorString{}函数进行了修改


#if CUDNN_VERSION_MIN(6, 0, 0)
    case CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING:
      return "CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING";
#endif
#if CUDNN_VERSION_MIN(7, 0, 0)
    case CUDNN_STATUS_RUNTIME_IN_PROGRESS:
      return "CUDNN_STATUS_RUNTIME_IN_PROGRESS";
    case CUDNN_STATUS_RUNTIME_FP_OVERFLOW:
      return "CUDNN_STATUS_RUNTIME_FP_OVERFLOW";
#endif

再次编译后顺利通过,只是会出现一些如下的warnings:

/ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/util/cudnn.hpp:22:10: warning: enumeration value 'CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING' not handled in switch [-Wswitch]
   switch (status) {
          ^
In file included from /ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/util/device_alternate.hpp:40:0,
                 from /ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/common.hpp:19,
                 from /ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/blob.hpp:8,
                 from /ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/layers/cudnn_relu_layer.hpp:6,
                 from /ghome/zhaojie/NeuronSeg/3D-Caffe/src/caffe/layers/cudnn_relu_layer.cpp:4:
/ghome/zhaojie/NeuronSeg/3D-Caffe/include/caffe/util/cudnn.hpp: In function 'const char* cudnnGetErrorString(cudnnStatus_t)':

解决:Ubuntu安装caffe和rcnn的兼容性问题

faster-rcnn在编译时遇到的一些问题

1. 在安装faster-rcnn事如果报这种错误,caffe编译时报错
In file included from ./include/caffe/util/device_alternate.hpp:40:0,
                 from ./include/caffe/common.hpp:19,
                 from ./include/caffe/blob.hpp:8,
                 from ./include/caffe/net.hpp:10,
                 from ./include/caffe/solver.hpp:7,
                 from ./include/caffe/sgd_solvers.hpp:7,
                 from src/caffe/solvers/adam_solver.cpp:3:
./include/caffe/util/cudnn.hpp: In function ‘const char* cudnnGetErrorString(cudnnStatus_t)’:
./include/caffe/util/cudnn.hpp:21:10: warning: enumeration value CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING’ not handled in switch [-Wswitch]
   switch (status) {
          ^
./include/caffe/util/cudnn.hpp: In function ‘void caffe::cudnn::setConvolutionDesc(cudnnConvolutionStruct**, cudnnTensorDescriptor_t, cudnnFilterDescriptor_t, int, int, int, int)’:
./include/caffe/util/cudnn.hpp:108:70: error: too few arguments to function ‘cudnnStatus_t cudnnSetConvolution2dDescriptor(cudnnConvolutionDescriptor_t, int, int, int, int, int, int, cudnnConvolutionMode_t, cudnnDataType_t)’
       pad_h, pad_w, stride_h, stride_w, 1, 1, CUDNN_CROSS_CORRELATION));
处理方案:

    这种问题出现的原因是在rbg开源这套物体检查方案时是依据的当时最新的cundnn版本,而这么多年过去,cudnn已经发布了好几个版本,所以要处理这个问题,只需要将caffe框架下的所有cudnn相关的文件替换车成当前caffe最新的即可.

rbg的faster-rcnn模型代码地址: https://github.com/rbgirshick/py-faster-rcnn

cafe框架的代码地址: https://github.com/BVLC/caffe

进入faster-rcnn目录下py-caffe的src/caffe/util/cudnn.cpp 以及src/caffe/layers/cudnn*中的所有文件.还有include/util/cudnn.hpp 和include/layers/cudnn*.hpp中的文件全部替换成caffe中对应的文件.然后编译即可.”

经实验,按照上述方案替换对应文件后,编译会报很多其他的错误,建议还是依据编译时报错信息的提示,逐个进行更改(参考BVLC-caffe的源码)。


2.  集群中,进入caffe镜像编译时遇到的问题


进到容器里面,cd到脚本所在目录,然后运行那个脚本。修改.sh的权限

起这个容器的时候,caffe lib没有挂载进去。-u少了挂载,-u ''-it -v <caffe_lib>:/opt/caffe/lib''。要把<caffe_lib>替换成caffe库的路径 xxx/build/lib。

操作如下:

startdocker -u "-it -v /ghome/zhaojie/NeuronSeg2/3D-Caffe/build/lib:/opt/caffe/lib" -c /bin/bash bit:5000/cuda8-cudnn6-caffe-dev-ubuntu16.04 

cd /ghome/zhaojie/NeuronSeg2/3D-DSN/

sh start_train.sh

任务提交成功

注意:调试节点不要使用全部gpu卡


3. 集群上编译caffe的MATLAB接口:matcaffe

    A.首先需要修改 ./.bashrc 文件(先找到集群上matlab文件夹):

        a.  home目录下vim ./.bashrc,添加matlab的路径。举例:

                # added by MATLAB
                export PATH="/home/rfdeng/tools/MATLAB/bin:$PATH"

        b.  :wq保存退出,然后source ./.bashrc 更新bashrc

    B. matlab 测试是否可用

    C. 在caffe/makefile.config里修改matlab的路径

    D. 编译caffe。make -j8;make matcaffe


4. make matcaffe时报错如下:


显示找不到hdf5.h 库文件

集群上查找该库的位置,然后在 .bashrc文件中添加该路径


########################################################

2.

python setup.py build_ext --inplace
Traceback (most recent call last):
  File "setup.py", line 58, in <module>
    CUDA = locate_cuda()
  File "setup.py", line 55, in locate_cuda
    raise EnvironmentError('The CUDA %s path could not be located in %s' % (k, v))
EnvironmentError: The CUDA lib64 path could not be located in /usr/lib64
Makefile:2: recipe for target 'all' failed
make: *** [all] Error 1

 这种问题是由于只需要将53行的第二个lib64换成lib即可.


3.安装opencv 
sudo apt-get install python-opencv

4. 
File "/home/gxjun/Qunar/py-faster-rcnn/tools/train_faster_rcnn_alt_opt.py", line 67, in get_roidb
    roidb = get_training_roidb(imdb)
  File "/home/gxjun/Qunar/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 118, in get_training_roidb
    imdb.append_flipped_images()
  File "/home/gxjun/Qunar/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 111, in append_flipped_images
    assert (boxes[:, 2] >= boxes[:, 0]).all()
这种问题,一般都是清楚缓存,去cache下删除所有文件就可以了


5. 出现问题:训练faster rcnn时出现如下报错:


File "/py-faster-rcnn/tools/../lib/datasets/imdb.py", line 108, in append_flipped_images
    assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError
2、问题分析:
检查自己数据发现,左上角坐标(x,y)可能为0,或标定区域溢出图片
snapshot_prefix: "vgg16_rpn"
average_loss: 100
I0421 11:53:05.251756 24051 solver.cpp:81] Creating training net from train_net file: models/pascal_voc/VGG16/faster_rcnn_alt_opt/stage1_rpn_train.pt
F0421 11:53:05.251797 24051 io.cpp:36] Check failed: fd != -1 (-1 vs. -1) File not found: models/pascal_voc/VGG16/faster_rcnn_alt_opt/stage1_rpn_train.pt
*** Check failure stack trace: ***
将sovler.txt中的路径设置成绝对路径


7.
Traceback (most recent call last):
  File "/usr/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python2.7/multiprocessing/process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/train_faster_rcnn_alt_opt.py", line 208, in train_fast_rcnn
    max_iters=max_iters)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 160, in train_net
    model_paths = sw.train_model(max_iters)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/fast_rcnn/train.py", line 101, in train_model
    self.solver.step(1)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 144, in forward
    blobs = self._get_next_minibatch()
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/roi_data_layer/layer.py", line 63, in _get_next_minibatch
    return get_minibatch(minibatch_db, self._num_classes)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 55, in get_minibatch
    num_classes)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 125, in _sample_rois
    roidb['bbox_targets'][keep_inds, :], num_classes)
  File "/media/gxjun/78289D37289CF4FA/py-faster-rcnn/tools/../lib/roi_data_layer/minibatch.py", line 176, in _get_bbox_regression_labels
    bbox_targets[ind, start:end] = bbox_target_data[ind, 1:]
ValueError: could not broadcast input array from shape (4) into shape (0)
这种问题,一般是model配置参数有问题.需要重新设置protxt中的参数.

其实还存在一种问题,就是图片出现旋转问题.


8.
645 net.cpp:408] rpn_cls_prob_reshape -> rpn_cls_prob_reshape
F0810 10:54:11.421221   645 reshape_layer.cpp:80] Check failed: 0 == bottom[0]->count() % explicit_count (0 vs. 58320) bottom count (408240) must be divisible by the product of the specified dimensions (87480)
*** Check failure stack trace: ***
这种问题,一般是看对应的层的参数,比如这里是rpn_cls_prob参数有问题.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值