图像分割方法deeplab以及Hole算法解析

最新推荐文章于 2025-07-11 15:25:58 发布

转载最新推荐文章于 2025-07-11 15:25:58 发布 · 9.4k 阅读

DL 专栏收录该内容

158 篇文章

订阅专栏

Deeplab是一种结合深度卷积神经网络与全连接条件随机场的图像分割方法。该方法首先使用全卷积网络（FCN）生成粗略得分图，再利用全连接条件随机场（CRF）细化边界，提高分割精度。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

尊重原创，转载请注明：http://blog.youkuaiyun.com/tangwei2014

deeplab发表在ICLR 2015上。论文下载地址：Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFS.

deeplab方法概述
deeplab方法分为两步走，第一步仍然采用了FCN得到 coarse score map并插值到原图像大小，然后第二步借用fully connected CRF对从FCN得到的分割结果进行细节上的refine。(有关FCN的内容介绍，可以参考我的前面得一篇博客：http://blog.youkuaiyun.com/tangwei2014/article/details/46882257)
下面这张图很清楚地展示了整个结构：

然后这张图展示了CRF处理前后的效果对比，可以看出用了CRF以后，细节确实改善了很多：
deeplab对FCN更加优雅的处理方式
在第一步中，deeplab仍然采用了FCN来得到score map,并且也是在VGG网络上进行fine-tuning。但是在得到score map的处理方式上，要比原FCN处理的优雅很多。
还记得CVPR 2015的FCN中是怎么得到一个更加dense的score map的吗？是一张500x500的输入图像，直接在第一个卷积层上conv1_1来了一个100的大padding。最终在fc7层勉强得到一个16x16的score map。虽然处理上稍显粗糙，但是毕竟人家是第一次将图像分割在CNN上搞成end-to-end，并且在当时performance是state-of-the-art，也很理解。
deeplab摒弃了这种做法，取而代之的是对VGG的网络结构上做了小改动：将VGG网络的pool4和pool5层的stride由原来的2改为了1。就是这样一个改动，使得vgg网络总的stride由原来的32变成8，进而使得在输入图像为514x514，正常的padding时，fc7能得到67x67的score map, 要比FCN确实要dense很多很多。
但是这种改变网络结果的做法也带来了一个问题： stride改变以后，如果想继续利用vgg model进行fine tuning，会导致后面filter作用的区域发生改变，换句话说就是感受野发生变化。这个问题在下图(a) (b)中通过花括号体现出来了:
Hole算法
于是乎，作者想出了一招，来解决两个看似有点矛盾的问题：
既想利用已经训练好的模型进行fine-tuning，又想改变网络结构得到更加dense的score map.
这个解决办法就是采用Hole算法。如下图(a) (b)所示，在以往的卷积或者pooling中，一个filter中相邻的权重作用在feature map上的位置都是物理上连续的。如下图(c)所示，为了保证感受野不发生变化，某一层的stride由2变为1以后，后面的层需要采用hole算法，具体来讲就是将连续的连接关系是根据hole size大小变成skip连接的（图(c)为了显示方便直接画在本层上了）。不要被(c)中的padding为2吓着了，其实2个padding不会同时和一个filter相连。
pool4的stride由2变为1，则紧接着的conv5_1, conv5_2和conv5_3中hole size为2。接着pool5由2变为1, 则后面的fc6中hole size为4。
代码

主要是im2col(前传)和col2im(反传)中做了改动 (增加了hole_w, hole_h)，这里只贴cpu的用于理解：

//forward
template <typename Dtype>
void im2col_cpu(const Dtype* data_im, 
    const int num, const int channels, const int height, const int width,
    const int kernel_h, const int kernel_w, const int pad_h, const int pad_w,
    const int stride_h, const int stride_w, const int hole_h, const int hole_w,
    Dtype* data_col) {
  // effective kernel if we expand the holes (trous)
  const int kernel_h_eff = kernel_h + (kernel_h - 1) * (hole_h - 1);
  const int kernel_w_eff = kernel_w + (kernel_w - 1) * (hole_w - 1);
  int height_col = (height + 2 * pad_h - kernel_h_eff) / stride_h + 1;
  int width_col = (width + 2 * pad_w - kernel_w_eff) / stride_w + 1;
  int channels_col = channels * kernel_h * kernel_w;
  for (int n = 0; n < num; ++n) {
    for (int c = 0; c < channels_col; ++c) {
      int w_offset = (c % kernel_w)  * hole_w;
      int h_offset = ((c / kernel_w) % kernel_h) * hole_h;
      int c_im = c / kernel_w / kernel_h;
      for (int h = 0; h < height_col; ++h) {
        const int h_im = h * stride_h + h_offset - pad_h;
        for (int w = 0; w < width_col; ++w) {
          const int w_im = w * stride_w + w_offset - pad_w;
          data_col[((n * channels_col + c) * height_col + h) * width_col + w] =
            (h_im >= 0 && h_im < height && w_im >= 0 && w_im < width) ?
            data_im[((n * channels + c_im) * height + h_im) * width + w_im] : 
            0.; // zero-pad
        } //width_col
      } //height_col
    } //channels_col
  } //num
}

//backward
template <typename Dtype>
void col2im_cpu(const Dtype* data_col,
    const int num, const int channels, const int height, const int width,
    const int kernel_h, const int kernel_w, const int pad_h, const int pad_w,
    const int stride_h, const int stride_w, const int hole_h, const int hole_w,
    Dtype* data_im) {
  caffe_set(num * channels * height * width, Dtype(0), data_im);
  const int kernel_h_eff = kernel_h + (kernel_h - 1) * (hole_h - 1);
  const int kernel_w_eff = kernel_w + (kernel_w - 1) * (hole_w - 1);
  int height_col = (height + 2 * pad_h - kernel_h_eff) / stride_h + 1;
  int width_col = (width + 2 * pad_w - kernel_w_eff) / stride_w + 1;
  int channels_col = channels * kernel_h * kernel_w;
  for (int n = 0; n < num; ++n) {
    for (int c = 0; c < channels_col; ++c) {
      int w_offset = (c % kernel_w)  * hole_w;
      int h_offset = ((c / kernel_w) % kernel_h) * hole_h;
      int c_im = c / kernel_w / kernel_h;
      for (int h = 0; h < height_col; ++h) {
    const int h_im = h * stride_h + h_offset - pad_h;
        for (int w = 0; w < width_col; ++w) {
          const int w_im = w * stride_w + w_offset - pad_w;
          if (h_im >= 0 && h_im < height && w_im >= 0 && w_im < width) {
            data_im[((n * channels + c_im) * height + h_im) * width + w_im] += 
              data_col[((n * channels_col + c) * height_col + h) * width_col + w];
          }
        }
      }
    }
  }
}