net_surgery中如何将全连接层转换成卷积层

本文详细解析了如何将一个分类网络转换为全卷积网络(FCN),通过调整输入大小和修改全连接层为卷积层,实现对不同输入尺寸的有效处理。以AlexNet为例,探讨了从原始输入到调整输入的转换方法,以及参数调整的逻辑,最终实现了在更大输入尺寸下保持与原网络相同的参数规模。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

最近在读Fully convolutional networks for semantic segmentation这篇论文,看到里面很多地方都感觉有些不太明白,只好一点一点来刨了。首先这个net_surgery的例子是非常重要的一个启示,关于如何把一个classifier转换成fully convolutional network

 

原始输入层定义:

https://github.com/BVLC/caffe/blob/master/models/bvlc_reference_caffenet/deploy.prototxt

Name:Alex(假设取名为Alex)

Input_shape{

Dim:10

Dim:3

Dim:227

Dim:227

}

layer {

name: "fc6"

type: "InnerProduct"

bottom: "pool5"

top: "fc6"

inner_product_param {

num_output: 4096

}

}

 

 

调整后的输入定义:

https://github.com/BVLC/caffe/blob/master/examples/net_surgery/bvlc_caffenet_full_conv.prototxt

Name:FCN(假设取名为FCN)

Input_shape{

Dim:10

Dim:3

Dim:451

Dim:451

}

layer {

name: "fc6-conv"

type: "Convolution"

bottom: "pool5"

top: "fc6-conv"

convolution_param {

num_output: 4096

kernel_size: 6

}

}

 

 

最重要的地方就是pool5后第一个全连接层的修改,花了些时间进行分析才算看明白了,记录下来怕后面看到又忘了。

 

对网络每一层的数据(blobs)以及参数(params)进行分析,可以推出,

在1x3x227x227的输入的前提下,Alex的pool5的数据(blobs)大小(shape)为1x256x6x6,这个时候下一层是全连接层fc6,它的num_output为4096,这时候fc6层的参数(params)大小(shape)为:4096x9216(1*256*6*6=9216)

 

然后对网络的输入进行修改,将输入放大到1x3x451x451,保留其他的参数不变,这个时候再次进行推算,可以得出FCN的pool5的数据(blobs)大小(shape)为1x256x13x13,然后将下一层的全连接层fc6修改为卷积层fc6-conv,num_output为4096,kernel_size为6,此时该层的参数(params)大小(shape)为4096x256x6x6,换算一下,刚好为4096x9216(256*6*6=9216)。

 

至此,可以将3x227x227输入情况下Alex的fc6层的参数(4096x9216)赋值给3x451x451输入情况下FCN的fc6-conv层(4096x9216)


一开始我一直以为是在相同输入大小的情况下来进行转换,使用451x451进行分析,得出pool5层后的fc层参数大小是4096x256x13x13,也就是4096x43264,然而models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel里的参数是(4096x9216),所以是没法赋值过来的。


如果不调整输入大小,直接将fc6层改为卷积层的话是可以做到参数相大小相匹配的,这个时候pool5为1x256x6x6,然后fc6-conv设定为num_output=4096,kernel_size:6,同样可以达到fc6-conv层的参数为4096x256x6x6=4096x9216的效果,但是注意观察进一步fc6-conv层之后的数据大小会发现,生成的feature_map边长变为(6-6)/1+1=1,只剩下了1x4096x1x1了,当然这样子做也没有什么问题,只是这1x1的的feature_map没法看到heat map的效果了啊


Fully Convolutional Networks for Semantic Segmentation This is the reference implementation of the models and code for the fully convolutional networks (FCNs) in the PAMI FCN and CVPR FCN papers: Fully Convolutional Models for Semantic Segmentation Evan Shelhamer*, Jonathan Long*, Trevor Darrell PAMI 2016 arXiv:1605.06211 Fully Convolutional Models for Semantic Segmentation Jonathan Long*, Evan Shelhamer*, Trevor Darrell CVPR 2015 arXiv:1411.4038 Note that this is a work in progress and the final, reference version is coming soon. Please ask Caffe and FCN usage questions on the caffe-users mailing list. Refer to these slides for a summary of the approach. These models are compatible with BVLC/caffe:master. Compatibility has held since master@8c66fa5 with the merge of PRs #3613 and #3570. The code and models here are available under the same license as Caffe (BSD-2) and the Caffe-bundled models (that is, unrestricted use; see the BVLC model license). PASCAL VOC models: trained online with high momentum for a ~5 point boost in mean intersection-over-union over the original models. These models are trained using extra data from Hariharan et al., but excluding SBD val. FCN-32s is fine-tuned from the ILSVRC-trained VGG-16 model, and the finer strides are then fine-tuned in turn. The "at-once" FCN-8s is fine-tuned from VGG-16 all-at-once by scaling the skip connections to better condition optimization. FCN-32s PASCAL: single stream, 32 pixel prediction stride net, scoring 63.6 mIU on seg11valid FCN-16s PASCAL: two stream, 16 pixel prediction stride net, scoring 65.0 mIU on seg11valid FCN-8s PASCAL: three stream, 8 pixel prediction stride net, scoring 65.5 mIU on seg11valid and 67.2 mIU on seg12test FCN-8s PASCAL at-once: all-at-once, three stream, 8 pixel prediction stride net, scoring 65.4 mIU on seg11valid FCN-AlexNet PASCAL: AlexNet (CaffeNet) architecture, single stream, 32 pixel prediction stride net, scoring 48.0 mIU on seg11valid. Unlike the FCN-32/16/8s models, this network is trained with gradient accumulation, normalized loss, and standard momentum. (Note: when both FCN-32s/FCN-VGG16 and FCN-AlexNet are trained in this same way FCN-VGG16 is far better; see Table 1 of the paper.) To reproduce the validation scores, use the seg11valid split defined by the paper in footnote 7. Since SBD train and PASCAL VOC 2011 segval intersect, we only evaluate on the non-intersecting set for validation purposes. NYUDv2 models: trained online with high momentum on color, depth, and HHA features (from Gupta et al. https://github.com/s-gupta/rcnn-depth). These models demonstrate FCNs for multi-modal input. FCN-32s NYUDv2 Color: single stream, 32 pixel prediction stride net on color/BGR input FCN-32s NYUDv2 HHA: single stream, 32 pixel prediction stride net on HHA input FCN-32s NYUDv2 Early Color-Depth: single stream, 32 pixel prediction stride net on early fusion of color and (log) depth for 4-channel input FCN-32s NYUDv2 Late Color-HHA: single stream, 32 pixel prediction stride net by late fusion of FCN-32s NYUDv2 Color and FCN-32s NYUDv2 HHA SIFT Flow models: trained online with high momentum for joint semantic class and geometric class segmentation. These models demonstrate FCNs for multi-task output. FCN-32s SIFT Flow: single stream stream, 32 pixel prediction stride net FCN-16s SIFT Flow: two stream, 16 pixel prediction stride net FCN-8s SIFT Flow: three stream, 8 pixel prediction stride net Note: in this release, the evaluation of the semantic classes is not quite right at the moment due to an issue with missing classes. This will be corrected soon. The evaluation of the geometric classes is fine. PASCAL-Context models: trained online with high momentum on an object and scene labeling of PASCAL VOC. FCN-32s PASCAL-Context: single stream, 32 pixel prediction stride net FCN-16s PASCAL-Context: two stream, 16 pixel prediction stride net FCN-8s PASCAL-Context: three stream, 8 pixel prediction stride net Frequently Asked Questions Is learning the interpolation necessary? In our original experiments the interpolation layers were initialized to bilinear kernels and then learned. In follow-up experiments, and this reference implementation, the bilinear kernels are fixed. There is no significant difference in accuracy in our experiments, and fixing these parameters gives a slight speed-up. Note that in our networks there is only one interpolation kernel per output class, and results may differ for higher-dimensional and non-linear interpolation, for which learning may help further. Why pad the input?: The 100 pixel input padding guarantees that the network output can be aligned to the input for any input size in the given datasets, for instance PASCAL VOC. The alignment is handled automatically by net specification and the crop layer. It is possible, though less convenient, to calculate the exact offsets necessary and do away with this amount of padding. Why are all the outputs/gradients/parameters zero?: This is almost universally due to not initializing the weights as needed. To reproduce our FCN training, or train your own FCNs, it is crucial to transplant the weights from the corresponding ILSVRC net such as VGG16. The included surgery.transplant() method can help with this. What about FCN-GoogLeNet?: a reference FCN-GoogLeNet for PASCAL VOC is coming soon.帮我翻译一下
05-31
评论 6
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值