[paper] Co-CNN

本文提出了一种新颖的上下文卷积神经网络(CoCNN)架构,用于人体解析任务。该架构集成了跨层上下文、全局图像级上下文和局部超像素上下文。通过层级本地到全局再到本地的结构捕捉不同卷积层间的全局语义信息和局部细节。利用辅助目标进行全局图像级标签预测,并采用超像素内的平滑及超像素邻域投票确保局部标签一致性。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

(ICCV 2015) Human Parsing with Contextualized Convolutional Neural Network
(T-PAMI 2016) Human Parsing with Contextualized Convolutional Neural Network
Paper: http://www.cv-foundation.org/openaccess/content_iccv_2015/papers/Liang_Human_Parsing_With_ICCV_2015_paper.pdf
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7423822&queryText=human%20parsing%20with%20contextualized&newsearch=true
Project: http://hcp.sysu.edu.cn/deep-human-parsing/

提出Contextualized Convolutional Neural Network (CoCNN),在CNN中加入cross-layer context, global image-level context, within-super-pixel context和cross-super-pixel neighborhood context。

cross-layer context: 加入local-to-global-to-local结构(把前面层的feature map加到后面层中,把输入的图片卷积后加入最后一个feature map中),结合了不同卷积层的全局语义信息和局部细节。

global image-level label prediction: 把分割中所有标签整合成一个binary vector进行多标签分类。

within-super-pixel smoothing and cross-superpixel neighbourhood voting: 平滑局部标签

In this work, we address the human parsing task with a novel Contextualized Convolutional Neural Network (CoCNN) architecture, which well integrates the cross-layer context, global image-level context, within-super-pixel context and cross-super-pixel neighborhood context into a unified network.

  1. the cross-layer context is captured by our basic local-to-global-to-local structure, which hierarchically combines the global semantic information and the local fine details across different convolutional layers.

  2. the global image-level label prediction is used as an auxiliary objective in the intermediate layer of the Co-CNN.

  3. the within-super-pixel smoothing and cross-superpixel neighbourhood voting are formulated as natural subcomponents of the Co-CNN to achieve the local label consistency in both training and testing process.

Introduction

none of previous methods has achieved excellent dense prediction over raw image pixels in a fully end-to-end way.

  1. diverse contextual information and mutual relationships among the key components of human parsing (i.e. semantic labels, spatial layouts and shape priors) should be well addressed during predicting the pixel-wise labels.

  2. the predicted label maps are desired to be detail-preserved and of high-resolution, in order to recognize or highlight very small labels (e.g. sunglass or belt).

In this paper, we present a novel Contextualized Convolutional Neural Network (Co-CNN) that successfully addresses the above mentioned issues.

Figure 1

Figure 1. Our Co-CNN integrates the cross-layer context, global image-level context and local super-pixel contexts into a unified network. It consists of cross-layer combination, global image-level label prediction, within-super-pixel smoothing and cross-super-pixel neighborhood voting.

  1. given an input 150 × 100 image, we extract the feature maps for four resolutions (i.e., 150 × 100, 75 × 50, 37 × 25 and 18 × 12). Then we gradually up-sample the feature maps and combine the corresponding early, fine layers (blue dash line) and deep, coarse layers (blue circle with plus) under the same resolutions to capture the cross-layer context.

  2. an auxiliary objective (shown as “Squared loss on image-level labels”) is appended after the down-sampling stream to predict global image-level labels. These predicted probabilities are then aggregated into the subsequent layers after the up-sampling (green line) and used to re-weight pixel-wise prediction (green circle with plus).

  3. the within-super-pixel smoothing and cross-super-pixel neighborhood voting are performed based on the predicted confidence maps (orange planes) and the generated super-pixel over-segmentation map to produce the final parsing result.

Only down-sampling, up-sampling, and prediction layers are shown; intermediate convolution layers are omitted. For better viewing of all figures in this paper, please see original zoomed-in color pdf file.

Human Parsing

Semantic Segmentation with CNN

The Proposed Co-CNN Architecture

Local-to-global-to-local Hierarchy

Our basic local-to-global-to-local structure captures the cross-layer context. It simultaneously considers the local fine details and global structure information.

the early convolutional layers with high spatial resolutions (e.g., 150×100 ) often capture more local details while the ones with low spatial resolutions (e.g., 18×12 ) can capture more structure information with high-level semantics.

The feature maps up-sampled from the low resolutions and those from the high resolutions are then aggregated with the element-wise summation, shown as the blue circle with plus in Figure 1.

To capture more detailed local boundaries, the input image is further filtered with the 5×5 convolutional filters and then aggregated into the later feature maps.

Global Image-level Context

An auxiliary objective for multi-label prediction is used after the intermediate layers with spatial resolution of 18×12 , as shown in the pentagon in Figure 1.

Following the fully-connected layer, the C-way softmax which produces a probability distribution over the C class labels is appended.

Squared loss is used during the global image label prediction.

Suppose for each image I in the training set, y=[y1,y2,,yC] is the ground-truth multi-label vector. yc=1,(c=1,,C) if the image is annotated with class c , and otherwise yc=0.

concatenating the predicted label probabilities with the intermediate convolutional layers (image label concatenation in Figure 1) and element-wise summation with label confidence maps (element-wise summation in Figure 1).

Local Super-pixel Context

It is advantageous that super-pixel guidance is used at the later stage, which avoids making premature decisions and thus learning unsatisfactory convolution filters.

Within-super-pixel Smoothing

For each input image I , we first compute the over-segmentation of I using the entropy rate based segmentation algorithm [17] and obtain 500 super-pixels per image. Given the C confidence maps {xc}C1 in the prediction layer, the within-super-pixel smoothing is performed on each map xc . Let us denote the super-pixel covering the pixel at location (i,j) by sij , the smoothed confidence maps x~c can be computed by

x~i,j,c=1sij(i,j)sijxi,j,c

Cross-super-pixel Neighborhood Voting

we can take the neighboring larger regions into account for better inference, and exploit more statistical structures and correlations between different super-pixels.

For each super-pixel s, we first compute a concatenation of bag-of-words from RGB, Lab and HOG descriptor for each super-pixel, and the feature of each super-pixel can be denoted as bs . The cross neighborhood voted response x¯s of the super-pixel s is calculated by

x¯s=(1α)x~s+αsDsexp(bsbs2)s^Dsexp(bsbs2)x~s

Our within-super-pixel smoothing and cross-super-pixel neighborhood voting can be seen as two types of pooling methods, which are performed on the local responses within the irregular regions depicted by super-pixels.

Parameter details of Co-CNN

Experiments

Experimental Settings

Dataset

the large ATR dataset [15] and the small Fashionista dataset [31]

Implementation Details

We augment the training images with the horizontal reflections, which improves about 4% in terms of F-1 scores.

Given a test image, we use the human detection algorithm [10] to detect the human body.

To evaluate the performance, we re-scale the output pixel-wise prediction back to the size of the original ground-truth labeling.

We train the networks for roughly 90 epochs, which takes 4 to 5 days.

Our Co-CNN can rapidly process one 150 × 100 image within about 0.0015 second. After incorporating the super-pixel extraction [17], we test one image within about 0.15 second.

Results and Comparisons

Discussion on Our Network

Local-to-Global-to-Local Hierarchy

Global Image-level Context

Local Super-pixel Contexts

Conclusions and Future Work

In this work, we proposed a novel Co-CNN architecture for human parsing task, which integrates the cross-layer context, global image label context and local super-pixel contexts into a unified network.

### R-CNN 论文下载与获取 R-CNN 是一种早期的深度学习目标检测算法,其全称为 **Regions with Convolutional Neural Networks (R-CNN)**。该论文由 Ross Girshick 等人在 2014 年发表,并提出了通过卷积神经网络(CNN)进行目标检测的方法[^2]。 #### 获取方式 - 可以访问原始论文链接:[Rich feature hierarchies for accurate object detection and semantic segmentation](https://arxiv.org/abs/1311.2524)[^2]。 - 此外,也可以通过学术搜索引擎如 Google Scholar 输入关键词 “R-CNN paper” 来找到对应的 PDF 文件版本。 以下是 R-CNN 的一些核心特点及其改进方向: #### 核心技术要点 - **Selective Search**: R-CNN 利用了 Selective Search 方法来生成约 2000 个候选区域(Region Proposals),这些区域随后被送入 CNN 进行特征提取[^3]。 - **预训练与微调**: 首先在大规模图像分类数据集(如 ImageNet)上对 CNN 模型进行监督预训练,之后针对特定的目标检测任务,在 Pascal VOC 数据集上进一步精调模型参数。 - **SVM 分类器**: 提取到固定大小的特征向量后,使用支持向量机(Support Vector Machine, SVM)完成最终类别预测工作。 - **边界框回归**: 引入线性回归模型调整初始包围盒的位置和尺寸,从而提高定位精度。 然而需要注意的是,尽管 R-CNN 实现了较高的准确性,但它存在明显的速度瓶颈以及存储开销较大的问题。这些问题促使后续出现了多个优化版变体,比如 Fast R-CNN 和 Faster R-CNN 等[^5]。 ```python import requests def download_paper(url, filename="rcnn_paper.pdf"): response = requests.get(url) if response.status_code == 200: with open(filename, 'wb') as f: f.write(response.content) print(f"Paper downloaded successfully as {filename}") else: print("Failed to download the paper") download_paper('https://arxiv.org/pdf/1311.2524.pdf') ``` 上述脚本可以用来自动抓取并保存指定 URL 地址下的 R-CNN 文献副本至本地磁盘中。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值