人群计数：CP-CNN --Generating High-Quality Crowd Density Maps using Contextual Pyramid CNNs

最新推荐文章于 2022-06-12 19:08:02 发布

目睹闰土刺猹的瓜

最新推荐文章于 2022-06-12 19:08:02 发布

阅读量1.3k

点赞数

CC 4.0 BY-SA版权

分类专栏： Crowd Counting 文章标签：人群计数 CP-CNN 人群密度估计深度学习

本文链接：https://blog.youkuaiyun.com/weixin_44585583/article/details/97367495

Crowd Counting 专栏收录该内容

17 篇文章

订阅专栏

The goal of this paper :

generating high-quality crowd density and lower count error map

**
The reason for doing this work :

Now many works do not explicitly incorporate contextual information which is essential for achieving further improvements.
Though existing approaches regress on density maps, they are more focused on improving count errors rather than quality of the density maps
Existing CNN-based approaches are trained using a pixel-wise Euclidean loss which results in blurred density maps.

Contributions :

CP-CNN
high-quality density maps
adversarial loss and Euclidean loss
the contribution of contextual information and adversarial loss

what is CP-CNN ?

Answer : CP-CNN consists of four modules: Global Context Estimator (GCE), Local Context Estimator (LCE), Density Map Estimator (DME) and a Fusion-CNN (F-CNN)

This architecture is shown in the following picture :
在这里插入图片描述

The fountion of each part :

**
1. GCE is a VGG-16 based CNN that encodes global context and it is trained to classify input images into different density classes.

Detail : A VGG-16 based network is fine-tuned with the crowd training data, and the last three fully connected layers of VGG-16 are replaced with a different configuration of fully connected layers in order to cater to our task of classification into five categories. As is shown in the following picture.

在这里插入图片描述

2. LCE is another CNN that encodes local context information and it is trained to perform patch-wise classification of input images into different density classes

Detail : some kind of local contextual information can aid us to achieve better quality maps. Learn an image’s local context by learning to classify it’s local patches into one of the five classes. As is shown in the following picture.
在这里插入图片描述

3. DME is a multi-column architecture-based CNN that aims to generate high-dimensional feature maps

Detail : Density Map Estimator (DME) : transform the input image into a set of high-dimensional feature maps which will be concatenated with the contextual information provided by GCE and LCE. As is shown in the following picture.
在这里插入图片描述

4. F-CNN ：fused the contextual information estimated by GCE and LCE and DME. And it uses a set of convolutional and fractionally-strided convolutional layers to generate high resolution and high-quality density maps.

Detail : CR(64,9)-CR(32,7)- TR(32)-CR(16,5)-TR(16)-C(1,1), where, C is convolutional layer, R is ReLU layer, T is fractionally-strided convolution layer and the first number inside every brace indicates the number of filters while the second number indicates filter size. Every fractionally-strided convolution layer increases the input resolution by a factor of 2, thereby ensuring that the output resolution is the same as that of input.

What is the conection of these parts ?

Answer : contextual information obtained by LCE and GCE is combined with the output of DME using a Fusion-CNN (F-CNN).

How to evaluate the performance of this network ?

**
The loss for training F-CNN and DME is defined as follows :

在这里插入图片描述
LT is the overall loss, LE is the pixel-wise Euclidean loss between estimated density map and it’s corresponding ground truth, λa is a weighting factor, LA is the adversarial loss, X is the input image of dimensions W × H, Y is the ground truth density map, φ is the network consisting of DME and F-CNN and φD is the discriminator sub-network for calculating the adversarial loss. Following structure is used for the discriminator sub-network: CP(64)-CP(128)-M-CP(256)-MCP(256)-CP(256)-M-C(1)-Sigmoid, where C represents convolutional layer, P represents PReLU layer and M is max-pooling layer.

Keeping on fighting