人群计数：Cross-scene Crowd Counting via Deep Convolutional Neural Networks

最新推荐文章于 2022-01-20 10:05:00 发布

目睹闰土刺猹的瓜

最新推荐文章于 2022-01-20 10:05:00 发布

阅读量572

点赞数

CC 4.0 BY-SA版权

分类专栏： Crowd Counting 文章标签：人群计数人群密度估计深度学习

本文链接：https://blog.youkuaiyun.com/weixin_44585583/article/details/98846216

Crowd Counting 专栏收录该内容

17 篇文章

订阅专栏

本文提出了一种深度卷积神经网络(CNN)，该网络通过两种相关学习目标——人群密度和人群计数，实现了跨场景的人群计数。模型在训练过程中采用可切换的学习过程，利用人群密度图和人数两种不同但相关的目标相互辅助，获得更优的局部最优解。为解决不同场景间的领域差距，设计了非参数微调方案，使预训练的CNN模型能够适应未见过的目标场景。此外，还引入了一个名为WorldExpo'10的新数据集，这是目前评估人群计数算法的最大数据集。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Goal:

**
proposed a deep CNN with two related learning objectives –crowd density and crowd count.

Contribution :

Our CNN model is trained for crowd scenes by a switchable learning process with two learning objectives, crowd density maps and crowd counts. The two different but related objectives can alternatively assist each other to obtain better local optima.
The target scenes require no extra labels in our framework for cross-scene counting. The pre-trained CNN model is fine-tuned for each target scene to overcome the domain gap between different scenes. The fine-tuned model is specifically adapted to the new target scene.
The framework does not rely on foreground segmentation results because only appearance information is considered in our method. No matter whether the crowd is moving or not, the crowd texture would be captured by the CNN model and can obtain a reasonable counting result.
We also introduce a new dataset named WorldExpo’10 for evaluating cross-scene crowd counting methods. To the best of our knowledge, this is the largest dataset for evaluating crowd counting algorithms.

Architecture :

**
在这里插入图片描述
The main objective for our crowd CNN model is to learn a mapping F : X → D, where X is the set of low-level features extracted from training images and D is the crowd density map of the image.

training :

Training set :

Perspective normalization is necessary to estimate the pedestrian scales. Patches randomly selected from the training images are treated as training samples, and the density maps of corresponding patches are treated as the ground truth for the crowd CNN model.

The input is the image patches cropped from training images. In order to obtain pedestrians at similar scales, the size of each patch at different locations is chosen according to the perspective value of its center pixel.

Here we constrain each patch to cover a 3-meter by 3-meter square in the actual scene as shown in Figure 3. Then the patches are warped to 72 pixels by 72 pixels as the input of the Crowd CNN model.

Training target :

The two loss functions of density map and crowd number:
在这里插入图片描述
Training process:

Cross-scene Crowd Counting :

In order to bridge the distribution gap between the training and test scenes, we design a nonparametric fine-tuning scheme to adapt our pre-trained CNN model to unseen target scenes.

Giving a target video from the unseen scenes, samples with similar properties from the training scenes are retrieved and added to training data to fine-tune the crowd CNN model. The retrieval task consists of two steps, candidate scenes retrieval and local patch retrieval.

Two steps : (a) Retrieving candidate scenes by matching perspective maps of the training scenes and the test scene. (b) Local patches similar to those in the test scene are retrieved from the candidate scenes.