人群计数:Cross-scene Crowd Counting via Deep Convolutional Neural Networks

本文提出了一种深度卷积神经网络(CNN),该网络通过两种相关学习目标——人群密度和人群计数,实现了跨场景的人群计数。模型在训练过程中采用可切换的学习过程,利用人群密度图和人数两种不同但相关的目标相互辅助,获得更优的局部最优解。为解决不同场景间的领域差距,设计了非参数微调方案,使预训练的CNN模型能够适应未见过的目标场景。此外,还引入了一个名为WorldExpo'10的新数据集,这是目前评估人群计数算法的最大数据集。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

**

Goal:

**
proposed a deep CNN with two related learning objectives –crowd density and crowd count.

**

Contribution :

**

  1. Our CNN model is trained for crowd scenes by a switchable learning process with two learning objectives, crowd density maps and crowd counts. The two different but related objectives can alternatively assist each other to obtain better local optima.
  2. The target scenes require no extra labels in our framework for cross-scene counting. The pre-trained CNN model is fine-tuned for each target scene to overcome the domain gap between different scenes. The fine-tuned model is specifically adapted to the new target scene.
  3. The framework does not rely on foreground segmentation results because only appearance information is considered in our method. No matter whether the crowd is moving or not, the crowd texture would be captured by the CNN model and can obtain a reasonable counting result.
  4. We also introduce a new dataset named WorldExpo’10 for evaluating cross-scene crowd counting methods. To the best of our knowledge, this is the largest dataset for evaluating crowd counting algorithms.

**

Architecture :

**
在这里插入图片描述
The main objective for our crowd CNN model is to learn a mapping F : X → D, where X is the set of low-level features extracted from training images and D is the crowd density map of the image.

**

training :

**

Training set :

Perspective normalization is necessary to estimate the pedestrian scales. Patches randomly selected from the training images are treated as training samples, and the density maps of corresponding patches are treated as the ground truth for the crowd CNN model.

The input is the image patches cropped from training images. In order to obtain pedestrians at similar scales, the size of each patch at different locations is chosen according to the perspective value of its center pixel.

Here we constrain each patch to cover a 3-meter by 3-meter square in the actual scene as shown in Figure 3. Then the patches are warped to 72 pixels by 72 pixels as the input of the Crowd CNN model.

Training target :

The two loss functions of density map and crowd number:
在这里插入图片描述
Training process:
在这里插入图片描述

**

Cross-scene Crowd Counting :

**

In order to bridge the distribution gap between the training and test scenes, we design a nonparametric fine-tuning scheme to adapt our pre-trained CNN model to unseen target scenes.

Giving a target video from the unseen scenes, samples with similar properties from the training scenes are retrieved and added to training data to fine-tune the crowd CNN model. The retrieval task consists of two steps, candidate scenes retrieval and local patch retrieval.

Two steps : (a) Retrieving candidate scenes by matching perspective maps of the training scenes and the test scene. (b) Local patches similar to those in the test scene are retrieved from the candidate scenes.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值