READING NOTE: Face Detection with End-to-End Integration of a ConvNet and a 3D Model

最新推荐文章于 2023-04-18 19:28:05 发布

Joshua_Li_

最新推荐文章于 2023-04-18 19:28:05 发布

阅读量1.4k

点赞数

CC 4.0 BY-SA版权

分类专栏：计算机视觉

本文链接：https://blog.youkuaiyun.com/joshua_1988/article/details/52705384

计算机视觉专栏收录该内容

72 篇文章

订阅专栏

本文提出了一种简单而有效的方法，将卷积神经网络（ConvNet）与三维模型进行端到端集成，并使用多任务损失函数进行人脸检测。该方法解决了在野外环境下更快的RCNN用于人脸检测时存在的两个限制：通过利用三维模型消除了锚点框的启发式设计；用配置池化代替了通用且预定义的区域兴趣池化，充分利用了潜在的对象结构配置。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

TITLE: Face Detection with End-to-End Integration of a ConvNet and a 3D Model

AUTHOR: Yunzhu Li, Benyuan Sun, Tianfu Wu, Yizhou Wang

ASSOCIATION: Peking University, North Carolina State University

FROM: arXiv:1606.00850

CONTRIBUTIONS

It presents a simple yet effective method to integrate a ConvNet and a 3D model in an end-to-end learning with multi-task loss used for face detection in the wild.
It addresses two limitations in adapting the state-of-the-art faster-RCNN for face detection: eliminating the heuristic design of anchor boxes by leveraging a 3D model, and replacing the generic and predefined RoI pooling with a configuration pooling which exploits the underlying object structural configurations.
It obtains very competitive state-of-the-art performance in the FDDB and AFW benchmarks.

METHOD

The main scheme of inferring is shown in the following figure.

The input image is sent into a ConvNet, e.g. VGG, with an upsampling layer. Then the network will generate face proposals based on the score of summing the log probability of the keypoints, which is predicted by the predefined 3D face model.

some details

The loss of keypoint labels is defined as

$L c l s (ω) = - 1 2 m \sum i = 1 2 m log (p x i l i)$ $L_{cls}(\omega)= -{1 \over 2m} \sum_{i=1}^{2m} \log(p_{l_i}^{\mathbf{x}_i})$

where $\omega$ stands for the learnable weights of ConvNet, $m$ is the number of the keypoints, and $p_{l_i}^{\mathbf{x}_i}$ is the probability of the point in location $\mathbf{x}_i$ , which can be obtained by annotations, belongs to label $l_i$ .
The loss of keypoit locations is defined as

$Lptloc(ω)=1m2∑i=1m∑i=1m∑t∈{x,y}Smooth(ti−t^i,j)$ $L_{loc}^{pt}(\omega)={1 \over m^2} \sum_{i=1}^m \sum_{i=1}^m \sum_{t \in \{x,y\}} Smooth(t_i-\hat{t}_{i,j})$

where $smooth(\cdot)$ is the smooth $l_1$ loss. For each ground-truth keypoint, we can generate a set of predicted keypoints based on the 3D face model and the 3D transformation parameters. If for each face we have keypoints, then we will generate m sets of predicted keypoints. For each keypoint, m locations will be predicted.
The Configuration Pooling Layer is similar to the ROI Pooling Layer in faster-RCNN. Features are extracted based on the locations and relations of the keypoints, rather than based on the predefined perceptive field.