[Paper note] Pyramid Scene Parsing Network

最新推荐文章于 2022-10-13 19:36:18 发布

原创最新推荐文章于 2022-10-13 19:36:18 发布 · 1k 阅读

CC 4.0 BY-SA版权

文章标签：

20 篇文章

订阅专栏

本文介绍了一种基于金字塔池的场景解析方法，在多个基准数据集上取得了优秀的效果。该方法利用不同尺度的池化来捕获全局上下文信息，并通过辅助损失函数进一步提升模型性能。实验结果显示，在ADE20K、PASCAL VOC 2012和Cityscapes等数据集上的表现均优于同类方法。

Datasets
FCN is the baseline model for deep learning based parsing
- Research line 1: multi-scale feature ensembling
- Research line 2: structure prediction
Some prior works use global image-level information for scene understanding

FCN method suffers from mismatched relationship, confusion categories and inconspicuous classes
Global average pooling fuses different stuff in a single vector and may lose the spatial relation. Global context information along with sub-region context may be more helpful.

Datasets: ImageNet scene parsing (ADE20K), PASCAL VOC 2012, Cityscapes
Settings: poly learning rate, augmentation, momentum = 0.9 and weight decay = 0.0001.
Large cropsize can get good performance (consistant with our experiment in ResNet)
Batch size in batch normalization layer is important.
Ablation study of settings
- Average pooling is better than max
- Pyramid is better than global pooling
- Dimension reduction after pooling and concatenate is helpful
Ablation study of auxiliary loss
- \alpha = 0.4 yields best performance
Ablation study for ResNet depth
- The deep the better
- All ResNet is pre-trained on ImageNet
Experiment on PASCAL VOC 2012
- 10,582, 1,449 and 1,456 images for training, validation and testing
- Top accuracy on all classes w/o MS COCO pre-training, top on most classes w/ MS COCO pre-training
Cityscapes
- 2,975, 500, and 1,525 for training, validation and testing, 19 categories containing both stuff and objects.
- 20,000 coarsely annotated images, can be used for training.
- Out-performs other methods with notable advantage (See project page for statistics)