- paper
- project page
- code
- 1-st place on ImageNet scene parsing challenge 2016
Parsing overview
- Datasets
- FCN is the baseline model for deep learning based parsing
- Research line 1: multi-scale feature ensembling
- Research line 2: structure prediction
- Some prior works use global image-level information for scene understanding
Intuition
- FCN method suffers from mismatched relationship, confusion categories and inconspicuous classes
- Global average pooling fuses different stuff in a single vector and may lose the spatial relation. Global context information along with sub-region context may be more helpful.
Model

- Pyramid pooling: bin size of 1x1, 2x2, 3x3, 6x6, both max and average pooling
- Auxiliary loss in ResNet-101
Experiment
- Datasets: ImageNet scene parsing (ADE20K), PASCAL VOC 2012, Cityscapes
- Settings: poly learning rate, augmentation, momentum = 0.9 and weight decay = 0.0001.
- Large cropsize can get good performance (consistant with our experiment in ResNet)
- Batch size in batch normalization layer is important.
- Ablation study of settings
- Average pooling is better than max
- Pyramid is better than global pooling
- Dimension reduction after pooling and concatenate is helpful
- Ablation study of auxiliary loss
\alpha = 0.4yields best performance
- Ablation study for ResNet depth
- The deep the better
- All ResNet is pre-trained on ImageNet
- Experiment on PASCAL VOC 2012
- 10,582, 1,449 and 1,456 images for training, validation and testing
- Top accuracy on all classes w/o MS COCO pre-training, top on most classes w/ MS COCO pre-training
- Cityscapes
- 2,975, 500, and 1,525 for training, validation and testing, 19 categories containing both stuff and objects.
- 20,000 coarsely annotated images, can be used for training.
- Out-performs other methods with notable advantage (See project page for statistics)
本文介绍了一种基于金字塔池的场景解析方法,在多个基准数据集上取得了优秀的效果。该方法利用不同尺度的池化来捕获全局上下文信息,并通过辅助损失函数进一步提升模型性能。实验结果显示,在ADE20K、PASCAL VOC 2012和Cityscapes等数据集上的表现均优于同类方法。
2673

被折叠的 条评论
为什么被折叠?



