目录
这是视觉问答论文阅读的系列笔记之一,本文有点长,请耐心阅读,定会有收货。如有不足,随时欢迎交流和探讨。
一、文献摘要介绍
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be finetuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0.
本文提出了统一的视觉语言预训练(VLP)模型。该模型的统一之处在于:(1)可以针对视觉语言生成(例如,图像描述)或理解(例如,视觉问题)任务进行微调,(2)使用共享的多层transformer网络进行建模编码和解码,这与许多现有方法不同,在现有方法中,使用单独的模型来实现编码器和解码器。在大量的图像-文本对上对统一VLP模型进行了预训练,使用以下两项任务的无监督学习目标:双向和序列对序列(seq2seq)掩码视觉-语言预测。两项任务的区别仅在于预测所基于的上下文。这是通过为共享的transformer网络使用特定的自注意掩码来控制的,下图是作者提出的用于一般视觉语言预训练的统一编码器-解码器模型。
二、网络框架介绍
我们将输入图像表示为,将关联/目标句子描述(单词)表示为
。我们使用现成的物体检测器从图像中提取固定数量的N个物体区域,