【论文笔记-AAAI2020】Overcoming Language Priors in VQA via Decomposed Linguistic Representations

最新推荐文章于 2025-03-14 23:33:06 发布

原创

最新推荐文章于 2025-03-14 23:33:06 发布 · 1.7k 阅读

6 ·

CC 4.0 BY-SA版权

文章标签：

#自然语言处理 #计算机视觉 #人工智能 #神经网络

本文探讨了视觉问答（VQA）领域中的语言偏见问题，介绍了一种新颖的方法，通过分解问题来减轻这一偏见。文章详细解析了模型的四个关键模块：语言注意力模块、问题识别模块、对象引用模块和视觉验证模块，展示了如何更准确地依据图像信息预测答案。

这篇博客会大概讲解一下论文的工作，以及一些VQA 领域的近况，也会涉及到一些自己的见解。一些容易误解的地方，我会尽量的表达细致，方便读者理解。如果需要深入研究，推荐自行再品读该论文：https://jingchenchen.github.io/files/papers/2020/AAAI_Decom_VQA.pdf
衷心希望这篇博客能有助于大家的科研工作

前言

这里我首先介绍一下Visual Question Answering（以下简称VQA）领域的language prior problem：

Most existing Visual Question Answering (VQA) models overly rely on superficial correlations between questions and answers.
For example, they may frequently answer “white” for questions about
color, “tennis” for questions about sports, no matter what images are given with the questions.、

简而言之，就是对于训练的Question与Image数据，模型并没有学会依照Image来回答问题，而只是简单的依赖answer的比例。比如对于what color这类question，答案为white占比为80%，那么当输入这类问题，模型就直接回答为white，而完全不需要依照Image，且这样的正确率很高。
相关工作：其实针对language prior的工作已经有不少了，比如18年nips的Overcoming Language Priors in VQA with Adversarial Regularization，以及CVPR的 Don’t Just Assume; Look and Answer: Overcoming Priors for VQA，等等。另外，还有我们团队的19年sigir的工作：Quantifying and Alleviating the Language Prior Problem in VQA。感兴趣的可以去看看论文。

文章概述

如何解决language prior problem一直是VQA任务的一大难点，这篇文章从question的角度出发，基于 Don’t Just Assumee; Look and Answer: Overcoming Priors for VQA那篇工作进一步延伸，对question进行了分解表示，消除了疑问词所带来的language prior，再依据Image信息进行预测answer。值得一提的是，它并不同于以前的 Neural Module Networks。且可以清晰的呈现model预测answer的过程。下面我拆分成Question decomposition和Answer prediction两部分介绍一下整个模型运行的过程。

Question decomposition：

因为question-answer pair常常包含三部分信息：question type, referring object, and expected concept. 所以作者将question分为了以上三部分。
因为存在question是否包含expected concept的问题，作者将question分为两种情况进行处理：yes/no和not yes/no。
具体的question representation如下面的示例图所示。

Answer prediction:
注意这里的question type只用来确定answer set，也就是这类question下的所以answer集合。它并没有直接参与到最终的answer预测，所以才会有language prior的减轻

如果question属于yes/no这类，那么它的anwer集为{yes, no}，其它的一律去掉。然后通过q_obj和Image信息采用up-down attention定位region，最后再和q_con混合，进行二分类
如果question不属于yes/no这类，那么首先需要用q_type来预测answer set，然后用q_obj与Image进行soft attention得到最终的image represention（与上面相同，）。最后，计算answer set中每个answer的得分即可。

方法介绍

概述上对整个模型讲的较为笼统，这里我尽量细致的讲解一下作者设计的各个模块，以及如何train各个模块。不太清楚的可以仔细看下上面的图。

这里首先看一下作者原文：

The proposed method includes four modules:
(a) a language attention module parses a ques-tion into the type representation, the object representation, and the concept representation;
(b) a question identification module uses the type representation to identify the question type and possible answers;
(c) an object referring module uses the object representation to attend to the relevant re-gion of an image;
(d) a visual verification module measures the relevance between the attended region and the concept representation to infer the answer.