论文地址:https://arxiv.org/pdf/1905.09998v3.pdf
项目地址:https://github.com/jialinwu17/Self_Critical_VQA
摘要
Visual Question Answering (VQA) deep-learning systems tend to capture superfi-cial statistical correlations in the training data because of strong language priors and fail to generalize to test data with a significantly different question-answer (QA) distribution [1]. To address this issue, we introduce a self-critical training objective that ensures that visual explanations of correct answers match the most influential image regions more than other competitive answer candidates. The
influential regions are either determined from human visual/textual explanations or automatically from just significant words in the question and answer. We evaluate our approach on the VQA generalization task using the VQA-CP dataset, achieving a new state-of-the-art i.e., 49.5% using textual explanations and 48.5% usingautomatically annotated regions.
目前的VQA任务依赖训练数据中的表面统计相关性,为解决这个问题,作者引入了一个自我批判的训练目标,确保正确答案的视觉解释比其他候选人给的答案更符合最有影响力的图像区域。影响区域要么由人类的视觉/文本解释决定,要么由问题和答案中的重要词自动决定。
1 贡献
作者发

本文介绍了一种新的VQA方法,通过引入自我批评训练目标,确保模型关注正确答案的最关键图像区域,避免过度依赖语言先验。研究了多种构建关键影响区域的方式,包括人工标注、文本解释和问题/答案中的对象。实验结果显示,这一策略在VQA-CP任务上实现了新的最佳性能。
最低0.47元/天 解锁文章
325

被折叠的 条评论
为什么被折叠?



