Abstract
-
数据集的提出
- 24282 问答pair,基于TBBT。question是knowledge-based的。
-
提出了一个video understanding model,combine了视频的visual texual content和specific knowledge
Findings:
Our main findings are:
(i) the incorporation of knowledge produces outstanding improvements for VQA in video
(ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.
1 Introduction
先前工作存在的不足 & 本篇文章要解决的问题
- coherence问题,时序的联系性
- knowledge问题
关于1,videoQA的工作尝试解决了。
关于2,KBVQA解决了。
但是两者都解决的工作还没有。
本文解决的问题:a general framework in which both video understanding and knowledge-based reasoning are required to answer questions.
Method 概括
cast the problem as a multi-choice challenge,
and introduce a two-piece model that
(i) acquires, processes,and maps specific knowledge into a continuous representation inferring the motivation behind each question
(ii)fuses video and language content together with the acquired knowledge in a multi-modal fashion to predict the answer.
2 Related Work
3 KnowIT VQA Dataset
QUESTION TYPE:
四类
- visual-based (22%), in which the answer is found in the video frames,
- textual-based (12%), in which the answer is found in the subtitles,
- temporalbased (4%), in which the answer is predictable from the current video clip at a specific time,
- knowledge-based(62%), 当前clip没有答案信息。需要across clips进行回答。
4 Human Evaluation
5 ROCK Model
Knowledge Base是AMT workers标注出来的,一个QApair对应一条Knowledge。首先进行Cleaning,就是去掉这24k条里一些重复的。完成后还剩19k左右。
代码结构
总共是三大块结构,运行上:
-
[Knowledge Base Construction]: 创建知识库。creates a knowledge base using the samples from the dataset.
-
[Knowledge Retrieval]: 知识库的索引。accesses the knowledge base and finds the best instance for a specific question and answers.
-
[Video Reasoning]: 做VQA,这个是main part。uses the information from the video, subtitles and retrieved knoweldge to predict the correct answer.
video reasoning这里有4个选择:
- image: 每一帧都输进不带FC的Resnet50去,获得feature(2048维)。使用FC将nf×2048n_f \times 2048nf<