AAAI KnowitVQA阅读笔记

最新推荐文章于 2025-03-28 23:03:22 发布

Lumiran

最新推荐文章于 2025-03-28 23:03:22 发布

阅读量414

点赞数

CC 4.0 BY-SA版权

分类专栏： VQA 论文阅读

本文链接：https://blog.youkuaiyun.com/Lumiran/article/details/109808320

本文介绍了一个名为KnowIT VQA的数据集，包含24282个基于视频的知识问答对，旨在推动视频理解和知识推理的研究。提出的ROCK模型通过结合视频内容、语言信息和特定知识来预测答案，尤其是在知识型问题上表现突出。数据集分为视觉、文本、时间性和知识型问题，其中62%的问题需要跨片段的知识。模型包括知识获取、检索和视频推理三个部分。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Abstract

数据集的提出
1. 24282 问答pair，基于TBBT。question是knowledge-based的。
提出了一个video understanding model，combine了视频的visual texual content和specific knowledge

Findings：
Our main findings are:
(i) the incorporation of knowledge produces outstanding improvements for VQA in video
(ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

1 Introduction

先前工作存在的不足 & 本篇文章要解决的问题

coherence问题，时序的联系性
knowledge问题
关于1，videoQA的工作尝试解决了。
关于2，KBVQA解决了。
但是两者都解决的工作还没有。

本文解决的问题：a general framework in which both video understanding and knowledge-based reasoning are required to answer questions.

论文提出的基于TBBT的数据集

Method 概括
cast the problem as a multi-choice challenge,
and introduce a two-piece model that
(i) acquires, processes,and maps specific knowledge into a continuous representation inferring the motivation behind each question
(ii)fuses video and language content together with the acquired knowledge in a multi-modal fashion to predict the answer.

2 Related Work

3 KnowIT VQA Dataset

QUESTION TYPE:
四类

visual-based (22%), in which the answer is found in the video frames,
textual-based (12%), in which the answer is found in the subtitles,
temporalbased (4%), in which the answer is predictable from the current video clip at a specific time,
knowledge-based(62%), 当前clip没有答案信息。需要across clips进行回答。

4 Human Evaluation

5 ROCK Model

Knowledge Base是AMT workers标注出来的，一个QApair对应一条Knowledge。首先进行Cleaning，就是去掉这24k条里一些重复的。完成后还剩19k左右。

代码结构

在这里插入图片描述

总共是三大块结构，运行上：

[Knowledge Base Construction]: 创建知识库。creates a knowledge base using the samples from the dataset.
[Knowledge Retrieval]: 知识库的索引。accesses the knowledge base and finds the best instance for a specific question and answers.
[Video Reasoning]: 做VQA，这个是main part。uses the information from the video, subtitles and retrieved knoweldge to predict the correct answer.

video reasoning这里有4个选择：