AAAI KnowitVQA阅读笔记

本文介绍了一个名为KnowIT VQA的数据集,包含24282个基于视频的知识问答对,旨在推动视频理解和知识推理的研究。提出的ROCK模型通过结合视频内容、语言信息和特定知识来预测答案,尤其是在知识型问题上表现突出。数据集分为视觉、文本、时间性和知识型问题,其中62%的问题需要跨片段的知识。模型包括知识获取、检索和视频推理三个部分。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Abstract

  1. 数据集的提出

    1. 24282 问答pair,基于TBBT。question是knowledge-based的。
  2. 提出了一个video understanding model,combine了视频的visual texual content和specific knowledge

Findings:
Our main findings are:
(i) the incorporation of knowledge produces outstanding improvements for VQA in video
(ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

1 Introduction

先前工作存在的不足 & 本篇文章要解决的问题

  1. coherence问题,时序的联系性
  2. knowledge问题
    关于1,videoQA的工作尝试解决了。
    关于2,KBVQA解决了。
    但是两者都解决的工作还没有。

本文解决的问题:a general framework in which both video understanding and knowledge-based reasoning are required to answer questions.

论文提出的基于TBBT的数据集

Method 概括
cast the problem as a multi-choice challenge,
and introduce a two-piece model that
(i) acquires, processes,and maps specific knowledge into a continuous representation inferring the motivation behind each question
(ii)fuses video and language content together with the acquired knowledge in a multi-modal fashion to predict the answer.

2 Related Work

3 KnowIT VQA Dataset

QUESTION TYPE:
四类

  1. visual-based (22%), in which the answer is found in the video frames,
  2. textual-based (12%), in which the answer is found in the subtitles,
  3. temporalbased (4%), in which the answer is predictable from the current video clip at a specific time,
  4. knowledge-based(62%), 当前clip没有答案信息。需要across clips进行回答。

4 Human Evaluation

5 ROCK Model

Knowledge Base是AMT workers标注出来的,一个QApair对应一条Knowledge。首先进行Cleaning,就是去掉这24k条里一些重复的。完成后还剩19k左右。

代码结构

在这里插入图片描述

总共是三大块结构,运行上:

  1. [Knowledge Base Construction]: 创建知识库。creates a knowledge base using the samples from the dataset.

  2. [Knowledge Retrieval]: 知识库的索引。accesses the knowledge base and finds the best instance for a specific question and answers.

  3. [Video Reasoning]: 做VQA,这个是main part。uses the information from the video, subtitles and retrieved knoweldge to predict the correct answer.

video reasoning这里有4个选择:

  • image: 每一帧都输进不带FC的Resnet50去,获得feature(2048维)。使用FC将nf×2048n_f \times 2048nf<
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值