Multimodal Memory Modelling for Video Captioning

最新推荐文章于 2024-12-21 01:00:00 发布

转载最新推荐文章于 2024-12-21 01:00:00 发布 · 1.1k 阅读

paper reading 同时被 2 个专栏收录

85 篇文章

订阅专栏

caption

16 篇文章

订阅专栏

提出一种多模态记忆模型，用于视频到自然语言句子的自动转换。该模型通过外部记忆存储和检索视觉及文本内容，实现长程视觉文本依赖建模，并指导全局视觉注意力。实验表明，该方法在MSVD和MSR-VTT数据集上取得优异效果。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Multimodal Memory Modelling for Video Captioning

Junbo Wang, Wei Wang, Yan Huang, Liang Wang, Tieniu Tan

(Submitted on 17 Nov 2016)

Video captioning which automatically translates video clips into natural language sentences is a very important task in computer vision. By virtue of recent deep learning technologies, e.g., convolutional neural networks (CNNs) and recurrent neural networks (RNNs), video captioning has made great progress. However, learning an effective mapping from visual sequence space to language space is still a challenging problem. In this paper, we propose a Multimodal Memory Model (M3) to describe videos, which builds a visual and textual shared memory to model the long-term visual-textual dependency and further guide global visual attention on described targets. Specifically, the proposed M3 attaches an external memory to store and retrieve both visual and textual contents by interacting with video and sentence with multiple read and write operations. First, text representation in the Long Short-Term Memory (LSTM) based text decoder is written into the memory, and the memory contents will be read out to guide an attention to select related visual targets. Then, the selected visual information is written into the memory, which will be further read out to the text decoder. To evaluate the proposed model, we perform experiments on two publicly benchmark datasets: MSVD and MSR-VTT. The experimental results demonstrate that our method outperforms the state-of-theart methods in terms of BLEU and METEOR.