Deep Reinforcement Learning-based Image Captioning with Embedding Reward

最新推荐文章于 2024-07-02 11:41:14 发布

原创最新推荐文章于 2024-07-02 11:41:14 发布 · 7.7k 阅读

0 ·

CC 4.0 BY-SA版权

paper reading 专栏收录该内容

85 篇文章

订阅专栏

本文提出一种结合政策网络与价值网络的新型强化学习框架，用于解决图像配文任务。该框架通过局部与全局指导机制生成更贴近真实描述的文本，并在COCO数据集上验证了其优越性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

https://arxiv.org/abs/1704.03899

Image captioning is a challenging problem owing to the complexity in understanding the image content and diverse ways of describing it in natural language. Recent advances in deep neural networks have substantially improved the performance of this task. Most state-of-the-art approaches follow an encoder-decoder framework, which generates captions using a sequential recurrent prediction model. However, in this paper, we introduce a novel decision-making framework for image captioning. We utilize a "policy network" and a "value network" to collaboratively generate captions. The policy network serves as a local guidance by providing the confidence of predicting the next word according to the current state. Additionally, the value network serves as a global and lookahead guidance by evaluating all possible extensions of the current state. In essence, it adjusts the goal of predicting the correct words towards the goal of generating captions similar to the ground truth captions. We train both networks using an actor-critic reinforcement learning model, with a novel reward defined by visual-semantic embedding. Extensive experiments and analyses on the Microsoft COCO dataset show that the proposed framework outperforms state-of-the-art approaches across different evaluation metrics.