[arXiv 2021]ClipCap: CLIP Prefix for Image Captioning

ClipCap图像字幕生成方法解析

最新推荐文章于 2025-11-24 23:34:07 发布

原创最新推荐文章于 2025-11-24 23:34:07 发布 · 866 阅读

24 ·

CC 4.0 BY-SA版权

文章标签：

#深度学习 #人工智能 #机器学习 #神经网络 #计算机视觉 #字幕预测 #GPT-2

论文精读专栏收录该内容

224 篇文章

订阅专栏

论文网址：[2111.09734] ClipCap: CLIP Prefix for Image Captioning

论文代码：GitHub - rmokady/CLIP_prefix_caption: Simple image captioning model

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

2.4.2. Language model fine-tuning

2.4.3. Mapping Network Architecture

2.4.4. Inference

2.5. Results

2.6. Conclusion

1. 心得

（1）github维护的很好的样子，而且怎么感觉还有宣传，但论文还是没投出去吗？好可惜

（2）设计上确实很简单，但主旨就是为了快，感觉也不是不行

（3）⭐第一次见到把GPT前缀打印出来的！

2. 论文逐段精读

2.1. Abstract

①They apply CLIP feature as the prefix of text decoder

②They encourage simpler, faster and lighter

2.2. Introduction

①Captioning task:

②Overall framework:

2.3. Related Works

①Introduced CLIP

②Pre trained models do not need to rely on other annotations

2.4. Method

① $\{x^i,c^i\}_{i=1}^N$ denotes image and caption pairs

②Training object:

$\max_{\theta}\sum_{i=1}^{N}\log p_{\theta}(c_{1}^{i},\ldots,c_{\ell}^{i}|x^{i})$

where $c^{i}=c_{1}^{i},\ldots,c_{\ell}^{i}$ denotes tokens in caption, $\theta$ denotes trainable parameter

③Autoregressive prediction goal:

$\max_\theta\sum_{i=1}^N\sum_{j=1}^\ell\log p_\theta(c_j^i|x^i,c_1^i,\ldots,c_{j-1}^i)$

2.4.1. Overview

①Language model: GPT-2

②Extracting image feature by CLIP and mapping network:

$p_1^i,\ldots,p_k^i=F(\mathrm{CLIP}(x^i))$

where $p^i_j$ has the same dimension as a word embedding

③Connect embedding and caption（ $c$ 是自回归得到的，是从1开始一个一个生成出来的，不是一开始就是一整句话）:

$Z^i=p_1^i,\ldots,p_k^i,c_1^i,\ldots,c_\ell^i.$

④Cross entropy loss:

$\mathcal{L}_{X}=-\sum_{i=1}^{N}\sum_{j=1}^{\ell}\log p_{\theta}(c_{j}^{i}|p_{1}^{i},\ldots,p_{k}^{i},c_{1}^{i},\ldots,c_{j-1}^{i})$

2.4.2. Language model fine-tuning

①微调很浪费时间所以作者觉得不微调了

2.4.3. Mapping Network Architecture

①They fed CLIP extracted feature and constants in mapping network

2.4.4. Inference

①For each token, the language model outputs probabilities for all vocabulary tokens, which are used to determine the next one by employing a greedy approach or beam search

2.5. Results

①Datasets: COCO-captions, nocaps, and Conceptual Captions

②Data split: 120,000 images and 5 captions per image (80 categories) of COCO for training, nocaps for validation and testing (new categories) and Conceptual Captions for validation (specific entities are replaced with general notions)

③Performance and ablation study: