[arXiv 2021]ClipCap: CLIP Prefix for Image Captioning

ClipCap图像字幕生成方法解析

论文网址:[2111.09734] ClipCap: CLIP Prefix for Image Captioning

论文代码:GitHub - rmokady/CLIP_prefix_caption: Simple image captioning model

英文是纯手打的!论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误,若有发现欢迎评论指正!文章偏向于笔记,谨慎食用

目录

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. Method

2.4.1. Overview

2.4.2. Language model fine-tuning

2.4.3. Mapping Network Architecture

2.4.4. Inference

2.5. Results

2.6. Conclusion

1. 心得

(1)github维护的很好的样子,而且怎么感觉还有宣传,但论文还是没投出去吗?好可惜

(2)设计上确实很简单,但主旨就是为了快,感觉也不是不行

(3)⭐第一次见到把GPT前缀打印出来的!

2. 论文逐段精读

2.1. Abstract

        ①They apply CLIP feature as the prefix of text decoder

        ②They encourage simpler, faster and lighter

2.2. Introduction

        ①Captioning task:

        ②Overall framework:

2.3. Related Works

        ①Introduced CLIP

        ②Pre trained models do not need to rely on other annotations

2.4. Method

        ①\{x^i,c^i\}_{i=1}^N denotes image and caption pairs

        ②Training object:

\max_{\theta}\sum_{i=1}^{N}\log p_{\theta}(c_{1}^{i},\ldots,c_{\ell}^{i}|x^{i})

where c^{i}=c_{1}^{i},\ldots,c_{\ell}^{i} denotes tokens in caption, \theta denotes trainable parameter

        ③Autoregressive prediction goal:

\max_\theta\sum_{i=1}^N\sum_{j=1}^\ell\log p_\theta(c_j^i|x^i,c_1^i,\ldots,c_{j-1}^i)

2.4.1. Overview

        ①Language model: GPT-2

        ②Extracting image feature by CLIP and mapping network:

p_1^i,\ldots,p_k^i=F(\mathrm{CLIP}(x^i))

where p^i_j has the same dimension as a word embedding

        ③Connect embedding and caption(c是自回归得到的,是从1开始一个一个生成出来的,不是一开始就是一整句话):

Z^i=p_1^i,\ldots,p_k^i,c_1^i,\ldots,c_\ell^i.

        ④Cross entropy loss:

\mathcal{L}_{X}=-\sum_{i=1}^{N}\sum_{j=1}^{\ell}\log p_{\theta}(c_{j}^{i}|p_{1}^{i},\ldots,p_{k}^{i},c_{1}^{i},\ldots,c_{j-1}^{i})

2.4.2. Language model fine-tuning

        ①微调很浪费时间所以作者觉得不微调了

2.4.3. Mapping Network Architecture

        ①They fed CLIP extracted feature and constants in mapping network

2.4.4. Inference

        ①For each token, the language model outputs probabilities for all vocabulary tokens, which are used to determine the next one by employing a greedy approach or beam search

2.5. Results

        ①Datasets: COCO-captions, nocaps, and Conceptual Captions

        ②Data split: 120,000 images and 5 captions per image (80 categories) of COCO for training, nocaps for validation and testing (new categories) and Conceptual Captions for validation (specific entities are replaced with general notions)

        ③Performance and ablation study:

        ④Captioning results on COCO:

        ⑤Captioning results on Conceptual Captions:

        ⑥Generalization on smartphone photos:

        ⑦Prefix represents(这东西还能是一句话的??):

obtained by the most relevant vocabulary on cosine similarity

        ⑧Prefix length ablation:

2.6. Conclusion

        ~

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值