第五十周学习笔记

最新推荐文章于 2024-07-13 21:40:48 发布

luputo

最新推荐文章于 2024-07-13 21:40:48 发布

阅读量997

点赞数

CC 4.0 BY-SA版权

分类专栏：学习笔记

本文链接：https://blog.youkuaiyun.com/luo3300612/article/details/96489042

学习笔记专栏收录该内容

61 篇文章

订阅专栏

第五十周学习笔记

论文阅读概述

SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text: This article introduces a novel model dubbed SemStyle to generate diverse image caption based on paired factual data and unpaired stylistic corpus by two-stage method which firstly map image to semantic term sequence and then feed it to language model to generate captions. By extracting semantic term with NLP package from Unpaired stylistic corpus to train the second model, SemStyle incorporates stylistic data to achieve good performance.
Dense Captioning with Joint Inference and Visual Context: This article comes up with joint inference(which simultaneosly generate region captions and region) and context fusion(which provides contextual information for region captioning) to address the problem of highly overlapping target regions in dataset and difficulty in recognizing each region by appearance alone, achieving SoTA on VG.
Semantic Compositional Networks for Visual Captioning: This article introduces a novel model named SCN to effectively incorporate high-level semantic concepts in image captioning system by using softmax output of multi-label concept(object) detection to weight weights of LSTM analogous to an esemble LSTM model, achieving top3 performance then.
Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects: This article introduces a novel model called LSTM-C(Copy) to incorporate external visual recognition dataset by linearly combining traditional decoder output next word probability distribution and object detection-based distribution to generate next word to achieve the goal of novel object captioning.
Captioning Images with Diverse Objects: This article introduces a novel model to simultaneosly target at object detection,next word generation and image captioning with shared model and parameters to incorporate external visual detection dataset and text dataset.
Top-down Visual Saliency Guided by Captions: This article tries to investigate the internal mechanism of image captioning model by replacing contextual vector with single region feature and comparing language model output distribution of both to explain the dependency between specific word and specific region and figure out whether encoder-decoder could adaptively find the connection between them which the answer is ‘yes’.
Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning: This article comes up with bidirectional beam search for fill-in-the-blank image captioning as an attempt to explore bidirectional decoding method.
Beyond instance-level image retrieval:Leveraging captions to learn a global visual representation for semantic retrieval: This article exploits caption dataset to supervise semantic image encoding model for semantic image retrieval and achieve better performance.
Areas of Attention for Image Captioning: This article comes up with a novel language model to generate captions by modeling the interplay of region feature, current input word and hidden state with different region feature(spatial, proposal, transform) to find a better attention mechanism for image captioning.
An Empirical Study of Language CNN for Image Captioning: This article introduces Language CNN as decoder for image caption to deal with long-term dependency which is a challenge in RNN-based decoder, by using layer-wise transformation of fixed window size in CNN(followed by a rnn cell to model local information) to model the structure and global information in all previous generated sequence.
Scene Graph Generation from Objects, Phrases and Region Captions(amazing one): This article firstly emphasizes the connection between three difference level visual understanding task——object detection, scene graph generation and region captioning then design an end-to-end model to be simultaneosly trained on these three tasks by refining different tasks’ visual feature as connected node on a hierachical dynamic tree as mutual compensation, boosting all of three tasks’ performance.
Improved Image Captioning via Policy Gradient optimization of SPIDEr: This article introduces a robust policy gradient algorithm to directly optimize on image captioning metric for more human-consensus caption and a better optimization SPIDEr which is the linear combination of SPICE and CIDEr.
Speaking the Same Language:Matching Machine to Human Captions by Adversarial Training: This article comes up with a new question to generate caption set from a single image to better use the one-to-many dataset of image captioning by adversarial training, achieving both accuracy and diversity.
Paying Attention to Descriptions Generated by Image Captioning Models: This article investigate the difference of saliency between human and image captioning model and prove that model sharing more consensus on saliency with human can achieve better performance.
Boosting Image Captioning with Attributes: Again, this article emphasizes the importance of semantic level information for better image caption generation by detecting and inputting multi-label object detection distribution to decoder to achieve SoTA performance.

代码运行结果

感谢ruotianluo，运行了一些主流的image captioning模型，得到的结果如图（图太多就先放一张cider的把），350k iteration之前是XE优化，之后是CIDEr optimization，分别使用了spatial attantion特征和bottom-up attention特征
在这里插入图片描述

结果表明