第五十周学习笔记

第五十周学习笔记

论文阅读概述

  • SemStyle: Learning to Generate Stylised Image Captions using Unaligned Text: This article introduces a novel model dubbed SemStyle to generate diverse image caption based on paired factual data and unpaired stylistic corpus by two-stage method which firstly map image to semantic term sequence and then feed it to language model to generate captions. By extracting semantic term with NLP package from Unpaired stylistic corpus to train the second model, SemStyle incorporates stylistic data to achieve good performance.
  • Dense Captioning with Joint Inference and Visual Context: This article comes up with joint inference(which simultaneosly generate region captions and region) and context fusion(which provides contextual information for region captioning) to address the problem of highly overlapping target regions in dataset and difficulty in recognizing each region by appearance alone, achieving SoTA on VG.
  • Semantic Compositional Networks for Visual Captioning: This article introduces a novel model named SCN to effectively incorporate high-level semantic concepts in image captioning system by using softmax output of multi-label concept(object) detection to weight weights of LSTM analogous to an esemble LSTM model, achieving top3 performance then.
  • Incorporating Copying Mechanism in Image Captioning for Learning Novel Objects: This article introduces a novel model called LSTM-C(Copy) to incorporate external visual recognition dataset by linearly combining traditional decoder output next word probability distribution and object detection-based distribution to generate next word to achieve the goal of novel object captioning.
  • Captioning Images with Diverse Objects: This article introduces a novel model to simultaneosly target at object detection,next word generation and image captioning with shared model and parameters to incorporate external visual detection dataset and text dataset.
  • Top-down Visual Saliency Guided by Captions: This article tries to investigate the internal mechanism of image captioning model by replacing contextual vector with single region feature and comparing language model output distribution of both to explain the dependency between specific word and specific region and figure out whether encoder-decoder could adaptively find the connection between them which the answer is ‘yes’.
  • Bidirectional Beam Search: Forward-Backward Inference in Neural Sequence Models for Fill-in-the-Blank Image Captioning: This article comes up with bidirectional beam search for fill-in-the-blank image captioning as an attempt to explore bidirectional decoding method.
  • Beyond instance-level image retrieval:Leveraging captions to learn a global visual representation for semantic retrieval: This article exploits caption dataset to supervise semantic image encoding model for semantic image retrieval and achieve better performance.
  • Areas of Attention for Image Captioning: This article comes up with a novel language model to generate captions by modeling the interplay of region feature, current input word and hidden state with different region feature(spatial, proposal, transform) to find a better attention mechanism for image captioning.
  • An Empirical Study of Language CNN for Image Captioning: This article introduces Language CNN as decoder for image caption to deal with long-term dependency which is a challenge in RNN-based decoder, by using layer-wise transformation of fixed window size in CNN(followed by a rnn cell to model local information) to model the structure and global information in all previous generated sequence.
  • Scene Graph Generation from Objects, Phrases and Region Captions(amazing one): This article firstly emphasizes the connection between three difference level visual understanding task——object detection, scene graph generation and region captioning then design an end-to-end model to be simultaneosly trained on these three tasks by refining different tasks’ visual feature as connected node on a hierachical dynamic tree as mutual compensation, boosting all of three tasks’ performance.
  • Improved Image Captioning via Policy Gradient optimization of SPIDEr: This article introduces a robust policy gradient algorithm to directly optimize on image captioning metric for more human-consensus caption and a better optimization SPIDEr which is the linear combination of SPICE and CIDEr.
  • Speaking the Same Language:Matching Machine to Human Captions by Adversarial Training: This article comes up with a new question to generate caption set from a single image to better use the one-to-many dataset of image captioning by adversarial training, achieving both accuracy and diversity.
  • Paying Attention to Descriptions Generated by Image Captioning Models: This article investigate the difference of saliency between human and image captioning model and prove that model sharing more consensus on saliency with human can achieve better performance.
  • Boosting Image Captioning with Attributes: Again, this article emphasizes the importance of semantic level information for better image caption generation by detecting and inputting multi-label object detection distribution to decoder to achieve SoTA performance.

代码运行结果

感谢ruotianluo,运行了一些主流的image captioning模型,得到的结果如图(图太多就先放一张cider的把),350k iteration之前是XE优化,之后是CIDEr optimization,分别使用了spatial attantion特征和bottom-up attention特征
在这里插入图片描述
在这里插入图片描述
结果表明

  • 所有模型使用bottom-up特征的都比spatial特征的最终结果好
  • transformer是本次实验的最好模型(暂时凭CIDEr),其次是top-down模型
  • 特别的是show tell模型,它的XE训练结果的spatial特征比bottom-up特征要好
  • bottom-up特征能帮助屌丝模型实现逆袭,参见transformer

暂时还漏了几个模型没跑,下周一起跑完了把所有metric放一起详细分析下,并对比论文结果

本周小结

上周任务完成度

  • 读完17-19年image captioning的CVPR论文 √
  • 整理近年来的SoTA image captioning model ~(ongoing)
  • 整理先前阅读的所有论文 ×
  • 整理论文的书写方法 ~
  • 整理重要的引用文献 ~
  • 研究CIDEr optimization和top-down model的细节 ×
  • 所有模型跑一个CIDEr optimization的版本 √
  • 跑基于top-down feature的模型 √

各种整理任务均在进行中,完成时间还是往后推迟,每周放在计划里先作一个提醒

下周目标

  • 读完17ICCV 18年ECCV的image captioning论文
  • 整理近年来的SoTA image captioning model
  • 整理先前阅读的论文
  • 整理论文的书写方法
  • 整理重要的引用文献
  • 研究策略梯度、CIDEr optimization和top-down model的细节
  • 完成基本模型的运行任务,比对其原论文详细分析结果
### 吴达恩机器学习课程第四学习笔记 #### 高偏差与高方差问题及其解决方法 当遇到高偏差的情况时,采用更复杂的模型结构能够有效降低偏差[^1]。对于高方差的问题,则可以通过引入更多高质量的训练样本来缓解过拟合现象;尽管这种方法并非总是可行,因为获取额外数据可能会受到资源或者环境因素的限制。 GPU技术的发展极大地促进了复杂模型训练效率的提升,在处理大规模数据集以及深层架构方面表现尤为突出,这也成为近年来机器学习领域快速发展的推动力量之一。 #### 构建基于大型语言模型的应用程序潜力 当前阶段,利用预训练的大规模语言模型来开发创新性的应用正成为一个充满活力的研究方向。随着相关理论和技术的进步,开发者们有能力创造出前所未有的智能化解决方案,并推动整个行业向前发展[^2]。 #### 使用深度学习框架进行具体任务实现 TensorFlow 和 PyTorch 提供了强大的工具支持用于构建各种类型的神经网络模型。例如,在咖啡烘焙过程中通过调整参数预测最终产品品质的任务就可以借助这些平台完成。下面给出了一段简单的Python代码片段展示如何定义一个多层感知机来进行此类分类工作: ```python import tensorflow as tf model = tf.keras.models.Sequential([ tf.keras.layers.Dense(64, activation='relu', input_shape=(2,)), tf.keras.layers.Dense(32, activation='relu'), tf.keras.layers.Dense(1, activation='sigmoid') ]) # 编译模型 model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # 假设X_train为输入特征矩阵,Y_train为目标标签向量 history = model.fit(X_train, Y_train, epochs=50) ``` 此示例中的`Dense`层即所谓的全连接层,它接收前一层所有节点的信息作为输入并传递给下一层[^3]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值