Show and Tell Lessons learned from the 2015 MSCOCO Image Captioning Challenge论文及tensorflow源码解读(2)

最新推荐文章于 2023-10-25 22:57:03 发布

原创最新推荐文章于 2023-10-25 22:57:03 发布 · 797 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#源码 #tensorflow #inception #deep-learning

读读论文专栏收录该内容

5 篇文章

订阅专栏

本文深入探讨如何利用预训练的Inception V3模型进行图像特征提取，结合LSTM进行图像标题生成。通过TensorFlow实现训练过程，包括模型构建、学习率设置及训练操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Source code

Source code

build_model

build_image_embeddings

在建立了图片和caption的输入后，这部分将图片转换为固定大小的tensor，就像论文提及的，使用已经用很大的数据集训练好的深度网络模型，不改变它的参数，直接用于特征提取。

首先将图片丢入inception v3网络中，得到输出，代码如下：

inception_output = image_embedding.inception_v3(
        self.images,
        trainable=self.train_inception,
        is_training=self.is_training())

这里我们先来看一下inception v3这个模型。inception model
“Rethinking the Inception Architecture for Computer Vision”slim包中提供的inception_v3函数直接返回论文中提到的模型。

Map inception output into embedding space.
这里直接用inception的输出作为图片的特征，并且通过一个全联接层，作为embedding。

with tf.variable_scope("image_embedding") as scope:
  image_embeddings = tf.contrib.layers.fully_connected(
      inputs=inception_output,
      num_outputs=self.config.embedding_size,
      activation_fn=None,
      weights_initializer=self.initializer,
      biases_initializer=None,
      scope=scope)

build_seq_embeddings

建立好了图片的embeddings之后，要建立文字，word的embeddings。也就是将一个个word转换为固定长度的向量。
通过embedding_map这个矩阵，和索引–self.input_seqs查询每个word对应的embedding

seq_embeddings = tf.nn.embedding_lookup(embedding_map, self.input_seqs)

build_model

首先建立lstm_cell

lstm_cell = tf.contrib.rnn.BasicLSTMCell(
        num_units=self.config.num_lstm_units, state_is_tuple=True)

image的embedding作为-1时刻用来初始化lstm，之后在不断输入word的embedding得到sequence_length长度的输出，得到output

lstm_outputs, _ = tf.nn.dynamic_rnn(cell=lstm_cell,
                                    inputs=self.seq_embeddings,
                                    sequence_length=sequence_length,
                                    initial_state=initial_state,
                                    dtype=tf.float32,
                                    scope=lstm_scope)

加一个全联接层，表示字典里每个word可能性的大小，最后在通过softmax

logits = tf.contrib.layers.fully_connected(
          inputs=lstm_outputs,
          num_outputs=self.config.vocab_size,
          activation_fn=None,
          weights_initializer=self.initializer,
          scope=logits_scope)

Set up learning rate

tf.train.exponential_decay(
              learning_rate,
              global_step,
              decay_steps=decay_steps,
              decay_rate=training_config.learning_rate_decay_factor,
              staircase=True)

Set up the training ops.

    train_op = tf.contrib.layers.optimize_loss(
        loss=model.total_loss,
        global_step=model.global_step,
        learning_rate=learning_rate,
        optimizer=training_config.optimizer,
        clip_gradients=training_config.clip_gradients,
        learning_rate_decay_fn=learning_rate_decay_fn)

Run training.

  tf.contrib.slim.learning.train(
      train_op,
      train_dir,
      log_every_n_steps=FLAGS.log_every_n_steps,
      graph=g,
      global_step=model.global_step,
      number_of_steps=FLAGS.number_of_steps,
      init_fn=model.init_fn,
      saver=saver)