25、图像字幕生成项目实战

最新推荐文章于 2025-10-26 14:07:50 发布

pear55

最新推荐文章于 2025-10-26 14:07:50 发布

阅读量37

点赞数

CC 4.0 BY-SA版权

分类专栏：深度学习实战：从入门到精通文章标签：图像字幕生成 InceptionV3 Bahdanau注意力

本文链接：https://blog.youkuaiyun.com/pear55/article/details/151030520

深度学习实战：从入门到精通专栏收录该内容

34 篇文章 ¥499.90

订阅专栏¥69.90

会员秒杀 ¥9.9 重磅福利

超级会员免费看

图像字幕生成项目实战

1. 项目概述

在图像字幕生成网络模型的创建中，合适的训练和测试数据至关重要。有多个公开数据集可供选择：
| 数据集名称 | 图像数量 |
| ---- | ---- |
| Flickr8k | 约8000张 |
| Flickr30k | 约30000张 |
| MS COCO | 约180000张 |

对于学习目的而言，Flickr8k数据集已足够，该数据集中每张图像配有5条相关字幕。

2. 创建项目

首先，打开一个新的Colab项目并将其重命名为ImageCaptioning，然后导入所需的库：

import os
import time
import pickle
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications import InceptionV3
from os import listdir
from tqdm import tqdm
from PIL import Image

3. 下载数据

此项目需要下载两种数据库：图像及其对应的字幕。
- 下载字幕数据 ：

!wget --no-check-certificate -r  'https://drive.google.com/uc? export=download&id=1c7yGTpizf5egVD9dc3Q2lrxS8wtOAV42' -O Flickr8k_text.zip
!mkdir captions images
!unzip 'Flickr8k_text.zip' -d '/content/captions'

下载图像数据 ：

!wget --no-check-certificate -r 'https://drive.google.com/uc? export=download&id=1126G_E2OpvULyvTm0Kz_oMhOzv8CkiW1' -O Flickr8k_Dataset.zip
!unzip 'Flickr8k_Dataset.zip' -d '/content/images'

可以使用以下代码检查数据库中的图像数量：

image_dir = '/content/images/Flicker8k_Dataset'
images = listdir(image_dir)
print("The number of jpg flies in Flicker8k: {}".format(len(images)))

运行结果显示数据库中有8091张图像，足以用于实验和学习。

4. 解析令牌文件

接下来，解析令牌文件以创建图像名称和对应字幕的列表。为了缩短训练时间，这里每张图像仅使用一条字幕。
- 加载数据 ：

# load doc into memory
def load(filename):
    file = open(filename, 'r')
    text = file.read()
    file.close()
    return text

filename = '/content/captions/Flickr8k.token.txt'
doc = load(filename)

创建迭代器 ：

dirs = listdir('/content/images/Flicker8k_Dataset')
dirs[:5]

输出示例：

['3583065748_7d149a865c.jpg',
 '3358621566_12bac2e9d2.jpg',
 '509778093_21236bb64d.jpg',
 '2094323311_27d58b1513.jpg',
 '3314180199_2121e80368.jpg']

创建列表 ：

def load_small(doc):
    PATH = '/content/images/Flicker8k_Dataset/'
    img_path = []
    img_id = []
    img_cap = []
    for line in doc.split('\n'):
        tokens = line.split()
        if len(line) < 2:
            continue
        image_id , image_desc = tokens[0] , tokens[1:]
        image_id = image_id.split('.')[0]
        image_id = image_id + '.jpg'
        image_desc = ' '.join(image_desc)
        if image_id not in img_id:
            if len(img_id) <= 8000:
                img_id.append(image_id)
                image_path = PATH + image_id
                image_desc = '<start> ' + image_desc + ' <end>'
                if image_id in dirs:
                    img_path.append(image_path)
                    img_cap.append(image_desc)
            else:
                continue
    return img_path , img_cap

all_image_path , all_image_captions = load_small(doc)
print('Number of images: ', len(all_image_path))
all_image_path[:5]
print('Number of captions: ', len(all_image_captions))
all_image_captions[:5]

输出结果显示有8000张图像和8000条字幕，且每条字幕都添加了 <start> 和 <end> 标签。
- 打乱训练数据 ：

train_captions, img_name_vector = shuffle(all_image_captions, all_image_path, random_state=1)

5. 加载InceptionV3模型

使用InceptionV3模型进行图像特征提取，加载模型的代码如下：

image_model = InceptionV3(include_top=False, weights='imagenet')

这里使用预训练的权重，并且去除了用于图像分类的顶层。接下来创建自己的 tf.keras 模型用于提取图像特征：

new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)

该输出层的形状为8x8x2048，是InceptionV3模型的最后一个卷积层。

6. 准备数据集

InceptionV3模型要求图像大小为299x299，并且图像像素值需归一化到 -1 到 1 的范围。
- 加载和调整图像大小的函数 ：

def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    img = tf.image.resize(img, (299, 299))
    img = tf.keras.applications.inception_v3.preprocess_input(img)
    return img, image_path

创建图像数据集 ：

encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)

7. 提取特征

对于数据集中的每张图像，调用之前创建的 image_features_extract_model 提取特征，然后将数据重塑并保存到物理文件中：

for img, path in tqdm(image_dataset):
    batch_features = image_features_extract_model(img)
    batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
    for bf, p in zip(batch_features, path):
        path_of_feature = p.numpy().decode("utf-8")
        np.save(path_of_feature, bf.numpy())

虽然将特征保存到内存更高效，但会消耗大量资源。在GPU上运行上述循环大约需要2分钟。

8. 创建词汇表

创建所有独特单词的词汇表：

tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(train_captions)
max_size = len(tokenizer.word_index)

9. 创建输入序列

使用以下代码创建分词后单词的输入序列：

train_seqs = tokenizer.texts_to_sequences(train_captions)
train_seqs[:5]

输出示例：

[[2, 1, 2339, 8, 155, 2340, 1198, 19, 2341, 1390, 24, 480, 554, 3],
 [2, 21, 1714, 7, 1199, 1715, 1, 108, 2342, 19, 5, 173, 3],
 [2, 1, 11, 4, 1, 28, 32, 506, 1, 507, 3],
 [2, 1, 101, 102, 12, 1, 26, 3],
 [2, 63, 34, 4, 1, 272, 3]]

由于这些序列长度不同，需要对其进行填充：

max_length = max(len(t) for t in train_seqs)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
cap_vector[:5]

填充后所有分词后的单词序列长度相等。

10. 创建训练数据集

声明一些变量用于创建数据集：

BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = max_size + 1
num_steps = len(img_name_vector) // BATCH_SIZE

定义加载图像特征向量的函数：

def map_func(img_name, cap):
    img_tensor = np.load(img_name.decode('utf-8')+'.npy')
    return img_tensor, cap

创建数据集的函数：

def create_dataset(img_name_train,caption_train):
    dataset = tf.data.Dataset.from_tensor_slices((img_name_train, caption_train))
    dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
    dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
    return dataset

dataset = create_dataset(img_name_vector,cap_vector)

11. 创建模型

创建带有Bahdanau注意力和门控循环单元（GRU）的序列到序列模型。GRU在RNN中提供了门控机制，与LSTM类似，但参数更少，训练速度更快，内存使用更少，但在处理长序列时准确性不如LSTM。

Bahdanau注意力的工作流程如下：
1. 为给定输入图像生成编码器隐藏状态。
2. 计算对齐分数，即每个先前编码器隐藏状态与先前解码器隐藏状态之间的分数。
3. 对对齐分数进行Softmax操作。
4. 计算上下文向量。
5. 解码输出。
6. 重复步骤2到5，直到遇到结束标记。

以下是mermaid格式的流程图展示Bahdanau注意力的工作流程：

graph LR
    A[输入图像] --> B[生成编码器隐藏状态]
    B --> C[计算对齐分数]
    C --> D[Softmax对齐分数]
    D --> E[计算上下文向量]
    E --> F[解码输出]
    F --> G{是否遇到结束标记}
    G -- 否 --> C
    G -- 是 --> H[结束]

12. 创建编码器

编码器将提取的特征作为输入，并将其传递到全连接层：

class Inception_Encoder(tf.keras.Model):
    def __init__(self, embedding_dim):
        super(Inception_Encoder, self).__init__()
        self.fc = tf.keras.layers.Dense(embedding_dim)

    def call(self, x):
        x = self.fc(x)
        x = tf.nn.relu(x)
        return x

13. 创建解码器

解码器是此应用中最重要的部分，使用Bahdanau注意力机制。

一般来说，有两种注意力机制：
1. Bahdanau注意力 - 确定性“软”注意力
2. Luong注意力 - 随机“硬”注意力

这里使用的是Bahdanau注意力，该机制计算每个输入向量（即图像的提取特征）的注意力权重和上下文向量。上下文向量是图像输入在时间t时相关部分的动态表示。

解码器的功能可以总结为以下三个简单步骤：
1. 上下文向量（注意力权重和编码器输出的加权乘法）
2. 前一个时间步的解码器输出
3. 前一个解码器的隐藏状态

解码器的初始化代码如下：

class RNN_Decoder(tf.keras.Model):
    def __init__(self, units):
        super(RNN_Decoder, self).__init__()
        self.gru = tf.keras.layers.GRU(units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
        self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
        self.W1 = tf.keras.layers.Dense(units)
        self.W2 = tf.keras.layers.Dense(units)
        self.V = tf.keras.layers.Dense(1)

    def call(self, x, features, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)
        # 计算注意力分数
        score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
        attention_weights = tf.nn.softmax(score, axis=1)
        context_vector = attention_weights * features
        context_vector = tf.reduce_sum(context_vector, axis=1)
        x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
        output, state = self.gru(x)
        output = tf.reshape(output, (-1, output.shape[2]))
        x = self.fc(output)
        return x, state, attention_weights

通过以上步骤，我们完成了图像字幕生成项目的整个流程，从数据准备到模型创建，为后续的训练和预测奠定了基础。

图像字幕生成项目实战

14. 解码器调用方法

解码器需要三个输入：
1. 编码器输出
2. 隐藏状态（初始化为 0）
3. 解码器输入（即起始标记）

调用方法的代码如下：

def call(self, x, features, hidden):
    hidden_with_time_axis = tf.expand_dims(hidden, 1)
    # 注意力分数计算公式
    score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis)))
    attention_weights = tf.nn.softmax(score, axis = 1)
    context_vector = attention_weights * features
    context_vector = tf.reduce_sum(context_vector, axis = 1)
    x = tf.concat([tf.expand_dims(context_vector, 1), x], axis = -1)
    output, state = self.gru(x)
    output = tf.reshape(output, (-1, output.shape[2]))
    x = self.fc(output)
    return x, state, attention_weights

以下是对解码器调用过程的详细解释：
|步骤|操作|说明|
| ---- | ---- | ---- |
|1| hidden_with_time_axis = tf.expand_dims(hidden, 1) |改变前一个解码器隐藏状态的形状，增加一个维度|
|2| score = self.V(tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))) |计算注意力分数|
|3| attention_weights = tf.nn.softmax(score, axis = 1) |对注意力分数进行 Softmax 操作，得到注意力权重|
|4| context_vector = attention_weights * features |计算上下文向量|
|5| context_vector = tf.reduce_sum(context_vector, axis = 1) |对上下文向量进行求和|
|6| x = tf.concat([tf.expand_dims(context_vector, 1), x], axis = -1) |将上下文向量和输入进行拼接|
|7| output, state = self.gru(x) |通过 GRU 层得到输出和状态|
|8| output = tf.reshape(output, (-1, output.shape[2])) |重塑输出形状|
|9| x = self.fc(output) |通过全连接层得到最终输出|

15. 模型训练

在完成模型的创建后，接下来进行模型的训练。
- 定义优化器和损失函数 ：

optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')

def loss_function(real, pred):
    mask = tf.math.logical_not(tf.math.equal(real, 0))
    loss_ = loss_object(real, pred)
    mask = tf.cast(mask, dtype=loss_.dtype)
    loss_ *= mask
    return tf.reduce_mean(loss_)

训练步骤 ：

@tf.function
def train_step(img_tensor, target):
    loss = 0
    hidden = decoder.reset_state(batch_size=target.shape[0])
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * target.shape[0], 1)
    with tf.GradientTape() as tape:
        features = encoder(img_tensor)
        for i in range(1, target.shape[1]):
            predictions, hidden, _ = decoder(dec_input, features, hidden)
            loss += loss_function(target[:, i], predictions)
            dec_input = tf.expand_dims(target[:, i], 1)
    total_loss = (loss / int(target.shape[1]))
    trainable_variables = encoder.trainable_variables + decoder.trainable_variables
    gradients = tape.gradient(loss, trainable_variables)
    optimizer.apply_gradients(zip(gradients, trainable_variables))
    return loss, total_loss

训练步骤的流程图如下：

graph LR
    A[输入图像张量和目标字幕] --> B[初始化损失和隐藏状态]
    B --> C[获取编码器特征]
    C --> D{是否遍历完目标字幕}
    D -- 否 --> E[解码器预测]
    E --> F[计算损失]
    F --> G[更新解码器输入]
    G --> D
    D -- 是 --> H[计算总损失]
    H --> I[计算梯度]
    I --> J[应用梯度更新参数]
    J --> K[返回损失和总损失]

训练循环 ：

EPOCHS = 20
for epoch in range(EPOCHS):
    start = time.time()
    total_loss = 0
    for (batch, (img_tensor, target)) in enumerate(dataset):
        batch_loss, t_loss = train_step(img_tensor, target)
        total_loss += t_loss
        if batch % 100 == 0:
            print(f'Epoch {epoch+1} Batch {batch} Loss {batch_loss.numpy():.4f}')
    print(f'Epoch {epoch+1} Loss {total_loss/num_steps:.4f}')
    print(f'Time taken for 1 epoch {time.time()-start:.2f} sec\n')

16. 模型评估

训练完成后，需要对模型进行评估。这里我们通过生成图像字幕并与真实字幕进行对比来评估模型的性能。

def evaluate(image):
    attention_plot = np.zeros((max_length, 64))
    hidden = decoder.reset_state(batch_size = 1)
    temp_input = tf.expand_dims(load_image(image)[0], 0)
    img_tensor_val = image_features_extract_model(temp_input)
    img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
    features = encoder(img_tensor_val)
    dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
    result = []
    for i in range(max_length):
        predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
        attention_plot[i] = tf.reshape(attention_weights, (-1, )).numpy()
        predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
        result.append(tokenizer.index_word[predicted_id])
        if tokenizer.index_word[predicted_id] == '<end>':
            return result, attention_plot
        dec_input = tf.expand_dims([predicted_id], 0)
    attention_plot = attention_plot[:len(result), :]
    return result, attention_plot

评估步骤如下：
1. 初始化注意力图和隐藏状态。
2. 提取图像特征。
3. 通过编码器得到特征表示。
4. 从起始标记开始，逐步生成字幕。
5. 当遇到结束标记时，停止生成。

17. 结果展示

我们可以选择一些图像进行测试，并展示生成的字幕。

# 选择一张图像进行测试
image_path = '/content/images/Flicker8k_Dataset/1000268201_693b08cb0e.jpg'
result, attention_plot = evaluate(image_path)
print('Prediction Caption:', ' '.join(result))
# 显示图像
img = Image.open(image_path)
plt.imshow(img)

18. 总结

通过本次图像字幕生成项目，我们完成了从数据准备、模型创建到训练和评估的整个流程。具体步骤总结如下：
1. 数据准备 ：下载并解析图像和字幕数据，创建图像路径和字幕列表。
2. 特征提取 ：使用 InceptionV3 模型提取图像特征，并保存到文件中。
3. 词汇表和序列创建 ：创建词汇表，将字幕转换为输入序列并进行填充。
4. 模型创建 ：创建带有 Bahdanau 注意力和 GRU 的序列到序列模型，包括编码器和解码器。
5. 模型训练 ：定义优化器和损失函数，进行模型训练。
6. 模型评估 ：编写评估函数，对模型进行评估并展示结果。

通过不断调整模型参数和训练策略，可以进一步提高模型的性能，生成更准确、更自然的图像字幕。

以上就是整个图像字幕生成项目的详细过程，希望对大家有所帮助。