图像字幕生成技术详解
1. 引言
图像字幕生成是计算机视觉和自然语言处理领域的重要任务,旨在为图像自动生成描述性的文本。本文将详细介绍图像字幕生成的相关技术,包括Bahdanau注意力模型的实现、解码器的构建、优化器和损失函数的定义,以及模型的训练和推理过程。
2. Bahdanau注意力模型实现
Bahdanau注意力模型在图像字幕生成中起着关键作用,它能够帮助模型聚焦于图像的不同部分,从而生成更准确的字幕。
2.1 得分计算
Bahdanau注意力模型的得分计算伪代码如下:
score = FC(tanh(FC(EO) + FC(H)))
实际实现代码为:
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
这里,前一个解码器的隐藏状态和当前输入向量的隐藏状态作为输入。
2.2 注意力权重计算
注意力权重的数学表达式为:
[a_{t,s} = \frac{\exp(score(h_t, h_s))}{\sum_{s’=1}^{S}\exp(score(h_t, h_{s’}))}]
实现代码如下:
attention_weights = tf.nn.softmax(self.V(score), axis=1)
通过对得分应用softmax激活函数,得到注意力权重,其总和为1,反映了每个输入序列的权重或影响。
2.3 上下文向量计算
上下文向量的数学表达式为:
[c_t = \sum_{s}a_{t,s}h_s]
实现分为两步:
第一步,计算每个输入的上下文向量:
context_vector = attention_weights * features
第二步,对所有乘积求和:
context_vector = tf.reduce_sum(context_vector, axis=1)
上下文向量与前一个解码器输出拼接后,输入到解码器的循环神经网络(RNN)中,生成新的输出。
3. 解码器实现
解码器的主要任务是将图像特征和字幕信息结合,生成最终的字幕。
3.1 字幕索引向量化
首先,将字幕索引通过嵌入层转换为向量:
x = self.embedding(x)
3.2 上下文向量与字幕向量合并
将上下文向量与字幕向量映射并合并为单个向量:
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
3.3 通过GRU层
将合并后的向量通过门控循环单元(GRU):
output, state = self.gru(x)
3.4 通过全连接层
将GRU层的输出通过全连接层:
x = self.fc1(output)
然后,将x重塑为
(batch_size * max_length, hidden_size)
:
x = tf.reshape(x, (-1, x.shape[2]))
3.5 添加Dropout和BatchNorm层
添加Dropout和BatchNorm层进行正则化:
x = self.dropout(x)
x = self.batchnormalization(x)
3.6 最终输出
将输出通过另一个全连接层,转换为
(64 x 8329)
的形状,其中8329是词汇表大小:
x = self.fc2(x)
最后,返回计算得到的值:
return x, state, attention_weights
此外,还定义了一个用于重置解码器初始状态的函数:
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
4. 编码器和解码器实例化
创建编码器和解码器的实例:
encoder = Inception_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)
可以使用
tf.keras.utils.plot_model
函数绘制编码器和解码器的模型图,但这些图的实际意义不大,因为主要处理都在内部模型中完成。
5. 优化器和损失函数定义
使用Adam优化器和SparseCategoricalCrossentropy作为损失函数:
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
定义损失函数:
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
下面通过一个具体例子说明损失的计算过程:
假设有一个真实的字幕向量
real
:
real(passed as a parameter) : tf.Tensor(
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 4 0 0 0 0 1760 0 367 0 0 4 0 0
0 0 0 0 0 0 0 9 453 0 0 0 0 0
0 0 0 0 0 0 132 0 0 0 0 0 0 0
0 0 0 4 0 0 0 5], shape=(64,), dtype=int32)
首先,使用
tf.math.equal
将向量转换为布尔值:
tf.math.equal(real , 0)
输出为:
[True True True True True True True True True True True True
True True True False True True True True False True False True
True False True True True True True True True True True False
False True True True True True True True True True True True
False True True True True True True True True True True False
True True True False], shape=(64,), dtype=bool)
然后,使用
tf.math.logical_not
进行逻辑非操作:
mask = tf.math.logical_not(tf.math.equal(real, 0))
输出为:
tf.Tensor(
[False False False False False False False False False False False False
False False False True False False False False True False True False
False True False False False False False False False False False True
True False False False False False False False False False False False
True False False False False False False False False False False True
False False False True], shape=(64,), dtype=bool)
接着,计算损失张量:
loss_ = loss_object(real, pred)
输出为:
tf.Tensor(
[13.458616 11.725777 13.339547 13.877813 13.6512375 13.609352
12.680449 13.963526 12.929108 12.504114 12.995626 13.473895
13.966334 13.3766165 13.607654 0.10513641 13.231352 13.313489
13.727711 14.456019 10.560667 13.632038 4.2983437 14.144966
14.331357 0.28515333 13.97144 13.087602 15.597718 13.351999
13.649492 12.489752 12.744471 12.558954 13.255367 1.8581532
3.1811125 13.873036 12.329573 12.222642 13.126439 14.233135
12.379726 11.951986 12.869691 13.468082 12.732171 12.240744
3.8898373 12.682398 13.192276 12.453615 15.758832 14.152502
13.160431 11.863881 12.530688 13.764532 13.640175 0.7283469
14.0648575 12.560375 14.25197 0.53315634], shape=(64,),
dtype=float32)
创建掩码:
mask = tf.cast(mask, dtype=loss_.dtype)
输出为:
tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 1. 0. 1. 0.
0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1.], shape=(64,), dtype=float32)
最后,将损失与掩码相乘:
loss_ *= mask
得到最终的损失张量:
loss tf.Tensor(
[ 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
0. 0. 0. 0.10513641 0. 0.
0. 0. 10.560667 0. 4.2983437 0.
0. 0.28515333 0. 0. 0. 0.
0. 0. 0. 0. 0. 1.8581532
3.1811125 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.
3.8898373 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.7283469
0. 0. 0. 0.53315634], shape=(64,),
dtype=float32)
6. 创建检查点
创建一个单独的文件夹用于保存检查点,并最多保存五个检查点:
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder, decoder=decoder, optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
声明一个变量
start_epoch
,用于从最后一个已知检查点重新开始训练:
start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
ckpt.restore(ckpt_manager.latest_checkpoint)
else:
ckpt.restore(tf.train.latest_checkpoint(checkpoint_path))
7. 训练步骤函数
定义训练步骤函数:
loss_plot = []
def train_step(img_tensor, target):
loss = 0
hidden = decoder.reset_state(batch_size=target.shape[0])
dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)
with tf.GradientTape() as tape:
features = encoder(img_tensor)
for i in range(1, target.shape[1]):
predictions, hidden, _ = decoder(dec_input, features, hidden)
loss += loss_function(target[:, i], predictions)
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = (loss / int(target.shape[1]))
trainable_variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, trainable_variables)
optimizer.apply_gradients(zip(gradients, trainable_variables))
return loss, total_loss
该函数首先初始化解码器的隐藏状态,添加
<start>
标签,然后使用梯度带迭代数据批次,更新梯度。
8. 模型训练
通过调用训练步骤函数多次来训练模型:
for epoch in range(start_epoch, 20):
start = time.time()
total_loss_train = 0
for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss_train += t_loss
if epoch % 5 == 0:
ckpt_manager.save()
print ('Epoch {} Train-Loss {:.4f}'.format(epoch + 1, (total_loss_train/num_steps)))
print ('Time taken for this epoch {} sec\n'.format(time.time() - start))
9. 模型推理
为了为未见过的图像生成字幕,定义
evaluate
函数:
def evaluate(image):
hidden = decoder.reset_state(batch_size=1)
temp_input = tf.expand_dims(load_image(image)[0], 0)
img_tensor_val = image_features_extract_model(temp_input)
img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
features = encoder(img_tensor_val)
dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
result = []
for i in range(max_length):
predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
result.append(tokenizer.index_word[predicted_id])
if tokenizer.index_word[predicted_id] == '<end>':
return result
dec_input = tf.expand_dims([predicted_id], 0)
return result
定义
predict
函数,接受图像URL和随机名称:
def predict(image_url , random_name):
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image'+ random_name + image_extension, origin=image_url)
result = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
Image.open(image_path)
return image_path
最后,调用
predict
函数对测试图像进行测试:
image_url = 'https://tensorflow.org/images/surf.jpg'
path = predict(image_url , 'surfee')
Image.open(path)
10. 完整代码
以下是完整的图像字幕生成代码:
import os
import time
import pickle
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from tensorflow.keras.applications import InceptionV3
from os import listdir
from tqdm import tqdm
from PIL import Image
!wget --no-check-certificate -r ‘https://drive.google.com/uc? export=download&id=1c7yGTpizf5egVD9dc3Q2lrxS8wtOAV42’ -O Flickr8k_text.zip
!mkdir captions images
!unzip 'Flickr8k_text.zip' -d '/content/captions'
!wget --no-check-certificate -r 'https://drive.google.com/uc? export=download&id=1126G_E2OpvULyvTm0Kz_oMhOzv8CkiW1' -O Flickr8k_Dataset.zip
!unzip 'Flickr8k_Dataset.zip' -d '/content/images'
image_dir = '/content/images/Flicker8k_Dataset'
images = listdir(image_dir)
print("The number of jpg flies in Flicker8k: {}".format(len(images)))
def load(filename):
file = open(filename, 'r')
text = file.read()
file.close()
return text
filename = '/content/captions/Flickr8k.token.txt'
doc = load(filename)
dirs = listdir('/content/images/Flicker8k_Dataset')
dirs[:5]
def load_small(doc):
PATH = '/content/images/Flicker8k_Dataset/'
img_path = []
img_id = []
img_cap = []
for line in doc.split('\n'):
tokens = line.split()
if len(line) < 2:
continue
image_id , image_desc = tokens[0] , tokens[1:]
image_id = image_id.split('.')[0]
image_id = image_id + '.jpg'
image_desc = ' '.join(image_desc)
if image_id not in img_id:
if len(img_id) <= 8000:
img_id.append(image_id)
image_path = PATH + image_id
image_desc = '<start> ' + image_desc + ' <end>'
if image_id in dirs:
img_path.append(image_path)
img_cap.append(image_desc)
else:
continue
return img_path , img_cap
all_image_path , all_image_captions = load_small(doc)
print('Number of images: ', len(all_image_path))
all_image_path[:5]
print('Number of captions: ', len(all_image_captions))
all_image_captions[:5]
train_captions, img_name_vector = shuffle(all_image_captions, all_image_path, random_state=1)
image_model = InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
def load_image(image_path):
img = tf.io.read_file(image_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, (299, 299))
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, image_path
encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)
for img, path in tqdm(image_dataset):
batch_features = image_features_extract_model(img)
batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
for bf, p in zip(batch_features, path):
path_of_feature = p.numpy().decode("utf-8")
np.save(path_of_feature, bf.numpy())
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(train_captions)
max_size = len(tokenizer.word_index)
train_seqs = tokenizer.texts_to_sequences(train_captions)
train_seqs[:5]
max_length = max(len(t) for t in train_seqs)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
cap_vector[:5]
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = max_size + 1
num_steps = len(img_name_vector) // BATCH_SIZE
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode('utf-8')+'.npy')
return img_tensor, cap
def create_dataset(img_name_train,caption_train):
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, caption_train))
dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset
dataset = create_dataset(img_name_vector,cap_vector)
class Inception_Encoder(tf.keras.Model):
def __init__(self, embedding_dim):
super(Inception_Encoder, self).__init__()
self.fc = tf.keras.layers.Dense(embedding_dim)
def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
self.fc2 = tf.keras.layers.Dense(vocab_size)
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, x, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
attention_weights = tf.nn.softmax(self.V(score), axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
x = self.embedding(x)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)
x = self.fc1(output)
x = tf.reshape(x, (-1, x.shape[2]))
x = self.dropout(x)
x = self.batchnormalization(x)
x = self.fc2(x)
return x, state, attention_weights
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
encoder = Inception_Encoder(embedding_dim)
decoder = RNN_Decoder(embedding_dim, units, vocab_size)
tf.keras.utils.plot_model (encoder)
tf.keras.utils.plot_model (decoder)
optimizer = tf.keras.optimizers.Adam()
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction='none')
def loss_function(real, pred):
mask = tf.math.logical_not(tf.math.equal(real, 0))
loss_ = loss_object(real, pred)
mask = tf.cast(mask, dtype=loss_.dtype)
loss_ *= mask
return tf.reduce_mean(loss_)
checkpoint_path = "./checkpoints/train"
ckpt = tf.train.Checkpoint(encoder=encoder, decoder=decoder, optimizer = optimizer)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=5)
start_epoch = 0
if ckpt_manager.latest_checkpoint:
start_epoch = int(ckpt_manager.latest_checkpoint.split('-')[-1])
ckpt.restore(ckpt_manager.latest_checkpoint)
else:
ckpt.restore(tf.train.latest_checkpoint(checkpoint_path))
loss_plot = []
def train_step(img_tensor, target):
loss = 0
hidden = decoder.reset_state(batch_size=target.shape[0])
dec_input = tf.expand_dims([tokenizer.word_index['<start>']] * BATCH_SIZE, 1)
with tf.GradientTape() as tape:
features = encoder(img_tensor)
for i in range(1, target.shape[1]):
predictions, hidden, _ = decoder(dec_input, features, hidden)
loss += loss_function(target[:, i], predictions)
dec_input = tf.expand_dims(target[:, i], 1)
total_loss = (loss / int(target.shape[1]))
trainable_variables = encoder.trainable_variables + decoder.trainable_variables
gradients = tape.gradient(loss, trainable_variables)
optimizer.apply_gradients(zip(gradients, trainable_variables))
return loss, total_loss
for epoch in range(start_epoch, 20):
start = time.time()
total_loss_train = 0
for (batch, (img_tensor, target)) in enumerate(dataset):
batch_loss, t_loss = train_step(img_tensor, target)
total_loss_train += t_loss
if epoch % 5 == 0:
ckpt_manager.save()
print ('Epoch {} Train-Loss {:.4f}'.format(epoch + 1, (total_loss_train/num_steps)))
print ('Time taken for this epoch {} sec\n'.format(time.time() - start))
def evaluate(image):
hidden = decoder.reset_state(batch_size=1)
temp_input = tf.expand_dims(load_image(image)[0], 0)
img_tensor_val = image_features_extract_model(temp_input)
img_tensor_val = tf.reshape(img_tensor_val, (img_tensor_val.shape[0], -1, img_tensor_val.shape[3]))
features = encoder(img_tensor_val)
dec_input = tf.expand_dims([tokenizer.word_index['<start>']], 0)
result = []
for i in range(max_length):
predictions, hidden, attention_weights = decoder(dec_input, features, hidden)
predicted_id = tf.random.categorical(predictions, 1)[0][0].numpy()
result.append(tokenizer.index_word[predicted_id])
if tokenizer.index_word[predicted_id] == '<end>':
return result
dec_input = tf.expand_dims([predicted_id], 0)
return result
def predict(image_url , random_name):
image_extension = image_url[-4:]
image_path = tf.keras.utils.get_file('image'+ random_name + image_extension, origin=image_url)
result = evaluate(image_path)
print ('Prediction Caption:', ' '.join(result))
Image.open(image_path)
return image_path
image_url = 'https://tensorflow.org/images/surf.jpg'
path = predict(image_url , 'surfee')
Image.open(path)
通过以上步骤,我们可以实现一个完整的图像字幕生成系统,包括数据预处理、模型构建、训练和推理。这个系统可以为图像自动生成描述性的字幕,在许多领域都有广泛的应用前景。
图像字幕生成技术详解(续)
11. 代码关键模块分析
为了更好地理解整个图像字幕生成系统,下面对代码中的关键模块进行详细分析。
11.1 数据加载与预处理模块
这部分代码主要负责加载图像和字幕数据,并进行必要的预处理。
image_dir = '/content/images/Flicker8k_Dataset'
images = listdir(image_dir)
print("The number of jpg flies in Flicker8k: {}".format(len(images)))
def load(filename):
file = open(filename, 'r')
text = file.read()
file.close()
return text
filename = '/content/captions/Flickr8k.token.txt'
doc = load(filename)
dirs = listdir('/content/images/Flicker8k_Dataset')
dirs[:5]
def load_small(doc):
PATH = '/content/images/Flicker8k_Dataset/'
img_path = []
img_id = []
img_cap = []
for line in doc.split('\n'):
tokens = line.split()
if len(line) < 2:
continue
image_id , image_desc = tokens[0] , tokens[1:]
image_id = image_id.split('.')[0]
image_id = image_id + '.jpg'
image_desc = ' '.join(image_desc)
if image_id not in img_id:
if len(img_id) <= 8000:
img_id.append(image_id)
image_path = PATH + image_id
image_desc = '<start> ' + image_desc + ' <end>'
if image_id in dirs:
img_path.append(image_path)
img_cap.append(image_desc)
else:
continue
return img_path , img_cap
all_image_path , all_image_captions = load_small(doc)
print('Number of images: ', len(all_image_path))
all_image_path[:5]
print('Number of captions: ', len(all_image_captions))
all_image_captions[:5]
train_captions, img_name_vector = shuffle(all_image_captions, all_image_path, random_state=1)
操作步骤如下:
1. 定义图像目录,统计图像数量。
2. 编写
load
函数用于读取文本文件。
3. 加载包含图像字幕信息的文本文件。
4. 编写
load_small
函数筛选出部分数据,并添加
<start>
和
<end>
标签。
5. 打乱图像和字幕数据。
11.2 特征提取模块
利用预训练的InceptionV3模型提取图像特征。
image_model = InceptionV3(include_top=False, weights='imagenet')
new_input = image_model.input
hidden_layer = image_model.layers[-1].output
image_features_extract_model = tf.keras.Model(new_input, hidden_layer)
def load_image(image_path):
img = tf.io.read_file(image_path)
img = tf.image.decode_jpeg(img, channels=3)
img = tf.image.resize(img, (299, 299))
img = tf.keras.applications.inception_v3.preprocess_input(img)
return img, image_path
encode_train = sorted(set(img_name_vector))
image_dataset = tf.data.Dataset.from_tensor_slices(encode_train)
image_dataset = image_dataset.map(load_image, num_parallel_calls=tf.data.experimental.AUTOTUNE).batch(16)
for img, path in tqdm(image_dataset):
batch_features = image_features_extract_model(img)
batch_features = tf.reshape(batch_features, (batch_features.shape[0], -1, batch_features.shape[3]))
for bf, p in zip(batch_features, path):
path_of_feature = p.numpy().decode("utf-8")
np.save(path_of_feature, bf.numpy())
操作步骤如下:
1. 加载预训练的InceptionV3模型,去掉顶部的全连接层。
2. 定义
load_image
函数对图像进行读取、解码、调整大小和预处理。
3. 创建图像数据集,并使用
map
函数并行处理图像。
4. 提取图像特征并保存为numpy文件。
11.3 数据处理模块
对字幕数据进行分词、填充等处理。
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+.,-/:;=?@[\]^_`{|}~ ')
tokenizer.fit_on_texts(train_captions)
max_size = len(tokenizer.word_index)
train_seqs = tokenizer.texts_to_sequences(train_captions)
train_seqs[:5]
max_length = max(len(t) for t in train_seqs)
cap_vector = tf.keras.preprocessing.sequence.pad_sequences(train_seqs, padding='post')
cap_vector[:5]
操作步骤如下:
1. 创建分词器并对训练字幕进行拟合。
2. 将字幕转换为序列。
3. 计算最大序列长度。
4. 对序列进行填充,使其长度一致。
11.4 数据集创建模块
创建用于训练的数据集。
BATCH_SIZE = 64
BUFFER_SIZE = 1000
embedding_dim = 256
units = 512
vocab_size = max_size + 1
num_steps = len(img_name_vector) // BATCH_SIZE
def map_func(img_name, cap):
img_tensor = np.load(img_name.decode('utf-8')+'.npy')
return img_tensor, cap
def create_dataset(img_name_train,caption_train):
dataset = tf.data.Dataset.from_tensor_slices((img_name_train, caption_train))
dataset = dataset.map(lambda item1, item2: tf.numpy_function(map_func, [item1, item2], [tf.float32, tf.int32]), num_parallel_calls=tf.data.experimental.AUTOTUNE)
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE).prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
return dataset
dataset = create_dataset(img_name_vector,cap_vector)
操作步骤如下:
1. 定义批量大小、缓冲区大小等超参数。
2. 编写
map_func
函数加载图像特征。
3. 编写
create_dataset
函数创建数据集,包括映射、打乱、分批和预取操作。
12. 模型架构分析
整个图像字幕生成模型由编码器和解码器组成,下面对其架构进行详细分析。
12.1 编码器
class Inception_Encoder(tf.keras.Model):
def __init__(self, embedding_dim):
super(Inception_Encoder, self).__init__()
self.fc = tf.keras.layers.Dense(embedding_dim)
def call(self, x):
x = self.fc(x)
x = tf.nn.relu(x)
return x
编码器的主要作用是将图像特征映射到指定的嵌入维度。操作步骤如下:
1. 定义一个全连接层。
2. 在
call
方法中,将输入通过全连接层并应用ReLU激活函数。
12.2 解码器
class RNN_Decoder(tf.keras.Model):
def __init__(self, embedding_dim, units, vocab_size):
super(RNN_Decoder, self).__init__()
self.units = units
self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
self.gru = tf.keras.layers.GRU(self.units, return_sequences=True, return_state=True, recurrent_initializer='glorot_uniform')
self.fc1 = tf.keras.layers.Dense(self.units)
self.dropout = tf.keras.layers.Dropout(0.5, noise_shape=None, seed=None)
self.batchnormalization = tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001, center=True, scale=True, beta_initializer='zeros', gamma_initializer='ones', moving_mean_initializer='zeros', moving_variance_initializer='ones', beta_regularizer=None, gamma_regularizer=None, beta_constraint=None, gamma_constraint=None)
self.fc2 = tf.keras.layers.Dense(vocab_size)
self.W1 = tf.keras.layers.Dense(units)
self.W2 = tf.keras.layers.Dense(units)
self.V = tf.keras.layers.Dense(1)
def call(self, x, features, hidden):
hidden_with_time_axis = tf.expand_dims(hidden, 1)
score = tf.nn.tanh(self.W1(features) + self.W2(hidden_with_time_axis))
attention_weights = tf.nn.softmax(self.V(score), axis=1)
context_vector = attention_weights * features
context_vector = tf.reduce_sum(context_vector, axis=1)
x = self.embedding(x)
x = tf.concat([tf.expand_dims(context_vector, 1), x], axis=-1)
output, state = self.gru(x)
x = self.fc1(output)
x = tf.reshape(x, (-1, x.shape[2]))
x = self.dropout(x)
x = self.batchnormalization(x)
x = self.fc2(x)
return x, state, attention_weights
def reset_state(self, batch_size):
return tf.zeros((batch_size, self.units))
解码器的操作步骤如下:
1. 初始化嵌入层、GRU层、全连接层、Dropout层、BatchNorm层和注意力机制相关的层。
2. 在
call
方法中,计算注意力得分、注意力权重和上下文向量。
3. 将输入字幕索引转换为向量,并与上下文向量拼接。
4. 通过GRU层和全连接层进行处理。
5. 应用Dropout和BatchNorm层进行正则化。
6. 最后通过全连接层输出预测结果。
7.
reset_state
方法用于重置解码器的初始状态。
13. 训练与推理流程总结
整个图像字幕生成系统的训练和推理流程可以总结如下:
13.1 训练流程
graph TD;
A[数据加载与预处理] --> B[特征提取];
B --> C[数据处理];
C --> D[创建数据集];
D --> E[模型初始化];
E --> F[训练循环];
F --> G[计算损失];
G --> H[更新梯度];
H --> I[保存检查点];
I --> F;
操作步骤如下:
1. 加载图像和字幕数据,并进行预处理。
2. 利用预训练模型提取图像特征。
3. 对字幕数据进行分词、填充等处理。
4. 创建用于训练的数据集。
5. 初始化编码器和解码器模型。
6. 进入训练循环,计算损失并更新梯度。
7. 定期保存检查点。
13.2 推理流程
graph TD;
J[输入图像] --> K[图像预处理];
K --> L[特征提取];
L --> M[编码器处理];
M --> N[解码器初始化];
N --> O[循环预测];
O --> P[生成字幕];
操作步骤如下:
1. 输入待生成字幕的图像。
2. 对图像进行预处理。
3. 提取图像特征。
4. 通过编码器处理特征。
5. 初始化解码器。
6. 循环预测单词,直到遇到
<end>
标签或达到最大长度。
7. 生成最终的字幕。
14. 总结与展望
通过上述详细的介绍,我们实现了一个完整的图像字幕生成系统。该系统结合了计算机视觉和自然语言处理技术,利用Bahdanau注意力模型提高了字幕生成的准确性。
在未来,可以考虑以下几个方面进行改进:
-
模型复杂度
:尝试更复杂的模型架构,如Transformer,以进一步提高性能。
-
数据增强
:使用更多的数据增强技术,扩充训练数据,提高模型的泛化能力。
-
评估指标
:引入更全面的评估指标,如BLEU、ROUGE等,更准确地评估字幕生成的质量。
通过不断的优化和改进,图像字幕生成技术将在更多领域得到广泛应用,如辅助视觉障碍人士、图像搜索和智能监控等。
超级会员免费看
7065

被折叠的 条评论
为什么被折叠?



