模型加速--IO加速，tfRecord和keras Sequence

本文链接：https://blog.youkuaiyun.com/joyce_peng/article/details/107466265

本文介绍如何使用Keras的Sequence和TensorFlow的TFRecord优化大规模数据集的模型训练过程，减少数据读取时间，提高GPU利用率。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

每次在训练模型时，尤其是训练数据较大时，都会大部分时间都会花在数据IO读写上，而不是真正的GPU计算，这也就意味着，GPU实际上很多时候是空闲等待状态！
在keras中可以通过sequence实现，在tensorflow中可以通过tfRecord实现。或者将图片以.npy的格式保存在本地，在训练的时候读取也会快很多。

如果将大规模数据一次性读进内存会很耗内存，可以使用tensorflow的queue和keras的sequence来存储数据。

1、keras的sequence
2、tensorflow的tfRecord

1、keras sequence

keras官方给出了参考实例，这里给了个常用的

import ast
import os
import numpy as np
import random
import math
from tensorflow.python.keras.preprocessing.image import img_to_array as img_to_array
from tensorflow.python.keras.preprocessing.image import load_img as load_img

def load_image(image_path, size):
    return img_to_array(load_img(image_path, target_size=(size, size))) / 255.

# shuffle好像有点问题,如有问题可参考https://www.kaggle.com/wrosinski/pretrained-cnn-albumentations
class KagglePlanetSequence(tf.keras.utils.Sequence):
    """
    在不把数据一次性读进内存的情况下，我们使用Sequence完成数据相对高效的IO
    """
    
    def __init__(self, df, data_path, im_size, batch_size, mode='train'):
        """
        df: pandas dataframe that contains columns with image names and labels
        data_path: path that contains the training images
        im_size: image size
        mode: when in training mode, data will be shuffled between epochs
        """
        self.df = df
        self.batch_size = batch_size
        self.im_size = im_size
        self.mode = mode
        
        # Take labels and a list of image locations in memory
        # ast.literal_eval类似eval，将字符转化为原有形式，更安全
        self.wlabels = self.df['weather_labels'].apply(lambda x: ast.literal_eval(x)).tolist()
        self.glabels = self.df['ground_labels'].apply(lambda x: ast.literal_eval(x)).tolist()
        self.image_list = self.df['image_name'].apply(lambda x: os.path.join(data_path, x + '.jpg')).tolist()

    def __len__(self):
        return int(math.ceil(len(self.df) / float(self.batch_size)))

    def on_epoch_end(self):
        # 每一轮之后对数据乱序
        self.indexes = range(len(self.image_list))
        if self.mode == 'train':
            self.indexes = random.sample(self.indexes, k=len(self.indexes))

    def get_batch_labels(self, idx): 
        # 拿到一个batch的标签
        return [self.wlabels[idx * self.batch_size: (idx + 1) * self.batch_size],
                self.glabels[idx * self.batch_size: (idx + 1) * self.batch_size]]

    def get_batch_features(self, idx):
        # 拿到一个batch的图像
        batch_images = self.image_list[idx * self.batch_size: (1 + idx) * self.batch_size]
        return np.array([load_image(im, self.im_size) for im in batch_images])

    def __getitem__(self, idx):
        batch_x = self.get_batch_features(idx)
        batch_y = self.get_batch_labels(idx)
        return batch_x, batch_y

可以使用此Sequence对象代替自定义生成器fit_generator（来训练模型。大家注意一下，不需要提供每个epoch的步数(steps)，__len__方法已经为生成器内部实现了这个逻辑。此外，tf.keras提供对可用于增强训练循环的所有可用Keras回调函数。可以作为辅助功能加入，比如可以提供early stopping，学习速率调度，为TensorBoard可视化写日志等等…比如使用ModelCheckPoint回调在每个时期之后保存模型，以便我们可以随时从预训练模型开始训练。

# 使用
seq = KagglePlanetSequence(df_train,
                       './train-jpg/',
                       im_size=IM_SIZE,
                       batch_size=32)
another_model = tf.keras.models.load_model('./model.h5')
another_model.fit_generator(generator=seq, verbose=1, epochs=1)

# 测试
test_seq = KagglePlanetSequence(df_train,
                       './train-jpg/',
                       im_size=IM_SIZE,
                       batch_size=32,
                       mode='test') # test mode disables shuffling

predictions = model.predict_generator(generator=test_seq, verbose=1)

2、tensorflow tfrecord

tensorflow中提供的tf.data是一个非常强大的数据源接口，它可以接受很不同形态的数据输入到模型中进行学习训练。
TFRecords 其实是一种二进制文件，虽然它不如其他格式好理解，但是它能更好的利用内存，更方便赋值和移动，并且不需要单独的标签文件，理论上，它能保存所有的信息。
在这里插入图片描述
tfrecord原理讲解可参考https://www.jianshu.com/p/b251e85ac582
tfrecord代码讲解可参考https://www.cnblogs.com/wj-1314/p/11211333.html

2.1 tf.Example

TFRecord 的核心内容在于内部有一系列的Example，Example 是protocolbuf 协议（protocolbuf 是通用的协议格式，对主流的编程语言都适用。所以这些 List对应到Python语言当中是列表。而对于Java 或者 C/C++来说他们就是数组）下的消息体。

一个Example消息体包含了一系列的feature属性。每一个feature是一个map，也就是 key-value 的键值对。key 取值是String类型。而value是Feature类型的消息体。将数据表示为{‘string’： value}形式的 message类型，TensorFlow经常使用 tf.Example 来写入，读取 TFRecord数据。
　　
通常情况下，tf.Example中可以使用以下几种格式：

tf.train.BytesList: 可以使用的类型包括 string和byte
tf.train.FloatList: 可以使用的类型包括 float和double
tf.train.Int64List: 可以使用的类型包括 enum,bool, int32, uint32, int64
　　TFRecord支持写入三种格式的数据：string，int64，float32，以列表的形式分别通过tf.train.BytesList，tf.train.Int64List，tf.train.FloatList 写入 tf.train.Feature

# 建立tfEample
# 以dict的形式把要写入的数据汇总，并构建 tf.train.Features，然后构建 tf.train.Example
def get_tfrecords_example(feature, label):
    tfrecords_features = {}
    feat_shape = feature.shape
    tfrecords_features['feature'] = tf.train.Feature(bytes_list=
                                              tf.train.BytesList(value=[feature.tostring()]))
    tfrecords_features['shape'] = tf.train.Feature(int64_list=
                                              tf.train.Int64List(value=list(feat_shape)))
    tfrecords_features['label'] = tf.train.Feature(float_list=
                                              tf.train.FloatList(value=label))
 
    return tf.train.Example(features=tf.train.Features(feature=tfrecords_features))

# 涉及的具体函数如下
def _bytes_feature(value):
    """Returns a bytes_list from a string/byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
 
def _float_feature(value):
    """Return a float_list form a float/double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
 
def _int64_feature(value):
    """Return a int64_list from a bool/enum/int/uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

把创建的tf.train.Example序列化下，便可以通过 tf.python_io.TFRecordWriter 写入 tfrecord文件中，如下：

#创建tfrecord的writer，文件名为xxx
tfrecord_wrt = tf.python_io.TFRecordWriter('xxx.tfrecord') 
#把数据写入Example
exmp = get_tfrecords_example(feats[inx], labels[inx]) 
#Example序列化
exmp_serial = exmp.SerializeToString()   
#写入tfrecord文件 
tfrecord_wrt.write(exmp_serial)   
#写完后关闭tfrecord的writer
tfrecord_wrt.close()

2.2 如何将一张图片转换为tfRecord格式

针对上面写入tfRecord,多了如何将image转化为feature

# 处理image
# 读取图片并进行解码
image = tf.read_file(input)
image_data = tf.image.decode_jpeg(image_data)
# 将图片转换成string
image_data = image_data.tostring()

# 或者keras处理方式（将其resize）
im = np.array(img_to_array(load_img(im_list[i], target_size=(IM_SIZE, IM_SIZE))) / 255.).tostring()

# 处理label名字
name = bytes('cat', encoding='utf-8')

总代码如下：

# _*_coding:utf-8_*_
import tensorflow as tf
 
def write_test(input, output):
    # 借助于TFRecordWriter 才能将信息写入TFRecord 文件
    writer = tf.python_io.TFRecordWriter(output)
 
    # 读取图片并进行解码
    image = tf.read_file(input)
    image = tf.image.decode_jpeg(image)
 
    with tf.Session() as sess:
        image = sess.run(image)
        shape = image.shape
        # 将图片转换成string
        image_data = image.tostring()
        print(type(image))
        print(len(image_data))
        name = bytes('cat', encoding='utf-8')
        print(type(name))
        # 创建Example对象，并将Feature一一对应填充进去
        example = tf.train.Example(features=tf.train.Features(feature={
             'name': tf.train.Feature(bytes_list=tf.train.BytesList(value=[name])),
             # 如果图片大小固定，可以不实用shape这一栏
             'shape': tf.train.Feature(int64_list=tf.train.Int64List(value=[shape[0], shape[1], shape[2]])),
             'data': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_data]))
        }
        ))
        # 将example序列化成string 类型，然后写入。
        writer.write(example.SerializeToString())
    writer.close()
 
 
if __name__ == '__main__':
    input_photo = 'cat.jpg'
    output_file = 'cat.tfrecord'
    write_test(input_photo, output_file)

2.3 处理tfRecord

读取之前存取的tfRecord并形成batch进行训练

# 用dataset读取TFRecords文件
dataset = tf.contrib.data.TFRecordDataset(input_file)

def _parse_record(example_photo):
    features = {
        'name': tf.FixedLenFeature((), tf.string),
        'shape': tf.FixedLenFeature([3], tf.int64),
        'data': tf.FixedLenFeature((), tf.string)
    }
    # 解析tfrecord 文件的每条记录，即序列化后的 tf.train.Example；使用 tf.parse_single_example 来解析：
    parsed_features = tf.parse_single_example(example_photo,features=features)
    return parsed_features
    
# 解析tfrecord文件中的所有记录，我们需要使用dataset的map方法
dataset = dataset.map(_parse_record)


# map方法可以接受任意函数对dataset中的数据进行处理；另外可以使用repeat，shuffle，batch方法对dataset进行重复，混洗，分批；用repeat赋值dataset以进行多个epoch；如下：
ds_train = dataset.repeat(epochs).shuffle(buffer_size).batch(batch_size)

对ds_train进行迭代

history = model.fit(ds_train, 
                    steps_per_epoch=100, # let's just take some steps
                    epochs=1)

# 或者通过创建iterator来进行
iterator = ds_train.make_one_shot_iterator()
features = sess.run(iterator.get_next())