【人工智能笔记】第十七节 tensorflow 2.0 分布式训练之单机多显卡的实现方式

本文介绍如何在单机上利用多个显卡进行模型的分布式训练，通过tf.distribute.MirroredStrategy实现梯度计算的并行化，提高训练效率。文章详细展示了如何手动指定操作运行的设备，并提供了完整的代码示例。

单机多显卡，关键使用tf.distribute.MirroredStrategy，对模型进行分布式创建与编译。其余Keras代码不用改变。

该分布式训练是采用多个相同模型，在不同显卡中计算梯度，然后合并求平均，再更新所有权重，同步到所有模型中。

另外可以在构建时，手动指定哪些操作使用的设备，如下：

# 显示当前操作运行的硬件信息
tf.debugging.set_log_device_placement(True)

# 在CPU下运行
with tf.device('/CPU:0'):
    a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])
    b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])

# 在第一个GPU下运行
with tf.device('/GPU:0'):
    c1 = tf.matmul(a, b)
    print(c1)
# 在第二个GPU下运行
with tf.device('/GPU:1'):
    c2 = tf.matmul(a, b)
    print(c2)

完整的单机多显卡分布式代码如下，这里使用了两个虚拟显卡来演示，虚拟显卡不会带来性能增加：

import tensorflow as tf
import numpy as np
import os
import sys

# 打印设备调用日志
# tf.debugging.set_log_device_placement(True)

# 获取所有的物理GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
print(len(gpus))
# 设置GPU可见，由于只有1个GPU，因此选择gpus[0]，一般一个物理GPU对应一个逻辑GPU
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
# 将物理GPU划分逻辑分区，这里将gpus[0]分为1个逻辑GPU，内存分别是2048，程序运行时占用内存 < 2048
tf.config.experimental.set_virtual_device_configuration(gpus[0], \
    [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024), \
    tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
# 获取所有的逻辑GPU
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(logical_gpus))

def main():
    mnist = tf.keras.datasets.mnist

    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train, x_test = x_train / 255.0, x_test / 255.0
    x_train, x_test = tf.expand_dims(x_train, axis=-1), tf.expand_dims(x_test, axis=-1)
    y_train, y_test = tf.one_hot(y_train, 10), tf.one_hot(y_test, 10)
    print('x_train, y_train', x_train.shape, y_train.shape)
    print('x_test, y_test', x_test.shape, y_test.shape)

    # 多GPU并行策略
    strategy = tf.distribute.MirroredStrategy()
    print('Number of devices: {}'.format(strategy.num_replicas_in_sync))
    with strategy.scope():
        model = tf.keras.Sequential([
            tf.keras.layers.Conv2D(32, 3, activation='relu', input_shape=(28, 28, 1)),
            tf.keras.layers.MaxPooling2D(),
            tf.keras.layers.Flatten(),
            tf.keras.layers.Dense(64, activation='relu'),
            tf.keras.layers.Dense(10)
        ])

        model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
                        optimizer=tf.keras.optimizers.Adam(),
                        metrics=['accuracy'])

    model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))


if __name__ == '__main__':
    main()