tensorflow 单机多卡示例--数据并行

本文介绍了如何在TensorFlow中实现单机多GPU的数据并行训练,通过简化官方CIFAR10分类示例,探讨了变量复用和设备绑定的概念,并展示了不同网络规模下的运行时间比较。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

本文参考自官方的cifar10分类示例:
[url]https://www.tensorflow.org/tutorials/deep_cnn/[/url]

多机多卡(未验证):
[list]
[*][url]http://blog.youkuaiyun.com/cq361106306/article/details/52929468[/url]
[*][url]http://weibo.com/ttarticle/p/show?id=2309404005132982440427[/url]
[/list]

本文只保留了必要的代码, 更适合于概念的理解。

在tensorflow中,变量是复用的,变量通过变量名唯一确定。
计算图也会和设备绑定,如果一个图计算时需要用到变量a,而变量a不在该设备上,则会自动生成相应的通信代码,将变量a加载到该设备上。因而,变量的存放设备对于程序的正确性没有影响,但会导致通信开销有所差异。


测试结果: 对于全连接网络,通信开销占比大,,,单卡最为理想。。。
网络大小:输入2000*600, 中间层: 512, 128, 128, 1
运行时间:单位:秒
[img]http://dl2.iteye.com/upload/attachment/0122/4125/fbe13a1d-cfd4-3e7d-a430-9c8e29a74f09.png[/img]


# coding=utf-8
'''
Created on Jan 4, 2017
@author: colinliang

tensorflow 单机多卡程序示例,
参考: tensorflow示例cifar10_multi_gpu_train.py
'''
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import numpy as np

def _allocate_variable(name, shape, initializer, dtype=tf.float32):
# 分配变量,Tensorflow 会自动处理变量在不同设备间的通信问题,因而可以放在GPU上,也可以放在CPU上
# 如果是单机单卡,都放在GPU上比较快 (无需显式指定device, tf自动分配即可)
# 如果是单机多卡,则放在CPU上略快; 可能是我这里使用了SLI连接两块GPU,GPU间通信速度还算可以
with tf.device('/cpu:0'): #强制放在主内存上
# with tf.device(None): # 默认放在当前设备上
var = tf.get_variable(name, shape, initializer=initializer, dtype=dtype)
print('%s: %s' % (var.op.name, var.device))
return var

# 创建网络 y=xw+b
def tower(input_tensor, target_tensor, scope, dims=[]):
for i, d in enumerate(dims):
with tf.variable_scope('affine%d' % i) as varscope: # 仅仅用于生成变量的全名,与存放设备无关
w = _allocate_variable('w', shape=[input_tensor.get_shape()[1], d], initializer=tf.truncated_normal_initializer(0, 1));
b = _allocate_variable('b', shape=[], initializer=tf.zeros_initializer);
input_tensor = tf.matmul(input_tensor, w) + b;
input_tensor = tf.nn.relu(input_tensor)

with tf.variable_scope('affine_last') as varscope: # 仅仅用于生成变量的全名,与存放设备无关
# w = _allocate_variable('w', shape=[input_tensor.get_shape()[1], 1], initializer=tf.truncated_normal_initializer(0, 1));
w = _allocate_variable('w', shape=[input_tensor.get_shape()[1], 1], initializer=tf.constant_initializer(value=1));
b = _allocate_variable('b', shape=[], initializer=tf.zeros_initializer);

y = tf.matmul(input_tensor, w) + b;
l = tf.reduce_mean(tf.square(y - target_tensor));
tf.add_to_collection('losses', l)
return y, l

# 合并所有tower上的梯度,取平均, 对于单机多卡程序,这段代码是通用的
def average_tower_grads(tower_grads):
print('towerGrads:')
idx = 0
for grads in tower_grads: # grads 为 一个list,其中元素为 梯度-变量 组成的二元tuple
print('grads---tower_%d' % idx)
for g_var in grads:
print(g_var)
print('\t%s\n\t%s' % (g_var[0].op.name, g_var[1].op.name))
# print('\t%s: %s'%(g_var[0].op.name,g_var[1].op.name))
idx += 1

if(len(tower_grads) == 1):
return tower_grads[0]
avgGrad_var_s = []
for grad_var_s in zip(*tower_grads):
grads = []
v = None
for g, v_ in grad_var_s:
g = tf.expand_dims(g, 0)
grads.append(g)
v = v_
all_g = tf.concat(0, grads)
avg_g = tf.reduce_mean(all_g, 0, keep_dims=False)
avgGrad_var_s.append((avg_g, v));
return avgGrad_var_s

# 方案1 ,每组输入分别用对应的placeholder作为输入; 未测试
def generate_towers_v1(NUM_GPU=2):

input_tensors = []
target_tensors = []

towerGrads = []
lr = 1e-3
opt = tf.train.AdamOptimizer(lr)

for i in range(NUM_GPU):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
input_tensor = tf.placeholder(tf.float32, shape=[None, 1], name='input_%d' % i);
input_tensors.append(input_tensor)
target_tensor = tf.placeholder(tf.float32, shape=[None, 1], name='target_%d' % i);
target_tensors.append(target_tensor)
y, loss = tower(input_tensor=input_tensor, target_tensor=target_tensor, scope=scope)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
grads = opt.compute_gradients(loss)
towerGrads.append(grads)
avgGrad_var_s = average_tower_grads(towerGrads)
apply_gradient_op = opt.apply_gradients(avgGrad_var_s, global_step=None)
loss = tf.Print(loss, data=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES))
return input_tensors, target_tensors, y, loss, apply_gradient_op

# 方案2: 一组placeholder, 再根据tower数量分割成n组输入,分别送人对应的tower
def generate_towers_v2(NUM_GPU=2, dim_in=1, dims=None, batch_size=None):
if(dims is None): dims = []

input_tensor = tf.placeholder(tf.float32, shape=[batch_size, dim_in], name='input');
target_tensor = tf.placeholder(tf.float32, shape=[batch_size, dim_in], name='target');
input_tensors = tf.split(0, NUM_GPU, input_tensor) # batch_size必须可以被dim_in整除
target_tensors = tf.split(0, NUM_GPU, target_tensor)

towerGrads = []
lr = 1e-2
opt = tf.train.AdamOptimizer(lr) # 与GradientDescentOptimizer相比,会自动分配一些中间变量
opt = tf.train.GradientDescentOptimizer(lr)
for i in range(NUM_GPU):
with tf.device('/gpu:%d' % i):
with tf.name_scope('tower_%d' % i) as scope:
input_sub = input_tensors[i]
print("device:%s" % input_sub.device)
target_sub = target_tensors[i]
y, loss = tower(input_tensor=input_sub, target_tensor=target_sub, scope=scope, dims=dims)
# Reuse variables for the next tower.
tf.get_variable_scope().reuse_variables()
grads = opt.compute_gradients(loss)
towerGrads.append(grads)
avgGrad_var_s = average_tower_grads(towerGrads)
loss = tf.Print(loss, data=tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES))

apply_gradient_op = opt.apply_gradients(avgGrad_var_s, global_step=None)

print('ALL variables:')
for v in tf.all_variables():
print('\t%s' % v.op.name)

return input_tensor, target_tensor, y, loss, apply_gradient_op

if __name__ == '__main__':
sess = tf.Session()
NUM_GPU = 1 # 由于只有两块GPU,如果设为3,会报错:Could not satisfy explicit device specification '/device:GPU:2'
dim_in = 600; # 输入变量x 的维度
dims = [512, 128, 128] #隐层单元数,设置为[]时表示 y=xw+b的线性变换,否则表示多层的全连接网络
batch_size = 2000;

input_tensor, target_tensor, y, loss, apply_gradient_op = generate_towers_v2(NUM_GPU=NUM_GPU, dim_in=dim_in, dims=dims)
sess.run(tf.initialize_all_variables())

inputs = np.random.rand(batch_size, dim_in)
targets = inputs * 2 + 1;
feed_dict = {input_tensor:inputs, target_tensor:targets}

import time
tstart = time.time()
for i in range(10000):
# _, l = sess.run([apply_gradient_op, loss], feed_dict=feed_dict) #will print w, b
# print(l)
sess.run([apply_gradient_op], feed_dict=feed_dict) # do not print w, b
telapse = time.time() - tstart
print(u'%d块GPU用时: %.2fs' % (NUM_GPU, telapse))


示例输出:
[quote]affine0/w: /device:CPU:0
affine0/b: /device:CPU:0
affine1/w: /device:CPU:0
affine1/b: /device:CPU:0
affine2/w: /device:CPU:0
affine2/b: /device:CPU:0
affine_last/w: /device:CPU:0
affine_last/b: /device:CPU:0
towerGrads:
grads---tower_0
(<tf.Tensor 'tower_0/gradients/tower_0/MatMul_grad/tuple/control_dependency_1:0' shape=(600, 512) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c7144d0>)
tower_0/gradients/tower_0/MatMul_grad/tuple/control_dependency_1
affine0/w
(<tf.Tensor 'tower_0/gradients/tower_0/add_grad/tuple/control_dependency_1:0' shape=() dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c7140d0>)
tower_0/gradients/tower_0/add_grad/tuple/control_dependency_1
affine0/b
(<tf.Tensor 'tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1:0' shape=(512, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c7146d0>)
tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1
affine1/w
(<tf.Tensor 'tower_0/gradients/tower_0/add_1_grad/tuple/control_dependency_1:0' shape=() dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c6cb850>)
tower_0/gradients/tower_0/add_1_grad/tuple/control_dependency_1
affine1/b
(<tf.Tensor 'tower_0/gradients/tower_0/MatMul_2_grad/tuple/control_dependency_1:0' shape=(128, 128) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c6cb750>)
tower_0/gradients/tower_0/MatMul_2_grad/tuple/control_dependency_1
affine2/w
(<tf.Tensor 'tower_0/gradients/tower_0/add_2_grad/tuple/control_dependency_1:0' shape=() dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c6f48d0>)
tower_0/gradients/tower_0/add_2_grad/tuple/control_dependency_1
affine2/b
(<tf.Tensor 'tower_0/gradients/tower_0/MatMul_3_grad/tuple/control_dependency_1:0' shape=(128, 1) dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c6f47d0>)
tower_0/gradients/tower_0/MatMul_3_grad/tuple/control_dependency_1
affine_last/w
(<tf.Tensor 'tower_0/gradients/tower_0/add_3_grad/tuple/control_dependency_1:0' shape=() dtype=float32>, <tensorflow.python.ops.variables.Variable object at 0x7f8b6c69f950>)
tower_0/gradients/tower_0/add_3_grad/tuple/control_dependency_1
affine_last/b
ALL variables:
affine0/w
affine0/b
affine1/w
affine1/b
affine2/w
affine2/b
affine_last/w
affine_last/b
[/quote]
### 单机训练概述 单机训练是一种常见的分布式计算方式,用于加速深度学习模型的训练过程。通过利用个GPU并行化计算任务,可以显著缩短训练时间。以下是关于TensorFlow和PyTorch两种主流框架下单机训练的相关说明。 --- #### TensorFlow中的单机训练TensorFlow中实现单机训练主要依赖于其内置的`tf.distribute.Strategy` API[^2]。该API提供了种策略来管理数据分布和模型参数更新的方式。对于单机场景,通常使用`MirroredStrategy`,它会在个GPU上复制相同的模型,并同步它们之间的梯度更新。 下面是一个简单的代码示例展示如何配置单机环境: ```python import tensorflow as tf strategy = tf.distribute.MirroredStrategy() with strategy.scope(): model = tf.keras.Sequential([ tf.keras.layers.Conv2D(32, (5, 5), activation='relu', input_shape=(28, 28, 1)), tf.keras.layers.MaxPooling2D((2, 2)), tf.keras.layers.Flatten(), tf.keras.layers.Dense(10) ]) model.compile(optimizer=tf.keras.optimizers.Adam(), loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)) ``` 上述代码片段展示了如何创建一个基于卷积神经网络(CNN)的模型,并将其放置在一个由`MirroredStrategy`定义的上下文中运行。 --- #### PyTorch中的单机训练 相比TensorFlow,PyTorch提供了一种更灵活的方式来设置单机训练——即通过`torch.nn.DataParallel`或`torch.distributed.launch`工具包。其中推荐的方法是后者,因为它具有更好的性能表现以及更高的可扩展性。 这里给出一段采用`torch.distributed.launch`模块启动脚本的例子: 首先安装必要的库: ```bash pip install torch torchvision ``` 接着编写Python脚本(假设文件名为train.py),如下所示: ```python import os import torch import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP def setup(rank, world_size): os.environ['MASTER_ADDR'] = 'localhost' os.environ['MASTER_PORT'] = '12355' # 初始化进程组 dist.init_process_group("gloo", rank=rank, world_size=world_size) def cleanup(): dist.destroy_process_group() class ToyModel(torch.nn.Module): def __init__(self): super().__init__() self.net1 = torch.nn.Linear(10, 10) self.relu = torch.nn.ReLU() self.net2 = torch.nn.Linear(10, 5) def forward(self, x): return self.net2(self.relu(self.net1(x))) if __name__ == "__main__": n_gpus = torch.cuda.device_count() assert n_gpus >= 2, f"Requires at least 2 GPUs to run, but got {n_gpus}" for gpu_id in range(n_gpus): setup(gpu_id, n_gpus) device = torch.device(f'cuda:{gpu_id}') model = ToyModel().to(device) ddp_model = DDP(model) loss_fn = torch.nn.MSELoss() optimizer = torch.optim.SGD(ddp_model.parameters(), lr=0.001) outputs = ddp_model(torch.randn(20, 10).to(device)) labels = torch.randn(20, 5).to(device) loss_fn(outputs, labels).backward() optimizer.step() cleanup() ``` 此段程序实现了基本的数据并行机制,在每张显上分别加载子批次数据进行前向传播与反向传播操作后再汇总结果[^1]。 --- ### 提升程序员技能的意义 值得注意的是,参与此类复杂系统的构建不仅有助于解决实际业务需求,还能极大地促进个人技术水平的成长。例如,熟悉大模型的应用开发流程能够增强开发者对机器学习算法的理解力及其实践运用的能力[^3]。 此外,社区分享了许宝贵的学习资料供初学者参考,包括但不限于详细的教程文档、项目案例研究以及面试准备指南等内容[^4]。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值