本文实践了tensorflow的分布式并行技术
Tensor的分布式有几种模式,In-graph replication模型并行,将模型的计算图的不同部分放在不同机器执行;
between-graph replication数据并行,每台机器使用完全相同的计算图,但是计算不同的batch数据。
此外,还有异步并行和同步并行,异步并行指每台机器独立计算梯度,一旦计算完就更新到paramter server中,不等其他机器
同步并行是等所有机器都完成对梯度的计算后,将多个梯度合成并统一更新模型参数。
同步并行训练loss下降速度更快,可以达到的更大精度最高
但是同步并行的木桶效应导致集群速度取决于最慢的机器,当设备速度一致时效率比较高
下面用tensorflow实现1个parameter server和1个worker的分布式并行训练程序
# -*- coding: utf-8 -*-
import math
import tempfile
import time
import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
flags = tf.app.flags
flags.DEFINE_string("data_dir","/tmp/mnist_data","storing data")
flags.DEFINE_integer("hidden_units",100,"hidden layer")
flags.DEFINE_integer("train_steps",1000,"step")
flags.DEFINE_integer("batch_size",100,"batch size")
flags.DEFINE_float("learning_rate",0.01,"learning rate")
flags.DEFINE_boolean("sync_replicas",False,"Use the sync_replicas mode")
flags.DEFINE_integer("replicas_to_aggregate",None,"Number of replicas to aggregate before parameter "
"update is applied ,defalut:num_workers")
# None代表worker的数量,即所有worker都完成一个batch的训练后再更新模型参数
flags.DEFINE_string("ps_hosts","192.168.0.107:2222","comma-separated lst of hostname:port pairs")
flags.DEFINE_string("worker_hosts","10.211.55.14:2222","comma-separated lst of hostname:port pairs")
flags.DEFINE_string("job_name",None,"job name:worker or ps")
flags.DEFINE_integer("task_index",None,"Worker task index,should be >=0, task=0 is "
"the master worker task the performs the variable initialization")
FLAGS=flags.FLAGS
IMAGE_PIXELS=28
def main(unused_argv):
mnist = input_data.read_data_sets(FLAGS.data_dir,one_hot=True)
if FLAGS.job_name is None or FLAGS.job_name =="":
raise ValueError("Must specify an explicit 'job_name")
if FLAGS.task_index is None or FLAGS.task_index =="":
raise ValueError("Must specify an explicit task_index")
print("job name = %s" % FLAGS.job_name)
print("task index = %d" % FLAGS.task_index)
先计算所有worker数量,然后使用tf.train.ClusterSpec生成一个Cluster对象
传入的参数是ps的地址和worker的地址信息
再用tf.train.Server创建当前机器的server,用以连接到Cluster。
如果当前节点是parameter server,则不再进行后续的操作,而是使用server.join等待work工作
ps_spec = FLAGS.ps_hosts.split(",")
worker_spec = FLAGS.worker_hosts.split(",")
num_workers = len(worker_spec)
cluster = tf.train.ClusterSpec({"ps": ps_spec,"worker":worker_spec})
server = tf.train.Server(
cluster, job_name=FLAGS.job_name,task_index=FLAGS.task_index
)
if FLAGS.job_name =="ps":
server.join()
is_chief = (FLAGS.task_index==0)
worker_device = "/job:worker/task:%d/cpu:0" % FLAGS.task_index
worker_device 为计算资源
ps_device 为存储模型参数的资源
通过replica_device_setter将模型参数部署在ps服务器,将训练操作部署在worker
最后再创建记录全局训练步数的变量global_step
with tf.device(
tf.train.replica_device_setter(
worker_device=worker_device,
ps_device="/job:ps/cpu:0",
cluster=cluster)):
global_step = tf.Variable(0,name="global_step",trainable=False)
# 定义一个简单的模型
hid_w = tf.Variable(
tf.truncated_normal([IMAGE_PIXELS*IMAGE_PIXELS,FLAGS.hidden_units],stddev=1.0/IMAGE_PIXELS),name="hid_w")
hid_b = tf.Variable(tf.zeros([FLAGS.hidden_units]),name="hid_b")
sm_w = tf.Variable(
tf.truncated_normal([FLAGS.hidden_units,10],stddev=1.0/math.sqrt(FLAGS.hidden_units)),name="sm_w")
sm_b = tf.Variable(tf.zeros([10]),name="sm_b")
x =tf.placeholder(tf.float32,[None,IMAGE_PIXELS*IMAGE_PIXELS])
y_ =tf.placeholder(tf.float32,[None,10])
hid_lin = tf.nn.xw_plus_b(x,hid_w,hid_b)
hid = tf.nn.relu(hid_lin)
y = tf.nn.softmax(tf.nn.xw_plus_b(hid,sm_w,sm_b))
cross_entropy = -tf.reduce_sum(y_*tf.log(tf.clip_by_value(y,1e-10,1.0)))
opt = tf.train.AdamOptimizer(FLAGS.learning_rate)
if FLAGS.sync_replicas:
if FLAGS.replicas_to_aggregate is None:
replicas_to_aggregate = num_workers
else:
replicas_to_aggregate = FLAGS.replicas_to_aggregate
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate = replicas_to_aggregate,
total_num_replicas=num_workers,
name="mnist_sync_replicas"
)
train_step = opt.minimize(cross_entropy,global_step=global_step)
如果是同步训练模式,并且为主节点,则使用opt.get_chief_queue_runner创建队列执行器
并使用opt.get_init_tokens_op创建全局参数初始化器
if FLAGS.sync_replicas and is_chief:
chief_queue_runner = opt.get_chief_queue_runner()
init_tokens_op = opt.get_init_tokens_op()
下面生成本地的参数初始化操作init_op,创建临时的训练目录,并使用Supervisor创建分布式训练的监督器
用于管理我们的task 参与到分布式训练
init_op = tf.global_variables_initializer()
train_dir = tempfile.mkdtemp()
sv =tf.train.Supervisor(is_chief=is_chief,
logdir=train_dir,
init_op=init_op,
recovery_wait_secs=1,
global_step=global_step)
设置session的参数,其中allow_soft_placement设置为True代表当某个操作在指定的device不能执行时
可以转到其他device执行
sess_config = tf.ConfigProto(
allow_soft_placement=True,
log_device_placement=False,
device_filters =["/job:ps",
"/job:worker/task:%d"%FLAGS.task_index]
)
如果为主节点 则显示初始化session,其他节点则显示等待主节点的初始化操作
if is_chief:
print("Worker %d: Initializing session..."%FLAGS.task_index)
else:
print("Worker %d:Waiting for session to be initialized..."%FLAGS.task_index)
sess = sv.prepare_or_wait_for_session(server.target,config=sess_config)
print("Worker %d:Session initialization complete." % FLAGS.task_index)
接着如果处于同步模式并且是主节点,则调用sv.start_queue_runners执行队列化执行器
并执行全局参数初始化器
if FLAGS.sync_replicas and is_chief:
print("Starting chief queue runner and running init_tokens_op")
sv.start_queue_runners(sess,[chief_queue_runner])
sess.run(init_tokens_op)
time_begin =time.time()
print("Training begins@ %f"%time_begin)
local_step =0
while True:
batch_xs,batch_ys= mnist.train.next_batch(FLAGS.batch_size)
train_feed = {x:batch_xs,y_:batch_ys}
_,step = sess.run([train_step,global_step],feed_dict=train_feed)
local_step+=1
now= time.time()
print("%f:Worker %d:training step %d done (global step:%d)"%
(now,FLAGS.task_index,local_step,step))
if step >= FLAGS.train_steps:
break
time_end =time.time()
print("Training ends@ %f" % time_end)
training_time = time_end-time_begin
print("Training elapsed time :%f s" % training_time)
val_feed ={x:mnist.validation.images,y_:mnist.validation.labels}
val_xent = sess.run(cross_entropy,feed_dict = val_feed)
print("After %d training step(s),validation cross entropy =%g" %
(FLAGS.train_steps,val_xent))
if __name__=="__main__":
tf.app.run()
在2台机器上分别运行上述代码
第一台机器执行第一行代码
python distribute.py --job_name=ps --task_index=0
python distribute.py --job_name=worker --task_index=0
如果使用同步模式,加入--sync_replicas=True,此时全局步数是所有worker训练步数之和
同步时则是指有多少轮并行训练