19_Training and Deploying TensorFlowModels at Scale_walk目录_TensorFlow Serving_requests_REST_gRPC_Docker_Google API Client Library_gpu : 19_训练 & 部署 TFModels at Scale_walk目录_TensorFlow Serving_requests_REST_gRPC_Docker_gcp客户端库_gpu_Linli522362242的专栏-优快云博客
Training Models Across Multiple Devices
There are two main approaches to training a single model across multiple devices: model parallelism, where the model is split across the devices, and data parallelism, where the model is replicated across every device, and each replica[ˈreplɪkə]复制品 is trained on a subset of the data. Let’s look at these two options closely before we train a model on multiple GPUs.
Model Parallelism

Figure 19-15. Splitting a fully connected neural network
So far we have trained each neural network on a single device. What if we want to train a single neural network across multiple devices? This requires chopping the model into separate chunks and running each chunk on a different device.Unfortunately, such model parallelism turns out to be pretty tricky, and it really depends on the architecture of your neural network. For fully connected networks, there is generally not much to be gained from this approach (see Figure 19-15). Intuitively, it may seem that an easy way to split the model is to place each layer on a different device, but this does not work because each layer needs to wait for the output of the previous layer before it can do anything. So perhaps you can slice it vertically—for example, with the left half of each layer on one device, and the right part on another device? This is slightly better, since both halves of each layer can indeed work in parallel, but the problem is that each half of the next layer requires the output of both halves, so there will be a lot of cross-device communication (represented by the dashed arrows). This is likely to completely cancel out the benefit of the parallel computation, since cross-device communication is slow (especially when the devices are located on different machines).

Figure 19-16. Splitting a partially connected neural network
Some neural network architectures, such as convolutional neural networks (see Cp14 14_Deep Computer Vision Using Convolutional Neural Networks_max pool_GridSpec_tf.nn, layers, contrib_Linli522362242的专栏-优快云博客), contain layers that are only partially connected to the lower layers, so it is much easier to distribute chunks across devices in an efficient way (Figure 19-16).
Figure 19-17. Splitting a deep recurrent neural network
Deep recurrent neural networks (see Cp15 15_RNN_naive_linear_CNN预测顺序数据10值_scalar_plt.sca_labelpad_curve_Layer Normal_TimeDistributed_LSTM_GRU_Linli522362242的专栏-优快云博客) can be split a bit more efficiently across multiple GPUs. If you split the network horizontally by placing each layer on a different device, and
- you feed the network with an input sequence to process,
- then at the first time step only one device will be active (working on the sequence’s first value),
- at the second step two will be active (the second layer will be handling the output of the first layer for the first value, while the first layer will be handling the second value),
- and by the time the signal propagates to the output layer, all devices will be active simultaneously (Figure 19-17).
- There is still a lot of cross-device communication going on, but since each cell may be fairly complex, the benefit of running multiple cells in parallel may (in theory) outweigh the communication penalty. However, in practice a regular stack of LSTM layers running on a single GPU actually runs much faster.
In short, model parallelism may speed up running or training some types of neural networks, but not all, and it requires special care and tuning, such as making sure that devices that need to communicate the most run on the same machine.(If you are interested in going further with model parallelism, check out Mesh TensorFlow.) Let’s look at a much simpler and generally more efficient option: data parallelism.
Data parallelism using the mirrored strategy使用镜像策略的数据并行性
Arguably the simplest approach is to completely mirror all the model parameters across all the GPUs and always apply the exact same parameter updates on every GPU. This way, all replicas always remain perfectly identical. This is called the mirrored strategy, and it turns out to be quite efficient, especially when using a single machine (see Figure 19-18).
Figure 19-18. Data parallelism using the mirrored strategy
The tricky part when using this approach is to efficiently compute the mean of all the gradients from all the GPUs and distribute the result across all the GPUs. This can be done using an AllReduce algorithm, a class of algorithms where multiple nodes collaborate to efficiently perform a reduce operation (such as computing the mean, sum, and max), while ensuring that all nodes obtain the same final result. Fortunately, there are off-the-shelf implementations of such algorithms, as we will see.
Data parallelism with centralized parameters
Another approach is to store the model parameters outside of the GPU devices performing the computations (called workers), for example on the CPU (see Figure 19-19). In a distributed setup, you may place all the parameters on one or more CPU-only servers called parameter servers, whose only role is to host and update the parameters.
Figure 19-19. Data parallelism with centralized parameters
Whereas the mirrored strategy imposes synchronous weight updates across all GPUs, this centralized approach allows either synchronous or asynchronous updates. Let’s see the pros and cons of both options.
Synchronous updates. With synchronous updates, the aggregator waits until all gradients are available before it computes the average gradients and passes them to the optimizer, which will update the model parameters. Once a replica has finished computing its gradients, it must wait for the parameters to be updated before it can proceed to the next mini-batch. The downside is that some devices may be slower than others, so all other devices will have to wait for them at every step. Moreover, the parameters will be copied to every device almost at the same time (immediately after the gradients are applied), which may saturate the parameter servers’ bandwidth.
To reduce the waiting time at each step, you could ignore the gradients from the slowest few replicas (typically ~10%). For example, you could run 20 replicas, but only aggregate the gradients from the fastest 18 replicas at each step, and just ignore the gradients from the last 2. As soon as the parameters are updated, the first 18 replicas can start working again immediately, without having to wait for the 2 slowest replicas. This setup is generally described as having 18 replicas plus 2 spare replicas.(This name is slightly confusing because it sounds like some replicas are special, doing nothing. In reality, all replicas are equivalent: they all work hard to be among the fastest at each training step, and the losers vary at every step (unless some devices are really slower than others). However, it does mean that if a server crashes, training will continue just fine.)
Asynchronous updates. With asynchronous updates, whenever a replica has finished computing the gradients, it immediately uses them to update the model parameters. There is no aggregation (it removes the “mean” step in Figure 19-19) and no synchronization. Replicas work independently of the other replicas. Since there is no waiting for the other replicas, this approach runs more training steps per minute. Moreover, although the parameters still need to be copied to every device at every step, this happens at different times for each replica, so the risk of bandwidth saturation is reduced.
Data parallelism with asynchronous updates is an attractive choice because of its simplicity, the absence of synchronization delay, and a better use of the bandwidth. However, although it works reasonably well in practice, it is almost surprising that it works at all! Indeed, by the time a replica has finished computing the gradients based on some parameter values, these parameters will have been updated several times by other replicas (on average N – 1 times, if there are N replicas), and there is no guarantee that the computed gradients will still be pointing in the right direction (see Figure 19-20). When gradients are severely out-of-date, they are called stale gradients: they can slow down convergence, introducing noise and wobble effects (the learning curve may contain temporary oscillations), or they can even make the training algorithm diverge
Figure 19-20. Stale[steɪl]使变旧;不新鲜的 gradients when using asynchronous updates
There are a few ways you can reduce the effect of stale gradients:
- • Reduce the learning rate.
- • Drop stale gradients or scale them down.
- • Adjust the mini-batch size.
- • Start the first few epochs using just one replica (this is called the warmup phase). Stale gradients tend to be more damaging at the beginning of training, when gradients are typically large and the parameters have not settled into a valley of the cost function yet, so different replicas may push the parameters in quite different directions.
A paper published by the Google Brain team in 2016(Jianmin Chen et al., “Revisiting Distributed Synchronous SGD,” arXiv preprint arXiv:1604.00981 (2016).) benchmarked various approaches and found that using synchronous updates with a few spare replicas was more efficient than using asynchronous updates, not only converging faster but also producing a better model. However, this is still an active area of research, so you should not rule out asynchronous updates just yet.
Bandwidth saturation
Whether you use synchronous or asynchronous updates, data parallelism with centralized parameters still requires communicating the model parameters from the parameter servers(whose only role is to host and update the parameters) to every (model) replica at the beginning of each training step, and the gradients in the other direction at the end of each training step. Similarly, when using the mirrored strategy, the gradients produced by each GPU will need to be shared with every other GPU. Unfortunately, there always comes a point where adding an extra GPU will not improve performance at all because the time spent moving the data into and out of GPU RAM (and across the network in a distributed setup) will outweigh the speedup obtained by splitting the computation load. At that point, adding more GPUs will just worsen the bandwidth saturation and actually slow down training.
For some models, typically relatively small and trained on a very large training set, you are often better off training the model on a single machine with a single powerful GPU with a large memory bandwidth.
Saturation is more severe for large dense models, since they have a lot of parameters and gradients to transfer. It is less severe for small models (but the parallelization gain is limited) and for large sparse models, where the gradients are typically mostly zeros and so can be communicated efficiently. Jeff Dean, initiator and lead of the Google Brain project, reported typical speedups of 25–40× when distributing computations across 50 GPUs for dense models, and a 300× speedup for sparser models trained across 500 GPUs. As you can see, sparse models really do scale better. Here are a few concrete examples:
- • Neural machine translation: 6× speedup on 8 GPUs
- • Inception/ImageNet: 32× speedup on 50 GPUs
- • RankBrain: 300× speedup on 500 GPUs
Beyond a few dozen GPUs for a dense model or few hundred GPUs for a sparse model, saturation kicks in and performance degrades. There is plenty of research going on to solve this problem (exploring peer-to-peer architectures rather than centralized parameter servers, using lossy model compression, optimizing when and what the replicas need to communicate, and so on), so there will likely be a lot of progress in parallelizing neural networks in the next few years.
In the meantime, to reduce the saturation problem, you probably want to use a few powerful GPUs rather than plenty of weak GPUs, and you should also group your GPUs on few and very well interconnected servers. You can also try dropping the float precision from 32 bits (tf.float32) to 16 bits (tf.bfloat16). This will cut in half the amount of data to transfer, often without much impact on the convergence rate or the model’s performance. Lastly, if you are using centralized parameters, you can shard (split) the parameters across multiple parameter servers: adding more parameter servers will reduce the network load on each server and limit the risk of bandwidth saturation.
OK, now let’s train a model across multiple GPUs!
Training at Scale Using the Distribution Strategies API
Many models can be trained quite well on a single GPU, or even on a CPU. But
- if training is too slow, you can try distributing it across multiple GPUs on the same machine.
- If that’s still too slow, try using more powerful GPUs, or add more GPUs to the machine.
- If your model performs heavy computations (such as large matrix multiplications), then it will run much faster on powerful GPUs, and you could even try to use TPUs on Google Cloud AI Platform, which will usually run even faster for such models.
- But if you can’t fit any more GPUs on the same machine, and if TPUs aren’t for you (e.g., perhaps your model doesn’t benefit much from TPUs, or perhaps you want to use your own hardware infrastructure), then you can try training it across several servers, each with multiple GPUs (if this is still not enough, as a last resort you can try adding some model parallelism, but this requires a lot more effort). In this section we will see how to train models at scale, starting with multiple GPUs on the same machine (or TPUs) and then moving on to multiple GPUs across multiple machines.
without splitting a GPU into two virtual GPUs
from tensorflow import keras
import numpy as np
import tensorflow as tf
# split a GPU into two or more virtual GPUs
# physical_gpus = tf.config.experimental.list_physical_devices('GPU')
# tf.config.experimental.set_virtual_device_configuration(
# physical_gpus[0],
# [ tf.config.experimental.VirtualDeviceConfiguration( memory_limit=5120),
# tf.config.experimental.VirtualDeviceConfiguration( memory_limit=5120)
# ]
# )
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train_full = X_train_full[..., np.newaxis].astype(np.float32) / 255.
X_test = X_test[..., np.newaxis].astype( np.float32 )/255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_new = X_test[:3]
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
def create_model():
return keras.models.Sequential([
keras.layers.Conv2D( filters=64, kernel_size=7, activation = "relu",
padding="same", input_shape=[28,28,1]
), # (None, 28, 28, 64)
keras.layers.MaxPooling2D( pool_size=2 ), # (None, 14, 14, 64)
keras.layers.Conv2D( filters=128, kernel_size=3, activation="relu",
padding="same"
), # (None, 14, 14, 128)
keras.layers.Conv2D( filters=128, kernel_size=3, activation="relu",
padding="same"
),
keras.layers.MaxPooling2D( pool_size=2 ), # (None, 7, 7, 128)
keras.layers.Flatten(), # (None, 6272)
keras.layers.Dense( units=64, activation='relu' ),# (None, 64)
keras.layers.Dropout(0.5),
keras.layers.Dense( units=10, activation="softmax" ),# (None, 10)
])
# model = create_model()
# model.build()
# model.summary()
batch_size = 100
model = create_model()
model.compile( loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD( learning_rate=1e-2 ),
metrics = ["accuracy"]
)
model.fit(X_train, y_train, epochs=10,
validation_data=(X_valid, y_valid), batch_size=batch_size
)
vs without splitting a GPU into two virtual GPUs

Luckily, TensorFlow comes with a very simple API that takes care of all the complexity for you: the Distribution Strategies API. To train a Keras model across all available GPUs (on a single machine, for now) using data parallelism with the mirrored strategy, create a MirroredStrategy object, call its scope() method to get a distribution context, and wrap the creation and compilation of your model inside that context. Then call the model’s fit() method normally:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
distribution = tf.distribute.MirroredStrategy()
with distribution.scope():
model = create_model()
model.compile( loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.SGD( learning_rate=1e-2),
metrics=['accuracy'])

VS NCCL(NVIDIA Collective Communication Library) is not supported when using virtual GPUs, fallingback to reduction to one device.(The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, reduce-scatter as well as point-to-point send and receive that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over NVIDIA Mellanox Network across nodes.)
with splitting a GPU into two virtual GPUs

mirror all the model parameters across all the GPUs and always apply the exact same parameter updates on every GPU. This way, all (model)replicas always remain perfectly identical. This is called the mirrored strategy, and it turns out to be quite efficient, especially when using a single machine (see Figure 19-18).
Figure 19-18. Data parallelism using the mirrored strategy
The tricky part when using this approach is to efficiently compute the mean of all the gradients from all the GPUs and distribute the result across all the GPUs. This can be done using an AllReduce algorithm, a class of algorithms where multiple nodes collaborate to efficiently perform a reduce operation (such as computing the mean, sum, and max), while ensuring that all nodes obtain the same final result.
(by default it will use all available GPUs)
batch_size = 100 # must be divisible by the number of workers
model.fit( X_train, y_train, epochs=10,
validation_data=( X_valid, y_valid), batch_size=batch_size
)
Under the hood, tf.keras is distribution-aware, so in this MirroredStrategy context it knows that it must replicate all variables and operations across all available GPU devices. Note that the fit() method will automatically split each training batch across all the replicas, so it’s important that the batch size be divisible by the number of replicas. And that’s all! Training will generally be significantly faster than using a single device, and the code change was really minimal.

with splitting a GPU into two virtual GPUs
np.round( model.predict( X_new),2 )
![]()
Once you have finished training your model, you can use it to make predictions efficiently: call the predict() method, and it will automatically split the batch across all replicas, making predictions in parallel (again, the batch size must be divisible by the number of replicas). If you call the model’s save() method, it will be saved as a regular model, not as a mirrored model with multiple replicas. So when you load it, it will run like a regular model, on a single device (by default GPU 0, or the CPU if there are no GPUs). If you want to load a model and run it on all available devices, you must call keras.models.load_model() within a distribution context:
model.save("my_mnist_model.h5")
![]()
with distribution.scope():
mirrored_model = keras.models.load_model("my_mnist_model.h5")
If you only want to use a subset of all the available GPU devices, you can pass the list to the MirroredStrategy’s constructor:
distribution = tf.distribute.MirroredStrategy(["/gpu:0", "/gpu:1"])
![]()
By default, the MirroredStrategy class uses the NVIDIA Collective Communications Library (NCCL) for the AllReduce mean operation, but you can change it by setting the cross_device_ops argument to an instance of the tf.distribute.HierarchicalCopyAllReduce class, or an instance of the tf.distribute.ReductionToOneDevice class. The default NCCL option is based on the tf.distribute.NcclAllReduce class, which is usually faster, but this depends on the number and types of GPUs, so you may want to give the alternatives a try(For more details on AllReduce algorithms, read this great post by Yuichiro Ueno, and this page on scaling with NCCL.).
If you want to try using data parallelism with centralized parameters, replace the MirroredStrategy with the CentralStorageStrategy:
distribution = tf.distribute.experimental.CentralStorageStrategy()
You can optionally set the compute_devices argument to specify the list of devices you want to use as workers (by default it will use all available GPUs), and you can optionally set the parameter_device argument to specify the device you want to store the parameters on (by default it will use the CPU, or the GPU if there is just one).
from tensorflow import keras
import numpy as np
import tensorflow as tf
physical_gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_virtual_device_configuration(
physical_gpus[0],
[ tf.config.experimental.VirtualDeviceConfiguration( memory_limit=5120),
tf.config.experimental.VirtualDeviceConfiguration( memory_limit=5120)
]
)
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.mnist.load_data()
X_train_full = X_train_full[..., np.newaxis].astype(np.float32) / 255.
X_test = X_test[..., np.newaxis].astype( np.float32 )/255.
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_new = X_test[:3]
def create_model():
return keras.models.Sequential([
keras.layers.Conv2D( filters=64, kernel_size=7, activation = "relu",
padding="same", input_shape=[28,28,1]
), # (None, 28, 28, 64)
keras.layers.MaxPooling2D( pool_size=2 ), # (None, 14, 14, 64)
keras.layers.Conv2D( filters=128, kernel_size=3, activation="relu",
padding="same"
), # (None, 14, 14, 128)
keras.layers.Conv2D( filters=128, kernel_size=3, activation="relu",
padding="same"
),
keras.layers.MaxPooling2D( pool_size=2 ), # (None, 7, 7, 128)
keras.layers.Flatten(), # (None, 6272)
keras.layers.Dense( units=64, activation='relu' ),# (None, 64)
keras.layers.Dropout(0.5),
keras.layers.Dense( units=10, activation="softmax" ),# (None, 10)
])
# Why tf.distribute.ReduceOp.SUM?
cp15_Classifying Images with Deep Convolutional NN_Loss_Cross Entropy_ax.text_mnist_ CelebA_Colab_ck_Linli522362242的专栏-优快云博客
Equation 4-22. Cross entropy cost function(average cross-entropy error) 
is equal to 1 if the target class for the ith instance is k; otherwise, it is equal to 0.
Firstly, batch_size = 100 (OR m, global_batch_size), 2 replicas (since 2 gpu devices), so the size of replica A and B is equal to 50; tf.keras.losses.sparse_categorical_crossentropy returns the loss of each instance. Then call K.sum() to get the cumulative loss of all instances, so the average loss of all replicas is (K.sum(loss_each_instance_in_A)/50 + K.sum(loss_each_instance_in_B)/50)/2 ==> K.sum(loss_each_instance_in_A)/100 + K.sum(loss_each_instance_in_B)/100, 100 is the global_batch_size.
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
K = keras.backend
batch_size=100 # global_batch_size
distribution = tf.distribute.MirroredStrategy()
with distribution.scope():
model = create_model()
optimizer = keras.optimizers.SGD()
with distribution.scope(): # global_batch_size
dataset = tf.data.Dataset.from_tensor_slices( (X_train, y_train) ).repeat().batch( batch_size )
# Data from the given dataset will be distributed evenly across all the compute
# replicas. We will assume that the input dataset is batched by the global batch
# size. With this assumption, we will make a best effort to divide each batch
# across all the replicas (one or more workers).
# If this effort fails, an error will be thrown, and the user should instead use
# `make_input_fn_iterator` which provides more control to the user, and does not
# try to divide a batch across replicas.
input_iterator = distribution.make_dataset_iterator( dataset )
@tf.function
def train_step():
def step_fn( inputs ):
X, y = inputs
with tf.GradientTape() as tape:
Y_proba = model(X)
# average cross-entropy error
loss = K.sum( keras.losses.sparse_categorical_crossentropy(y, Y_proba) )/batch_size
grads = tape.gradient( loss, model.trainable_variables )
optimizer.apply_gradients( zip(grads, model.trainable_variables) )
return loss
per_replica_losses = distribution.experimental_run( step_fn, input_iterator )
# https://www.tensorflow.org/api_docs/python/tf/distribute/Strategy
# For example, if you have a global batch size of 8 and 2 replicas, values for
# examples [0, 1, 2, 3] will be on replica 0 and [4, 5, 6, 7] will be on replica 1.
# With axis=None, reduce will aggregate only across replicas, returning
# [0+4, 1+5, 2+6, 3+7]. This is useful when each replica is computing a scalar
# or some other value that doesn't have a "batch" dimension (like a gradient or loss).
mean_loss = distribution.reduce( tf.distribute.ReduceOp.SUM,
per_replica_losses, axis=None
)
return mean_loss
n_epochs = 10
with distribution.scope():
input_iterator.initialize() # input_iterator.initializer
for epoch in range( n_epochs ):
print( "Epoch {}/{}".format( epoch+1, n_epochs ) )
for iteration in range( len(X_train)//batch_size ):
print( "\rLoss: {:.3f}".format( train_step().numpy() ), end="" )
print()

Now let’s see how to train a model across a cluster of TensorFlow servers!
Training a Model on a TensorFlow Cluster
A TensorFlow cluster is a group of TensorFlow processes running in parallel, usually on different machines, and talking to each other to complete some work—for example, training or executing a neural network. Each TF process in the cluster is called a task, or a TF server. It has an
- IP address,
- a port, and
- a type (also called its role or its job).
The type can be either "worker", "chief", "ps" (parameter server), or "evaluator":
- • Each worker performs computations, usually on a machine with one or more GPUs.
- • The chief performs computations as well (it is a worker), but it also handles extra work such as writing TensorBoard logs or saving checkpoints. There is a single chief in a cluster. If no chief is specified, then the first worker is the chief( i.e., worker #0 ).
- • A parameter server only

探讨在多设备上训练单一模型的两种主要方式:模型并行和数据并行,以及如何利用TensorFlow Serving、gRPC、Docker等进行高效部署。介绍AI平台超参数优化,支持大规模数据训练,通过REST API调用模型预测。
最低0.47元/天 解锁文章

被折叠的 条评论
为什么被折叠?



