Huggingface博客整理

Performance and Scalability: How To Fit a Bigger Model and Train It Faster (huggingface.co)

12个NVLink互联,任意2个GPU之间的带宽=单个NVLink的带宽*12

不用Zero等频繁通信的方式,只是普通DDP训练大模型的话,因为每个batch只通信一次,有没有NVLink对训练速度差比大约只有20%

3种计算:

Transformers architecture includes 3 main groups of operations grouped below by compute-intensity.

  1. Tensor Contractions

    Linear layers and components of Multi-Head Attention all do batched matrix-matrix multiplications. These operations are the most compute-intensive part of training a transformer.

  2. Statistical Normalizations

    Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more reduction operations, the result of which is then applied via a map.

  3. Element-wise Operators

    These are the remaining operators: biases, dropout, activations, and residual connections. These are the least compute-intensive operations.

显存Memory占用都在哪儿:

Model Weights
  • 4 bytes * number of parameters for fp32 training
  • 6 bytes * number of parameters for mixed precision training
Optimizer States
  • 8 bytes * number of parameters for normal AdamW (maintains 2 states)
  • 2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes
  • 4 bytes * number of parameters for optimizers like SGD (maintains only 1 state)
Gradients
  • 4 bytes * number of parameters for either fp32 or mixed precision training
Forward Activations
  • size depends on many factors, the key ones being sequence length, hidden size and batch size.

There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.

Temporary Memory

Additionally there are all kinds of temporary variables which get released once the calculation is done, but in the moment these could require additional memory and could push to OOM. Therefore when coding it’s crucial to think strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed.

Functionality-specific memory

Then your software could have special memory needs. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs.

FP16 mixed precision training过程:

If we look at what’s happening with FP16 training (mixed precision) we have:

  • the model has two copies in memory: one in half-precision for the forward/backward computations and one in full precision - no memory saved here
  • the forward activations saved for gradient computation are in half-precision - memory is saved here
  • the gradients are computed in half-precision but converted to full-precision for the update, no saving there
  • the optimizer states are in full precision as all the updates are done in full-precision

The gain for FP16 training is that in each of those cases, the training with the flag --fp16 is twice as fast, which does require every tensor to have every dimension be a multiple of 8 (examples pad the tensors to a sequence length that is a multiple of 8).

Summary: FP16 with apex or AMP will only give you some memory savings with a reasonably high batch size.

Additionally, under mixed precision when possible, it’s important that the batch size is a multiple of 8 to efficiently use tensor cores.

Note that in some situations the speed up can be as big as 5x when using mixed precision. e.g. we have observed that while using Megatron-Deepspeed.

推理阶段,可以完全使用fp16或bf16(或INT8、INT4)(因为不需要optimizer梯度更新权重了),加速&节约显存;

Thanks to the fp32-like dynamic range with bf16 mixed precision loss scaling is no longer needed.

also be aware that if you pre-trained a model in bf16, it’s likely to have overflow issues if someone tries to finetune it in fp16 down the road. So once started on the bf16-mode path it’s best to remain on it and not switch to fp16.

TF32: 1+8+10=19bits;NVIDIA在高端GPU上支持,适配TensorCores;对用户来说是透明的,像使用FP32那样使用,NVIDIA芯片在合适的时候会自动转化为TF32;号称throughput相比FP32提升3倍

Batch size: 8或者64的整数倍,可以更有效的利用GPU和TensorCore,加速;batch-size越大,通信占比越小,速度越快;

DP VS. DDP

DistributedDataParallel (DDP) is typically faster than DataParallel (DP), but it is not always the case:

  • while DP is python threads-based, DDP is multiprocess-based - and as such it has no python threads limitations, such as GIL
  • on the other hand a slow inter-connectivity between the GPU cards could lead to an actual slower outcome with DDP

DDP是一上来每个进程各自读各自的,只有在加和梯度时AllReduce一把,更新weights也是各算各的;

DP是有个master线程,它一个人读数据,分发给slave线程;最后一层输出汇总发送到master,由master来计算loss,再把loss广播给所有slave;梯度都发送到master,master更新完weights后,把weights发送个所有线程;(master更忙;其他salve更闲)

The only communication DDP performs per batch is sending gradients, whereas DP does 5 different data exchanges per batch.

DP copies data within the process via python threads, whereas DDP copies data via torch.distributed.

 DP is ~10% slower than DDP w/ NVlink, but ~15% faster than DDP w/o NVlink

DataLoader

  • DataLoader(pin_memory=True, ...) which ensures that the data gets preloaded into the pinned memory on CPU and typically leads to much faster transfers from CPU to GPU memory.
  • DataLoader(num_workers=4, ...) - spawn several workers to pre-load data faster -

Mixture of Experts (MoE)

4~5倍训练加速;推理也可加速;

显存占用增大很多!

把FFN层,替换成多个,开头用Gating控制路由到哪个;

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值