NCCL的多种通信拓扑

小集群:Ring;缺点:GPU多了以后,延迟太大,N;

大集群:Double-Tree; 优点:hop数是lg(N),延迟减少;

更大集群:两级;节点内先AllReduce一把,结果再在跨机器上AllReduce;优点:减少速度较慢的跨机器通信的数据量;

原文:

  • Ring Algorithm: This is NCCL's default communication pattern for small to medium-sized clusters. In this scheme, each GPU sends data to its neighbor and receives data from another neighbor, forming a ring-like structure. CUDA facilitates this by handling the GPU-to-GPU data transfers within the ring, using high-bandwidth connections like NVLink within a node or GPUDirect RDMA across nodes. This algorithm is bandwidth-efficient because each GPU participates in both sending and receiving, distributing the workload across the ring. However, as the number of GPUs in the ring increases, the communication latency becomes a limiting factor, necessitating more complex algorithms for larger clusters.

  • Tree and Hierarchical Algorithms: For larger clusters, NCCL employs tree-based algori

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值