小集群:Ring;缺点:GPU多了以后,延迟太大,N;
大集群:Double-Tree; 优点:hop数是lg(N),延迟减少;
更大集群:两级;节点内先AllReduce一把,结果再在跨机器上AllReduce;优点:减少速度较慢的跨机器通信的数据量;
原文:
Ring Algorithm: This is NCCL's default communication pattern for small to medium-sized clusters. In this scheme, each GPU sends data to its neighbor and receives data from another neighbor, forming a ring-like structure. CUDA facilitates this by handling the GPU-to-GPU data transfers within the ring, using high-bandwidth connections like NVLink within a node or GPUDirect RDMA across nodes. This algorithm is bandwidth-efficient because each GPU participates in both sending and receiving, distributing the workload across the ring. However, as the number of GPUs in the ring increases, the communication latency becomes a limiting factor, necessitating more complex algorithms for larger clusters.
Tree and Hierarchical Algorithms: For larger clusters, NCCL employs tree-based algori