NCCL的Double Binary Tree实现原理

本文探讨了NCCL 2.4中的双二叉树算法如何通过全带宽和低延迟提高大规模GPU AllReduce操作效率。与环形结构相比,双二叉树在延迟上有显著优势,尤其是在处理超过数百GPU时。文章还介绍了双树结构在性能上的提升,特别是在深度学习训练中的应用和网络错误处理的新功能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

假设每个节点上的数据Size是S,单向带宽为B;Node to Node传输1个byte的延迟为L;

无脑单二叉树,使用流式加和和传输,汇总到root,总消耗时间为: 2*S/B + lgN*L;其中,2*S为每个非叶子节点需要接收的数据量,瓶颈在此;root再广播到所有节点,消耗同样时间;因此,AllReduce总耗时为2*(2*S/B + lgN*L);

RingAllReduce:(2*S*(N-1)/B/N + 2*(N-1)*L)

double binary tree延迟小的原因:hop次数是lgN(RingAllReduce是N-1)

吞吐量高的原因:每个节点,把数据流动起来了,子节点传过来一部分,加和这部分,传出给父节点;

第1个tree,所有节点只传输和加和前一半数组;第2个tree,只做后一半数组;

假设每个节点上的数据Size是S,单向带宽为B;延迟是lgN*单跳延迟L,下面先不考虑延迟;

第1个tree,每个节点的发射耗时是(S/2)/B,每个叶子节点的接收耗时为0,每个非叶子节点的接收耗时为S/B,总共S/B的耗时可以将这前一半数据流式汇总到root节点;第2个tree,数字也是这样,S/B的耗时将后一半数据流式汇总到root节点;2个Tree同时开工,所有节点的发送带宽和接收带宽都打满了(因为每个节点既是另2个节点的father,也是另2个节点的child),S/B的耗时将数据汇总到2个root节点;

Broadcast过程,雷同,从上往下传输,还是2个Tree同时开工,S/B的耗时将结果广播至所有节点;

因此,double binary tree的AllReduce总耗时是2*S/B;

再加上延迟,就是(2*S/B + lgN*L);

比AllReduce的(2*S*(N-1)/B/N + 2*(N-1)*L),在L项上少一半个数量级;

RingAllReduce,每次每个节点等量的发送和接收,所以接收到的加和完后,没有带宽再同时发送了!

Massively Scale Your Deep Learning Training with NCCL 2.4

By Sylvain Jeaugey

Imagine using tens of thousands of GPUs to train your neural network. Using multiple GPUs to train neural networks has become quite common with all deep learning frameworks, providing optimized, multi-GPU, and multi-machine training. Allreduce operations, used to sum gradients over multiple GPUs, have usually been implemented using rings [1] [2] to achieve full bandwidth. The downside of rings is that latency scales linearly with the number of GPUs, preventing scaling above hundreds of GPUs. Enter NCCL 2.4.

Many large scale experiments have replaced the flat ring by a hierarchical, 2D ring algorithm [3] [4] [5] to get reasonably good bandwidth while lowering latency.

NCCL 2.4 now adds double binary trees, which offer full bandwidth and a logarithmic latency even lower than 2D ring latency.

Double binary trees

Double binary trees were introduced in MPI in 2009 [6] and offer the advantage of combining both full bandwidth for broadcast and reduce operations (which can be combined into an allreduce performing a reduce, then a broadcast) and a logarithmic latency, enabling good performance on small and medium size operations.

In NCCL, we build binary trees using an easy-to-implement pattern which maximizes locality, as shown in figure 1.

Binary tree diagram

Figure 1. Binary tree using a power-of-two pattern

Double binary trees rely on the fact that half or less ranks in a binary tree are nodes and half (or more) ranks are leaves. Therefore, we can build a second tree using leaves as nodes and vice-versa for each binary tree. There might be one rank which is a leaf on both trees but no rank is a node on both trees.

Figure 2 shows how we can use the pattern above to build a double binary tree by flipping the tree to invert nodes and leaves.

Double complementary binary tree diagram

Figure 2. Two complementary binary trees where each rank is at most a node in one tree and a leaf in the other.

If you superimpose the two trees, all ranks have both two parents and two children except for the root ranks, which only have one parent and one child. If we use each of the two trees to process half of the data, each rank will at most receive half of the data twice and send half of the data twice, which is as optimal as rings in terms of data sent/received.

Performance at scale

We tested NCCL 2.4 on various large machines, including the Summit [7] supercomputer, up to 24,576 GPUs. As figure 3 shows, latency improves significantly using trees. The difference from ring increases with the scale, with up to 180x improvement at 24k GPUs.

Summit Latency chart

Figure 3. NCCL latency on up to 24,576 GPUs

We confirmed that the system maintains full bandwidth with double binary trees. At scale, bandwidth degrades a bit when we cross L3 switches in the InfiniBand fabric, which we believe is due to inefficiencies between the NCCL communication pattern and InfiniBand routing algorithms.

While not perfect, this might be improved in the future. Even so, trees still show a clear advantage even when limited in bandwidth because of their small initial latency. However, NCCL automatically switches back to rings when that pattern results in greater bandwidth.

Summit bandwidth diagram

Figure 4. NCCL bus bandwidth on up to 24,576 GPUs

Effect on DL training

Figure 5 shows performance improvement on DL training is significant, and increases as we scale to larger numbers of GPUs.

We compared NCCL 2.3 and NCCL 2.4, as well as the 2D hierarchical rings using NCCL 2.3. The hierarchical ring is a 2D ring (intra-node/inter-node being the 2 dimensions) which performs a reduce-scatter operation inside the node, then multiple all-reduce operations between nodes, then an all-gather operation inside the node again.

NCCL performance comparison on ResNet50 chart

Fig 5. Performance comparison on ResNet50

While the hierarchical rings perform better than non-hierarchical rings, their advantage at scale remains constant. The tree algorithm, on the other hand, offers an increasing advantage as we scale.

Other features

Network error handling

NCCL operations behave as CUDA kernels. Once the operation launches on a CUDA stream, the user waits for its completion using stream semantics, e.g. cudaStreamQuery or cudaStreamSynchronize. It’s convenient to have the NCCL operation start as soon as the CUDA kernel producing the data completes, but it doesn’t let NCCL report errors during communication.

However, as we start using the network between nodes, network errors can occur and could prevent the NCCL operation from completing, causing a hang. This becomes increasingly important as we grow in size. NCCL 2.4 introduces two new verbs : ncclCommGetAsyncError and ncclCommAbort to handle this.

Programs can call ncclCommGetAsyncError in a loop waiting for operations to complete. If an error happens, they can abort the application or try to only abort the communicator operation with ncclCommAbort, then recreate a new communicator with the remaining nodes.

An example of using those two functions can be found in the documentation. Here is a simplified example illustrating the usage of those two functions :

int ncclStreamSynchronize(cudaStream_t stream, ncclComm_t comm) {
   while (1) {
      cudaError_t cudaErr = cudaStreamQuery(stream);
      ncclResult_t ncclAsyncErr, ncclErr;
      ncclErr = ncclCommGetAsyncError(comm, &ncclAsyncErr);

      if (cudaErr == cudaSuccess) return 0;
      if (cudaErr != cudaErrorNotReady || ncclErr != ncclSuccess) {
         printf("CUDA/NCCL Error : %d/%d\n", cudaErr, ncclErr);
         return 1; // Abnormal error
      }
            if (ncclAsyncErr != ncclSuccess) { // Async network error
               // Stop and destroy communicator 
              if (ncclCommAbort(comm) != ncclSuccess) {
                 printf("NCCL Comm Abort error : %d\n", ncclErr);          
                 return 1; // Abnormal error
              }
             return 2; // Normal error : may recreate a new comm
          }
       }
    }

This function can be generalized to including polling for other asynchronous operations, such as MPI, socket, or other I/O operations.

Support for more networks

NCCL 2.4 comes with native support for TCP/IP Sockets and InfiniBand Verbs. TCP/IP sockets should work on most networks but can also be bandwidth- and latency-limited due to limitations in the kernel driver. CPU affinity can also be complex to handle.

The InfiniBand verbs library enables an application to bypass the kernel and directly handle all network communication from user space. This is the prefered API to use on InfiniBand and RDMA over Converged Ethernet (RoCE) capable hardware..

Some other networking providers have different network APIs which provides better performance than TCP/IP sockets. Those vendors can get the best performance from NCCL by implementing an external network plugin to be used by NCCL when present. This can be provided in the form of a library named libnccl-net.so. NCCL includes an example in ext-net/dummy. Check out one example in the plugin for the libfabrics API.

Get NCCL 2.4 Today

You can get started scaling your applications to massive numbers of GPUs today. Pre-built NCCL package can be obtained from the download page. The source code is also available on github.

References


[1] Baidu Allreduce


[2] Horovod


[3] Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, Xiaowen Chu; Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes


[4] Hiroaki Mikami, Hisahiro Suganuma, Pongsakorn U-chupala, Yoshiki Tanaka, Yuichi Kageyama; ImageNet/ResNet-50 Training in 224 Seconds


[5] Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, Youlong Cheng; Image Classification at Supercomputer Scale


[6] Peter Sanders; Jochen Speck, Jesper Larsson Träff (2009); Two-tree algorithms for full bandwidth broadcast, reduction and scan


[7] Summit Supercomputer

<think>好的,我现在需要解决用户关于如何实现最优二叉搜索树(Optimal Binary Search Tree,OBST)的问题,特别是如何根据给定的概率分布计算最小平均路径长度的二叉搜索树,并提供C++实现代码。首先,我需要回忆相关知识,并确保理解用户的需求。 用户提到的引用内容主要涉及二叉搜索树(BST)的基本性质、性能问题(如退化成单支树)以及改进方法(如AVL树和红黑树)。但用户现在的问题是关于最优二叉搜索树的,这是一个不同的概念,属于动态规划的经典问题。因此,我需要先明确OBST的定义和解决方法。 最优二叉搜索树的目标是构造一棵期望搜索代价最小的二叉搜索树,给定每个键的访问概率。通常,这个问题使用动态规划来解决。关键步骤包括定义子问题、构建代价矩阵、计算子问题的最优解,并最终构造树结构。 首先,我需要回顾动态规划解决OBST的步骤。根据记忆,算法通常涉及以下步骤: 1. **输入**:有序的键集合以及每个键(和虚拟键,即失败节点)的概率。 2. **定义子问题**:计算键区间a_i到a_j的最小期望搜索代价。 3. **递推关系**:利用子问题的解来构建更大问题的最优解。 4. **构造代价矩阵**:使用二维数组存储每个子问题的解。 5. **填充矩阵**:按照区间长度递增的顺序计算每个子问题的最小代价。 6. **记录根节点**:构建另一个二维数组来记录每个子问题选择的根节点,便于后续构造树结构。 接下来,我需要考虑如何将这些步骤转化为C++代码。代码需要处理输入的概率数组,构建动态规划表,计算最小期望代价,并生成树结构。可能需要注意的问题包括数组索引的处理、概率的累加、以及如何根据记录的根节点信息构建实际的树结构。 用户可能希望得到完整的实现代码,所以需要确保代码的正确性和可读性。同时,用户提到了最小平均路径长度,也就是期望的搜索代价,这需要正确计算每个节点的深度乘以其概率的总和。 此外,需要注意动态规划的时间复杂度是O(n^3),对于较大的n可能效率不高,但对于用户的问题来说,可能需要一个正确的基础实现,而不是优化后的版本。 现在,我需要组织这些思路,确保动态规划的过程正确。例如,键的概率p和虚拟键的概率q通常总和为1。在动态规划中,子问题e[i][j]表示包含键k_i到k_j的最小期望代价。为了计算e[i][j],需要遍历所有可能的根节点k_r,并计算其左右子树的期望代价之和,加上该子树所有节点的概率之和(因为根节点的深度比子节点少1,所以子树的概率和相当于增加了根节点的深度贡献)。 此外,需要维护一个概率和数组w,其中w[i][j]表示从i到j的概率总和,包括虚拟键。这样,在计算e[i][j]时,可以快速获取该子树的概率总和,从而避免重复计算。 构造树的部分可能比较复杂,需要根据记录的根节点信息递归地构建每个子树。例如,对于每个区间[i][j],选择根节点r,然后递归构建左子树[i][r-1]和右子树[r+1][j]。 现在,我需要验证这些步骤是否正确。例如,动态规划的状态转移方程是否正确,是否考虑了所有可能的根节点。可能需要参考教材或权威资料中的算法描述,确保代码的正确性。 最后,编写代码时需要注意数组的索引范围,例如C++中数组通常从0开始,而算法描述中的索引可能从1开始,这需要调整以避免越界错误。此外,输入的p和q数组的长度是否正确,比如如果有n个键,那么应该有n个p和n+1个q(虚拟键)。 总结步骤: 1. 输入键的数量和对应的概率p和q。 2. 初始化动态规划表e和w,以及根记录表root。 3. 填充基础情况(单个虚拟键的情况)。 4. 按照区间长度递增的顺序填充动态规划表,计算每个子问题的最小代价和对应的根。 5. 根据root表递归构建树结构。 6. 输出树的结构或计算的最小期望代价。 在代码实现中,可能使用二维数组来存储e、w和root。对于树结构的构建,可以定义一个树节点结构,包含键值、左右子节点指针等。 可能的错误点包括索引处理错误,概率累加错误,或者在动态规划循环中遗漏某些情况。需要仔细检查循环范围和变量更新。 现在,将以上思路转化为具体的C++代码。例如,代码可能包含一个OptimalBST类,包含构建动态规划表和构造树的方法。主函数中输入概率,调用方法,输出结果。 另外,用户可能需要代码示例,因此需要确保代码的完整性和正确性,例如正确计算e和w,正确选择根节点,并正确构建树的结构。 最后,生成相关问题,以帮助用户进一步理解OBST和相关主题,如与平衡二叉树的区别、动态规划的应用等。</think>### 最优二叉搜索树(OBST)实现方法 最优二叉搜索树通过动态规划算法实现,其核心思想是**基于概率分布计算最小期望搜索代价**,步骤如下: --- #### **算法步骤说明** 1. **定义输入** 给定$n$个**有序**键值$\{k_1,k_2,...,k_n\}$及其搜索概率$\{p_1,p_2,...,p_n\}$,以及$n+1$个虚拟键(失败节点)的概率$\{q_0,q_1,...,q_n\}$[^4] 2. **构建动态规划表** - `e[i][j]`:键区间$k_i$到$k_j$的最小期望搜索代价 - `w[i][j]`:区间$k_i$到$k_j$的总概率(包含虚拟键) - `root[i][j]`:区间$k_i$到$k_j$的最优根节点 3. **递推关系** $$w[i][j] = w[i][j-1] + p_j + q_j$$ $$e[i][j] = \min_{r=i}^{j} \{ e[i][r-1] + e[r+1][j] + w[i][j] \}$$ --- #### **C++ 实现代码** ```cpp #include <iostream> #include <vector> #include <climits> using namespace std; struct Node { int key; Node* left; Node* right; Node(int k) : key(k), left(nullptr), right(nullptr) {} }; class OptimalBST { private: vector<vector<double>> e, w; vector<vector<int>> root; vector<double> p, q; int n; void buildDPTable() { // 初始化基础情况(只有虚拟键) for (int i = 0; i <= n; ++i) { e[i][i] = q[i]; w[i][i] = q[i]; } // 动态规划填充表(l为区间长度) for (int l = 1; l <= n; ++l) { // l从1到n for (int i = 0; i <= n - l; ++i) { // i起始点 int j = i + l; // j终点 e[i][j] = INT_MAX; w[i][j] = w[i][j-1] + p[j] + q[j]; // 累加概率 // 遍历所有可能的根节点r for (int r = i; r < j; ++r) { // r从i到j-1 double cost = e[i][r] + e[r+1][j] + w[i][j]; if (cost < e[i][j]) { e[i][j] = cost; root[i][j] = r; } } } } } Node* constructTree(int i, int j) { if (i > j) return nullptr; int r = root[i][j]; Node* node = new Node(r + 1); // 假设键值从1开始编号 node->left = constructTree(i, r-1); node->right = constructTree(r+1, j); return node; } public: OptimalBST(vector<double>& p, vector<double>& q) : p(p), q(q), n(p.size()) { e.resize(n+1, vector<double>(n+1, 0)); w.resize(n+1, vector<double>(n+1, 0)); root.resize(n+1, vector<int>(n+1, 0)); buildDPTable(); } Node* buildTree() { return constructTree(0, n); } double getMinCost() { return e[0][n]; } }; int main() { vector<double> p = {0.0, 0.15, 0.10, 0.05, 0.10}; // 实际键概率(索引1~4) vector<double> q = {0.05, 0.10, 0.05, 0.05, 0.05}; // 虚拟键概率(索引0~4) OptimalBST obst(p, q); cout << "最小期望搜索代价: " << obst.getMinCost() << endl; Node* root = obst.buildTree(); return 0; } ``` --- #### **关键解释** 1. **动态规划表构建** 时间复杂度为$O(n^3)$,空间复杂度为$O(n^2)$。通过三层循环逐步计算所有可能的子区间最优解[^2] 2. **树结构生成** `constructTree`函数根据记录的`root`表递归构建树结构,时间复杂度为$O(n)$ 3. **概率处理** 输入时需注意键值必须**有序排列**,且概率总和需满足$\sum p_i + \sum q_j = 1$ ---
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值