NCCL源码解析⑥:Channel搜索

NCCL中channel的搜索过程解析
本文主要介绍NCCL中channel的搜索过程。channel表示通信路径,NCCL使用多channel以利用带宽和网卡。先分析单机八卡简化情况,介绍初始化、搜索算法及回溯过程;再阐述多机场景,如两机十六卡,最终基于机器拓扑搜索出一组channel用于数据通信并记录。

b47d799c4f110d502bd2060dcecc864e.png

作者|KIDGINBROOK
更新|潘丽晨

上节讲到已经计算出GPU和NIC节点到其他任意节点的最优路径了,本节看下NCCL中channel的搜索过程。

NCCL中channel的概念表示一个通信路径,为了更好地利用带宽和网卡,以及同一块数据可以通过多个channel并发通信,另外后续可以看到一个channel对应了一个GPU SM,所以基于这些原因,NCCL会使用多channel,搜索的过程就是搜索出来一组channel。

如上节所述,单机的情况下会在ncclTopoTrimSystem函数里删除网卡,因此我们先看下单机八卡这种简化的情况,最后再看下多机引入网卡之后的情况。

static float getMaxWidth(struct ncclTopoSystem* system, struct ncclTopoNode* gpu, int type) {
  float maxWidth = 0.0;
  for (int i=0; i<system->nodes[type].count; i++) {
    struct ncclTopoLinkList* path = gpu->paths[type]+i;
    float width = path->width;
    if (path->count == 0) continue;
    maxWidth = std::max(maxWidth, width);
  }
  return maxWidth;
}
 
 
ncclResult_t ncclTopoSearchInit(struct ncclTopoSystem* system) {
  system->maxWidth = 0.0;
  int inter = system->nodes[NET].count;
  if (inter == 0 && system->nodes[GPU].count == 1) {
    system->maxWidth = LOC_WIDTH;
    return ncclSuccess;
  }
  for (int g=0; g<system->nodes[GPU].count; g++) {
    struct ncclTopoNode* gpu = system->nodes[GPU].nodes+g;
    system->maxWidth = std::max(system->maxWidth, getMaxWidth(system, gpu, inter ? NET : GPU));
  }
  return ncclSuccess;
}

ncclTopoSearchInit就是初始化system->maxWidth,如果是单机单卡的情况,那么maxWidth设置为LOC_WIDTH,否则就遍历每个GPU节点,查看到其他所有GPU节点或者网卡最大带宽。

struct ncclTopoGraph ringGraph;
  ringGraph.id = 0;
  ringGraph.pattern = NCCL_TOPO_PATTERN_RING;
  ringGraph.crossNic = ncclParamCrossNic();
  ringGraph.collNet = 0;
  ringGraph.minChannels = 1;
  ringGraph.maxChannels = MAXCHANNELS/2;
  NCCLCHECK(ncclTopoCompute(comm->topo, &ringGraph));
  NCCLCHECK(ncclTopoPrintGraph(comm->topo, &ringGraph));

nccl执行集合通信时支持ring,tree和collnet三种算法,这里我们以ring来举例,后续专门介绍ring和tree。

struct ncclTopoGraph {
  // Input / output
  int id; // ring : 0, tree : 1, collnet : 2
  int pattern;
  int crossNic;
  int collNet;
  int minChannels;
  int maxChannels;
  // Output
  int nChannels;      // 搜索到的channel数量
  float speedIntra;   // 节点内单个channel带宽
  float speedInter;   // 节点间单个channel带宽
  int typeIntra;      // 节点内channel的路径类型
  int typeInter;      // 节点间channel的路径类型
  int sameChannels;   // channel是否一样
  int nHops;
  int intra[MAXCHANNELS*NCCL_TOPO_MAX_NODES];  // 节点内每个channel路径
  int inter[MAXCHANNELS*2];                    // 节点间每个channel路径
};

ncclTopoGraph记录了搜索到的结果,具体含义见注释。

然后看下ncclTopoCompute,这里就是实际搜索channel的过程,目标是搜索出来尽可能多,带宽尽可能大的一系列channel,本质就是暴力搜索,先设置一系列的条件搜答案,如果搜不出来则降低条件继续搜。

由于此时没有NET节点,所以crossNic为0,然后初始化graph,首先设置最高的条件,限制节点内部只能使用不超过PATH_NVL路径,节点间只能使用不超过PATH_PIX的路径,然后通过system-maxWidth设置speedIntra和speedInter,接着执行ncclTopoSearchRec搜索出一个答案存储到tmpGraph中。

如果此时就是最优的结果,channel数等于maxChannel,并且speedInter也等于maxWidth,则直接退出,否则就开始逐步降低条件,比如将sameChannel设置为0,允许channel之间不一样;调大typeIntra和typeInter;允许crossNic;调小speedInter和speedIntra。

ncclResult_t ncclTopoCompute(ncclTopoSystem* system, struct ncclTopoGraph* graph) {
  int ngpus = system->nodes[GPU].count;
  int crossNic = (system->nodes[NET].count > 1) && graph->crossNic ? 1 : 0;
  graph->speedIntra = graph->speedInter = 0;
  if (graph->crossNic == 2) graph->crossNic = 0;
  graph->typeIntra = ngpus == 1 ? PATH_LOC : PATH_NVL;
  graph->typeInter = PATH_PIX;
  graph->nChannels = 0;
  graph->sameChannels = 1;
 
 
  if (ngpus == 1) if (graph->pattern != NCCL_TOPO_PATTERN_RING) graph->pattern = NCCL_TOPO_PATTERN_TREE;
 
  struct ncclTopoGraph tmpGraph;
  memcpy(&tmpGraph, graph, sizeof(struct ncclTopoGraph));
 
  // First try crossnic, then decrease speed and finally increase speedIntra.
  tmpGraph.pattern = graph->pattern;
  int pass = 1;
  int speedIndex = 0;
  while (speedArray[speedIndex] > system->maxWidth && speedIndex < NSPEEDS-1) speedIndex++;
  tmpGraph.speedIntra = tmpGraph.speedInter = speedArray[speedIndex];
  int64_t globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;
 
search:
  int time = tmpGraph.sameChannels ? NCCL_SEARCH_TIMEOUT_SAMECHANNELS :
    tmpGraph.pattern == NCCL_TOPO_PATTERN_TREE ? NCCL_SEARCH_TIMEOUT_TREE : NCCL_SEARCH_TIMEOUT;
  tmpGraph.nChannels = 0;
  globalTimeout -= time;
 
  NCCLCHECK(ncclTopoSearchRec(system, &tmpGraph, graph, &time));
 
  // Optimal solution, stop here
  if (graph->nChannels == graph->maxChannels && graph->speedInter == system->maxWidth) goto done;
 
  if (pass == 1) {
    // First pass, we don't have a solution yet ; try other options
 
    // Try having different channels
    if (tmpGraph.sameChannels == 1) {
      tmpGraph.sameChannels = 0;
      goto search;
    }
    tmpGraph.sameChannels = 1;
 
    if (time != -1) globalTimeout += time;
    else globalTimeout = NCCL_SEARCH_GLOBAL_TIMEOUT;
    if (globalTimeout < 0) goto done;
 
    int maxTypeIntra = system->nodes[NET].count > 0 ? tmpGraph.typeInter : PATH_SYS;
    if (tmpGraph.typeIntra < maxTypeIntra && (graph->nChannels == 0 || tmpGraph.typeIntra < graph->typeIntra)) {
      tmpGraph.typeIntra += 1;
      goto search;
    }
    tmpGraph.typeIntra = ngpus == 1 ? PATH_LOC : PATH_NVL;
    if (system->nodes[NET].count > 0 && tmpGraph.typeInter < PATH_SYS && (graph->nChannels == 0 || tmpGraph.typeInter < graph->typeInter || tmpGraph.typeInter < PATH_PXB)) {
      tmpGraph.typeInter += 1;
      goto search;
    }
    tmpGraph.typeInter = PATH_PIX;
 
    // Try 
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值