DeepEP源码解析：从Buffer类到内核实现的技术细节-优快云博客

DeepEP源码解析：从Buffer类到内核实现的技术细节

【免费下载链接】DeepEP DeepEP: an efficient expert-parallel communication library 项目地址: https://gitcode.com/GitHub_Trending/de/DeepEP

引言

在分布式训练系统中，专家并行（Expert Parallelism, EP）已成为提升大规模模型性能的关键技术。然而，跨节点通信的延迟和吞吐量瓶颈一直是阻碍其广泛应用的主要障碍。DeepEP作为一个高效的专家并行通信库，通过精心设计的Buffer管理机制和优化的内核实现，显著提升了分布式训练的性能。本文将深入剖析DeepEP的源码实现，从Python层的Buffer类到底层CUDA内核，全面揭示其高效通信的技术细节。

读完本文后，您将能够：

理解DeepEP的Buffer管理机制及其在专家并行中的作用
掌握DeepEP底层通信内核的设计原理和实现细节
了解如何优化分布式系统中的跨节点通信性能
学会在实际项目中应用DeepEP进行高效的专家并行训练

DeepEP架构概览

DeepEP采用分层架构设计，从高层Python API到底层CUDA内核，形成了一个完整的专家并行通信解决方案。其架构如图1所示：

mermaid

图1: DeepEP架构图

DeepEP的核心组件包括：

Python API层：提供用户友好的接口，主要是Buffer类
C++运行时层：实现核心逻辑，管理通信资源
CUDA内核层：实现高性能的通信操作
通信层：封装底层通信技术，如NVLink和RDMA

下面我们将从Python层的Buffer类开始，逐步深入到DeepEP的底层实现。

Buffer类详解

Buffer类是DeepEP的核心组件，负责管理专家并行通信所需的缓冲区。它支持多种通信模式，包括高吞吐量的节点内通信、高吞吐量的节点间通信以及低延迟的全对全通信。

Buffer类的初始化

Buffer类的构造函数是理解其工作原理的关键。让我们来看一下其初始化过程：

def __init__(self, group: Optional[dist.ProcessGroup],
             num_nvl_bytes: int = 0, num_rdma_bytes: int = 0,
             low_latency_mode: bool = False, num_qps_per_rank: int = 24,
             allow_nvlink_for_low_latency_mode: bool = True,
             allow_mnnvl: bool = False,
             explicitly_destroy: bool = False,
             comm: Optional["mpi4py.MPI.Comm"] = None) -> None:
    check_nvlink_connections(group)
    
    # 初始化CPP运行时
    if group is not None:
        self.rank = group.rank()
        self.group = group
        self.group_size = group.size()
        
        def all_gather_object(obj):
            object_list = [None] * self.group_size
            dist.all_gather_object(object_list, obj, group)
            return object_list
    elif comm is not None:
        # 使用MPI通信器初始化
        # ...省略代码...
    else:
        raise ValueError("Either 'group' or 'comm' must be provided.")
    
    # 设置NVSHMEM环境变量
    if self.runtime.get_num_rdma_ranks() > 1 or low_latency_mode:
        os.environ['NVSHMEM_DISABLE_P2P'] = '0' if allow_nvlink_for_low_latency_mode else '1'
        os.environ['NVSHMEM_IB_ENABLE_IBGDA'] = '1'
        os.environ['NVSHMEM_IBGDA_NUM_RC_PER_PE'] = f'{num_qps_per_rank}'
        # ...设置其他环境变量...
        
    # 同步设备ID、IPC句柄和NVSHMEM唯一ID
    self.runtime.sync(device_ids, ipc_handles, root_unique_id)
    assert self.runtime.is_available()

Buffer类的初始化过程主要包括以下几个步骤：

检查NVLink连接，确保硬件支持
初始化通信组信息，包括rank和group_size
设置NVSHMEM环境变量，配置RDMA通信参数
同步设备ID、IPC句柄和NVSHMEM唯一ID，建立跨节点通信通道

Buffer类的核心方法

Buffer类提供了丰富的方法来支持专家并行通信，其中最核心的是dispatch和combine方法。

dispatch方法

dispatch方法负责将token分发到不同的rank，支持节点内和节点间通信：

def dispatch(self, x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],
             handle: Optional[Tuple] = None,
             num_tokens_per_rank: Optional[torch.Tensor] = None, 
             # ...其他参数...
             ) -> Tuple[Union[Tuple[torch.Tensor, torch.Tensor], torch.Tensor], ...]:
    # 默认配置
    config = self.get_dispatch_config(self.group_size) if config is None else config
    
    # 节点间通信
    if self.runtime.get_num_rdma_ranks() > 1:
        return self.internode_dispatch(...)
    
    # 使用缓存或非缓存模式启动内核
    x, x_scales = x if isinstance(x, tuple) else (x, None)
    if handle is not None:
        # 使用缓存的布局信息
        # ...省略代码...
    else:
        # 计算新的布局信息
        # ...省略代码...

dispatch方法的工作流程：

检查是否需要节点间通信，如果是则调用internode_dispatch
根据是否有缓存的布局信息，选择不同的执行路径
调用C++运行时的intranode_dispatch或internode_dispatch方法执行实际的通信操作

combine方法

combine方法负责将来自不同rank的token合并：

def combine(self, x: torch.Tensor, handle: Tuple,
            topk_weights: Optional[torch.Tensor] = None,
            # ...其他参数...
            ) -> Tuple[torch.Tensor, Optional[torch.Tensor], EventOverlap]:
    # 默认配置
    config = self.get_combine_config(self.group_size) if config is None else config
    
    # 节点间通信
    if self.runtime.get_num_rdma_ranks() > 1:
        return self.internode_combine(...)
    
    # 解包handle和bias
    rank_prefix_matrix, _, channel_prefix_matrix, src_idx, is_recv_token_in_rank, send_head = handle
    bias_0, bias_1 = Buffer._unpack_bias(bias)
    
    # 启动内核
    recv_x, recv_topk_weights, event = self.runtime.intranode_combine(
        x, topk_weights, bias_0, bias_1,
        src_idx, rank_prefix_matrix, channel_prefix_matrix, send_head, config,
        getattr(previous_event, 'event', None), async_finish, allocate_on_comm_stream)
    return recv_x, recv_topk_weights, EventOverlap(event)

combine方法的工作流程与dispatch类似，但方向相反，它将分散在不同rank的token合并回原始rank。

Buffer类的内存管理

DeepEP采用了精心设计的内存管理策略，以确保高效的通信性能。Buffer类提供了get_local_buffer_tensor方法来获取缓冲区的PyTorch张量视图：

def get_local_buffer_tensor(self, dtype: torch.dtype, size: Optional[torch.Size] = None,
                            offset: int = 0, use_rdma_buffer: bool = False) -> torch.Tensor:
    tensor = self.runtime.get_local_buffer_tensor(dtype, offset, use_rdma_buffer)
    if size is None:
        return tensor
    
    assert tensor.numel() >= size.numel()
    return tensor[:size.numel()].view(size)

这个方法允许用户直接访问通信缓冲区，避免了不必要的数据拷贝，提高了性能。

C++运行时层

Python API之下是C++运行时层，它实现了DeepEP的核心逻辑。C++层的Buffer类定义在csrc/deep_ep.hpp中：

namespace deep_ep {

struct Buffer {
private:
    // 低延迟模式缓冲区
    int low_latency_buffer_idx = 0;
    bool low_latency_mode = false;

    // NVLink缓冲区
    int64_t num_nvl_bytes;
    void* buffer_ptrs[NUM_MAX_NVL_PEERS] = {nullptr};
    void** buffer_ptrs_gpu = nullptr;

    // NVSHMEM缓冲区
    int64_t num_rdma_bytes;
    void* rdma_buffer_ptr = nullptr;

    // 设备信息和通信
    int device_id;
    int num_device_sms;
    int rank, rdma_rank, nvl_rank;
    int num_ranks, num_rdma_ranks, num_nvl_ranks;
    cudaIpcMemHandle_t ipc_handles[NUM_MAX_NVL_PEERS];

    // 通信流
    at::cuda::CUDAStream comm_stream;

    // 同步后设置为true
    bool available = false;

    // 其他成员变量...

public:
    Buffer(int rank, int num_ranks, int64_t num_nvl_bytes, int64_t num_rdma_bytes, 
           bool low_latency_mode, bool explicitly_destroy);

    ~Buffer() noexcept(false);

    // 公共方法...
};

} // namespace deep_ep

C++层的Buffer类管理着实际的通信缓冲区和通信资源，如NVLink和RDMA缓冲区、通信流等。它提供了intranode_dispatch、internode_dispatch、intranode_combine和internode_combine等方法，这些方法会调用底层的CUDA内核来执行实际的通信操作。

CUDA内核层

DeepEP的性能优势很大程度上来自于其优化的CUDA内核。这些内核实现了高效的全对全通信，充分利用了NVLink和RDMA等硬件特性。

缓冲区模板类

在csrc/kernels/buffer.cuh中，定义了一系列缓冲区模板类，如Buffer、AsymBuffer和SymBuffer，用于管理GPU上的通信缓冲区：

template <typename dtype_t>
struct Buffer {
private:
    uint8_t* ptr;

public:
    int total_bytes;

    __device__ __forceinline__ Buffer() : ptr(nullptr), total_bytes(0) {}

    __device__ __forceinline__ Buffer(void* &gbl_ptr, int num_elems, int offset = 0) {
        total_bytes = num_elems * sizeof(dtype_t);
        ptr = reinterpret_cast<uint8_t*>(gbl_ptr) + offset * sizeof(dtype_t);
        gbl_ptr = reinterpret_cast<uint8_t*>(gbl_ptr) + total_bytes;
    }

    __device__ __forceinline__ dtype_t* buffer() {
        return reinterpret_cast<dtype_t*>(ptr);
    }

    __device__ __forceinline__ dtype_t& operator[](int idx) {
        return buffer()[idx];
    }
};

这些模板类提供了高效的缓冲区管理，支持不同类型的数据和通信模式。

内核启动配置

csrc/kernels/launch.cuh中定义了内核启动的宏，用于配置CUDA内核的启动参数：

#ifndef SETUP_LAUNCH_CONFIG
#ifndef DISABLE_SM90_FEATURES
#define SETUP_LAUNCH_CONFIG(num_sms, num_threads, stream) \
    cudaLaunchConfig_t cfg = {(num_sms), (num_threads), 0, stream, nullptr, 0}; \
    cudaLaunchAttribute attr[2]; \
    attr[0].id = cudaLaunchAttributeCooperative; \
    attr[0].val.cooperative = 1; \
    attr[1].id = cudaLaunchAttributeClusterDimension; \
    attr[1].val.clusterDim.x = (num_sms % 2 == 0 ? 2 : 1); \
    attr[1].val.clusterDim.y = 1; \
    attr[1].val.clusterDim.z = 1; \
    cfg.attrs = attr; \
    cfg.numAttrs = 2
#else
#define SETUP_LAUNCH_CONFIG(sms, threads, stream) \
    int __num_sms = (sms); \
    int __num_threads = (threads); \
    auto __stream = (stream)
#endif
#endif

这些宏根据GPU架构和内核需求，配置最优的启动参数，如线程块大小、共享内存大小等。

节点间通信内核

节点间通信内核实现了跨节点的专家并行通信，位于csrc/kernels/internode.cu中。以notify_dispatch内核为例：

template <bool kLowLatencyMode, int kNumRDMARanks>
__global__ void
notify_dispatch(const int* num_tokens_per_rank, int* moe_recv_counter_mapped, int num_ranks,
                const int* num_tokens_per_rdma_rank, int* moe_recv_rdma_counter_mapped,
                const int* num_tokens_per_expert, int* moe_recv_expert_counter_mapped, int num_experts,
                const bool* is_token_in_rank, int num_tokens, int num_channels, int expert_alignment,
                const int rdma_clean_offset, const int rdma_num_int_clean,
                const int nvl_clean_offset, const int nvl_num_int_clean,
                int* rdma_channel_prefix_matrix, int* recv_rdma_rank_prefix_sum,
                int* gbl_channel_prefix_matrix, int* recv_gbl_rank_prefix_sum,
                void* rdma_buffer_ptr,
                void** buffer_ptrs, int** barrier_signal_ptrs, int rank,
                const nvshmem_team_t rdma_team) {
    auto sm_id = static_cast<int>(blockIdx.x);
    auto thread_id = static_cast<int>(threadIdx.x), warp_id = thread_id / 32, lane_id = get_lane_id();
    auto num_threads = static_cast<int>(blockDim.x), num_warps = num_threads / 32;

    auto rdma_rank = rank / NUM_MAX_NVL_PEERS, nvl_rank = rank % NUM_MAX_NVL_PEERS;
    auto num_rdma_experts = num_experts / kNumRDMARanks, num_nvl_experts = num_rdma_experts / NUM_MAX_NVL_PEERS;

    if (sm_id == 0) {
        // 与其他节点通信
        // ...同步和通信代码...
    } else {
        // 计算元数据
        int dst_rdma_rank = sm_id - 1;
        for (int channel_id = warp_id; channel_id < num_channels; channel_id += num_warps) {
            int token_start_idx, token_end_idx;
            get_channel_task_range(num_tokens, num_channels, channel_id, token_start_idx, token_end_idx);

            // 遍历token
            int total_count = 0, per_nvl_rank_count[NUM_MAX_NVL_PEERS] = {0};
            for (int64_t i = token_start_idx + lane_id; i < token_end_idx; i += 32) {
                auto is_token_in_rank_uint64 = *reinterpret_cast<const uint64_t*>(is_token_in_rank + i * num_ranks + dst_rdma_rank * NUM_MAX_NVL_PEERS);
                auto is_token_in_rank_values = reinterpret_cast<const bool*>(&is_token_in_rank_uint64);
                #pragma unroll
                for (int j = 0; j < NUM_MAX_NVL_PEERS; ++ j)
                    per_nvl_rank_count[j] += is_token_in_rank_values[j];
                total_count += (is_token_in_rank_uint64 != 0);
            }

            // Warp归约和写入通道矩阵
            // ...代码...
        }
    }
}

notify_dispatch内核负责在节点间同步token分发信息，它使用了以下优化技术：

按SM和warp划分工作负载，充分利用GPU的并行计算能力
使用位运算优化token归属信息的处理
使用warp归约高效计算每个rank的token数量
通过共享内存和全局内存的合理使用，优化内存访问模式

IBGDA技术

DeepEP使用IBGDA（InfiniBand GPU Direct Access）技术来加速RDMA通信。ibgda_device.cuh中实现了IBGDA相关的设备端函数，如nvshmemi_ibgda_put_nbi_warp：

template <bool kAlwaysDoPostSend = false>
__device__ static __forceinline__ void
nvshmemi_ibgda_put_nbi_warp(uint64_t req_rptr, uint64_t req_lptr, size_t bytes, int dst_pe, int qp_id, int lane_id, int message_idx) {
    // 获取lkey和rkey，存储到lane中
    uint32_t num_wqes = 0;
    __be32 my_lkey = 0;
    uint64_t my_laddr = 0;
    __be32 my_rkey = 0;
    uint64_t my_raddr = 0;
    uint64_t my_chunk_size = 0;

    auto qp = ibgda_get_rc(dst_pe, qp_id);

    // 决定需要多少个消息
    auto remaining_bytes = bytes;
    while (remaining_bytes > 0) {
        if (lane_id == num_wqes) {
            my_chunk_size = min(remaining_bytes,
                                ibgda_get_lkey_and_rkey(my_laddr = req_lptr,
                                                        &my_lkey,
                                                        req_rptr,
                                                        dst_pe,
                                                        &my_raddr,
                                                        &my_rkey,
                                                        qp->dev_idx));
        }

        // 移动到下一个消息
        auto chunk_size = __shfl_sync(0xffffffff, my_chunk_size, static_cast<int>(num_wqes));
        remaining_bytes -= chunk_size;
        req_lptr += chunk_size;
        req_rptr += chunk_size;
        ++ num_wqes;
    }
    EP_DEVICE_ASSERT(num_wqes <= 32);

    // 处理WQE
    uint64_t base_wqe_idx = 0;
    if (lane_id == 0)
        base_wqe_idx = ibgda_reserve_wqe_slots(qp, num_wqes);
    base_wqe_idx = __shfl_sync(0xffffffff, base_wqe_idx, 0);
    if (lane_id < num_wqes) {
        auto wqe_idx = base_wqe_idx + lane_id;
        auto wqe_ptr = ibgda_get_wqe_ptr(qp, wqe_idx);
        ibgda_write_rdma_write_wqe(qp, my_laddr, my_lkey, my_raddr, my_rkey, my_chunk_size,
                                   wqe_idx, &wqe_ptr);
    }
    __syncwarp();

    // 提交请求
    if (lane_id == 0)
        ibgda_submit_requests<kAlwaysDoPostSend>(qp, base_wqe_idx, num_wqes, message_idx);
    __syncwarp();
}

这个函数实现了基于IBGDA的非阻塞RDMA put操作，使用warp级别的并行来高效地处理多个RDMA请求。

性能优化技术

DeepEP采用了多种性能优化技术，使其在专家并行通信中表现出色。

通信与计算重叠

DeepEP通过异步通信和事件同步，实现了通信与计算的重叠：

def get_dispatch_layout(self, topk_idx: torch.Tensor, num_experts: int,
                        previous_event: Optional[EventOverlap] = None, async_finish: bool = False,
                        allocate_on_comm_stream: bool = False) -> \
        Tuple[torch.Tensor, Optional[torch.Tensor], torch.Tensor, torch.Tensor, EventOverlap]:
    num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, event = \
        self.runtime.get_dispatch_layout(topk_idx, num_experts, getattr(previous_event, 'event', None),
                                         async_finish, allocate_on_comm_stream)
    return num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, EventOverlap(event)

get_dispatch_layout方法返回一个EventOverlap对象，调用者可以使用这个对象来等待布局计算完成，从而 overlap 通信和计算。

多层次缓冲区设计

DeepEP使用了多层次的缓冲区设计，包括NVLink缓冲区和RDMA缓冲区，以适应不同的通信需求：

// NVLink Buffer
int64_t num_nvl_bytes;
void* buffer_ptrs[NUM_MAX_NVL_PEERS] = {nullptr};
void** buffer_ptrs_gpu = nullptr;

// NVSHMEM Buffer
int64_t num_rdma_bytes;
void* rdma_buffer_ptr = nullptr;

这种设计允许DeepEP根据通信距离（节点内或节点间）选择最优的通信路径和缓冲区。

自适应配置

DeepEP提供了get_dispatch_config和get_combine_config方法，根据rank数量返回优化的通信配置：

@staticmethod
def get_dispatch_config(num_ranks: int) -> Config:
    # TODO: 自动调优
    config_map = {
        2: Config(Buffer.num_sms, 24, 256, 6, 128),
        4: Config(Buffer.num_sms, 6, 256, 6, 128),
        8: Config(Buffer.num_sms, 6, 256, 6, 128),
        16: Config(Buffer.num_sms, 36, 288, 20, 128),
        # ...其他配置...
    }
    assert num_ranks in config_map, f'Unsupported number of EP ranks: {num_ranks}'
    return config_map[num_ranks]

这些配置包括SM数量、线程块大小、通信通道数量等参数，通过调整这些参数可以优化不同规模下的通信性能。

使用示例

下面是一个使用DeepEP进行专家并行训练的简单示例：

import torch
import torch.distributed as dist
from deep_ep import Buffer

# 初始化分布式环境
dist.init_process_group(backend='nccl')
rank = dist.get_rank()
world_size = dist.get_world_size()

# 创建通信组
group = dist.new_group(list(range(world_size)))

# 初始化DeepEP Buffer
buffer = Buffer(group, num_nvl_bytes=1024*1024*100, num_rdma_bytes=1024*1024*1000)

# 生成随机输入数据
num_tokens = 1024
hidden_dim = 2048
x = torch.randn(num_tokens, hidden_dim, dtype=torch.bfloat16, device=f'cuda:{rank}')

# 生成随机的topk专家索引
num_experts = 16
num_topk = 2
topk_idx = torch.randint(0, num_experts, (num_tokens, num_topk), dtype=torch.int64, device=f'cuda:{rank}')

# 计算分发布局
num_tokens_per_rank, num_tokens_per_rdma_rank, num_tokens_per_expert, is_token_in_rank, event = \
    buffer.get_dispatch_layout(topk_idx, num_experts)

# 等待布局计算完成
event.wait()

# 获取分发配置
config = buffer.get_dispatch_config(world_size)

# 分发token
recv_x, recv_topk_idx, recv_topk_weights, num_recv_tokens_per_expert_list, handle, event = \
    buffer.dispatch(x, num_tokens_per_rank=num_tokens_per_rank, 
                    num_tokens_per_rdma_rank=num_tokens_per_rdma_rank,
                    is_token_in_rank=is_token_in_rank, num_tokens_per_expert=num_tokens_per_expert,
                    topk_idx=topk_idx, config=config)

# 等待分发完成
event.wait()

# 在专家上执行计算 (示例)
experts = torch.nn.ModuleList([torch.nn.Linear(hidden_dim, hidden_dim) for _ in range(num_experts//world_size)])
output = []
offset = 0
for i, expert in enumerate(experts):
    num_tokens = num_recv_tokens_per_expert_list[i]
    if num_tokens > 0:
        output.append(expert(recv_x[offset:offset+num_tokens]))
    offset += num_tokens
output = torch.cat(output, dim=0)

# 合并结果
combined_x, _, event = buffer.combine(output, handle)

# 等待合并完成
event.wait()

# 使用合并后的结果继续训练...

这个示例展示了DeepEP的基本使用流程：初始化Buffer、计算分发布局、分发token、在专家上执行计算、合并结果。

总结与展望

DeepEP通过精心设计的Buffer管理机制和优化的CUDA内核，为专家并行训练提供了高效的通信解决方案。其核心优势包括：

多层次缓冲区设计，充分利用NVLink和RDMA等硬件特性
优化的CUDA内核，实现高效的全对全通信
通信与计算重叠，提高整体利用率
自适应配置，适应不同规模的专家并行

未来，DeepEP可以在以下方面进一步改进：

自动调优机制，根据硬件和工作负载自动调整通信参数
支持更多的数据类型和通信模式
与深度学习框架的更深度集成
支持异构计算环境

通过本文的解析，我们深入了解了DeepEP的内部工作原理和优化技术。希望这些知识能帮助您更好地使用DeepEP，或在自己的项目中实现高效的分布式通信。

要开始使用DeepEP，请访问项目仓库：https://gitcode.com/GitHub_Trending/de/DeepEP，获取最新代码和详细文档。

如果您觉得本文对您有帮助，请点赞、收藏并关注，以获取更多关于分布式训练和高性能计算的技术分享。

【免费下载链接】DeepEP DeepEP: an efficient expert-parallel communication library 项目地址: https://gitcode.com/GitHub_Trending/de/DeepEP

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考