【Cuda 编程思想】案例分析--DeepSpeed量化cuda算子

最新推荐文章于 2025-04-05 23:46:25 发布

Mr.Lee jack

最新推荐文章于 2025-04-05 23:46:25 发布

阅读量525

点赞数 5

分类专栏： cuda 文章标签： c++ cuda

本文链接：https://blog.youkuaiyun.com/xzpdxz/article/details/146947179

版权

cuda 专栏收录该内容

4 篇文章

订阅专栏

量化过程详解：从浮点数到低位整数

量化是将高精度数值（如FP16/FP32）转换为低精度整数（如INT8/INT4）的过程，目的是减少模型大小和加速推理

量化的基本数学原理

对称量化（Symmetric）

对称量化使用一个缩放因子(scale)将浮点数映射到整数：

量化公式：q = round(x / scale)
反量化公式：x ≈ q * scale

其中：
- x 是原始浮点数
- q 是量化后的整数
- scale 是缩放因子，通常由数据范围决定：scale = max(abs(x)) / (2^(bits-1) - 1)

非对称量化（Asymmetric）

非对称量化使用缩放因子和零点偏移(zero_point)：

量化公式：q = round(x / scale) + zero_point
反量化公式：x ≈ (q - zero_point) * scale

实际量化步骤

假设我们进行4位对称量化，步骤如下：

确定缩放因子：
- 找出数据中的最大绝对值：abs_max = max(abs(data))
- 计算缩放因子：scale = abs_max / 7（4位有符号整数范围是-7到7）
应用量化：
- 对每个浮点值x：q = round(x / scale)
- 截断到合法范围：q = clamp(q, -7, 7)
打包结果：
- 将多个4位整数打包到8位整数中
- 例如：将q1和q2打包成一个INT8：packed = (q1 & 0xF) | ((q2 & 0xF) << 4)

具体例子

假设我们有8个FP16值需要量化为4位：

[0.5, -0.3, 0.8, -0.9, 0.2, -0.1, 0.7, -0.4]

步骤1：找出最大绝对值

abs_max = 0.9

步骤2：计算缩放因子

 scale = 0.9 / 7 = 0.1286

步骤3：量化每个值

q1 = round(0.5 / 0.1286) = 4
q2 = round(-0.3 / 0.1286) = -2
q3 = round(0.8 / 0.1286) = 6
q4 = round(-0.9 / 0.1286) = -7
q5 = round(0.2 / 0.1286) = 2
q6 = round(-0.1 / 0.1286) = -1
q7 = round(0.7 / 0.1286) = 5
q8 = round(-0.4 / 0.1286) = -3

步骤4：打包为INT8

packed1 = (q1 & 0xF) | ((q2 & 0xF) << 4) = 0x24  // 4和-2打包
packed2 = (q3 & 0xF) | ((q4 & 0xF) << 4) = 0x96  // 6和-7打包
packed3 = (q5 & 0xF) | ((q6 & 0xF) << 4) = 0x12  // 2和-1打包
packed4 = (q7 & 0xF) | ((q8 & 0xF) << 4) = 0x35  // 5和-3打包

最终输出的INT8数组：[0x24, 0x96, 0x12, 0x35]

quantize.cu 写法分析

// Copyright (c) Microsoft Corporation.
// SPDX-License-Identifier: Apache-2.0  // 版权声明，使用Apache 2.0许可证

// DeepSpeed Team  // 表明这是DeepSpeed团队开发的代码

#include "ds_kernel_utils.h"  // 包含DeepSpeed内核工具函数
#include "memory_access_utils.h"  // 包含内存访问工具函数
#include "quantization.h"  // 包含量化相关的定义和函数
#include "quantization_utils.h"  // 包含量化工具函数
#include "reduction_utils.h"  // 包含归约操作工具函数

namespace cg = cooperative_groups;  // 为cooperative_groups命名空间创建别名cg，用于GPU线程协作

/*
Pure quantization kernel with no fusion.  // 纯量化内核，没有与其他操作融合
*/
template <int q_bits,  // 量化位数（如4位或8位）
          quantize::Type quant_type,  // 量化类型（对称或非对称）
          int UNROLL,  // 外部循环展开因子
          int internal_unroll,  // 内部循环展开因子
          int threads_per_group,  // 每个量化组使用的线程数
          int max_threads>  // 每个线程块的最大线程数
__global__ void cached_quantization(int8_t* __restrict__ output_data,  // 输出量化数据的指针
                                    float* __restrict__ params,  // 量化参数（如缩放因子）的指针
                                    const __half* __restrict__ input_data,  // 输入半精度浮点数据的指针
                                    int groups,  // 要量化的组数
                                    int elems_per_group)  // 每组中的元素数量
{
    cg::thread_block tb = cg::this_thread_block();  // 获取当前线程块
    cg::thread_block_tile<hw_warp_size> warp = cg::tiled_partition<hw_warp_size>(tb);  // 获取当前线程块中的warp

    // Indexing offsets  // 计算索引偏移量
    const int block_offset =
        (tb.group_index().x * (max_threads / threads_per_group) * elems_per_group) +  // 块索引偏移
        (tb.thread_index().y * elems_per_group);  // y维度线程索引偏移
    const int elem_offset = tb.thread_index().x * quantize::h_per_load;  // x维度线程索引偏移
    const int base_offset = block_offset + elem_offset;  // 基础偏移量
    const int stride = tb.size() * quantize::h_per_load;  // 步长，用于迭代访问数据

    const __half* input_base = input_data + base_offset;  // 计算输入数据的基址

    __half2 local_buffer[UNROLL * internal_unroll * quantize::h2_per_load];  // 本地缓冲区，用于存储加载的数据

#pragma unroll  // 指示编译器展开下面的循环
    for (int i = 0; i < UNROLL; i++) {  // 外部循环，处理多个数据块
        // Convenience helper, should resolve to register indices and not realize.  // 便利的辅助指针，应该解析为寄存器索引而不是实际分配内存
        __half2* iteration_buffer = local_buffer + i * internal_unroll * quantize::h2_per_load;  // 当前迭代的缓冲区指针
#pragma unroll  // 指示编译器展开下面的循环
        for (int j = 0; j < internal_unroll; j++) {  // 内部循环，进一步处理数据
            const int iteration = i * internal_unroll + j;  // 计算总迭代次数
            mem_access::load_global<quantize::granularity>(  // 从全局内存加载数据到本地缓冲区
                iteration_buffer + j * quantize::h2_per_load,  // 目标缓冲区
                input_base + iteration * stride,  // 源数据指针
                elem_offset + iteration * stride < elems_per_group);  // 边界检查条件
        }
    }

    quantize::  // 调用量化命名空间中的函数
        local_array<quant_type, q_bits, UNROLL * internal_unroll, threads_per_group, max_threads>(  // 对本地数组进行量化
            local_buffer,  // 包含要量化的数据的本地缓冲区
            params,  // 量化参数
            output_data,  // 输出量化数据的目标位置
            elems_per_group,  // 每组的元素数量
            groups);  // 组数
}

/********* Launcher methods ***********/  // 启动器方法，用于配置和启动内核
#define LAUNCH_CACHED_QUANT_CALL(q_bits, quant_type) \  // 定义宏，用于简化内核启动代码
    cached_quantization<q_bits,                      \  // 量化位数
                        quant_type,                  \  // 量化类型
                        unroll_factor,               \  // 外部展开因子
                        internal_unroll_l,           \  // 内部展开因子
                        threads_per_group,           \  // 每组线程数
                        max_threads>                 \  // 最大线程数
        <<<grid, block, 0, stream>>>(output_data, params, input_data, groups, elems_per_group);  // 启动内核

#define LAUNCH_CACHED_QUANT(                                                        \  // 定义更高级的宏，处理不同量化位数和类型的情况
    q_bits, quant_type, unroll_factor_in, internal_unroll_in, threads_per_group_in) \
    const int unroll_factor = unroll_factor_in;                                     \  // 设置外部展开因子
    const int internal_unroll_l = internal_unroll_in;                               \  // 设置内部展开因子
    const int threads_per_group = threads_per_group_in;                             \  // 设置每组线程数
    if (q_bits == 4) {                                                              \  // 如果是4位量化
        if (quant_type == quantize::Type::Asymmetric) {                             \  // 如果是非对称量化
            LAUNCH_CACHED_QUANT_CALL(4, quantize::Type::Asymmetric)                 \  // 启动4位非对称量化内核
        } else {                                                                    \  // 否则是对称量化
            LAUNCH_CACHED_QUANT_CALL(4, quantize::Type::Symmetric)                  \  // 启动4位对称量化内核
        }                                                                           \
    } else {                                                                        \  // 否则是8位量化
        if (quant_type == quantize::Type::Asymmetric) {                             \  // 如果是非对称量化
            LAUNCH_CACHED_QUANT_CALL(8, quantize::Type::Asymmetric)                 \  // 启动8位非对称量化内核
        } else {                                                                    \  // 否则是对称量化
            LAUNCH_CACHED_QUANT_CALL(8, quantize::Type::Symmetric)                  \  // 启动8位对称量化内核
        }                                                                           \
    }

void launch_quant(int8_t* output_data,  // 输出量化数据的指针
                  float* params,  // 量化参数的指针
                  const __half* input_data,  // 输入半精度浮点数据的指针
                  const int groups,  // 要量化的组数
                  const int elems_per_group,  // 每组中的元素数量
                  const int num_bits,  // 量化位数
                  const quantize::Type quant_type,  // 量化类型
                  cudaStream_t stream)  // CUDA流，用于异步执行
{
    constexpr int max_threads = 256;  // 每个线程块的最大线程数

    constexpr int internal_unroll = 2;  // 内部循环展开因子

    const bool is_subblock_schedule = (elems_per_group <= 128) ? true : false;  // 判断是否使用子块调度
    const int h_per_step = is_subblock_schedule ? quantize::h_per_load  // 如果是子块调度，每步处理的半精度浮点数数量
                                                : quantize::h_per_load * internal_unroll;  // 否则，每步处理更多数据

    // Scheduling concern: may be slightly faster for some inputs to assign multiple stages of
    // warp-sized blocks rather than stepping up to 64/96 threads  // 调度考虑：对某些输入，使用多个warp大小的块可能比使用64/96线程更快
    const int one_step_threads = next_pow2((elems_per_group + h_per_step - 1) / h_per_step);  // 计算一步需要的线程数（向上取2的幂）
    const int threads_per_group = (one_step_threads < max_threads) ? one_step_threads : max_threads;  // 确保线程数不超过最大值

    const int groups_per_block =  // 每个块处理的组数
        is_subblock_schedule ? (max_threads + threads_per_group - 1) / threads_per_group : 1;  // 子块调度时可能一个块处理多个组
    const int groups_launch = (groups_per_block + groups - 1) / groups_per_block;  // 需要启动的组数

    dim3 block(threads_per_group, groups_per_block);  // 设置块维度
    dim3 grid(groups_launch);  // 设置网格维度

    const int elems_per_step = threads_per_group * h_per_step;  // 每步处理的元素数
    const int external_unroll = (elems_per_group + elems_per_step - 1) / elems_per_step;  // 计算外部展开因子

    if (is_subblock_schedule) {  // 如果是子块调度（每组元素数<=128）
        // <=128
        if (threads_per_group == 1) {  // 如果每组只需要1个线程
            LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, 1, 1);  // 启动量化内核，使用1个线程
        } else if (threads_per_group == 2) {  // 如果每组需要2个线程
            LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, 1, 2);  // 启动量化内核，使用2个线程
        } else if (threads_per_group == 4) {  // 如果每组需要4个线程
            LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, 1, 4);  // 启动量化内核，使用4个线程
        } else if (threads_per_group == 8) {  // 如果每组需要8个线程
            LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, 1, 8);  // 启动量化内核，使用8个线程
        } else if (threads_per_group == 16) {  // 如果每组需要16个线程
            LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, 1, 16);  // 启动量化内核，使用16个线程
        }
    } else if (external_unroll == 1) {  // 如果外部展开因子为1（129-4096元素）
        // 129 - 4096 elems
        // (this can launch with 1-7 warps as well)  // 这也可以使用1-7个warp启动
        LAUNCH_CACHED_QUANT(num_bits, quant_type, 1, internal_unroll, max_threads);  // 启动量化内核，使用最大线程数
    } else if (external_unroll == 2) {  // 如果外部展开因子为2（4097-8192元素）
        // 4097 - 8192 elems
        LAUNCH_CACHED_QUANT(num_bits, quant_type, 2, internal_unroll, max_threads);  // 启动量化内核，外部展开因子为2
    } else if (external_unroll == 3) {  // 如果外部展开因子为3（8193-12288元素）
        // 8193 - 12288 elems
        LAUNCH_CACHED_QUANT(num_bits, quant_type, 3, internal_unroll, max_threads);  // 启动量化内核，外部展开因子为3
    } else if (external_unroll == 4) {  // 如果外部展开因子为4（12289-16384元素）
        // 12289 - 16384 elems
        LAUNCH_CACHED_QUANT(num_bits, quant_type, 4, internal_unroll, max_threads);  // 启动量化内核，外部展开因子为4
    }
}

备注

寄存器大小问题

#pragma unroll
for (int i = 0; i < UNROLL; i++) {
    __half2* iteration_buffer = local_buffer + i * internal_unroll * quantize::h2_per_load;
#pragma unroll
    for (int j = 0; j < internal_unroll; j++) {
        const int iteration = i * internal_unroll + j;
        mem_access::load_global<quantize::granularity>(
            iteration_buffer + j * quantize::h2_per_load,
            input_base + iteration * stride,
            elem_offset + iteration * stride < elems_per_group);
    }
}

总共加载的数据量是：

每个线程加载 UNROLL * internal_unroll * quantize::h2_per_load 个 __half2 元素
每个 __half2 包含 2 个 __half 值，共 4 字节总数据量 = UNROLL * internal_unroll *
quantize::h2_per_load * 4 字节

寄存器数量限制：

现代 NVIDIA GPU 每个线程通常有 255 个 32 位寄存器
每个 __half2 需要 1 个寄存器（4 字节）
最大可存储约 255 个 __half2 值

实际使用情况：

从代码中可以看到，local_buffer 被声明为：
这表明设计者已经考虑了寄存器容量问题

典型参数值：

在量化操作中，典型的参数值可能是：
UNROLL = 1-4
internal_unroll = 1-2
quantize::h2_per_load = 4-8
这意味着每个线程加载 4-64 个 __half2 元素，远低于寄存器限制

为什么要对一个块分多个组

const bool is_subblock_schedule = (elems_per_group <= 128) ? true : false;  // 判断是否使用子块调度
    const int h_per_step = is_subblock_schedule ? quantize::h_per_load  // 如果是子块调度，每步处理的半精度浮点数数量
                                                : quantize::h_per_load * internal_unroll;  // 否则，每步处理更多数据

    // Scheduling concern: may be slightly faster for some inputs to assign multiple stages of
    // warp-sized blocks rather than stepping up to 64/96 threads  // 调度考虑：对某些输入，使用多个warp大小的块可能比使用64/96线程更快
    const int one_step_threads = next_pow2((elems_per_group + h_per_step - 1) / h_per_step);  // 计算一步需要的线程数（向上取2的幂）
    const int threads_per_group = (one_step_threads < max_threads) ? one_step_threads : max_threads;  // 确保线程数不超过最大值

GPU调度特性

GPU以线程束(warp)为单位调度，每个warp通常有32个线程
- 如果一个组只需要少量线程（如16个），一个块一组会导致：
- 每个warp只有一半线程被使用
- 另一半线程闲置，造成50%的计算资源浪费
数据对比
- 假设有1000个组，每组只需16个线程处理：

调度方式	线程块数	每块线程数	有效线程利用率	总线程数
一块一组	1000	16	50%	16,000
一块多组	63	256	100%	16,128

- 结果：一块多组方式减少了94%的块数，同时将线程利用率从50%提升到接近100%。

实际影响
- 对于1000个小组：
- 一块一组：1000个块的启动开销
- 一块多组(16组/块)：只有约63个块的启动开销
- 结果：启动开销减少了约94%，对于有大量小组的应用尤为重要

输入规模	一块一组	一块多组	性能提升
小组(≤128元素)	基准	快2-5倍	100%-400%
中等组(129-4096)	基准	快1-2倍	0%-100%
大组(>4096)	基准	相似	0%

output_data 数据是如何加载到寄存器的

launch_quant是主函数，output_data 通过宏 LAUNCH_CACHED_QUANT LAUNCH_CACHED_QUANT_CALL 启动，表示launch_quant中的变量都是全局可见的，可以直接使用
所以 launch_quant 中的 output_data 在 LAUNCH_CACHED_QUANT_CALL 中可以直接使用