Intel Distiller工具包-量化实现2

cyz0202

已于 2022-07-05 11:12:48 修改

阅读量431

点赞数 1

分类专栏：技术问题 # 量化 # 深度学习文章标签：深度学习

于 2022-05-31 01:03:23 首次发布

本文链接：https://blog.youkuaiyun.com/cyz0202/article/details/125041577

版权

技术问题同时被 3 个专栏收录

56 篇文章

订阅专栏

深度学习

28 篇文章

订阅专栏

量化

8 篇文章

订阅专栏

本系列文章

Intel Distiller工具包-量化实现1

Intel Distiller工具包-量化实现2

回顾

上一篇文章中介绍了Distiller及Quantizer基类，基类定义了重要的变量，如replacement_factory（dict，用于记录待量化module对应的wrapper）；此外定义了量化流程，包括预处理（BN折叠，激活优化等）、量化模块替换、后处理等主要步骤；
本文介绍继承自Quantizer的子类量化器，包括
- PostTrainLinearQuantizer（本文）
- QuantAwareTrainRangeLinearQuantizer（后续）
- PACTQuantizer（后续）
- NCFQuantAwareTrainQuantizer（后续）
本文代码也挺多的，由于没法全部贴出来，有些地方说的不清楚的，还请读者去参考源码；

PostTrainLinearQuantizer

后训练量化器；对已训练好的模型进行量化，需要先使用少量输入收集模型内部输入、输出、权重的统计数据；
PostTrainLinearQuantizer的类定义如下：主要内容在构造函数里，下面介绍；其他就是加了预处理（BN折叠、激活层优化）等；
构造函数：重点在这

构造函数中，前面都是检查和默认设置（如量化模式检查、clip模式检查、是否有统计数据、默认量化设置等）；直到下面这段代码，才是PostTrainLinearQuantizer核心

        #### PART1-参数层的量化：使用固定的 RangeLinearQuantParamLayerWrapper ####
        self.replacement_factory[nn.Conv2d] = replace_param_layer
        self.replacement_factory[nn.Conv3d] = replace_param_layer
        self.replacement_factory[nn.Linear] = replace_param_layer

        #### PART2-非参数层的量化：使用相应的Wrapper，使用functools partial ####
        # concat：固定wrapper_type为RangeLinearQuantConcatWrapper
        factory_concat = partial(
            replace_non_param_layer, RangeLinearQuantConcatWrapper)

        # add：固定wrapper_type为RangeLinearQuantEltwiseAddWrapper
        factory_eltwiseadd = partial(
            replace_non_param_layer, RangeLinearQuantEltwiseAddWrapper)

        # dot-product：固定wrapper_type为RangeLinearQuantEltwiseMultWrapper
        factory_eltwisemult = partial(
            replace_non_param_layer, RangeLinearQuantEltwiseMultWrapper)

        # matrix-mul：固定wrapper_type为RangeLinearQuantMatmulWrapper
        factory_matmul = partial(
            replace_non_param_layer, RangeLinearQuantMatmulWrapper)

        update_wrapper(factory_concat, replace_non_param_layer)  # 从参数2的对象 取内置参数 覆盖 参数1的对象对应参数
        update_wrapper(factory_eltwiseadd, replace_non_param_layer)
        update_wrapper(factory_eltwisemult, replace_non_param_layer)
        update_wrapper(factory_matmul, replace_non_param_layer)

        self.replacement_factory[distiller.modules.Concat] = factory_concat  # factory_concat(Concat, ...)
        self.replacement_factory[distiller.modules.EltwiseAdd] = factory_eltwiseadd
        self.replacement_factory[distiller.modules.EltwiseMult] = factory_eltwisemult
        self.replacement_factory[distiller.modules.Matmul] = factory_matmul
        self.replacement_factory[distiller.modules.BatchMatmul] = factory_matmul
        # ===============================================

        #### PART3-embedding层的量化：####
        self.replacement_factory[nn.Embedding] = replace_embedding

正如代码注释所述，代码分别对可量化参数层、非参数层、embedding层进行量化设置

参数层：

以nn.Conv2d为例：self.replacement_factory[nn.Conv2d] = replace_param_layer
表示nn.Conv2d这个module会被replace_param_layer生成的量化版本module替换掉
我们来看看replace_param_layer是什么样的？
可以看到replace_param_layer在对module（此处指nn.Conv2d）封装时，其实返回的是RangeLinearQuantParamLayerWrapper对象（它是一个nn.Module，即新module替换旧module）；所以我们再来看看这个wrapper对象是如何实现的；
RangeLinearQuantParamLayerWrapper继承自RangeLinearQuantWrapper，需要先看看基类RangeLinearQuantWrapper的定义：
重点看一下forward函数：

RangeLinearQuantParamLayerWrapper需要根据基类RangeLinearQuantWrapper的定义和自身需求实现相关函数，主要是quantized_forward和get_accum_to_output_re_quantization_params；具体定义如下

# 定义了参数层量化的方式
class RangeLinearQuantParamLayerWrapper(RangeLinearQuantWrapper):
    """
    Linear range-based quantization wrappers for layers with weights and bias (namely torch.nn.ConvNd and
    torch.nn.Linear)

    Assume:

    x_q = round(scale_x * x_f) - zero_point_x

    Hence:

    x_f = 1/scale_x * x_q + zero_point_x  # 根据下面写法，应该是 1/scale_x * (x_q + zero_point_x)

    (And the same for y_q, w_q and b_q)

    So, we get: (use "zp" as abbreviation for zero_point)

    y_f = x_f * w_f + b_f
    # 以下除第一步外均省略round，注意中括号中，bias是作为re-quant-bias存在，也就是实现时是round()-0
    y_q = round(scale_y * y_f) - zp_y =  scale_y * (x_f * w_f + b_f) - zp_y =

                scale_y                                         scale_x * scale_w
        = ------------------- * [(x_q + zp_x) * (w_q + zp_w) + ------------------- * (b_q + zp_b)] - zp_y
           scale_x * scale_w                                         scale_b

    Args:
        wrapped_module (torch.nn.Module): Module to be wrapped
        num_bits_acts (int): Number of bits used for inputs and output quantization
        num_bits_params (int): Number of bits used for parameters (weights and bias) quantization
        num_bits_accum (int): Number of bits allocated for the accumulator of intermediate integer results
        mode (LinearQuantMode): Quantization mode to use (symmetric / asymmetric-signed/unsigned)
        clip_acts (ClipNode): See RangeLinearQuantWrapper
        per_channel_wts (bool): Enable quantization of weights using separate quantization parameters per
            output channel
        activation_stats (dict): See RangeLinearQuantWrapper
        clip_n_stds (int): See RangeLinearQuantWrapper
        clip_half_range (bool) : See RangeLinearQuantWrapper
        scale_approx_mult_bits (int): See RangeLinearQuantWrapper（是否使用整型处理scale）
    """
    def __init__(self, wrapped_module, num_bits_acts, num_bits_params, num_bits_accum=32,
                 mode=LinearQuantMode.SYMMETRIC, clip_acts=ClipMode.NONE, per_channel_wts=False, activation_stats=None,
                 clip_n_stds=None, clip_half_range=False, scale_approx_mult_bits=None):
        super(RangeLinearQuantParamLayerWrapper, self).__init__(wrapped_module, num_bits_acts, num_bits_accum, mode,
                                                                clip_acts, activation_stats, clip_n_stds,
                                                                clip_half_range, scale_approx_mult_bits)

        if not isinstance(wrapped_module, (nn.Conv2d, nn.Conv3d, nn.Linear)):
            raise ValueError(self.__class__.__name__ + ' can wrap only Conv2D, Conv3D and Linear modules')

        self.num_bits_params = num_bits_params  # 参数量化bits
        self.per_channel_wts = per_channel_wts  # 权重是否逐通道量化

        # 获取量化范围（根据量化模式）sign: [-128,127] unsign: [0, 255]
        self.params_min_q_val, self.params_max_q_val = get_quantized_range(
            num_bits_params, signed=mode != LinearQuantMode.ASYMMETRIC_UNSIGNED)

        # Quantize weights - overwrite FP32 weights（获取当前op权重weights的量化参数）
        w_scale, w_zero_point = _get_quant_params_from_tensor(wrapped_module.weight, num_bits_params, self.mode,
                                                              per_channel=per_channel_wts)

        # 添加一个伪属性fake-attr，可保存但不参与参数更新
        self.register_buffer('w_scale', w_scale)
        self.register_buffer('w_zero_point', w_zero_point)
        # 量化权重(weight.data)并且替换(inplace=True)，注意要clamp（另一种方式是提前将x、w都clamp到规定的范围内）
        linear_quantize_clamp(wrapped_module.weight.data, self.w_scale, self.w_zero_point, self.params_min_q_val,
                              self.params_max_q_val, inplace=True)

        self.has_bias = hasattr(wrapped_module, 'bias') and wrapped_module.bias is not None
        device = self.w_scale.device

        # 当前module是否有统计数据并且有input
        if self.preset_act_stats:
            self.in_0_scale = self.in_0_scale.to(device)  # 只使用第一个input的scale
            self.register_buffer('accum_scale', self.in_0_scale * self.w_scale)
            if self.per_channel_wts:  # TODO how?
                self.accum_scale = self.accum_scale.squeeze(dim=-1)
        else:
            self.accum_scale = 1

        # Quantize bias
        self.has_bias = hasattr(wrapped_module, 'bias') and wrapped_module.bias is not None
        if self.has_bias:
            if self.preset_act_stats:
                # 如果accu_scale可以根据统计数据得到，则令bias_scale==accu_scale，bias_zp==0，量化范围也来自accu
                # 注意这里的量化方式，是根据doc里的公式设计的；此外，bias根据accum的量化范围进行clamp
                linear_quantize_clamp(wrapped_module.bias.data, self.accum_scale.squeeze(), 0,
                                      self.accum_min_q_val, self.accum_max_q_val, inplace=True)
            else:
                # 否则使用bias自己的scale和zp和通用量化范围
                b_scale, b_zero_point = _get_quant_params_from_tensor(wrapped_module.bias, num_bits_params, self.mode)
                self.register_buffer('b_scale', b_scale)
                self.register_buffer('b_zero_point', b_zero_point)
                # TODO 注意inplace=False，dynamic requantize，why？
                base_b_q = linear_quantize_clamp(wrapped_module.bias.data, self.b_scale, self.b_zero_point,
                                                 self.params_min_q_val, self.params_max_q_val)
                # Dynamic ranges - save in auxiliary buffer, requantize each time based on dynamic input scale factor
                self.register_buffer('base_b_q', base_b_q)

        # A flag indicating that the simulated quantized weights are pre-shifted. for faster performance.
        # In the first forward pass - `w_zero_point` is added into the weights, to allow faster inference,
        # and all subsequent calls are done with these shifted weights.
        # Upon calling `self.state_dict()` - we restore the actual quantized weights.
        # i.e. is_simulated_quant_weight_shifted = False
        self.register_buffer('is_simulated_quant_weight_shifted', torch.tensor(0, dtype=torch.uint8, device=device))

    def state_dict(self, destination=None, prefix='', keep_vars=False):
        if self.is_simulated_quant_weight_shifted:
            # We want to return the weights to their integer representation:
            # 根据doc，weight要加上w_zp，利用is_simulated_quant_weight_shifted标记；保存的时候恢复到真实的weight
            self.wrapped_module.weight.data -= self.w_zero_point
            self.is_simulated_quant_weight_shifted.sub_(1) # i.e. is_simulated_quant_weight_shifted = False
        return super(RangeLinearQuantParamLayerWrapper, self).state_dict(destination, prefix, keep_vars)

    def get_inputs_quantization_params(self, input):
        if not self.preset_act_stats:  # 如果没有统计数据提供支持，则需要dynamic计算
            self.in_0_scale, self.in_0_zero_point = _get_quant_params_from_tensor(
                input, self.num_bits_acts, self.mode, clip=self.clip_acts,
                num_stds=self.clip_n_stds, scale_approx_mult_bits=self.scale_approx_mult_bits)
        return [self.in_0_scale], [self.in_0_zero_point]

    def quantized_forward(self, input_q):
        # See class documentation for quantized calculation details.

        if not self.preset_act_stats:  # 没有统计数据
            # 基类中 self.accum_scale = 1，这里重置
            self.accum_scale = self.in_0_scale * self.w_scale  # in_0_scale在get_inputs_quantization_params中补全定义了
            if self.per_channel_wts:
                self.accum_scale = self.accum_scale.squeeze(dim=-1)

            if self.has_bias:
                # Re-quantize bias to match x * w scale:
                # b_q' = (in_scale * w_scale / b_scale) * (b_q + b_zero_point)
                bias_requant_scale = self.accum_scale.squeeze() / self.b_scale
                if self.scale_approx_mult_bits is not None:
                    bias_requant_scale = approx_scale_as_mult_and_shift(bias_requant_scale, self.scale_approx_mult_bits)
                # 没有统计数据情况下，对bias根据accum scale进行*重新*量化（根据doc里的公式，但公式里原来是没有round的）
                # b_q' = round[(in_scale * w_scale / b_scale) * (base_b_q + b_zero_point)] - 0
                self.wrapped_module.bias.data = linear_quantize_clamp(self.base_b_q + self.b_zero_point,
                                                                      bias_requant_scale, 0,
                                                                      self.accum_min_q_val, self.accum_max_q_val)

        # Note the main terms within the summation is:
        #   (x_q + zp_x) * (w_q + zp_w)
        # In a performance-optimized solution, we would expand the parentheses and perform the computation similar
        # to what is described here:
        #   https://github.com/google/gemmlowp/blob/master/doc/low-precision.md#efficient-handling-of-offsets
        # However, for now we're more concerned with simplicity rather than speed. So we'll just add the zero points
        # to the input and weights and pass those to the wrapped model. Functionally, since at this point we're
        # dealing solely with integer values, the results are the same either way.

        if self.mode != LinearQuantMode.SYMMETRIC and not self.is_simulated_quant_weight_shifted:
            # We "store" the w_zero_point inside our wrapped module's weights to
            # improve performance on inference.
            self.wrapped_module.weight.data += self.w_zero_point  # 对称模式w_zp=0
            self.is_simulated_quant_weight_shifted.add_(1) # i.e. is_simulated_quant_weight_shifted = True

        input_q += self.in_0_zero_point
        # 执行doc中的 (x_q + zp_x) * (w_q + zp_w) +
        #            round[ (in_scale * w_scale / b_scale) * (b_q + b_zero_point) - 0 ]
        # 获得accum，注意只是个中间量化值，没有实际意义
        accum = self.wrapped_module.forward(input_q)
        # accum的量化范围是32bits
        clamp(accum.data, self.accum_min_q_val, self.accum_max_q_val, inplace=True)

        return accum

    def get_output_quantization_params(self, accumulator):
        if self.preset_act_stats:
            return self.output_scale, self.output_zero_point
        # TODO why? 没有历史统计数据的话，scale_y和zp_y都无从获得？
        y_f = accumulator / self.accum_scale
        return _get_quant_params_from_tensor(y_f, self.num_bits_acts, self.mode, clip=self.clip_acts,
                                             num_stds=self.clip_n_stds,
                                             scale_approx_mult_bits=self.scale_approx_mult_bits)

    def get_accum_to_output_re_quantization_params(self, output_scale, output_zero_point):
        requant_scale = output_scale / self.accum_scale
        if self.scale_approx_mult_bits is not None:
            requant_scale = approx_scale_as_mult_and_shift(requant_scale, self.scale_approx_mult_bits)
        return requant_scale, output_zero_point

整体思路是：
- 在doc里，解释了distiller是如何将原计算转换成 float -> int -> float形式的，并让原module执行int部分的计算，达到核心计算由int执行的目的（即实现量化计算）
- __init__中，事先计算weight、bias（如果有）的量化值；此处根据是否有统计数据会有些处理上的差异；
- quantized_forward函数：接收量化的输入，然后让原module执行量化计算，返回的是中间量化值（不是y_f对应的y_q）
- get_accum_to_output_re_quantization_params函数：根据doc解释，要得到最终的量化输出结果y_q，还需要对quantized_forward的输出值（accum）进行变换；观察doc解释，该变换又可以视为一种量化，因此该函数根据doc输出二次量化的scale和zp；
- 小结：
  - 通过上述module替换（封装），当做inference时，虽然nn.Conv2d不变，但是计算时就变成int类型计算了；
  - 值得一提的是，distiller工具包没有实现整型的gemm（矩阵乘/igemm），因此量化后还需要引入igemm包；

非参数层

非参数层和参数层做法大体类似，只不过因为没有参数，所以有些不同

举例说明：这是对加法操作（distiller现将其封装为module）进行的量化

# add：固定wrapper_type为RangeLinearQuantEltwiseAddWrapper
factory_eltwiseadd = partial(replace_non_param_layer, 
                             RangeLinearQuantEltwiseAddWrapper)

replace_non_param_layer类似replace_param_layer，但是多了一个参数wrapper_type，表示wrapper类型；该参数的设置是为了复用函数考虑的，因为非参module较多（如加法、逐元素乘法/点积、批乘法等），各个module需要使用不同的wrapper；定义如下：

看一下eltwiseadd的wrapper，实现思路和上面的带参wrapper是一样的，不一样的是quantized_forward代表的量化实现过程

class RangeLinearQuantEltwiseAddWrapper(RangeLinearQuantWrapper):
    """ add-0107-zyc
    y_f = in0_f + in1_f
    in0_q = round(in0_f * scale_in0) - zp_in0
    in1_q = round(in1_f * scale_in1) - zp_in1
    y_q = round(y_f * scale_y) - zp_y => 以下省略round

        = scale_y * ( in0_f + in1_f ) - zp_y

                scale_y
        = ------------------- * [ scale_in1*(in0_q + zp_in0) + (in1_q + zp_in1)*scale_in0 ] - zp_y
          scale_in0*scale_in1
    => 可以发现上式不能进行int计算，所以需要进一步处理：对in0_q和in1_q重新量化，scale统一为该节点output/accum的scale
    in0_re_q = scale_accum * [( in0_q + zp_in0 ) / scale_in0] - 0  # 此时新的scale为scale_accum，zp为0
    in1_re_q = scale_accum * [( in0_q + zp_in1 ) / scale_in1] - 0
    => 则
    y_q = round(y_f * scale_y) - zp_y => 以下省略round

        = scale_y * ( in0_f + in1_f ) - zp_y

                   scale_y
        = ------------------------- * [scale_re_in1*(in0_re_q+zp_re_in0)+(in1_re_q+zp_re_in1)*scale_re_in0] - zp_y
          scale_re_in0*scale_re_in1

        = 1 * [ (in0_re_q + zp_re_in0) + (in1_re_q + zp_re_in1) ] - zp_y

        = 1 * (in0_re_q + in1_re_q) - zp_y
    note：从推导的角度来看，上述重量化做法并没有问题

    """

    def __init__(self, wrapped_module, num_bits_acts, mode=LinearQuantMode.SYMMETRIC, clip_acts=ClipMode.NONE,
                 activation_stats=None, clip_n_stds=None, clip_half_range=False, scale_approx_mult_bits=None):
        if not isinstance(wrapped_module, distiller.modules.EltwiseAdd):
            raise ValueError(self.__class__.__name__ + ' can only wrap distiller.modules.EltwiseAdd modules')

        if not activation_stats:
            raise NoStatsError(self.__class__.__name__ +
                               ' must get activation stats, dynamic quantization not supported')

        super(RangeLinearQuantEltwiseAddWrapper, self).__init__(wrapped_module, num_bits_acts, mode=mode,
                                                                clip_acts=clip_acts, activation_stats=activation_stats,
                                                                clip_n_stds=clip_n_stds,
                                                                clip_half_range=clip_half_range,
                                                                scale_approx_mult_bits=scale_approx_mult_bits)

        if self.preset_act_stats:
            # For addition to make sense, all input scales must match. So we set a re-scale factor according
            # to the preset output scale
            requant_scales = [self.output_scale / in_scale for in_scale in self.inputs_scales()]
            if scale_approx_mult_bits is not None:
                requant_scales = [approx_scale_as_mult_and_shift(requant_scale, scale_approx_mult_bits)
                                  for requant_scale in requant_scales]
            for idx, requant_scale in enumerate(requant_scales):
                self.register_buffer('in_{0}_requant_scale'.format(idx), requant_scale)

    def inputs_requant_scales(self):
        if not self.preset_act_stats:
            raise RuntimeError('Input quantization parameter iterators only available when activation stats were given')
        for idx in range(self.num_inputs):
            name = 'in_{0}_requant_scale'.format(idx)
            yield getattr(self, name)

    def get_inputs_quantization_params(self, *inputs):
        return self.inputs_scales(), self.inputs_zero_points()

    def quantized_forward(self, *inputs_q):
        # Re-scale inputs to the accumulator range
        # 对已量化的输入重新量化，scale变为accum的（self.output_scale)，如此统一了scale_in
        inputs_re_q = [linear_quantize_clamp(input_q + zp, requant_scale, 0,
                                             self.accum_min_q_val, self.accum_max_q_val, inplace=False)
                       for input_q, requant_scale, zp in zip(inputs_q, self.inputs_requant_scales(),
                                                             self.inputs_zero_points())]
        accum = self.wrapped_module(*inputs_re_q)
        clamp(accum.data, self.accum_min_q_val, self.accum_max_q_val, inplace=True)

        return accum

    def get_output_quantization_params(self, accumulator):
        return self.output_scale, self.output_zero_point

    def get_accum_to_output_re_quantization_params(self, output_scale, output_zero_point):
        return 1., self.output_zero_point

Embedding层
- 该层的量化计算就比较简单了（其实不做量化也行吧，除非为了减少模型大小），以下为定义：
- 可以看到，没有复杂的quantized_forward和二次量化；
小结：以上介绍了后训练量化器是如何实现的，核心部分是各类wrapper 对原module的封装替代；
补充：scale（一般是float）的整型计算方法
- 直接上代码了；基本思想就是 floor(scale*2^N) / (2^N)，N其实能大就大