目录
本文主要参考课程 quantization-in-depth,B站视频带中文字幕( 链接),官方视频带jupyter。
quantization-in-depth 系列视频主要分为三个部分:
1、介绍量化的概念,实现对称量化、非对称量化,基于对称量化实现逐层、逐通道、逐组量化
2、实现自己的量化器并量化开源模型
3、介绍weights packing 和 unpacking
本文主要涉及第一部分。
一、什么是量化
量化是将模型中的高精度(如float32)权重和激活值转换为低精度数值(如int8)表示
1.1 量化的分类
根据映射函数是否是线性可以分为两类:线性量化和非线性量化,这里主要研究线性量化技术。
根据量化的粒度可分为:per tensor、per channel、per group
根据量化的时机可分为:训练后量化 PTQ(Post-Training Quantization)、训练时量化 QAT(Quantization-Aware Training)
1.2 量化的优势
1、更小的模型,更少的内存
2、更快的速度
1.3 量化的挑战
1、Quantization error
2、Retraining(QAT)
3、Limited Hardware support
4、Calibration dataset needed
5、packing/unpacking
1.4 如何评估量化的好坏
一般将量化后的数据再反量化回去,然后计算出原始数据和反量化数据的均方差,即
q
u
a
n
t
_
e
r
r
=
(
w
e
i
g
h
t
s
_
t
e
n
s
o
r
−
d
e
q
u
a
n
t
_
w
e
i
g
h
t
s
_
t
e
n
s
o
r
)
.
s
q
u
a
r
e
(
)
.
m
e
a
n
(
)
quant\_err=(weights\_tensor-dequant\_weights\_tensor).square().mean()
quant_err=(weights_tensor−dequant_weights_tensor).square().mean()
二、线性量化
2.1 映射
考虑一个普遍性的问题,如图,需要将
R
∈
[
r
_
m
i
n
,
r
_
m
a
x
]
R∈[r\_min,r\_max]
R∈[r_min,r_max]映射到
Q
∈
[
q
_
m
i
n
,
q
_
m
a
x
]
Q∈[q\_min,q\_max]
Q∈[q_min,q_max], 且 R 中的0映射到 Q 中的 z。

现在在 R 中任取一点 r ,假设 r 映射到 Q 中的点为 q ,那么有如下的关系式:
r
−
0
q
−
z
=
s
\frac{r-0}{q-z} = s
q−zr−0=s
从而我们得到量化关系式为:
q
=
r
s
+
z
q
=
r
o
u
n
d
(
q
)
q
=
c
l
a
m
p
(
q
)
q=\frac{r}{s}+z\\ q=round(q)\\ q=clamp(q)
q=sr+zq=round(q)q=clamp(q)
反量化关系式为:
r
=
s
(
q
−
z
)
r=s(q-z)
r=s(q−z)
当z为0时,这种特殊的量化方式为对称量化;当z不为0时,为非对称量化。在量化中,我们习惯称s为scale,z为zero_point.
2.2 非对称量化的简单实现
simple_quant_01.py
import torch
def dequant(tensor,scale,zero_point):
#注意这里的tensor进行了强制类型转换,否则容易出现溢出的风险
r_tensor = (tensor.float() - zero_point) * scale
def linear_quant_with_zero_point(tensor,scale,zero_point,dtype=torch.int8):
scaled_tensor = tensor / scale + zero_point
round_tensor = torch.round(scaled_tensor)
q_min = torch.iinfo(dtype).min
q_max = torch.iinfo(dtype).max
q_tensor = round_tensor.clamp(q_min,q_max).to(dtype)
return q_tensor
r = torch.tensor([
[191.6,-13.5,728.6],
[92.14,295.5,-184],
[0,684.6,245.5]
])
scale = 3.5
zero_point = -70
#量化
q = linear_quant_with_zero_point(r,scale,zero_point)
print(f"quantized tensor:{q}")
#反量化
r_dequant = linear_dequant(q,scale,zero_point)
print(f"dequantized tensor:{r_dequant}")
#量化误差
quant_error = r - r_dequant
print(f"quant_error tensor:{quant_error}")
print(f"quant_error:{quant_error.square().mean()}")
输出为:
quantized tensor:tensor([[ -15, -74, 127],
[ -44, 14, -123],
[ -70, 126, 0]], dtype=torch.int8)
dequantized tensor:tensor([[ 192.5000, -14.0000, 689.5000],
[ 91.0000, 294.0000, -185.5000],
[ 0.0000, 686.0000, 245.0000]])
quant_error tensor:tensor([[-0.9000, 0.5000, 39.1000],
[ 1.1400, 1.5000, 1.5000],
[ 0.0000, -1.4000, 0.5000]])
quant_error:170.87530517578125
量化误差还是很大的,这是因为这里的scale和zero_point是随机设定的,接下来介绍如何找出最优的scale和zero_point
2.3 scale和zero_point
线性量化将
R
∈
[
r
_
m
i
n
,
r
_
m
a
x
]
R∈[r\_min,r\_max]
R∈[r_min,r_max]映射到
Q
∈
[
q
_
m
i
n
,
q
_
m
a
x
]
Q∈[q\_min,q\_max]
Q∈[q_min,q_max],从极值考虑,有:
r
m
i
n
=
s
(
q
m
i
n
−
z
)
r
m
a
x
=
s
(
q
m
a
x
−
z
)
r_{min} = s(q_{min} - z)\\ r_{max} = s(q_{max} - z)
rmin=s(qmin−z)rmax=s(qmax−z)
从而得到:
s
=
r
m
a
x
−
r
m
i
n
q
m
a
x
−
q
m
i
n
g
z
=
i
n
t
(
r
o
u
n
d
(
q
m
i
n
−
r
m
i
n
s
)
)
s = \frac{r_{max}-r_{min}}{q_{max} - q_{ming}}\\ z=int(round(q_{min} - \frac{r_{min}}{s}))
s=qmax−qmingrmax−rminz=int(round(qmin−srmin))
特别地,当
z
<
=
q
m
i
n
z<=q_{min}
z<=qmin,则
z
=
q
m
i
n
z=q_{min}
z=qmin;当
z
>
=
q
m
a
x
z>=q_{max}
z>=qmax,则
z
=
q
m
a
x
z=q_{max}
z=qmax。
我们用新的scale和zero_point量化上文simple_quant_01.py:
r
m
a
x
=
728.6
,
r
m
i
n
=
−
184
q
m
a
x
=
127
,
q
m
i
n
=
−
128
→
s
=
728.6
−
(
−
184
)
127
−
(
−
128
)
=
3.58
→
z
=
i
n
t
(
r
o
u
n
d
(
−
128
−
−
184
3.58
)
)
=
−
77
r_{max} = 728.6, r_{min} = -184\\ q_{max} = 127, q_{min}=-128\\ \to s = \frac{728.6 - (-184)}{127 - (-128)} = 3.58\\ \to z=int(round(-128 - \frac{-184}{3.58})) = -77
rmax=728.6,rmin=−184qmax=127,qmin=−128→s=127−(−128)728.6−(−184)=3.58→z=int(round(−128−3.58−184))=−77
我们编写代码 simple_quant_02_asymmetric.py
import torch
def linear_dequant(tensor,scale,zero_point):
#注意这里的tensor进行了强制类型转换,否则容易出现溢出的风险
r_tensor = (tensor.float() - zero_point) * scale
return r_tensor
def linear_quant_with_zero_point(tensor,scale,zero_point,dtype=torch.int8):
scaled_tensor = tensor / scale + zero_point
round_tensor = torch.round(scaled_tensor)
q_min = torch.iinfo(dtype).min
q_max = torch.iinfo(dtype).max
q_tensor = round_tensor.clamp(q_min,q_max).to(dtype)
return q_tensor
def get_scale_and_zero_point(tensor,dtype=torch.int8):
q_max = torch.iinfo(dtype).max
q_min = torch.iinfo(dtype).min
r_max = tensor.max().item()
r_min = tensor.min().item()
scale = (r_max - r_min)/(q_max - q_min)
zero_point = q_min - (r_min / scale)
if zero_point < q_min:
zero_point = q_min
elif zero_point > q_max:
zero_point = q_max
else:
zero_point =int(round(zero_point))
return scale,zero_point
r = torch.tensor([
[191.6,-13.5,728.6],
[92.14,295.5,-184],
[0,684.6,245.5]
])
s,z=get_scale_and_zero_point(r, torch.int8)
print(f"scale:{s} zero_point:{z}")
q = linear_quant_with_zero_point(r, s, z)
print(f"quantized tensor:{q}")
r_dequant = linear_dequant(q, s, z)
print(f"dequantized tensor:{r_dequant}")
quant_error = r - r_dequant
print(f"quant_error tensor:{quant_error}")
print(f"quant_error:{quant_error.square().mean()}")
运行得:
scale:3.578823433670343 zero_point:-77
quantized tensor:tensor([[ -23, -81, 127],
[ -51, 6, -128],
[ -77, 114, -8]], dtype=torch.int8)
dequantized tensor:tensor([[ 193.2565, -14.3153, 730.0800],
[ 93.0494, 297.0423, -182.5200],
[ 0.0000, 683.5552, 246.9388]])
quant_error tensor:tensor([[-1.6564, 0.8153, -1.4800],
[-0.9094, -1.5423, -1.4800],
[ 0.0000, 1.0447, -1.4388]])
quant_error:1.5729731321334839
这样的量化误差已经很小了
2.4 对称量化
当zero_point为0时为对称量化,那么有:
s
=
r
m
a
x
q
m
a
x
q
=
c
l
a
m
p
(
r
o
u
n
d
(
r
s
)
)
s=\frac{r_{max}}{q_{max}}\\ q = clamp(round(\frac{r}{s}))
s=qmaxrmaxq=clamp(round(sr))
我们来实现对称量化 simple_quant_03_symmetric.py
import torch
def linear_dequant(tensor,scale,zero_point):
#注意这里的tensor进行了强制类型转换,否则容易出现溢出的风险
r_tensor = (tensor.float() - zero_point) * scale
return r_tensor
def linear_quant_with_zero_point(tensor,scale,zero_point,dtype=torch.int8):
scaled_tensor = tensor / scale + zero_point
round_tensor = torch.round(scaled_tensor)
q_min = torch.iinfo(dtype).min
q_max = torch.iinfo(dtype).max
q_tensor = round_tensor.clamp(q_min,q_max).to(dtype)
return q_tensor
def get_scale_symmetric(tensor,dtype=torch.int8):
q_max = torch.iinfo(dtype).max
r_max = tensor.max().item()
scale = r_max/q_max
return scale
r = torch.tensor([
[191.6,-13.5,728.6],
[92.14,295.5,-184],
[0,684.6,245.5]
])
s=get_scale_symmetric(r, torch.int8)
print(f"scale:{s}")
q = linear_quant_with_zero_point(r, s, zero_point=0)
print(f"quantized tensor:{q}")
r_dequant = linear_dequant(q, s, zero_point=0)
print(f"dequantized tensor:{r_dequant}")
quant_error = r - r_dequant
print(f"quant_error tensor:{quant_error}")
print(f"quant_error:{quant_error.square().mean()}")
运行得:
scale:5.737007681779035
quantized tensor:tensor([[ 33, -2, 127],
[ 16, 52, -32],
[ 0, 119, 43]], dtype=torch.int8)
dequantized tensor:tensor([[ 189.3213, -11.4740, 728.6000],
[ 91.7921, 298.3244, -183.5842],
[ 0.0000, 682.7039, 246.6913]])
quant_error tensor:tensor([[ 2.2787, -2.0260, 0.0000],
[ 0.3479, -2.8244, -0.4158],
[ 0.0000, 1.8961, -1.1913]])
quant_error:2.5091912746429443
三、量化粒度
根据量化粒度划分为:逐层量化(per tensor)、逐通道量化(per channel)、逐组量化(per group)

3.1 per tensor
前面第二节其实都是逐层量化,整个tensor使用相同的scale和zero_point。我们用对称量化简单实现一下逐层量化,这里调用了第二节定义的函数。
def linear_quant_per_tensor(tensor,scale, dtype=torch.int8):
scale = get_scale_symmetric(tensor, dtype)
q = linear_quant_with_zero_point(input_tensor, scale, zero_point=0)
return q
input_tensor = torch.tensor(
[[191.6,-13.5,728.6],
[92.14,295.5,-184],
[0,684.6,245.5]]
)
quantized_tensor = linear_quant_per_tensor(input_tensor)
3.2 per channel
同样用对称量化实现逐通道量化,以二维tensor为例,这里:
dim = 0, 沿行量化
dim = 1, 沿列量化
def linear_quant_per_channel(tensor, dim, dtype=torch.int8):
output_dim = tensor.shape[dim] # 沿dim的维度大小
scale = torch.zeros(output_dim) # 创建一个对应维度的tensor,用于存储scale
for i in range(output_dim):
sub_tensor = tensor.select(dim, i)
scale[i] = get_scale_symmetric(sub_tensor, dtype=dtype)
scale_shape = [1] * tensor.dim()
scale_shape[dim] = -1
scale = scale.view(scale_shape)
quantized_tensor = linear_quant_with_zero_point(tensor, scale=scale, zero_point=0, dtype=dtype)
return quantized_tensor, scale
输出为:
quantized tensor:tensor([[ 33, -2, 127],
[ 40, 127, -79],
[ 0, 127, 46]], dtype=torch.int8)
dequantized tensor:tensor([[ 189.3213, -11.4740, 728.6000],
[ 93.0709, 295.5000, -183.8150],
[ 0.0000, 684.6000, 247.9653]])
quant_error tensor:tensor([[ 2.2787, -2.0260, 0.0000],
[-0.9309, 0.0000, -0.1850],
[ 0.0000, 0.0000, -2.4653]])
quant_error:1.8084441423416138
3.3 per group
per group的实现是先将tensor按group_size reshape,得到每行为group_size大小的新tensor,然后调用per channel 量化,最后reshape回去
def linear_quant_per_group(tensor, group_size, dtype=torch.int8):
t_shape = tensor.shape
assert t_shape[1] % group_size == 0
assert tensor.dim() == 2
tensor = tensor.view(-1, group_size)
quantized_tensor, scale = linear_quant_per_channel(tensor, dim=0, dtype=dtype)
quantized_tensor = quantized_tensor.view(t_shape)
return quantized_tensor, scale
def linear_dequantization_per_group(quantized_tensor, scale, group_size):
q_shape = quantized_tensor.shape
quantized_tensor = quantized_tensor.view(-1, group_size)
dequantized_tensor = linear_dequant(quantized_tensor, scale, 0)
dequantized_tensor = dequantized_tensor.view(q_shape)
return dequantized_tensor
四、使用量化后的数据进行推理
在神经网络中,我们不仅可以量化权重还可以量化激活值。如果只量化权重(如W8A32),那么计算将使用浮点运算,并且先要进行反量化才能做计算。如果量化权重和激活(如W8A8),那么计算将使用整型计算,但并非所有硬件都支持。
下面介绍如何用W8A32进行推理,为简化,线性层不包含bias
def quantized_linear_W8A32_without_bias(input, q_w, s_w, z_w):
assert input.dtype == torch.float32
assert q_w.dtype == torch.int8
dequantized_weight = (q_w.to(torch.float32) - z_w) * s_w
output = torch.nn.functional.linear(input, dequantized_weight)
return output
1182

被折叠的 条评论
为什么被折叠?



