awq量化 精度降低6个点。推理耗时降低从0.447s降低到0.4s
在llamafactory环境中,安装
pip install autoawq
量化代码:
def qu_awq():
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import json
model_path = "model_path"
quant_path = "awq_model_path"
calib_data = "_quantize.json"
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"}
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(
model_path, trust_remote_code=True, device_map="auto", safetensors=True
)
# The pattern of data
""" # Example
msg=[
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
{"role": "user", "content": "Tell me who you are."},
{"role": "assistant", "content": "I am a large language mod

最低0.47元/天 解锁文章
1780

被折叠的 条评论
为什么被折叠?



