前言
继续写这个系列,这俩天有点闲,打算趁机吧千问的token生成器和编码器部分摘出来,弄成一个独立的py包,丢给android去使用。
研究了几天transformer,越看这个代码越觉得自己的菜是难以言语描述
这些代码打算用python先写,后边有时间了换,没时间了就算了,
另外,本轮代码抄写自tokenization_utils_base这个py文件的apply_chat_template方法
注意:本次模型输出只是为了输出一个能用的模型,past_key_value,use_cache 等这些参数都会关掉,以便尽可能的简化模型的输出工作,日后细调了,该有的cache还是会还原回去
输入内容前处理
这里需要将输入的文本信息,如
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.","user":"yinjun"},
]
转化成这样的格式
message = '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'
方便后续调用,源码部分我就不贴了,这里贴我找到感觉比较核心的代码
大致就是,通过jinja2 对输入内容的一个分词吧,注意我这里写的示例提示词可能有问题,分不出多行,我感觉应该是分多行才对。
import jinja2
from jinja2.exceptions import TemplateError
from jinja2.sandbox import ImmutableSandboxedEnvironment
# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
chat_template = config['chat_template']
def raise_exception(message):
raise TemplateError(message)
jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
jinja_env.globals["raise_exception"] = raise_exception
compiled_template = jinja_env.from_string(chat_template)
# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
rendered=[]
conversations = [messages]
template_kwargs = {
'eos_token' : config['eos_token'],
'pad_token' : config['pad_token'],
'additional_special_tokens':config['additional_special_tokens']
}
for chat in conversations:
if hasattr(chat, "messages"):
# Indicates it's a Conversation object
chat = chat.messages
rendered_chat = compiled_template.render(
messages=chat, add_generation_prompt=add_generation_prompt, **template_kwargs
)
rendered.append(rendered_chat)
模型导出前前后处理验证
首先找到配置文件,我的在如下路径
C:\Users\30585\.cache\huggingface\hub\models--Qwen--Qwen2.5-Coder-0.5B-Instruct\snapshots\1f3785e6a5098279993727eab5ca5c9aa6444c34
注意一下,1f37什么的你的可能和我的不一样,需要自己看一下,然后这个文件夹下有各种配置文件,但我们这次来只需要这一个tokenizer.json
拿到以后,根据from_file这个方法建立tokenizer,然后将qwen示例代码中的输入输出内容提取出来,我这边已经完成了提取,就不截图了
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers import AddedToken
from tokenizers.processors import TemplateProcessing
import numpy as np
# 读取 tokenizer_config.json 文件
import json
tokenizer_json = './assets/tokenizer.json'
tokenizer = Tokenizer.from_file(tokenizer_json)
testtext = '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'
# 测试编码和解码
encoding = tokenizer.encode_batch(
[testtext],
add_special_tokens=True,
is_pretokenized=False,
)
print(f"Tokens: {encoding[0].tokens}")
print(f"Token IDs: {encoding[0].ids}")
decoded_text = tokenizer.decode(encoding[0].ids)
print(f"Decoded Text: {decoded_text}")
out = [102645, 69249, 100027, 114714, 101158, 44063, 104757, 45181, 39973,
8863, 99390, 105359, 17714, 58364, 8863, 99390, 104339, 3837,
99652, 104193, 104757, 100629, 61149, 33071, 1773, 102645, 69249,
100027, 114714, 110322, 17714, 48443, 78045, 282, 1155, 8,
284, 1124, 1242, 15159, 77, 28, 15, 92, 61,
35702, 258, 36958, 92, 272, 1089, 384, 47822, 72,
69761, 9139, 1124, 2533, 90919, 17767, 272, 1089, 1124,
8, 54851, 110589, 3837, 44292, 308, 1124, 8, 54851,
107586, 1773, 100431, 101909, 105172, 30280, 46100, 19793, 26355,
3837, 37029, 63, 35083, 63, 44956, 36407, 101884, 102645,
69249, 100027, 114714, 3407, 73594, 12669, 198, 474, 8591,
438, 2595, 271, 2, 41479, 248, 64559, 46944, 104757,
198, 83, 284, 2595, 38712, 7, 15, 11, 220,
17, 353, 2595, 24259, 11, 220, 16, 15, 15,
15, 340, 26622, 284, 2595, 16318, 1155, 692, 2,
33424, 94, 69103, 102645, 69249, 100027, 114714, 198, 69,
284, 2595, 79899, 79899, 56782, 692, 2, 220, 46485,
102645, 69249, 100027, 114714, 9370, 110589, 198, 48638, 28142,
284, 282, 58, 15, 2533, 1350, 67018, 28142, 340,
13874, 19324, 104596, 19793, 26355, 15946, 3837, 97639, 101140,
91282, 104059, 102298, 20450, 44292, 259, 1124, 8, 9370,
69824, 90395, 50377, 104059, 102298, 36556, 106514, 104757, 9370,
69824, 44292, 8286, 1124, 8, 1773, 101889, 3837, 97639,
37029, 63, 6199, 79899, 79899, 63, 32804, 100768, 102645,
69249, 100027, 114714, 1773, 100161, 3837, 97639, 107439, 102645,
69249, 100027, 114714, 9370, 110589, 62926, 102703, 99898, 3407,
104001, 107083, 46100, 3837, 102762, 101051, 46944, 102298, 64952,
101454, 110589, 9370, 69824, 3837, 103991, 102268, 99661, 106168,
107586, 9370, 102645, 69249, 100027, 114714, 110589, 1773, 151645]
out = np.array(out)
d_text = tokenizer.decode(out,skip_special_tokens=True)
print(d_text)
运行结果比对,俩者一致
piplies流输出结果
tokenizer输出结果
模型导出推理测试
前言
qwen使用的是pytorch,而pytorch是支持直接导出onnx的,这里我直接硬导出base_model一下试试看
通过代码
注意:如果直接尝试导出整个model,会因为里面一些参的输出结构不支持,导致导出失败,而且Qwen2ForCausalLM推理层里面一些操作导出时,比较担心不同opset输出内容不一致,为了安全起见,还是导出原始qwen2model(绝对不是因为解决不了Here, received an input of unsupported type: BatchEncoding错误才这么干,绝对不是.jpg)
此外,导出模型时需要将模型导出为float32格式的数据,原版模型为bf16,如果导出onnx里面一些算子,如pow是不支持bf1的
qwen核心mod导出
1、模型转化为float32格式
将模型转化为float32格式
model_name = "Qwen/Qwen2.5-Coder-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
local_files_only=True
)
def convert_bf16_fp16_to_fp32(model):
for param in model.parameters():
if param.dtype == torch.bfloat16 or param.dtype == torch.float16:
param.data = param.data.to(dtype=torch.float32)
for buffer in model.buffers():
if buffer.dtype == torch.bfloat16 or buffer.dtype == torch.float16:
buffer.data = buffer.data.to(dtype=torch.float32)
return model
#qwenmodel导出
tmodel = model.model.base_model
# lm_head的导出
llm_model = model.model.lm_head
tmodel = convert_bf16_fp16_to_fp32(tmodel)
llm_model = convert_bf16_fp16_to_fp32(llm_model)
2、模型导出:
首先需要先对原始模型代码做一个简单的修改
找到qwen2的modeling_qwen2下的Qwen2Model,修改输入的默认参数,全改成false关闭
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = False,
output_attentions: Optional[bool] = False,
output_hidden_states: Optional[bool] = False,
return_dict: Optional[bool] = False,
) -> Union[Tuple, BaseModelOutputWithPast]:
然后获取出原本的输入数据,然后使用onnx,对上面已经转化为float32的模型进行一个导出工作
#模型输入数据获取
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
#获取basemodel必须的三个输入项
input_ids = model_inputs.data['input_ids']
attention_mask = model_inputs.data['attention_mask']
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
最后,对llm_head和qwen2model俩个模型进行分别导出吧
qwen2model导出:
torch.onnx.export(tmodel, (input_ids,attention_mask,position_ids) ,input_names=input_names ,output_names=output_names , f="./onnx/model32.onnx" ,dynamic_axes={'input_ids':[1],'attention_mask':[1],'position_ids':[1],'last_hidden_state':[1]},opset_version=20)
llm_head导出:
torch.onnx.export(llm_model, (llm_head_input) ,input_names=['input_0'] ,output_names=["logits"] , f="./onnx/llm_model.onnx" ,dynamic_axes={'input_0':{1:"out_size"}})
模型推理piplines代码编写
至此,fc层和model层俩边的模型就全部完成了初步导出,接下来就要搞这个piplines了
前面已经简单复盘了一下,整个piplines的工作流程,这里直接复用就行
注意!:这里是为了快速弄好,跳过了一些“可以先不写”的代码,但这些不写的代码很多都是纠错用的,后边该吃的坑肯定还是要吃回来的
公用部分10步
1、初始化配置文件,并且手动给予初始化配置参数,跳过(ps:不是真跳过,相当于原始代码,这里是读取配置文件自动生成,而我偷懒,直接写死,好孩子不要学)
代码如下
pad_token_id = 151643
bos_token_id = 151643
eos_token_id = [151645,151643]
max_position_embeddings=32768,
# 是否存在最大长度
has_default_max_length = True
# 是否存在最短长度
has_default_min_length = True
# 最长tokean数
max_new_tokens=512
# 由于没有限制最小长度,这里配为0
min_length = 0
temperature = 0.7
top_k = 20
top_p = 0.8
min_tokens_to_keep=1
2、生成对应的logits_processor和stopping_criteria,生成eos_token_id和pad_token_id,这个不能跳过,得留着,代码先复制出来
# 用logits processor 来控制生成的过程
logits_processor = LogitsProcessorList()
# 用stopping_criteria控制结束的过程
stopping_criteria = StoppingCriteriaList()
warpers = StoppingCriteriaList()
pad_token_id = 151643
bos_token_id = 151643
eos_token_id = [151645,151643]
max_position_embeddings=32768,
3、输入数据验证,我们本次输入没有inputs_embeds,没调用_maybe_initialize_input_ids_for_generation这个方法,_maybe_initialize_input_ids_for_generation方法在input为none时才执行,这里也可以先跳过
4、编译模型的其他输入参数,这个方法是在模型推理前,use_cache, output_attentions,=output_hidden_states,return_dict, 这几个参数看看那个是true那个false,我们先前写死了,这里可以继续绕过。
5、准备用于自回归生成的“input_ids”
由于qwen2模型不是一个encoder_decoder的模型,这里直接被else了,挺好,继续绕过。
6、根据其他停止条件准备“max_length”。以及验证输入内容有没有超过最大输入token长度和最小长度。我感觉提取完,能用的代码如下。
# 获取当前输入内容的长度
input_ids_length = input.shape[1]
# 重新编译最大长度。
max_len = input_ids_length + max_new_tokens
# 由于没有限制最小长度,这里配为0
min_length = 0
7、运行模式确认
我们本次使用模型,已经确定且固定为sample模式,绕过
8、准备分布预处理采样器
这里我看的qwen的推理流,看起来只注册了一个RepetitionPenaltyLogitsProcessor,而默认传入的logits_processor也是空的
在这里插入代码片
logits_processor = LogitsProcessorList()
repetition_penalty = 1.05
logits_processor.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))
9、准备停止标准
这里我翻看源码,似乎只调用了这俩个停止器,先暂时只加这俩个,回头再有了再说
stopping_criteria = StoppingCriteriaList()
eos_token_id = [151645,151643]
# 追加停止器
stopping_criteria.append(MaxLengthCriteria(max_length=max_len,max_position_embeddings=max_position_embeddings))
stopping_criteria.append(EosTokenCriteria(eos_token_id))
10、开始模型推理,接下来就是私有的三步了
私有部分三步
11、准备logits 的标准?
这个我看翻译稍微有点反应不太过来,不过里面的源码阅读后,留下的大概是这个样子
warpers = StoppingCriteriaList()
temperature = 0.7
top_k = 20
top_p = 0.8
min_tokens_to_keep=1
warpers.append(TemperatureLogitsWarper(temperature))
warpers.append(TopKLogitsWarper(top_k=top_k,min_tokens_to_keep=min_tokens_to_keep))
warpers.append(TopPLogitsWarper(top_p=top_p,min_tokens_to_keep=min_tokens_to_keep))
12、二次处理input_ids
这里的话,似乎只有这一行代码是有用的,其他都没有用到,就先单独只要这一行就ok了
expand_size=1
input_ids = input_ids.repeat_interleave(expand_size, dim=0)
13、调用推理函数,进行模型推理
这里我感觉我削的是很严重的,基本上吧汽车拆成自行车了,完全就是抱着能跑就行的想法做的。。
sample内方法解析
这里开始调用模型了,对输入数据进行一个前处理,通过prepare_inputs_for_generation方法,处理完成后,输入内容,并再次调用lm_head得到最终输出结果
# 预处理 qwen2的输入数据
def prepare_inputs_for_generation(self, input_ids,position_ids=None, attention_mask=None, past_key_values=None, inputs_embeds=None,seen_tokens=0, **kwargs):
if attention_mask is None:
attention_mask = np.ones_like(input_ids,dtype=np.int64)
past_length = seen_tokens
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = np.cumsum(attention_mask,axis=-1).astype(np.int64)-1
position_ids = np.where(attention_mask==0,1,position_ids)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
运行部分的代码
def runForCausalLM(self ,input_ids):
input_names = [input.name for input in self.model.get_inputs()]
inputs = self.prepare_inputs_for_generation(input_ids,None,None,None)
# inputs = {"input_ids": inputs['input_ids'],"attention_mask":inputs["attention_mask"],"position_ids":inputs["position_ids"]}
output_names = [output.name for output in self.model.get_outputs()]
outputs=self.model.run(output_names,inputs)
hidden_states = outputs[0]
llm_inputs = {"input_0": hidden_states}
llm_output_names = [output.name for output in self.lm_head.get_outputs()]
logits = self.lm_head.run(llm_output_names,llm_inputs)
logits = logits[0].astype(np.float32)
return logits
通过while 来循环遍历上面的方法,碰到指定条件的时候,再终止运行
while(not this_peer_finished):
logits = self.runForCausalLM(input_ids)
input_ids = torch.from_numpy(input_ids)
logits = torch.from_numpy(logits)
next_token_logits = logits[:,-1,:]
next_token_scores = self.logits_processor(input_ids, next_token_logits)
next_token_scores = self.logits_warper(input_ids, next_token_scores)
next_token_scores = next_token_scores.numpy()
probs = self.softmax(next_token_scores,-1)
next_tokens = self.multinomial_numpy(probs,1)
next_tokens = torch.from_numpy(next_tokens)
if self.eos_token_id is not None:
next_tokens = next_tokens * unfinished_sequences + self.pad_token_id * (1 - unfinished_sequences)
# 更新input_ids,将inputid与新的输出内容进行拼接
ntoken = next_tokens[:, None]
input_ids = np.concatenate([input_ids, ntoken], axis=-1)
input_ids = torch.from_numpy(input_ids)
if not (scores is None):
scores = torch.from_numpy(scores)
unfinished_sequences = unfinished_sequences & ~self.stopping_criteria(input_ids, scores)
this_peer_finished = unfinished_sequences.max() == 0
input_ids = input_ids.numpy()
之后,根据上述内容,即可完成一轮的推理运行
piplines流完整代码
输入内容部分
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers import AddedToken
from tokenizers.processors import TemplateProcessing
import numpy as np
import jinja2
from jinja2.exceptions import TemplateError
from jinja2.sandbox import ImmutableSandboxedEnvironment
from testrun import QwenMoelRun
# 读取 tokenizer_config.json 文件
import json
model = QwenMoelRun()
# 配置项
tokenizer_json = './assets/tokenizer.json'
config_path = "./assets/tokenizer_config.json"
add_generation_prompt = True
# 测试数据
prompt= "请帮我写一个傅里叶变化公式,并使用python代码简单复现一下"
testtext = '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'
messages = [
{"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.","user":"yinjun"},
{"role": "user", "content": prompt,"user":"yinjun"}
]
# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
chat_template = config['chat_template']
def raise_exception(message):
raise TemplateError(message)
jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
jinja_env.globals["raise_exception"] = raise_exception
compiled_template = jinja_env.from_string(chat_template)
# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
config = json.load(f)
rendered=[]
conversations = [messages]
template_kwargs = {
'eos_token' : config['eos_token'],
'pad_token' : config['pad_token'],
'additional_special_tokens':config['additional_special_tokens']
}
for chat in conversations:
if hasattr(chat, "messages"):
# Indicates it's a Conversation object
chat = chat.messages
rendered_chat = compiled_template.render(
messages=chat, add_generation_prompt=add_generation_prompt, **template_kwargs
)
rendered.append(rendered_chat)
# 前处理一下输入内容
tokenizer = Tokenizer.from_file(tokenizer_json)
# 测试编码和解码
encoding = tokenizer.encode_batch(
rendered,
add_special_tokens=True,
is_pretokenized=False,
)
print(f"Tokens: {encoding[0].tokens}")
print(f"Token IDs: {encoding[0].ids}")
input_ids = encoding[0].ids
input_ids = np.array(input_ids,np.int64)
output = model.generate(input_ids)
decoded_text = tokenizer.decode(encoding[0].ids)
print(f"Decoded Text: {decoded_text}")
out = output[0]
d_text = tokenizer.decode(out,skip_special_tokens=True)
print(d_text)
模型推理部分
from transformer_lite .generation.logits_process import LogitsProcessorList,RepetitionPenaltyLogitsProcessor,TopKLogitsWarper,TopPLogitsWarper,TemperatureLogitsWarper
from transformer_lite .generation.stopping_criteria import StoppingCriteriaList,MaxLengthCriteria ,EosTokenCriteria
import numpy as np
import onnxruntime as ort
import onnx
import torch
class QwenMoelRun():
def __init__(self):
# 模型参数的加载
self.pad_token_id = 151643
self.bos_token_id = 151643
self.eos_token_id = [151645,151643]
self.max_position_embeddings=32768
# 是否存在最大长度
self.has_default_max_length = True
# 是否存在最短长度
self.has_default_min_length = True
# 最长tokean数
self.max_new_tokens=512
# 由于没有限制最小长度,这里配为0
self. min_length = 0
# 模型所在路径
model = "qwen2-code-0.5b"
model_type="onnx"
self.model_path ="./" + model + "/" + model_type + "/" + "model.onnx"
self.model_path ="./" + model + "/" + model_type + "/" + "model32.onnx"
self.lm_model_path ="./" + model + "/" + model_type + "/" + "lm_model32.onnx"
session_options = ort.SessionOptions()
session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
# session_options.enable_cuda_graph = True # 如果需要启用 CUDA 图形优化
# session_options.gpu_id = 0 # 指定使用第 0 块 GPU
# 有一个linear层无法导出来,这里需要手动额外加载使用
self.lm_head = ort.InferenceSession(self.lm_model_path, sess_options=session_options, providers=['CUDAExecutionProvider'])
# 模型本体的加载
print("Model is valid and supported by the current ONNX Runtime.")
self.model = ort.InferenceSession(self.model_path, sess_options=session_options, providers=['CUDAExecutionProvider'])
# 预处理 qwen2的输入数据
def prepare_inputs_for_generation(self, input_ids,position_ids=None, attention_mask=None, past_key_values=None, inputs_embeds=None,seen_tokens=0, **kwargs):
if attention_mask is None:
attention_mask = np.ones_like(input_ids,dtype=np.int64)
past_length = seen_tokens
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = np.cumsum(attention_mask,axis=-1).astype(np.int64)-1
position_ids = np.where(attention_mask==0,1,position_ids)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
def softmax(self,x,axis=-1):
# 计算输入向量的最大值,并保持其原始维度
x_max = np.max(x, axis=axis, keepdims=True)
# 对平移后的向量进行指数运算
e_x = np.exp(x - x_max)
# 计算指数运算结果的和,并保持其原始维度
return e_x / np.sum(e_x, axis=axis, keepdims=True)
def multinomial_numpy(self,probs, num_samples=1):
# 获取批次大小和词汇表大小
batch_size, vocab_size = probs.shape
# 初始化结果数组
next_tokens = np.zeros(batch_size, dtype=np.int64)
# 对每个样本单独采样
for i in range(batch_size):
# 使用 numpy.random.choice 按照概率分布采样
next_tokens[i] = np.random.choice(vocab_size, size=num_samples, p=probs[i], replace=True)
return next_tokens.squeeze() # 如果 num_samples=1,则去掉多余的维度
# 这里模拟ForCausalLM方法
def runForCausalLM(self ,input_ids):
input_names = [input.name for input in self.model.get_inputs()]
inputs = self.prepare_inputs_for_generation(input_ids,None,None,None)
# inputs = {"input_ids": inputs['input_ids'],"attention_mask":inputs["attention_mask"],"position_ids":inputs["position_ids"]}
output_names = [output.name for output in self.model.get_outputs()]
outputs=self.model.run(output_names,inputs)
hidden_states = outputs[0]
llm_inputs = {"input_0": hidden_states}
llm_output_names = [output.name for output in self.lm_head.get_outputs()]
logits = self.lm_head.run(llm_output_names,llm_inputs)
logits = logits[0].astype(np.float32)
return logits
def generate(self,input_ids):
input_ids = np.array(input_ids)
input_ids = np.reshape(input_ids,[1,input_ids.shape[0]])
# 获取当前输入内容的长度
input_ids_length = input_ids.shape[1]
# 重新编译最大长度。
max_len = input_ids_length + self.max_new_tokens
# 校队有没有超长
if input_ids_length >= max_len:
input_ids_string = "input_ids"
raise ValueError(
f"Input length of {input_ids_string} is {input_ids_length}, but `max_length` is set to"
f" {max_len}. This can lead to unexpected behavior. You should consider"
" increasing `max_length` or, better yet, setting `max_new_tokens`."
)
# 用logits processor 来控制生成的过程
self.logits_processor = LogitsProcessorList()
# 用stopping_criteria控制结束的过程
self.stopping_criteria = StoppingCriteriaList()
# 调整生成文本的随机性
self.logits_warper = LogitsProcessorList()
temperature = 0.7
top_k = 20
top_p = 0.8
min_tokens_to_keep=1
# 添加 RepetitionPenaltyLogitsProcessor
repetition_penalty = 1.05
# 追加生成过程器
self.logits_processor.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))
# 追加停止器
self.stopping_criteria.append(MaxLengthCriteria(max_length=max_len,max_position_embeddings=self.max_position_embeddings))
self.stopping_criteria.append(EosTokenCriteria(self.eos_token_id))
# 追加??
self.logits_warper.append(TemperatureLogitsWarper(temperature))
self.logits_warper.append(TopKLogitsWarper(top_k=top_k,min_tokens_to_keep=min_tokens_to_keep))
self.logits_warper.append(TopPLogitsWarper(top_p=top_p,min_tokens_to_keep=min_tokens_to_keep))
expand_size=1
input_ids = np.repeat(input_ids,expand_size, 0)
batch_size, seq_length = input_ids.shape
unfinished_sequences = np.ones(batch_size, dtype=np.int64)
unfinished_sequences = torch.from_numpy(unfinished_sequences)
this_peer_finished = False
# 这里是尝试模拟sample中,通过_has_unfinished_sequences方法来while循环执行的过程
scores = None
first = True
while(not this_peer_finished):
logits = self.runForCausalLM(input_ids)
input_ids = torch.from_numpy(input_ids)
logits = torch.from_numpy(logits)
next_token_logits = logits[:,-1,:]
next_token_scores = self.logits_processor(input_ids, next_token_logits)
next_token_scores = self.logits_warper(input_ids, next_token_scores)
next_token_scores = next_token_scores.numpy()
probs = self.softmax(next_token_scores,-1)
next_tokens = self.multinomial_numpy(probs,1)
next_tokens = torch.from_numpy(next_tokens)
if self.eos_token_id is not None:
next_tokens = next_tokens * unfinished_sequences + self.pad_token_id * (1 - unfinished_sequences)
# 更新input_ids,将inputid与新的输出内容进行拼接
ntoken = next_tokens[:, None]
input_ids = np.concatenate([input_ids, ntoken], axis=-1)
input_ids = torch.from_numpy(input_ids)
if not (scores is None):
scores = torch.from_numpy(scores)
unfinished_sequences = unfinished_sequences & ~self.stopping_criteria(input_ids, scores)
this_peer_finished = unfinished_sequences.max() == 0
input_ids = input_ids.numpy()
#这里结束推理,进行下一步操作
return input_ids
if __name__ == "__main__":
qwen = QwenMoelRun()