tansformer 学习记录（二、缺失past_key_value的模型cpu模式部署）-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_47491668/article/details/145552740

前言

继续写这个系列，这俩天有点闲，打算趁机吧千问的token生成器和编码器部分摘出来，弄成一个独立的py包，丢给android去使用。

研究了几天transformer，越看这个代码越觉得自己的菜是难以言语描述

这些代码打算用python先写，后边有时间了换，没时间了就算了，

另外，本轮代码抄写自tokenization_utils_base这个py文件的apply_chat_template方法

注意：本次模型输出只是为了输出一个能用的模型，past_key_value，use_cache 等这些参数都会关掉，以便尽可能的简化模型的输出工作，日后细调了，该有的cache还是会还原回去

输入内容前处理

这里需要将输入的文本信息，如

messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.","user":"yinjun"},
]

转化成这样的格式

message = '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'

方便后续调用，源码部分我就不贴了，这里贴我找到感觉比较核心的代码
大致就是，通过jinja2 对输入内容的一个分词吧，注意我这里写的示例提示词可能有问题，分不出多行，我感觉应该是分多行才对。

import jinja2
from jinja2.exceptions import TemplateError
from jinja2.sandbox import ImmutableSandboxedEnvironment


# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
    config = json.load(f)

chat_template = config['chat_template']

def raise_exception(message):
  raise TemplateError(message)

jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
jinja_env.globals["raise_exception"] = raise_exception
compiled_template = jinja_env.from_string(chat_template)


# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
  config = json.load(f)


rendered=[]

conversations = [messages]

template_kwargs = {
  'eos_token' : config['eos_token'],
  'pad_token' : config['pad_token'],
  'additional_special_tokens':config['additional_special_tokens']
}

for chat in conversations:
  if hasattr(chat, "messages"):
      # Indicates it's a Conversation object
      chat = chat.messages
  rendered_chat = compiled_template.render(
      messages=chat, add_generation_prompt=add_generation_prompt, **template_kwargs
  )
  rendered.append(rendered_chat)

模型导出前前后处理验证

首先找到配置文件，我的在如下路径

C:\Users\30585\.cache\huggingface\hub\models--Qwen--Qwen2.5-Coder-0.5B-Instruct\snapshots\1f3785e6a5098279993727eab5ca5c9aa6444c34

注意一下，1f37什么的你的可能和我的不一样，需要自己看一下，然后这个文件夹下有各种配置文件，但我们这次来只需要这一个tokenizer.json
在这里插入图片描述
拿到以后，根据from_file这个方法建立tokenizer，然后将qwen示例代码中的输入输出内容提取出来，我这边已经完成了提取，就不截图了

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers import AddedToken
from tokenizers.processors import TemplateProcessing
import numpy as np
# 读取 tokenizer_config.json 文件
import json

tokenizer_json = './assets/tokenizer.json'
tokenizer = Tokenizer.from_file(tokenizer_json)

testtext =  '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'

# 测试编码和解码
encoding = tokenizer.encode_batch(
            [testtext],
            add_special_tokens=True,
            is_pretokenized=False,
        )
print(f"Tokens: {encoding[0].tokens}")
print(f"Token IDs: {encoding[0].ids}")

decoded_text = tokenizer.decode(encoding[0].ids)
print(f"Decoded Text: {decoded_text}")


out = [102645,  69249, 100027, 114714, 101158,  44063, 104757,  45181,  39973,
          8863,  99390, 105359,  17714,  58364,   8863,  99390, 104339,   3837,
         99652, 104193, 104757, 100629,  61149,  33071,   1773, 102645,  69249,
        100027, 114714, 110322,  17714,  48443,  78045,    282,   1155,      8,
           284,   1124,   1242,  15159,     77,     28,     15,     92,     61,
         35702,    258,  36958,     92,    272,   1089,    384,  47822,     72,
         69761,   9139,   1124,   2533,  90919,  17767,    272,   1089,   1124,
             8,  54851, 110589,   3837,  44292,    308,   1124,      8,  54851,
        107586,   1773, 100431, 101909, 105172,  30280,  46100,  19793,  26355,
          3837,  37029,     63,  35083,     63,  44956,  36407, 101884, 102645,
         69249, 100027, 114714,   3407,  73594,  12669,    198,    474,   8591,
           438,   2595,    271,      2,  41479,    248,  64559,  46944, 104757,
           198,     83,    284,   2595,  38712,      7,     15,     11,    220,
            17,    353,   2595,  24259,     11,    220,     16,     15,     15,
            15,    340,  26622,    284,   2595,  16318,   1155,    692,      2,
         33424,     94,  69103, 102645,  69249, 100027, 114714,    198,     69,
           284,   2595,  79899,  79899,  56782,    692,      2,    220,  46485,
        102645,  69249, 100027, 114714,   9370, 110589,    198,  48638,  28142,
           284,    282,     58,     15,   2533,   1350,  67018,  28142,    340,
         13874,  19324, 104596,  19793,  26355,  15946,   3837,  97639, 101140,
         91282, 104059, 102298,  20450,  44292,    259,   1124,      8,   9370,
         69824,  90395,  50377, 104059, 102298,  36556, 106514, 104757,   9370,
         69824,  44292,   8286,   1124,      8,   1773, 101889,   3837,  97639,
         37029,     63,   6199,  79899,  79899,     63,  32804, 100768, 102645,
         69249, 100027, 114714,   1773, 100161,   3837,  97639, 107439, 102645,
         69249, 100027, 114714,   9370, 110589,  62926, 102703,  99898,   3407,
        104001, 107083,  46100,   3837, 102762, 101051,  46944, 102298,  64952,
        101454, 110589,   9370,  69824,   3837, 103991, 102268,  99661, 106168,
        107586,   9370, 102645,  69249, 100027, 114714, 110589,   1773, 151645]

out = np.array(out)

d_text = tokenizer.decode(out,skip_special_tokens=True)
print(d_text)

运行结果比对，俩者一致

piplies流输出结果
在这里插入图片描述 tokenizer输出结果

模型导出推理测试

前言

qwen使用的是pytorch，而pytorch是支持直接导出onnx的，这里我直接硬导出base_model一下试试看
通过代码
注意：如果直接尝试导出整个model，会因为里面一些参的输出结构不支持，导致导出失败，而且Qwen2ForCausalLM推理层里面一些操作导出时，比较担心不同opset输出内容不一致，为了安全起见，还是导出原始qwen2model(绝对不是因为解决不了Here, received an input of unsupported type: BatchEncoding错误才这么干，绝对不是.jpg)
此外，导出模型时需要将模型导出为float32格式的数据，原版模型为bf16，如果导出onnx里面一些算子，如pow是不支持bf1的

qwen核心mod导出

1、模型转化为float32格式

将模型转化为float32格式

model_name = "Qwen/Qwen2.5-Coder-0.5B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    local_files_only=True
)

def convert_bf16_fp16_to_fp32(model):
    for param in model.parameters():
        if param.dtype == torch.bfloat16 or param.dtype == torch.float16:
            param.data = param.data.to(dtype=torch.float32)
    for buffer in model.buffers():
        if buffer.dtype == torch.bfloat16 or buffer.dtype == torch.float16:
            buffer.data = buffer.data.to(dtype=torch.float32)
    return model
#qwenmodel导出
tmodel = model.model.base_model
# lm_head的导出
llm_model = model.model.lm_head
tmodel = convert_bf16_fp16_to_fp32(tmodel)
llm_model = convert_bf16_fp16_to_fp32(llm_model)

2、模型导出：

首先需要先对原始模型代码做一个简单的修改
找到qwen2的modeling_qwen2下的Qwen2Model,修改输入的默认参数，全改成false关闭

    @add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
    def forward(
        self,
        input_ids: torch.LongTensor = None,
        attention_mask: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.LongTensor] = None,
        past_key_values: Optional[List[torch.FloatTensor]] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        use_cache: Optional[bool] = False,
        output_attentions: Optional[bool] = False,
        output_hidden_states: Optional[bool] = False,
        return_dict: Optional[bool] = False,
    ) -> Union[Tuple, BaseModelOutputWithPast]:

然后获取出原本的输入数据，然后使用onnx，对上面已经转化为float32的模型进行一个导出工作

#模型输入数据获取
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
#获取basemodel必须的三个输入项
input_ids = model_inputs.data['input_ids']
attention_mask = model_inputs.data['attention_mask']
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)

最后，对llm_head和qwen2model俩个模型进行分别导出吧
qwen2model导出：

torch.onnx.export(tmodel, (input_ids,attention_mask,position_ids) ,input_names=input_names ,output_names=output_names , f="./onnx/model32.onnx" ,dynamic_axes={'input_ids':[1],'attention_mask':[1],'position_ids':[1],'last_hidden_state':[1]},opset_version=20)

llm_head导出：

torch.onnx.export(llm_model, (llm_head_input) ,input_names=['input_0'] ,output_names=["logits"] , f="./onnx/llm_model.onnx" ,dynamic_axes={'input_0':{1:"out_size"}})

模型推理piplines代码编写

至此，fc层和model层俩边的模型就全部完成了初步导出，接下来就要搞这个piplines了

前面已经简单复盘了一下，整个piplines的工作流程，这里直接复用就行

注意！：这里是为了快速弄好，跳过了一些“可以先不写”的代码，但这些不写的代码很多都是纠错用的，后边该吃的坑肯定还是要吃回来的

公用部分10步

1、初始化配置文件，并且手动给予初始化配置参数，跳过（ps:不是真跳过，相当于原始代码，这里是读取配置文件自动生成，而我偷懒，直接写死，好孩子不要学）
代码如下

pad_token_id = 151643
bos_token_id = 151643
eos_token_id = [151645,151643]
max_position_embeddings=32768,

# 是否存在最大长度
has_default_max_length = True
# 是否存在最短长度
has_default_min_length = True
# 最长tokean数
max_new_tokens=512
# 由于没有限制最小长度，这里配为0
min_length = 0

temperature = 0.7
top_k = 20
top_p = 0.8
min_tokens_to_keep=1

2、生成对应的logits_processor和stopping_criteria，生成eos_token_id和pad_token_id，这个不能跳过，得留着，代码先复制出来

# 用logits processor 来控制生成的过程
logits_processor =  LogitsProcessorList()
# 用stopping_criteria控制结束的过程
stopping_criteria = StoppingCriteriaList()


warpers = StoppingCriteriaList()


pad_token_id = 151643
bos_token_id = 151643
eos_token_id = [151645,151643]
max_position_embeddings=32768,

3、输入数据验证，我们本次输入没有inputs_embeds，没调用_maybe_initialize_input_ids_for_generation这个方法，_maybe_initialize_input_ids_for_generation方法在input为none时才执行，这里也可以先跳过

4、编译模型的其他输入参数，这个方法是在模型推理前，use_cache, output_attentions,=output_hidden_states,return_dict, 这几个参数看看那个是true那个false，我们先前写死了，这里可以继续绕过。
5、准备用于自回归生成的“input_ids”
由于qwen2模型不是一个encoder_decoder的模型，这里直接被else了，挺好，继续绕过。
6、根据其他停止条件准备“max_length”。以及验证输入内容有没有超过最大输入token长度和最小长度。我感觉提取完，能用的代码如下。


# 获取当前输入内容的长度
input_ids_length = input.shape[1]
# 重新编译最大长度。
max_len = input_ids_length + max_new_tokens
# 由于没有限制最小长度，这里配为0
min_length = 0

7、运行模式确认
我们本次使用模型，已经确定且固定为sample模式，绕过
8、准备分布预处理采样器
这里我看的qwen的推理流，看起来只注册了一个RepetitionPenaltyLogitsProcessor，而默认传入的logits_processor也是空的

在这里插入代码片
logits_processor =  LogitsProcessorList()
repetition_penalty = 1.05
logits_processor.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))

9、准备停止标准
这里我翻看源码，似乎只调用了这俩个停止器，先暂时只加这俩个，回头再有了再说

stopping_criteria = StoppingCriteriaList()
eos_token_id = [151645,151643]
# 追加停止器
stopping_criteria.append(MaxLengthCriteria(max_length=max_len,max_position_embeddings=max_position_embeddings))
stopping_criteria.append(EosTokenCriteria(eos_token_id))

10、开始模型推理，接下来就是私有的三步了

私有部分三步

11、准备logits 的标准？
这个我看翻译稍微有点反应不太过来，不过里面的源码阅读后，留下的大概是这个样子


warpers = StoppingCriteriaList()

temperature = 0.7
top_k = 20
top_p = 0.8
min_tokens_to_keep=1

warpers.append(TemperatureLogitsWarper(temperature))
warpers.append(TopKLogitsWarper(top_k=top_k,min_tokens_to_keep=min_tokens_to_keep))
warpers.append(TopPLogitsWarper(top_p=top_p,min_tokens_to_keep=min_tokens_to_keep))

12、二次处理input_ids
这里的话，似乎只有这一行代码是有用的，其他都没有用到，就先单独只要这一行就ok了

expand_size=1
input_ids = input_ids.repeat_interleave(expand_size, dim=0)

13、调用推理函数，进行模型推理

这里我感觉我削的是很严重的，基本上吧汽车拆成自行车了，完全就是抱着能跑就行的想法做的。。

sample内方法解析

这里开始调用模型了，对输入数据进行一个前处理，通过prepare_inputs_for_generation方法，处理完成后，输入内容，并再次调用lm_head得到最终输出结果


# 预处理 qwen2的输入数据
    def prepare_inputs_for_generation(self, input_ids,position_ids=None, attention_mask=None, past_key_values=None,  inputs_embeds=None,seen_tokens=0, **kwargs):

        if attention_mask is None:
            attention_mask = np.ones_like(input_ids,dtype=np.int64)
        past_length = seen_tokens
        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]

            
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation

            position_ids = np.cumsum(attention_mask,axis=-1).astype(np.int64)-1
            position_ids = np.where(attention_mask==0,1,position_ids)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs

运行部分的代码

    def runForCausalLM(self ,input_ids):

        input_names = [input.name for input in self.model.get_inputs()]
        inputs = self.prepare_inputs_for_generation(input_ids,None,None,None)
        # inputs = {"input_ids": inputs['input_ids'],"attention_mask":inputs["attention_mask"],"position_ids":inputs["position_ids"]}
        output_names = [output.name for output in self.model.get_outputs()]
        outputs=self.model.run(output_names,inputs)
        hidden_states = outputs[0]

        llm_inputs = {"input_0": hidden_states}
        llm_output_names = [output.name for output in self.lm_head.get_outputs()]
        logits = self.lm_head.run(llm_output_names,llm_inputs)
        logits = logits[0].astype(np.float32)
        return logits

通过while 来循环遍历上面的方法，碰到指定条件的时候，再终止运行

        while(not this_peer_finished):
    
            logits = self.runForCausalLM(input_ids)
            input_ids = torch.from_numpy(input_ids)
            logits = torch.from_numpy(logits)


            next_token_logits = logits[:,-1,:]
            next_token_scores = self.logits_processor(input_ids, next_token_logits)
            next_token_scores = self.logits_warper(input_ids, next_token_scores)
            next_token_scores = next_token_scores.numpy()
            probs = self.softmax(next_token_scores,-1)
            next_tokens = self.multinomial_numpy(probs,1)
            next_tokens = torch.from_numpy(next_tokens)
            if self.eos_token_id is not None:
                next_tokens = next_tokens * unfinished_sequences + self.pad_token_id * (1 - unfinished_sequences)
            # 更新input_ids,将inputid与新的输出内容进行拼接
            ntoken = next_tokens[:, None]
            input_ids = np.concatenate([input_ids, ntoken], axis=-1)

            input_ids = torch.from_numpy(input_ids)
            if not (scores is None):
                scores = torch.from_numpy(scores)
            unfinished_sequences = unfinished_sequences & ~self.stopping_criteria(input_ids, scores)
            this_peer_finished = unfinished_sequences.max() == 0
            input_ids = input_ids.numpy()

之后，根据上述内容，即可完成一轮的推理运行

piplines流完整代码

输入内容部分

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers import AddedToken
from tokenizers.processors import TemplateProcessing
import numpy as np

import jinja2
from jinja2.exceptions import TemplateError
from jinja2.sandbox import ImmutableSandboxedEnvironment


from testrun import QwenMoelRun


# 读取 tokenizer_config.json 文件
import json

model = QwenMoelRun()

# 配置项
tokenizer_json = './assets/tokenizer.json'
config_path = "./assets/tokenizer_config.json"
add_generation_prompt = True

# 测试数据
prompt= "请帮我写一个傅里叶变化公式,并使用python代码简单复现一下"
testtext =  '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n<|im_start|>assistant\n'
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant.","user":"yinjun"},
    {"role": "user", "content": prompt,"user":"yinjun"}
]



# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
    config = json.load(f)

chat_template = config['chat_template']

def raise_exception(message):
  raise TemplateError(message)

jinja_env = ImmutableSandboxedEnvironment(trim_blocks=True, lstrip_blocks=True)
jinja_env.globals["raise_exception"] = raise_exception
compiled_template = jinja_env.from_string(chat_template)


# 加载 tokenizer_config 文件
with open(config_path, 'r', encoding='utf-8') as f:
  config = json.load(f)


rendered=[]

conversations = [messages]

template_kwargs = {
  'eos_token' : config['eos_token'],
  'pad_token' : config['pad_token'],
  'additional_special_tokens':config['additional_special_tokens']
}

for chat in conversations:
  if hasattr(chat, "messages"):
      # Indicates it's a Conversation object
      chat = chat.messages
  rendered_chat = compiled_template.render(
      messages=chat, add_generation_prompt=add_generation_prompt, **template_kwargs
  )
  rendered.append(rendered_chat)


# 前处理一下输入内容
tokenizer = Tokenizer.from_file(tokenizer_json)

# 测试编码和解码
encoding = tokenizer.encode_batch(
            rendered,
            add_special_tokens=True,
            is_pretokenized=False,

        )
print(f"Tokens: {encoding[0].tokens}")
print(f"Token IDs: {encoding[0].ids}")


input_ids = encoding[0].ids
input_ids = np.array(input_ids,np.int64)
output = model.generate(input_ids)



decoded_text = tokenizer.decode(encoding[0].ids)
print(f"Decoded Text: {decoded_text}")


out = output[0]

d_text = tokenizer.decode(out,skip_special_tokens=True)
print(d_text)

模型推理部分


from transformer_lite .generation.logits_process import LogitsProcessorList,RepetitionPenaltyLogitsProcessor,TopKLogitsWarper,TopPLogitsWarper,TemperatureLogitsWarper 
from transformer_lite .generation.stopping_criteria import StoppingCriteriaList,MaxLengthCriteria ,EosTokenCriteria
import numpy as np
import onnxruntime as ort
import onnx

import torch 
class QwenMoelRun():


    def __init__(self):

        # 模型参数的加载
        self.pad_token_id = 151643
        self.bos_token_id = 151643
        self.eos_token_id = [151645,151643]
        self.max_position_embeddings=32768
        # 是否存在最大长度
        self.has_default_max_length = True
        # 是否存在最短长度
        self.has_default_min_length = True
        # 最长tokean数
        self.max_new_tokens=512
    
        # 由于没有限制最小长度，这里配为0
        self. min_length = 0

         # 模型所在路径
        model = "qwen2-code-0.5b"
        model_type="onnx"
        self.model_path ="./" + model + "/" + model_type + "/" + "model.onnx"
        self.model_path ="./" + model + "/" + model_type + "/" + "model32.onnx"
        self.lm_model_path ="./" + model + "/" + model_type + "/" + "lm_model32.onnx"
        session_options = ort.SessionOptions()
        session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_DISABLE_ALL
        # session_options.enable_cuda_graph = True  # 如果需要启用 CUDA 图形优化
        # session_options.gpu_id = 0  # 指定使用第 0 块 GPU
        # 有一个linear层无法导出来，这里需要手动额外加载使用
        self.lm_head = ort.InferenceSession(self.lm_model_path, sess_options=session_options, providers=['CUDAExecutionProvider'])
        # 模型本体的加载
        print("Model is valid and supported by the current ONNX Runtime.")
        self.model = ort.InferenceSession(self.model_path, sess_options=session_options, providers=['CUDAExecutionProvider'])


# 预处理 qwen2的输入数据
    def prepare_inputs_for_generation(self, input_ids,position_ids=None, attention_mask=None, past_key_values=None,  inputs_embeds=None,seen_tokens=0, **kwargs):

        if attention_mask is None:
            attention_mask = np.ones_like(input_ids,dtype=np.int64)
        past_length = seen_tokens
        # Keep only the unprocessed tokens:
        # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
        # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
        # input)
        if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
            input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
        # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
        # input_ids based on the past_length.
        elif past_length < input_ids.shape[1]:
            input_ids = input_ids[:, past_length:]

            
        if attention_mask is not None and position_ids is None:
            # create position_ids on the fly for batch generation

            position_ids = np.cumsum(attention_mask,axis=-1).astype(np.int64)-1
            position_ids = np.where(attention_mask==0,1,position_ids)
            if past_key_values:
                position_ids = position_ids[:, -input_ids.shape[1] :]

        # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
        if inputs_embeds is not None and past_key_values is None:
            model_inputs = {"inputs_embeds": inputs_embeds}
        else:
            model_inputs = {"input_ids": input_ids}

        model_inputs.update(
            {
                "position_ids": position_ids,
                "past_key_values": past_key_values,
                "use_cache": kwargs.get("use_cache"),
                "attention_mask": attention_mask,
            }
        )
        return model_inputs


    def softmax(self,x,axis=-1):


        # 计算输入向量的最大值，并保持其原始维度
        x_max = np.max(x, axis=axis, keepdims=True)
        # 对平移后的向量进行指数运算
        e_x = np.exp(x - x_max)
        # 计算指数运算结果的和，并保持其原始维度
        return e_x / np.sum(e_x, axis=axis, keepdims=True)
    def multinomial_numpy(self,probs, num_samples=1):
        # 获取批次大小和词汇表大小
        batch_size, vocab_size = probs.shape
        # 初始化结果数组
        next_tokens = np.zeros(batch_size, dtype=np.int64)
        # 对每个样本单独采样
        for i in range(batch_size):
            # 使用 numpy.random.choice 按照概率分布采样
            next_tokens[i] = np.random.choice(vocab_size, size=num_samples, p=probs[i], replace=True)
        return next_tokens.squeeze()  # 如果 num_samples=1，则去掉多余的维度
    
    
    # 这里模拟ForCausalLM方法
    def runForCausalLM(self ,input_ids):

        input_names = [input.name for input in self.model.get_inputs()]
        inputs = self.prepare_inputs_for_generation(input_ids,None,None,None)
        # inputs = {"input_ids": inputs['input_ids'],"attention_mask":inputs["attention_mask"],"position_ids":inputs["position_ids"]}
        output_names = [output.name for output in self.model.get_outputs()]
        outputs=self.model.run(output_names,inputs)
        hidden_states = outputs[0]

        llm_inputs = {"input_0": hidden_states}
        llm_output_names = [output.name for output in self.lm_head.get_outputs()]
        logits = self.lm_head.run(llm_output_names,llm_inputs)
        logits = logits[0].astype(np.float32)
        return logits
    def generate(self,input_ids):
       
        input_ids = np.array(input_ids)
        input_ids = np.reshape(input_ids,[1,input_ids.shape[0]])
        # 获取当前输入内容的长度
        input_ids_length = input_ids.shape[1]
        # 重新编译最大长度。
        max_len = input_ids_length + self.max_new_tokens
  
        # 校队有没有超长
        if input_ids_length >= max_len:
            input_ids_string = "input_ids"
            raise ValueError(
                f"Input length of {input_ids_string} is {input_ids_length}, but `max_length` is set to"
                f" {max_len}. This can lead to unexpected behavior. You should consider"
                " increasing `max_length` or, better yet, setting `max_new_tokens`."
            )


         # 用logits processor 来控制生成的过程
        self.logits_processor =  LogitsProcessorList()
        # 用stopping_criteria控制结束的过程
        self.stopping_criteria = StoppingCriteriaList()
        # 调整生成文本的随机性
        self.logits_warper = LogitsProcessorList()

        temperature = 0.7
        top_k = 20
        top_p = 0.8
        min_tokens_to_keep=1

        # 添加 RepetitionPenaltyLogitsProcessor
        repetition_penalty = 1.05
        # 追加生成过程器
        self.logits_processor.append(RepetitionPenaltyLogitsProcessor(repetition_penalty))
        # 追加停止器
        self.stopping_criteria.append(MaxLengthCriteria(max_length=max_len,max_position_embeddings=self.max_position_embeddings))
        self.stopping_criteria.append(EosTokenCriteria(self.eos_token_id))

        # 追加??
        self.logits_warper.append(TemperatureLogitsWarper(temperature))
        self.logits_warper.append(TopKLogitsWarper(top_k=top_k,min_tokens_to_keep=min_tokens_to_keep))
        self.logits_warper.append(TopPLogitsWarper(top_p=top_p,min_tokens_to_keep=min_tokens_to_keep))
        


        expand_size=1
        input_ids = np.repeat(input_ids,expand_size, 0)
        

        batch_size, seq_length = input_ids.shape

        unfinished_sequences = np.ones(batch_size, dtype=np.int64)
        unfinished_sequences = torch.from_numpy(unfinished_sequences)


        

        this_peer_finished = False
        # 这里是尝试模拟sample中，通过_has_unfinished_sequences方法来while循环执行的过程
        
        scores = None

        first = True
        while(not this_peer_finished):
    
            logits = self.runForCausalLM(input_ids)
            input_ids = torch.from_numpy(input_ids)
            logits = torch.from_numpy(logits)


            next_token_logits = logits[:,-1,:]
            next_token_scores = self.logits_processor(input_ids, next_token_logits)
            next_token_scores = self.logits_warper(input_ids, next_token_scores)
            next_token_scores = next_token_scores.numpy()
            probs = self.softmax(next_token_scores,-1)
            next_tokens = self.multinomial_numpy(probs,1)
            next_tokens = torch.from_numpy(next_tokens)
            if self.eos_token_id is not None:
                next_tokens = next_tokens * unfinished_sequences + self.pad_token_id * (1 - unfinished_sequences)
            # 更新input_ids,将inputid与新的输出内容进行拼接
            ntoken = next_tokens[:, None]
            input_ids = np.concatenate([input_ids, ntoken], axis=-1)

            input_ids = torch.from_numpy(input_ids)
            if not (scores is None):
                scores = torch.from_numpy(scores)
            unfinished_sequences = unfinished_sequences & ~self.stopping_criteria(input_ids, scores)
            this_peer_finished = unfinished_sequences.max() == 0
            input_ids = input_ids.numpy()
        #这里结束推理，进行下一步操作
        return input_ids
if __name__ == "__main__":
    qwen = QwenMoelRun()