1. 下载开源模型并进行推理的方法(以vicuna-7b为例):
- 【方法一】从网站(https://huggingface.co/lmsys/vicuna-7b-v1.5/tree/main)下载源文件。注意,要下载全部源文件。一般会包含LLM的weights,以及tokenizer的相关文件,全部上传到服务器,放在同一个文件夹里。
【方法二】或者在本地运行如下代码进行下载,然后再上传到服务器。注意,在windows下,运行如下代码会将模型和相关文件下载到C:\Users\用户名.cache\huggingface\hub\models–lmsys–vicuna-7b-v1.5,里边会包含多个文件夹,模型文件的路径为C:\Users\用户名.cache\huggingface\hub\models–lmsys–vicuna-7b-v1.5\snapshots\de56c35b1763eaae20f4d60efd64af0a9091ebe5。from transformers import AutoTokenizer, AutoModelForCausalLM import torch tokenizer = AutoTokenizer.from_pretrained("lmsys/vicuna-7b-v1.5") model = AutoModelForCausalLM.from_pretrained("lmsys/vicuna-7b-v1.5") torch.save(model, "D:/Documemts/experiments/huggingface_download/downloaded_models/vicuna-7b-v1.5-model.pt") torch.save(model, "D:/Documemts/experiments/huggingface_download/downloaded_models/vicuna-7b-v1.5-tokenizer.pt")
- 服务器端加载模型并推理:
from transformers import pipeline import torch from transformers import AutoTokenizer, AutoModelForCausalLM # 加载模型 device="cuda:7" model = AutoModelForCausalLM.from_pretrained("/checkpoints/models--lmsys--vicuna-7b-v1.5/snapshots/de56c35b1763eaae20f4d60efd64af0a9091ebe5") model.half() model.to(device) print('模型加载完毕') # 加载分词器 tokenizer = AutoTokenizer.from_pretrained("/checkpoints/models--lmsys--vicuna-7b-v1.5/snapshots/de56c35b1763eaae20f4d60efd64af0a9091ebe5",use_fast=False) print('分词器加载完毕') # 创建pipeline device_id = int(device.split(":")[-1]) gen = pipeline("text-generation", model=model, tokenizer=tokenizer, device=device_id) # 使用pipeline生成文本时设置参数 prompt = "If a movie review is positive, you need to output '111'. If a movie review is negative, you need to output '222'. Movie review: the movie is lovely and excellent." generated_texts = gen(prompt, max_length=50, temperature=0.4, top_k=50, top_p=0.95, repetition_penalty=1.2, length_penalty=1.0, num_return_sequences=1) # 打印生成的文本 print("##############Model output###############") for generated_text in generated_texts: print(generated_text["generated_text"])
2. 模型太大,load到一张gpu里装不下时,可以考虑使用model.half()
from transformers import AutoTokenizer, AutoModelForCausalLM
# 加载模型
device="cuda:7"
model = AutoModelForCausalLM.from_pretrained("/path/to/your/model")
model.half()
model.to(device)
3. ValueError: Couldn’t instantiate the backend tokenizer from one of: (1) a tokenizers library serialization file, (2) a slow tokenizer instance to convert or (3) an equivalent slow tokenizer class to instantiate and convert. You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
错误描述:运行如下代码加载tokenizer时出错:
tokenizer = AutoTokenizer.from_pretrained("/checkpoints/models--lmsys--vicuna-7b-v1.5")
解决办法:在AutoTokenizer.from_pretrained中加入use_fast=False
参数
tokenizer = AutoTokenizer.from_pretrained("/checkpoints/models--lmsys--vicuna-7b-v1.5",use_fast=False)
4. Installation error: symbol cublasLtGetStatusString version libcublasLt.so.11 not defined
错误描述:执行from transformers import pipeline
时报错。
解决办法:删除虚拟环境中的如下文件夹:my_env(你的虚拟环境名)/lib/python3.10/site-packages/nvidia/cublas
5. 生成更可控的输出的小技巧
有时当你的prompt太复杂,导致模型无法理解,从而可能不产生输出。这时可以通过min_length
参数强行限制模型输出一定长度的文本。注意,min_length表示的是:输入prompt+生成的文本的token总长。因此,若想控制生成文本的长度下限,需要加上input的长度。下面代码里input_ids.shape[1]
表示输入文本经过tokenizer之后得到的token的长度。input_ids的形状是[bsz, max_token_length]。
【其他参数的作用请见下面代码的注释部分】
# 将输入prompt进行tokenization
encoded_inputs = tokenizer(prompts, padding=True, return_tensors="pt", truncation=True, max_length=512)
input_ids = encoded_inputs["input_ids"].to(device) # 形状:[bsz, max_token_length]
# 模型推理
with torch.no_grad():
generated_ids = model.generate(
input_ids=input_ids,
max_length=input_ids.shape[1]+args.max_length, # 输入prompt+生成的文本的token总长
min_length=input_ids.shape[1]+args.min_length,
temperature=0.5, # 越高,生成的随机性越大
top_k=1, # 每次从k个最可能的词中选一个作为下一个生成的词
top_p=0.85, # 所选词的概率阈值,大于这个阈值才选
repetition_penalty=1.2, # 越大,模型越不容易生成重复的内容
length_penalty=1.0, # 越大,越不容易生成End of Sentence符
num_return_sequences=1,
pad_token_id=tokenizer.eos_token_id # Ensure generation stops at the end of the sentence
)