langChain通常使用api接口来调用大模型,可以直接访问官方的服务器或者用Ollama等搭建的本地服务器。langchain访问modelscope的官方服务可参照ModelScope | 🦜️🔗 LangChain
现在我们调用的是 modelscope上下载的用DeepSeek,由于本地显卡运算能力有限选用deepseek蒸馏出来的Qwen1.5B deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
snapshot_download(llm_model_name, cache_dir=cache_apath)下载大模型到本地cache并返回大模型的路径
需要设计自己的类(基类为LLM),至少拥有3个函数__init__、_call、_llm_type
__init__:指定tokenizer和llm model
_call:对message进行tokenizer,输入大模型来生成响应,对生成的向量进行解码
_llm_type:给一个名字
import os
from langchain_core.prompts import PromptTemplate
from langchain.llms.base import LLM
from typing import Any, List, Optional
from modelscope import snapshot_download
from langchain.callbacks.manager import CallbackManagerForLLMRun
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
import torch
llm_model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
cache_apath = os.path.join(os.getcwd(), 'cache')
# llm_model_path = snapshot_download(llm_model_name, cache_dir=cache_apath)
llm_model_path = os.path.join(cache_apath, *llm_model_name.replace(".","___").split('/'))
class DeepSeekQwenLLM(LLM):
tokenizer: AutoTokenizer = None
model: AutoModelForCausalLM = None
def __init__(self, model_path: str):
super().__init__()
print("Loading model from local path...")
self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
self.model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16, device_map="auto")
self.model.generation_config = GenerationConfig.from_pretrained(model_path)
print("Model loading complete.")
def _call(self, prompt: str, stop: Optional[List[str]] = None,
run_manager: Optional[CallbackManagerForLLMRun] = None,
**kwargs: Any) -> str:
messages = [{"role": "user", "content": prompt}]
input_text = self.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = self.tokenizer([input_text], return_tensors="pt").to('cuda')
output_ids = self.model.generate(model_inputs.input_ids, attention_mask=model_inputs['attention_mask'],
max_new_tokens=8192)
generated_ids = [ids[len(input_text):] for ids in output_ids]
response = self.tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
return response
@property
def _llm_type(self) -> str:
return "DeepSeekQwenLLM"
按照chain的方法,进行调用
llm_model = DeepSeekQwenLLM(llm_model_path)
template = '''
#背景信息#
你是一名知识丰富的导航助手,了解中国每一个地方的名胜古迹及旅游景点.
#问题#
游客:我是{本人的描述},我想去旅游,给我推荐{地方}十个值得玩的地方?"
'''
prompt = PromptTemplate(
input_variables=["本人的描述", "地方"],
template=template
)
chain = prompt | llm_model
response = chain.invoke({"本人的描述":"外国的小学生,喜欢室外活动", "地方":"武汉"} )
print(response)
import re
def split_text(text):
pattern = re.compile(r'(.*?)</think>(.*)', re.DOTALL)
match = pattern.search(text)
if match:
think_content = match.group(1).strip()
answer_content = match.group(2).strip()
else:
think_content = ""
answer_content = text.strip()
return think_content, answer_content
think, answer = split_text(response)
print(f"{' - '*20}思考{' - '*20}")
print(think)
print(f"{' - '*20}回答{' - '*20}")
print(answer)
另外说明一下:ollama对各个大模型的支持已经很好,langchain有langchain-ollama的包,通过ollama调用大模型更加方便,不必像上面这么麻烦。当然llama-index也是不错的选择