环境简介
Ubuntu20.04、
NVIDIA-SMI 545.29.06、
Cuda 11.4、
python3.10、
pytorch1.11.0
开始搭建
python环境设置
创建虚拟环境
conda create --name qewn python==3.10
预安装modelscope和transformers
pip install modelscope
pip install transformers
安装pytorch
conda install pytorch==1.11.0 torchvision==0.12.0 torchaudio==0.11.0 cudatoolkit=11.3
模型需要下载
创建一个python文件
gedit download.py
里面复制如下内容
from modelscope.hub.file_download import model_file_download
model_dir = model_file_download(model_id='qwen/Qwen1.5-0.5B-Chat-GGUF',file_path='qwen1_5-0_5b-chat-q5_k_m.gguf',revision='master',cache_dir='path/to/local/dir')
运行python文件进行下载
python download.py
下载llama.cpp
使⽤git命令克隆llama.cpp项⽬
git clone https://github.com/ggerganov/llama.cpp
克隆完成之后我们进入llama.cpp目录中,对项目进行编译
cd llama.cpp
make -j
模型下载
在魔搭社区中下载模型运行
https://www.modelscope.cn/models/qwen/Qwen1.5-0.5B-Chat-GGUF/files
本人下载的是qwen1_5-0_5b-chat-q5_k_m.gguf
终端运行,其中模型替换为自己的模型地址(官方给的-cml参数在help中没有找到,且影响运行,所以我删除掉了)
官方:
./main -m /path/to/local/dir/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q5_k_m.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
我运行:
./main -m /path/to/local/dir/qwen/Qwen1.5-0.5B-Chat-GGUF/qwen1_5-0_5b-chat-q5_k_m.gguf -n 512 --color -i -f prompts/chat-with-qwen.txt
help内容:
usage: ./main [options]
general:
-h, --help, --usage print usage and exit
--version show version and build info
-v, --verbose print verbose information
--verbosity N set specific verbosity level (default: 0)
--verbose-prompt print a verbose prompt before generation (default: false)
--no-display-prompt don't print prompt at generation (default: false)
-co, --color colorise output to distinguish prompt and user input from generations (default: false)
-s, --seed SEED RNG seed (default: -1, use random seed for < 0)
-t, --threads N number of threads to use during generation (default: 8)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default: same as --threads)
-td, --threads-draft N number of threads to use during generation (default: same as --threads)
-tbd, --threads-batch-draft N number of threads to use during batch and prompt processing (default: same as --threads-draft)
--draft N number of tokens to draft for speculative decoding (default: 5)
-ps, --p-split N speculative decoding split probability (default: 0.1)
-lcs, --lookup-cache-static FNAME
path to static lookup cache to use for lookup decoding (not updated by generation)
-lcd, --lookup-cache-dynamic FNAME
path to dynamic lookup cache to use for lookup decoding (updated by generation)
-c, --ctx-size N size of the prompt context (default: 0, 0 = loaded from model)
-n, --predict N number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
-b, --batch-size N logical maximum batch size (default: 2048)
-ub, --ubatch-size N physical maximum batch size (default: 512)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 = all)
--chunks N max number of chunks to process (default: -1, -1 = all)
-fa, --flash-attn enable Flash Attention (default: disabled)
-p, --prompt PROMPT prompt to start generation with (default: '')
-f, --file FNAME a file containing the prompt (default: none)
--in-file FNAME an input file (repeat to specify multiple files)
-bf, --binary-file FNAME binary file containing the prompt (default: none)
-e, --escape process escapes sequences