BigDL-LLM是一个用于在英特尔XPU上使用INT4/FP4/INT8/FP8运行LLM(大型语言模型)的库,具有非常低的延迟(适用于任何PyTorch模型)。
1、在英特尔CPU上使用BigDL-LLM
安装
pip install --pre --upgrade bigdl-llm[all]
运行模型
#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)
#run the optimized model on CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)
2、在英特尔GPU上使用BigDL-LLM
2.1、搭建开发环境
2.1.1、下载Visual Studio Community
- C# 和 Visual Basic Roslyn 编译器
- MSBuild
- Windows 11 SDK(10.0.22000.0)
- MSVC v143 - VS 2022 C++ x64/x86 生成工具
- 用于 Windows 的 C++ CMake 工具
2.1.2、下载oneAPI Base Toolkit
- intel DPC++ Compatibility Tool
- inte! Distribution for DB
- Intel oneAPI DPC++ Library
- intel oneAPI Threading Building Blocks
- Intel oneAPI DPC++/C++ Compile
- intel oneAPI Math Kernel Library
2.1.3、下载miniconda
https://docs.conda.io/projects/miniconda/en/latest/
https://docs.conda.io/projects/miniconda/en/latest/使用conda命令创建虚拟环境
conda create -n llm python=3.9 libuv
conda activate llm
2.2、安装BigDL-LLM
Welcome to Intel® Extension for PyTorch* Documentation!
使用pip直接安装:
pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/
下载whl安装包到本地:
https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp39-cp39-win_amd64.whl
https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp39-cp39-win_amd64.whl
https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp39-cp39-win_amd64.whl
从本地安装已下载的.whl安装包:
pip install torch-2.1.0a0+cxx11.abi-cp39-cp39-win_amd64.whl
pip install torchvision-0.16.0a0+cxx11.abi-cp39-cp39-win_amd64.whl
pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl
安装BigDL-LLM
pip install --pre --upgrade bigdl-llm[xpu] -i https://mirrors.aliyun.com/pypi/simple/
#或者
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
配置环境变量
conda activate llm
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
set SYCL_ENABLE_DEFAULT_CONTEXTS=1
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
若系统中有集成显卡,请执行下面的命令,保证英特尔独立显卡是“xpu”指代的计算设备。
set ONEAPI_DEVICE_SELECTOR=level_zero:1
2.3、运行模型
#load Hugging Face Transformers model with INT4 optimizations
import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True)
#run the optimized model on Intel GPU
model = model.bfloat16().eval().to('xpu')
model.model.embed_tokens.to('cpu')
# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
# 执行推理计算,生成Tokens
output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
3、使用BigDL-LLM优化Baichuan2
3.1、CPU优化方案
使用conda安装
pip install bigdl-llm[all] # install bigdl-llm with 'all' option
pip install transformers_stream_generator
运行
import torch
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt")
output = model.generate(input_ids, max_new_tokens=args.n_predict)
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
3.2、GPU方案
使用conda安装
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator
配置OneAPI环境变量
windows:
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
Linux:
source /opt/intel/oneapi/setvars.sh
为了在Arc上获得最佳性能,建议配置几个环境变量
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
运行
import torch
import intel_extension_for_pytorch as ipex
from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)
model = model.to('xpu')
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Generate predicted tokens
with torch.inference_mode():
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
# ipex model needs a warmup, then inference time can be accurate
output = model.generate(input_ids, max_new_tokens=args.n_predict)
# start inference
output = model.generate(input_ids, max_new_tokens=args.n_predict)
torch.xpu.synchronize()
output = output.cpu()
output_str = tokenizer.decode(output[0], skip_special_tokens=True)
本文介绍了BigDL-LLM库,它支持在英特尔CPU和GPU上使用INT4/FP4/INT8/FP8进行低延迟的大型语言模型(LLM)运行,特别是针对PyTorch模型的优化,包括安装步骤、环境配置和模型运行示例。
3291

被折叠的 条评论
为什么被折叠?



