使用BigDL-LLM优化大语言模型

本文介绍了BigDL-LLM库,它支持在英特尔CPU和GPU上使用INT4/FP4/INT8/FP8进行低延迟的大型语言模型(LLM)运行,特别是针对PyTorch模型的优化,包括安装步骤、环境配置和模型运行示例。
部署运行你感兴趣的模型镜像

BigDL-LLM是一个用于在英特尔XPU上使用INT4/FP4/INT8/FP8运行LLM(大型语言模型)的库,具有非常低的延迟(适用于任何PyTorch模型)。

1、在英特尔CPU上使用BigDL-LLM

安装

pip install --pre --upgrade bigdl-llm[all]

运行模型

#load Hugging Face Transformers model with INT4 optimizations
from bigdl.llm.transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('/path/to/model/', load_in_4bit=True)

#run the optimized model on CPU
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
input_ids = tokenizer.encode(input_str, ...)
output_ids = model.generate(input_ids, ...)
output = tokenizer.batch_decode(output_ids)

2、在英特尔GPU上使用BigDL-LLM

2.1、搭建开发环境

2.1.1、下载Visual Studio Community

https://visualstudio.microsoft.com/zh-hans/vs/community/icon-default.png?t=O83Ahttps://visualstudio.microsoft.com/zh-hans/vs/community/勾选:

  • C# 和 Visual Basic Roslyn 编译器
  • MSBuild
  • Windows 11 SDK(10.0.22000.0)
  • MSVC v143 - VS 2022 C++ x64/x86 生成工具
  • 用于 Windows 的 C++ CMake 工具
2.1.2、下载oneAPI Base Toolkit

https://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/base-toolkit-download.htmlicon-default.png?t=O83Ahttps://www.intel.cn/content/www/cn/zh/developer/tools/oneapi/base-toolkit-download.html勾选

  • intel DPC++ Compatibility Tool
  • inte! Distribution for DB
  • Intel oneAPI DPC++ Library
  • intel oneAPI Threading Building Blocks
  • Intel oneAPI DPC++/C++ Compile
  • intel oneAPI Math Kernel Library
2.1.3、下载miniconda

https://docs.conda.io/projects/miniconda/en/latest/icon-default.png?t=O83Ahttps://docs.conda.io/projects/miniconda/en/latest/使用conda命令创建虚拟环境

conda create -n llm python=3.9 libuv
conda activate llm

2.2、安装BigDL-LLM

Welcome to Intel® Extension for PyTorch* Documentation!

使用pip直接安装:

pip install torch==2.1.0a0 torchvision==0.16.0a0 torchaudio==2.1.0a0 intel-extension-for-pytorch==2.1.10 --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/

下载whl安装包到本地:

https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torch-2.1.0a0%2Bcxx11.abi-cp39-cp39-win_amd64.whl

https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/torchvision-0.16.0a0%2Bcxx11.abi-cp39-cp39-win_amd64.whl

https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/intel_extension_for_pytorch-2.1.10%2Bxpu-cp39-cp39-win_amd64.whl

从本地安装已下载的.whl安装包:

pip install torch-2.1.0a0+cxx11.abi-cp39-cp39-win_amd64.whl

pip install torchvision-0.16.0a0+cxx11.abi-cp39-cp39-win_amd64.whl

pip install intel_extension_for_pytorch-2.1.10+xpu-cp39-cp39-win_amd64.whl

安装BigDL-LLM

BigDL-LLM Installation: GPU — BigDL latest documentationicon-default.png?t=O83Ahttps://bigdl.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html

pip install --pre --upgrade bigdl-llm[xpu] -i https://mirrors.aliyun.com/pypi/simple/
#或者
pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu

配置环境变量

conda activate llm
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
set SYCL_ENABLE_DEFAULT_CONTEXTS=1
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1

若系统中有集成显卡,请执行下面的命令,保证英特尔独立显卡是“xpu”指代的计算设备。

set ONEAPI_DEVICE_SELECTOR=level_zero:1

2.3、运行模型

#load Hugging Face Transformers model with INT4 optimizations
import torch
import intel_extension_for_pytorch as ipex

from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, optimize_model=True, load_in_4bit=True)

#run the optimized model on Intel GPU
model = model.bfloat16().eval().to('xpu')
model.model.embed_tokens.to('cpu')

# 对提示词编码
input_ids = tokenizer.encode(prompt, return_tensors="pt")
input_ids = input_ids.to('xpu')
# 执行推理计算,生成Tokens
output = model.generate(input_ids, num_beams=1, do_sample=False, max_new_tokens=32)
# 对生成Tokens解码并显示
output_str = tokenizer.decode(output[0], skip_special_tokens=True)

3、使用BigDL-LLM优化Baichuan2

3.1、CPU优化方案

使用conda安装

pip install bigdl-llm[all] # install bigdl-llm with 'all' option
pip install transformers_stream_generator

运行

import torch

from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Generate predicted tokens
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    output = model.generate(input_ids, max_new_tokens=args.n_predict)
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

3.2、GPU方案

使用conda安装

pip install --pre --upgrade bigdl-llm[xpu] -f https://developer.intel.com/ipex-whl-stable-xpu
pip install transformers_stream_generator

配置OneAPI环境变量

windows:

call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

Linux:

source /opt/intel/oneapi/setvars.sh

为了在Arc上获得最佳性能,建议配置几个环境变量

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1

运行

import torch
import intel_extension_for_pytorch as ipex

from bigdl.llm.transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(model_path, load_in_4bit=True, trust_remote_code=True, use_cache=True)
model = model.to('xpu')

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
    
# Generate predicted tokens
with torch.inference_mode():
    input_ids = tokenizer.encode(prompt, return_tensors="pt")
    input_ids = input_ids.to('xpu')
    # ipex model needs a warmup, then inference time can be accurate
    output = model.generate(input_ids, max_new_tokens=args.n_predict)

    # start inference
    output = model.generate(input_ids, max_new_tokens=args.n_predict)
    torch.xpu.synchronize()
    output = output.cpu()
    output_str = tokenizer.decode(output[0], skip_special_tokens=True)

您可能感兴趣的与本文相关的镜像

PyTorch 2.6

PyTorch 2.6

PyTorch
Cuda

PyTorch 是一个开源的 Python 机器学习库,基于 Torch 库,底层由 C++ 实现,应用于人工智能领域,如计算机视觉和自然语言处理

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值