上一篇文章记录了本地部署Qwen2大模型之一:Ollama方式部署的过程,本文继续详细记录通过vLLM在本地部署该大模型的过程。
安装vLLM
按照 参考文章:https://blog.youkuaiyun.com/sexy19910923/article/details/140164232 中的说明,首先运行pip指令安装vLLM:
$ pip install vllm
但是指令执行非常慢,等不了,问豆包,怎么才能加快pip执行?回答说用国内镜像源。果断中止安装过程,添加国内pypi镜像源。可以直接在命令行中临时指定,也可以添加配置文件来永久指定。永久指定的方式是在当前用户主目录下创建~/.pip/pip.conf文件(如果目录不存在就先创建),在文件中输入以下内容后保存:
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
添加国内镜像源后再执行安装指令,果然速度飞快。
$ pip install -i https://pypi.tuna.tsinghua.edu.cn/simple vllm
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting vllm
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/c8/f4/e108a902ccad131d8978a9376343a6e95d78d0e12f152a796794647073ec/vllm-0.6.5-cp38-abi3-manylinux1_x86_64.whl (201.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 201.1/201.1 MB 9.2 MB/s eta 0:00:00
Requirement already satisfied: pyyaml in /usr/lib/python3/dist-packages (from vllm) (5.4.1)
Requirement already satisfied: psutil in /usr/lib/python3/dist-packages (from vllm) (5.9.0)
Collecting tokenizers>=0.19.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/22/06/69d7ce374747edaf1695a4f61b83570d91cc8bbfc51ccfecf76f56ab4aac/tokenizers-0.21.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.0/3.0 MB 21.8 MB/s eta 0:00:00
Collecting torchvision==0.20.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a2/f6/7ff89a9f8703f623f5664afd66c8600e3f09fe188e1e0b7e6f9a8617f865/torchvision-0.20.1-cp310-cp310-manylinux1_x86_64.whl (7.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.2/7.2 MB 20.8 MB/s eta 0:00:00
Collecting gguf==0.10.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/1b/e4/c5f9bd71840ae9afb7e2b7c285ba209f2ef5e9cd83885f8c596c551d3026/gguf-0.10.0-py3-none-any.whl (71 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 71.6/71.6 KB 6.3 MB/s eta 0:00:00
Collecting einops
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/44/5a/f0b9ad6c0a9017e62d4735daaeb11ba3b6c009d69a26141b258cd37b5588/einops-0.8.0-py3-none-any.whl (43 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 43.2/43.2 KB 6.6 MB/s eta 0:00:00
Collecting openai>=1.45.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8e/5a/d22cd07f1a99b9e8b3c92ee0c1959188db4318828a3d88c9daac120bdd69/openai-1.58.1-py3-none-any.whl (454 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 454.3/454.3 KB 12.4 MB/s eta 0:00:00
Collecting partial-json-parser
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/9c/93/791cbea58b8dc27dc76621438d7b030741a4ad3bb5c222363dd01057175a/partial_json_parser-0.2.1.1.post4-py3-none-any.whl (9.9 kB)
Requirement already satisfied: pillow in /usr/lib/python3/dist-packages (from vllm) (9.0.1)
Collecting tiktoken>=0.6.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2e/28/cf3633018cbcc6deb7805b700ccd6085c9a5a7f72b38974ee0bffd56d311/tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 14.1 MB/s eta 0:00:00
Collecting torch==2.5.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2a/ef/834af4a885b31a0b32fff2d80e1e40f771e1566ea8ded55347502440786a/torch-2.5.1-cp310-cp310-manylinux1_x86_64.whl (906.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 906.4/906.4 MB 3.3 MB/s eta 0:00:00
Collecting uvicorn[standard]
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/61/14/33a3a1352cfa71812a3a21e8c9bfb83f60b0011f5e36f2b1399d51928209/uvicorn-0.34.0-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.3/62.3 KB 8.4 MB/s eta 0:00:00
Collecting prometheus_client>=0.18.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ff/c2/ab7d37426c179ceb9aeb109a85cda8948bb269b7561a0be870cc656eefe4/prometheus_client-0.21.1-py3-none-any.whl (54 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.7/54.7 KB 5.3 MB/s eta 0:00:00
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/52/b3/7e4df40e585df024fac2f80d1a2d579c854ac37109675db2b0cc22c0bb9e/fastapi-0.115.6-py3-none-any.whl (94 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 94.8/94.8 KB 6.7 MB/s eta 0:00:00
Collecting requests>=2.26.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f9/9b/335f9764261e915ed497fcdeb11df5dfd6f7bf257d4a6a2a686d80da4d54/requests-2.32.3-py3-none-any.whl (64 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 64.9/64.9 KB 10.4 MB/s eta 0:00:00
Requirement already satisfied: protobuf in /usr/lib/python3/dist-packages (from vllm) (3.12.4)
Collecting sentencepiece
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/a6/27/33019685023221ca8ed98e8ceb7ae5e166032686fa3662c68f1f1edf334e/sentencepiece-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 23.7 MB/s eta 0:00:00
Collecting aiohttp
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/2f/cc/3a3fc7a290eabc59839a7e15289cd48f33dd9337d06e301064e1e7fb26c5/aiohttp-3.11.11-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 13.7 MB/s eta 0:00:00
Collecting pydantic>=2.9
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/f3/26/3e1bbe954fde7ee22a6e7d31582c642aad9e84ffe4b5fb61e63b87cd326f/pydantic-2.10.4-py3-none-any.whl (431 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 431.8/431.8 KB 13.1 MB/s eta 0:00:00
Collecting compressed-tensors==0.8.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/8a/b2/fbac759d64f52d11ab1a142b410cab732e933074d1e33b0249d59f72addf/compressed_tensors-0.8.1-py3-none-any.whl (87 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 87.5/87.5 KB 13.3 MB/s eta 0:00:00
Collecting mistral_common[opencv]>=1.5.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/be/48/b2acdd507d13a3fd926934eaed0a0afcbcd7e85e90639c91242e7423f3c4/mistral_common-1.5.1-py3-none-any.whl (6.5 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 20.1 MB/s eta 0:00:00
。。。。。。
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/02/cc/b7e31358aac6ed1ef2bb790a9746ac2c69bcb3c8588b41616914eb106eaf/exceptiongroup-1.2.2-py3-none-any.whl (16 kB)
Collecting httpcore==1.*
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/87/f5/72347bc88306acb359581ac4d52f23c0ef445b57157adedb9aee0cd689d2/httpcore-1.0.7-py3-none-any.whl (78 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.6/78.6 KB 12.4 MB/s eta 0:00:00
Collecting attrs>=17.3.0
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/89/aa/ab0f7891a01eeb2d2e338ae8fecbe57fcebea1a24dbb64d45801bfab481d/attrs-24.3.0-py3-none-any.whl (63 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 63.4/63.4 KB 4.2 MB/s eta 0:00:00
Collecting rpds-py>=0.7.1
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/4d/cf/96f1fd75512a017f8e07408b6d5dbeb492d9ed46bfe0555544294f3681b3/rpds_py-0.22.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (381 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 382.0/382.0 KB 15.4 MB/s eta 0:00:00
Collecting jsonschema-specifications>=2023.03.6
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/d1/0f/8910b19ac0670a0f80ce1008e5e751c4a57e14d2c4c13a482aa6079fa9d6/jsonschema_specifications-2024.10.1-py3-none-any.whl (18 kB)
Installing collected packages: sentencepiece, py-cpuinfo, nvidia-ml-py, mpmath, blake3, websockets, uvloop, typing_extensions, tqdm, sympy, sniffio, safetensors, rpds-py, regex, pyzmq, python-dotenv, pycountry, pybind11, protobuf, propcache, prometheus_client, pillow, partial-json-parser, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, numpy, networkx, nest_asyncio, msgspec, msgpack, jiter, interegular, httptools, h11, fsspec, frozenlist, filelock, exceptiongroup, einops, diskcache, dill, cloudpickle, charset-normalizer, attrs, async-timeout, astor, annotated-types, airportsdata, aiohappyeyeballs, uvicorn, triton, requests, referencing, pydantic-core, opencv-python-headless, nvidia-cusparse-cu12, nvidia-cudnn-cu12, multidict, httpcore, gguf, depyf, anyio, aiosignal, yarl, watchfiles, tiktoken, starlette, pydantic, nvidia-cusolver-cu12, jsonschema-specifications, huggingface-hub, httpx, torch, tokenizers, prometheus-fastapi-instrumentator, openai, lm-format-enforcer, jsonschema, fastapi, aiohttp, xformers, transformers, torchvision, ray, outlines_core, mistral_common, xgrammar, outlines, compressed-tensors, vllm
Successfully installed aiohappyeyeballs-2.4.4 aiohttp-3.11.11 aiosignal-1.3.2 airportsdata-20241001 annotated-types-0.7.0 anyio-4.7.0 astor-0.8.1 async-timeout-5.0.1 attrs-24.3.0 blake3-1.0.0 charset-normalizer-3.4.0 cloudpickle-3.1.0 compressed-tensors-0.8.1 depyf-0.18.0 dill-0.3.9 diskcache-5.6.3 einops-0.8.0 exceptiongroup-1.2.2 fastapi-0.115.6 filelock-3.16.1 frozenlist-1.5.0 fsspec-2024.12.0 gguf-0.10.0 h11-0.14.0 httpcore-1.0.7 httptools-0.6.4 httpx-0.28.1 huggingface-hub-0.27.0 interegular-0.3.3 jiter-0.8.2 jsonschema-4.23.0 jsonschema-specifications-2024.10.1 lm-format-enforcer-0.10.9 mistral_common-1.5.1 mpmath-1.3.0 msgpack-1.1.0 msgspec-0.18.6 multidict-6.1.0 nest_asyncio-1.6.0 networkx-3.4.2 numpy-1.26.4 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cudnn-cu12-9.1.0.70 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-ml-py-12.560.30 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 openai-1.58.1 opencv-python-headless-4.10.0.84 outlines-0.1.11 outlines_core-0.1.26 partial-json-parser-0.2.1.1.post4 pillow-10.4.0 prometheus-fastapi-instrumentator-7.0.0 prometheus_client-0.21.1 propcache-0.2.1 protobuf-5.29.2 py-cpuinfo-9.0.0 pybind11-2.13.6 pycountry-24.6.1 pydantic-2.10.4 pydantic-core-2.27.2 python-dotenv-1.0.1 pyzmq-26.2.0 ray-2.40.0 referencing-0.35.1 regex-2024.11.6 requests-2.32.3 rpds-py-0.22.3 safetensors-0.4.5 sentencepiece-0.2.0 sniffio-1.3.1 starlette-0.41.3 sympy-1.13.1 tiktoken-0.7.0 tokenizers-0.21.0 torch-2.5.1 torchvision-0.20.1 tqdm-4.67.1 transformers-4.47.1 triton-3.1.0 typing_extensions-4.12.2 uvicorn-0.34.0 uvloop-0.21.0 vllm-0.6.5 watchfiles-1.0.3 websockets-14.1 xformers-0.0.28.post3 xgrammar-0.1.7 yarl-1.18.3
要注意的是,参考文章中的安装指令是pip install vLLM>=0.4.0,实际执行这个指令时会卡住不动,我到https://pypi.org/search/上找了,vLLM的最新版本是0.6.5,2024年12月18日刚更新,实际安装时不指定版本号默认就是最新的版本(在这里可以直接下载各版本的源码包,后面会用到)。
另外,注意到安装过程中显示了类似下面的告警:
WARNING: The script cpuinfo is installed in '/home/zhangsan/.local/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
因为在pip install指令前面没有加sudo,是以普通用户身份安装,安装的目标位置默认为当前用户主目录下的.local/bin子目录。安装完成后,可以看到该子目录下已安装的所有程序文件:
$ ll ~/.local/bin/
总计 144
drwxrwxr-x 2 allen allen 4096 12月 21 01:12 ./
drwx------ 5 allen allen 4096 12月 21 01:11 ../
-rwxrwxr-x 1 allen allen 252 12月 21 01:12 convert-caffe2-to-onnx*
-rwxrwxr-x 1 allen allen 252 12月 21 01:12 convert-onnx-to-caffe2*
-rwxrwxr-x 1 allen allen 206 12月 21 01:11 cpuinfo*
-rwxrwxr-x 1 allen allen 212 12月 21 01:11 dotenv*
-rwxrwxr-x 1 allen allen 216 12月 21 01:11 f2py*
-rwxrwxr-x 1 allen allen 210 12月 21 01:12 fastapi*
-rwxrwxr-x 1 allen allen 2457 12月 21 01:11 get_gprof*
-rwxrwxr-x 1 allen allen 1651 12月 21 01:11 get_objgraph*
-rwxrwxr-x 1 allen allen 258 12月 21 01:11 gguf-convert-endian*
-rwxrwxr-x 1 allen allen 238 12月 21 01:11 gguf-dump*
-rwxrwxr-x 1 allen allen 254 12月 21 01:11 gguf-new-metadata*
-rwxrwxr-x 1 allen allen 254 12月 21 01:11 gguf-set-metadata*
-rwxrwxr-x 1 allen allen 204 12月 21 01:11 httpx*
-rwxrwxr-x 1 allen allen 239 12月 21 01:11 huggingface-cli*
-rwxrwxr-x 1 allen allen 205 12月 21 01:11 isympy*
-rwxrwxr-x 1 allen allen 229 12月 21 01:11 json-playground*
-rwxrwxr-x 1 allen allen 213 12月 21 01:12 jsonschema*
-rwxrwxr-x 1 allen allen 233 12月 21 01:11 normalizer*
-rwxrwxr-x 1 allen allen 209 12月 21 01:12 openai*
-rwxrwxr-x 1 allen allen 221 12月 21 01:11 proton*
-rwxrwxr-x 1 allen allen 221 12月 21 01:11 proton-viewer*
-rwxrwxr-x 1 allen allen 216 12月 21 01:11 pybind11-config*
-rwxrwxr-x 1 allen allen 218 12月 21 01:12 ray*
-rwxrwxr-x 1 allen allen 214 12月 21 01:12 rllib*
-rwxrwxr-x 1 allen allen 214 12月 21 01:12 serve*
-rwxrwxr-x 1 allen allen 229 12月 21 01:12 torchfrtrace*
-rwxrwxr-x 1 allen allen 220 12月 21 01:12 torchrun*
-rwxrwxr-x 1 allen allen 207 12月 21 01:11 tqdm*
-rwxrwxr-x 1 allen allen 237 12月 21 01:12 transformers-cli*
-rwxrwxr-x 1 allen allen 217 12月 21 01:12 tune*
-rwxrwxr-x 1 allen allen 587 12月 21 01:11 undill*
-rwxrwxr-x 1 allen allen 211 12月 21 01:11 uvicorn*
-rwxrwxr-x 1 allen allen 211 12月 21 01:12 vllm*
-rwxrwxr-x 1 allen allen 211 12月 21 01:11 watchfiles*
为了能够直接执行这些程序文件,需要将.local/bin加入到环境变量PATH中去,于是在.bashrc文件的最后加入一行:
export PATH=~/.local/bin:$PATH
按照参考文章的说明,需要继续安装ray,但是我在执行其安装指令时,提示已经安装好了:
$ pip install ray
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: ray in /home/zhangsan/.local/lib/python3.10/site-packages (2.40.0)
Requirement already satisfied: click>=7.0 in /usr/lib/python3/dist-packages (from ray) (8.0.3)
Requirement already satisfied: requests in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (2.32.3)
Requirement already satisfied: frozenlist in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (1.5.0)
Requirement already satisfied: filelock in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (3.16.1)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (1.1.0)
Requirement already satisfied: protobuf!=3.19.5,>=3.15.3 in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (5.29.2)
Requirement already satisfied: packaging in /usr/lib/python3/dist-packages (from ray) (21.3)
Requirement already satisfied: pyyaml in /usr/lib/python3/dist-packages (from ray) (5.4.1)
Requirement already satisfied: jsonschema in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (4.23.0)
Requirement already satisfied: aiosignal in /home/zhangsan/.local/lib/python3.10/site-packages (from ray) (1.3.2)
Requirement already satisfied: attrs>=22.2.0 in /home/zhangsan/.local/lib/python3.10/site-packages (from jsonschema->ray) (24.3.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /home/zhangsan/.local/lib/python3.10/site-packages (from jsonschema->ray) (2024.10.1)
Requirement already satisfied: rpds-py>=0.7.1 in /home/zhangsan/.local/lib/python3.10/site-packages (from jsonschema->ray) (0.22.3)
Requirement already satisfied: referencing>=0.28.4 in /home/zhangsan/.local/lib/python3.10/site-packages (from jsonschema->ray) (0.35.1)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/zhangsan/.local/lib/python3.10/site-packages (from requests->ray) (3.4.0)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/lib/python3/dist-packages (from requests->ray) (1.26.5)
Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3/dist-packages (from requests->ray) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/lib/python3/dist-packages (from requests->ray) (2020.6.20)
完成上述步骤后,可以尝试运行下vllm指令,如查看其版本号:
$ vllm -h
usage: vllm [-h] [-v] {serve,complete,chat} ...
vLLM CLI
positional arguments:
{serve,complete,chat}
serve Start the vLLM OpenAI Compatible API server
complete Generate text completions based on the given prompt via the running API server
chat Generate chat completions via the running API server
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
$ vllm -v
0.6.5
应该说至此vLLM框架就安装完成了。但是,真正的考验才刚刚开始!接下来我按参考文章中的第一个例子,把代码拷贝下来并保存为test.py文件,然后执行python test.py指令,程序停顿了好一会后报错。
。。。。。。
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/work/llm/vllm/progs/test.py", line 5, in <module>
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-7B-Instruct")
File "/home/zhangsan/.local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 878, in from_pretrained
config = AutoConfig.from_pretrained(
File "/home/zhangsan/.local/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py", line 1021, in from_pretrained
config_dict, unused_kwargs = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/zhangsan/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 590, in get_config_dict
config_dict, kwargs = cls._get_config_dict(pretrained_model_name_or_path, **kwargs)
File "/home/zhangsan/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 649, in _get_config_dict
resolved_config_file = cached_file(
File "/home/zhangsan/.local/lib/python3.10/site-packages/transformers/utils/hub.py", line 446, in cached_file
raise EnvironmentError(
OSError: We couldn't connect to 'https://huggingface.co' to load this file, couldn't find it in the cached files and it looks like Qwen/Qwen2-7B-Instruct is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
看得出来是找不到Qwen/Qwen2-7B-Instruct大模型,没有下载到本地,也没有配置路径,在网络上(默认网站:https://huggingface.co)也没有找到。看来是必须先下载大模型文件啊,参考文章中并没有明确说明,那咱就自己来爬坑吧。
首先,国内访问https://huggingface.co网站根本打不开,好消息是有国内镜像,可以完美平替该网站,镜像地址为 https://hf-mirror.com。
其次,从https://hf-mirror.com下载大模型文件的方法在该网站首页就有清晰的介绍,简单摘录方法一、二如下:
我们通常用方法二。按照方法二的步骤,先安装huggingface_hub工具,然后设置环境变量,最后下载Qwen/Qwen2-7B-Instruct模型,执行指令:
huggingface-cli download Qwen/Qwen2-7B-Instruct --local-dir Qwen
–resume-download参数是过时的,用不着,现在默认都是断点续传,–local-dir要指向已存在且有读写权限的目录。指令执行时开始下载文件,实际就是将指定的大模型路径下的“Files and versions"都下载下来,对应的网络路径是 https://hf-mirror.com/Qwen/Qwen2-7B-Instruct/tree/main,页面截图如下:
从这个页面上也可以直接下载每个文件,文件总大小接近15GB,我用huggingface-cli下载用时1小时05分。下载完成之后,再次运行测试程序,又报错:
。。。。。。
RuntimeError: Failed to infer device type
Exception ignored in: <function LLM.__del__ at 0x700a69ec13f0>
Traceback (most recent call last):
File "/home/zhangsan/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 236, in __del__
if self.llm_engine and hasattr(self.llm_engine, "shutdown"):
AttributeError: 'LLM' object has no attribute 'llm_engine'
问豆包,回答提示跟GPU有关:
回答里面提到了GPU,CUDA和torch,在前面以Ollama方式部署时,已确认我的电脑是有AMD GPU的,只是GPU驱动的安装方式不同;迅速检查了下在vLLM安装过程中顺带安装的CUDA和torch依赖,看起来也正常啊,那应该是跟它们的配置有关?
$ pip show torch
Name: torch
Version: 2.5.1
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: /home/zhangsan/.local/lib/python3.10/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvjitlink-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: compressed-tensors, outlines, torchvision, vllm, xformers, xgrammar
$ pip list | grep cuda
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
安装的torch版本为2.5.1,简介中有"with strong GPU acceleration"字样,高强度GPU加速啊!问题是我的显卡GPU是AMD的,不是NVidia的,前面安装Ollama时选择的也是ROCM驱动包,不是CUDA。顺带查了下ROCM和CUDA到底有何区别,摘录如下:
ROCM与CUDA的区别
架构背景和所属公司
CUDA(Compute Unified Device Architecture):
是 NVIDIA 公司推出的一种并行计算平台和编程模型。它专门用于 NVIDIA 的 GPU(图形处理器),使开发人员能够利用 GPU 的强大计算能力进行通用计算,如深度学习、科学计算等领域。NVIDIA 在 GPU 市场占据重要地位,其推出的 CUDA 生态系统非常完善,有大量的软件库、工具和文档支持。
ROCM(Radeon Open Compute Platform):
是 AMD 公司为其 GPU 开发的一个开放的、异构的计算平台。它的目标是提供类似于 CUDA 的功能,让开发者能够在 AMD 的 GPU 上进行高性能计算。AMD 作为 NVIDIA 的竞争对手,推出 ROCM 是为了吸引更多的开发者使用 AMD 的 GPU 进行计算任务,并且促进开源社区的发展。
硬件支持范围
CUDA:
仅支持 NVIDIA 的 GPU 产品系列。从早期的 GeForce、Tesla 系列到现在的 Ampere、Hopper 架构的 GPU 等。这意味着如果要使用 CUDA 进行计算,必须使用 NVIDIA 的 GPU 硬件。
ROCM:
主要支持 AMD 的 GPU,包括 Radeon 系列等。例如,AMD 的 Radeon VII、MI 系列等都可以很好地支持 ROCM 进行高性能计算。不过,ROCM 也在尝试通过一些技术手段,如与其他硬件厂商合作或者软件抽象层,来扩展其硬件支持范围。
软件生态系统
CUDA:
具有庞大而成熟的软件生态。在深度学习领域,许多流行的深度学习框架(如 PyTorch、TensorFlow 等)对 CUDA 有很好的支持。这些框架提供了简单易用的接口,方便开发者利用 CUDA 加速模型训练和推理。此外,还有许多专门为 CUDA 开发的高性能计算库,如 cuBLAS(用于线性代数运算)、cuDNN(用于深度学习神经网络加速)等,这些库可以大大提高计算效率。
ROCM:
虽然也在不断发展其软件生态,但相对 CUDA 来说还不够成熟。一些主流的深度学习框架也开始逐渐增加对 ROCM 的支持,但在某些功能和性能方面可能还不如 CUDA 完善。不过,随着 AMD 的推动和开源社区的参与,ROCM 的软件生态正在逐步壮大,例如,MIOpen 是 ROCM 中的一个类似于 cuDNN 的深度学习库,用于加速神经网络的运算。
编程模型和 API(应用程序接口)
CUDA:
有自己独特的编程模型。开发者需要学习 CUDA C/C++ 语言扩展或者使用支持 CUDA 的高级编程语言(如 Python 结合 PyTorch 等框架)来编写代码。CUDA 的 API 设计相对比较成熟和稳定,经过多年的发展和优化,对于熟悉其编程模型的开发者来说,可以高效地开发出高性能的 GPU 计算程序。
ROCM:
其编程模型在一定程度上与 CUDA 类似,目的是降低开发者从 CUDA 转向 ROCM 的学习成本。例如,ROCM 也支持 HIP(Heterogeneous - Compute Interface for Portability)编程接口,HIP 可以看作是一种兼容 CUDA 的编程接口,它允许开发者使用类似于 CUDA 的代码风格来编写程序,然后通过一些工具将其转换为可以在 ROCM 平台上运行的代码,这对于已经熟悉 CUDA 编程的开发者来说是一个很方便的特性。
性能表现和优化策略
CUDA:
由于 NVIDIA 在 GPU 硬件和软件优化方面投入了大量资源,在很多计算场景下能够提供出色的性能。NVIDIA 的 GPU 架构设计和 CUDA 软件栈之间的配合紧密,能够实现高效的并行计算。在深度学习模型训练和推理过程中,尤其是对于大规模数据和复杂模型,CUDA 的性能优化工具和技术(如自动混合精度训练等)可以显著提高计算速度和效率。
ROCM:
AMD 的 GPU 与 ROCM 平台也能提供有竞争力的性能。ROCM 在某些特定的计算任务和应用场景下可能会展现出优势,例如在一些对双精度浮点运算要求较高的科学计算领域。不过,由于其软件生态和硬件适配的复杂性,在性能优化方面可能需要开发者花费更多的时间和精力来调整代码和配置,以达到最佳性能。
所以接下来的问题是,在vLLM方式下如何安装ROCM驱动包,如何确保vLLM、ROCM和torch软件包之间的兼容?找到ROCM的官网查询了一下 驱动包版本和显卡设备的兼容列表,结果发现ROCM加速包根本就不支持我的显卡!晕啊,费了老大劲了~
那就不用安装ROCM加速包了啊,所以问题就变成了如何配置vLLM在不用GPU加速的情况下运行?继续问豆包,回答如下:
简单说就是要安装CPU版本的PyTorch,我已经安装了2.5.1版本的支持CUDA 12的PyTorch,现在要换,还好 vLLM的安装指导文档 有提到安装不同版本的PyTorch的方法,而 PyTorch官网首页 就有选择不同软硬件配置的安装方法,很好,就按CPU版本的操作!
直接执行指令
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
因为我之前已经安装过PyTorch v2.5.1 了,所以这次很快完成,省了很多下载。有什么区别呢? 结果只是增加了torchaudio-2.5.1+cpu模块。
Installing collected packages: torchaudio
Successfully installed torchaudio-2.5.1+cpu
完成后再次运行测试程序,还是不行,仍然是LLM引擎初始化失败。继续问豆包:
如何指定vLLM使用CPU进行推理?
在代码中设置参数
使用–target - device参数:在 Python 代码中启动 vLLM 服务时,可以通过设置–target - device参数为cpu来指定使用 CPU 进行推理。例如:
收起
python
from vllm import LLM
llm = LLM(model=“your_model_name”, target_device=“cpu”)
通过配置字典设置:也可以使用一个配置字典来设置参数。先创建一个包含target_device键值对的字典,然后将其传递给LLM类的构造函数。示例如下:
收起
python
config = {“target_device”: “cpu”}
llm = LLM(model=“your_model_name”, **config)
在命令行中指定
如果你是通过命令行来启动 vLLM 服务,同样可以使用–target - device cpu参数。例如:
收起
python
python -m vllm.entrypoints.openai.api_server --model your_model_name --target - device cpu
注意事项
性能差异:要知道使用 CPU 进行推理与使用 GPU 相比,速度通常会慢很多。因为 CPU 的计算核心数量相对 GPU 较少,并且没有像 GPU 那样针对大规模并行计算(如深度学习模型中的矩阵乘法等运算)进行优化。所以在实际应用中,如果对推理速度有较高要求,并且硬件条件允许,还是建议使用 GPU。
内存占用:在 CPU 上运行大型语言模型进行推理时,可能会占用大量的系统内存。要确保系统有足够的内存来加载模型和处理数据,否则可能会出现内存不足的错误。同时,一些在 GPU 上可以通过特殊技术(如模型量化等)缓解内存压力的方法,在 CPU 上的效果可能会有所不同。
在代码中添加参数,在命令行添加参数,两种方法都试过了,都报错,提示不识别target_device参数:
Traceback (most recent call last):
File "/mnt/work/llm/vllm/progs/test2.py", line 18, in <module>
llm = LLM(model="Qwen/Qwen2-7B-Instruct", target_device="cpu", task="generate") # Name or path of your model
File "/home/zhangsan/.local/lib/python3.10/site-packages/vllm/utils.py", line 990, in inner
return fn(*args, **kwargs)
File "/home/zhangsan/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 198, in __init__
engine_args = EngineArgs(
TypeError: EngineArgs.__init__() got an unexpected keyword argument 'target_device'
Exception ignored in: <function LLM.__del__ at 0x7f44717cd5a0>
Traceback (most recent call last):
File "/home/zhangsan/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 236, in __del__
if self.llm_engine and hasattr(self.llm_engine, "shutdown"):
AttributeError: 'LLM' object has no attribute 'llm_engine'
继续问豆包,回答是确认新的vLLM都支持这个参数,那怎么办? 查询vLLM官网资料,其中有专门的 安装CPU版本的vLLM 的说明章节,但它只说明了Docker方式安装和编译源码两种方式,就没有更简便的方式?!
好吧,编译就编译吧,看起来也不是太复杂,只是绕了好大个圈!!
先把先前安装的vllm和torch都卸载了,然后再照着文档中的步骤做。但注意:文档中说明编译源码的步骤时,没有如何说下载源码。源码是前提啊,可以到pypi上搜索vllm软件包,在找到的vllm模块页面上有 源码下载链接。下载、解压,进入到源码目录,然后启动编译:
$ tar -xvf vllm-0.6.5.tar.gz
$ cd vllm-0.6.5
$ pip install -v -r requirements-cpu.txt --extra-index-url https://download.pytorch.org/whl/cpu
$ VLLM_TARGET_DEVICE=cpu python setup.py install
只有下载了vllm的源代码,进入到源码目录vllm-0.6.5,才有了requirements-cpu.txt文件。但是上面最后一个指令执行失败:
$ VLLM_TARGET_DEVICE=cpu python setup.py install
/home/zhangsan/.local/lib/python3.10/site-packages/setuptools_scm/_integration/setuptools.py:31: RuntimeWarning:
ERROR: setuptools==59.6.0 is used in combination with setuptools_scm>=8.x
Your build configuration is incomplete and previously worked by accident!
setuptools_scm requires setuptools>=61
Suggested workaround if applicable:
- migrating from the deprecated setup_requires mechanism to pep517/518
and using a pyproject.toml to declare build dependencies
which are reliably pre-installed before running the build tools
warnings.warn(
running install
/usr/lib/python3/dist-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
/usr/lib/python3/dist-packages/setuptools/command/easy_install.py:158: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
error: can't create or remove files in install directory
The following error occurred while trying to add or remove files in the
installation directory:
[Errno 13] Permission denied: '/usr/local/lib/python3.10/dist-packages/test-easy-install-16606.write-test'
The installation directory you specified (via --install-dir, --prefix, or
the distutils default setting) was:
/usr/local/lib/python3.10/dist-packages/
Perhaps your account does not have write access to this directory? If the
installation directory is a system-owned directory, you may need to sign in
as the administrator or "root" account. If you do not have administrative
access to this machine, you may wish to choose a different installation
directory, preferably one that is listed in your PYTHONPATH environment
variable.
For information on other options, you may wish to consult the
documentation at:
https://setuptools.pypa.io/en/latest/deprecated/easy_install.html
Please make the appropriate changes for your system and try again.
首先是我现有的setuptools的版本与上面设置的setuptools_scm>=8.x的要求不一致,所以先升级setuptools:
$ pip show setuptools
Name: setuptools
Version: 59.6.0
Summary: Easily download, build, install, upgrade, and uninstall Python packages
Home-page: https://github.com/pypa/setuptools
Author: Python Packaging Authority
Author-email: distutils-sig@python.org
License: UNKNOWN
Location: /usr/lib/python3/dist-packages
Requires:
Required-by: bloom, catkin-pkg, catkin-pkg-modules, colcon-core, colcon-python-setup-py, rosdistro, rosdistro-modules, setuptools-scm
$ pip install --upgrade setuptools
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Requirement already satisfied: setuptools in /usr/lib/python3/dist-packages (59.6.0)
Collecting setuptools
Downloading https://pypi.tuna.tsinghua.edu.cn/packages/55/21/47d163f615df1d30c094f6c8bbb353619274edccf0327b185cc2493c2c33/setuptools-75.6.0-py3-none-any.whl (1.2 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 8.5 MB/s eta 0:00:00
Installing collected packages: setuptools
Successfully installed setuptools-75.6.0
再次运行安装指令,又报错,提示
!!
********************************************************************************
Please avoid runningsetup.py
directly.
Instead, use pypa/build, pypa/installer or other
standards-based tools.
See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
********************************************************************************
!!
意思是不要直接调用setup.py文件来编译和安装,问了下豆包,推荐的新方式是运行python的build模块来编译,然后再用pip install指令安装,执行指令:
$ pip install build
该指令先安装python的build模块,这也就是新的python编译机制pypa: Python Packaging Authority。接着就开始编译,执行指令:
VLLM_TARGET_DEVICE=cpu python -m build
还是报错,提示安装的Caffe2包要用CUDA但找不到CUDA的路径。我已经不需要GPU加速包了啊,Caffe2又是什么鬼?
CMake Warning at /tmp/build-env-acb2mew3/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:31 (message):
Caffe2: CUDA cannot be found. Depending on whether you are building Caffe2
or a Caffe2 dependent library, the next warning / error will give you more
info.
Call Stack (most recent call first):
/tmp/build-env-acb2mew3/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
/tmp/build-env-acb2mew3/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:84 (find_package)
CMake Error at /tmp/build-env-acb2mew3/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:90 (message):
Your installed Caffe2 version uses CUDA but I cannot find the CUDA
libraries. Please set the proper CUDA prefixes and / or install CUDA.
Call Stack (most recent call first):
/tmp/build-env-acb2mew3/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:84 (find_package)
问豆包,回答是:这个错误表明你安装的 Caffe2(它是 PyTorch 的一部分,用于一些深度学习相关的操作)版本是配置为使用 CUDA 的,但 CMake 在构建过程中无法找到 CUDA 库。这可能是由于 CUDA 没有正确安装、CUDA 环境变量没有正确设置或者 CMake 没有正确指向 CUDA 库的位置等原因导致的。
怎么可能呢? 刚安装的PyTorch已经是不用GPU加速的版本,为啥它的组成部分Caffe2还要用CUDA呢?官方的包都不一致吗?有点不好玩了,好痛苦啊!
反复琢磨,我就先把CUDA安装起来再说吧,有没有是一方面,用不用是另一方面,看能不能骗得过编译阶段? 执行指令:
sudo apt install nvidia-cuda-toolkit
一会就安装完成,再次执行vLLM的编译指令,哇,这次终于成功了!太不容易了。
$ VLLM_TARGET_DEVICE=cpu python -m build
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
- cmake>=3.26
- jinja2
- ninja
- packaging
- setuptools-scm>=8.0
- setuptools>=61
- torch == 2.5.1
- wheel
* Getting build dependencies for sdist...
/tmp/build-env-s3spf0kn/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
No CUDA runtime is found, using CUDA_HOME='/usr'
running egg_info
writing vllm.egg-info/PKG-INFO
writing dependency_links to vllm.egg-info/dependency_links.txt
writing entry points to vllm.egg-info/entry_points.txt
writing requirements to vllm.egg-info/requires.txt
writing top-level names to vllm.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'vllm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'vllm.egg-info/SOURCES.txt'
* Building sdist...
/tmp/build-env-s3spf0kn/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
No CUDA runtime is found, using CUDA_HOME='/usr'
running sdist
running egg_info
writing vllm.egg-info/PKG-INFO
writing dependency_links to vllm.egg-info/dependency_links.txt
writing entry points to vllm.egg-info/entry_points.txt
writing requirements to vllm.egg-info/requires.txt
writing top-level names to vllm.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'vllm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'vllm.egg-info/SOURCES.txt'
running check
creating vllm-0.6.5+cpu
。。。。。。
Writing vllm-0.6.5+cpu/setup.cfg
Creating tar archive
removing 'vllm-0.6.5+cpu' (and everything under it)
* Building wheel from sdist
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
- cmake>=3.26
- jinja2
- ninja
- packaging
- setuptools-scm>=8.0
- setuptools>=61
- torch == 2.5.1
- wheel
* Getting build dependencies for wheel...
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
No CUDA runtime is found, using CUDA_HOME='/usr'
running egg_info
writing vllm.egg-info/PKG-INFO
writing dependency_links to vllm.egg-info/dependency_links.txt
writing entry points to vllm.egg-info/entry_points.txt
writing requirements to vllm.egg-info/requires.txt
writing top-level names to vllm.egg-info/top_level.txt
ERROR setuptools_scm._file_finders.git listing git files failed - pretending there aren't any
reading manifest file 'vllm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
adding license file 'LICENSE'
writing manifest file 'vllm.egg-info/SOURCES.txt'
* Building wheel...
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:295: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at ../torch/csrc/utils/tensor_numpy.cpp:84.)
cpu = _conversion_method_template(device=torch.device("cpu"))
No CUDA runtime is found, using CUDA_HOME='/usr'
running bdist_wheel
running build
running build_py
creating build/lib.linux-x86_64-cpython-310/vllm
。。。。。。
-- Found Python: /tmp/build-env-dmjyow5e/bin/python (found version "3.10.12") found components: Interpreter
CMake Warning at /tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:140 (message):
Failed to compute shorthash for libnvrtc.so
Call Stack (most recent call first):
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:84 (find_package)
CMake Warning (dev) at /tmp/build-env-dmjyow5e/lib/python3.10/site-packages/cmake/data/share/cmake-3.31/Modules/FindPackageHandleStandardArgs.cmake:441 (message):
The package name passed to `find_package_handle_standard_args` (nvtx3) does
not match the name of the calling package (Caffe2). This can lead to
problems in calling code that expects `find_package` result variables
(e.g., `_FOUND`) to follow a certain pattern.
Call Stack (most recent call first):
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:174 (find_package_handle_standard_args)
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:84 (find_package)
This warning is for project developers. Use -Wno-dev to suppress it.
-- Could NOT find nvtx3 (missing: nvtx3_dir)
CMake Warning at /tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/public/cuda.cmake:180 (message):
Cannot find NVTX3, find old NVTX instead
Call Stack (most recent call first):
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Caffe2/Caffe2Config.cmake:86 (include)
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:68 (find_package)
CMakeLists.txt:84 (find_package)
-- USE_CUDNN is set to 0. Compiling without cuDNN support
-- USE_CUSPARSELT is set to 0. Compiling without cuSPARSELt support
-- USE_CUDSS is set to 0. Compiling without cuDSS support
-- USE_CUFILE is set to 0. Compiling without cuFile support
-- Automatic GPU detection failed. Building for common architectures.
-- Autodetected CUDA architecture(s): 3.5;5.0;8.0;8.6;8.6+PTX
-- Added CUDA NVCC flags for: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_86,code=compute_86
CMake Warning at /tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:22 (message):
static library kineto_LIBRARY-NOTFOUND not found.
Call Stack (most recent call first):
/tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/share/cmake/Torch/TorchConfig.cmake:120 (append_torchlib_if_found)
CMakeLists.txt:84 (find_package)
-- Found Torch: /tmp/build-env-dmjyow5e/lib/python3.10/site-packages/torch/lib/libtorch.so
CMake Warning at cmake/cpu_extension.cmake:85 (message):
vLLM CPU backend using AVX2 ISA
Call Stack (most recent call first):
CMakeLists.txt:92 (include)
-- CPU extension compile flags: -mf16c;-fopenmp;-DVLLM_CPU_EXTENSION;-mavx2
-- Enabling C extension.
-- Configuring done (13.1s)
-- Generating done (0.1s)
。。。。。。
removing build/bdist.linux-x86_64/wheel
Successfully built vllm-0.6.5+cpu.tar.gz and vllm-0.6.5+cpu.cpu-cp310-cp310-linux_x86_64.whl
编译完成之后,编译的结果放到了源码目录下的dist子目录中。最后通过pip进行安装:
$ ll dist/
总计 12276
drwxrwxr-x 2 allen allen 4096 12月 22 09:41 ./
drwx------ 14 allen allen 4096 12月 22 09:37 ../
-rw-rw-r-- 1 allen allen 7743235 12月 22 09:41 vllm-0.6.5+cpu.cpu-cp310-cp310-linux_x86_64.whl
-rw-rw-r-- 1 allen allen 4815390 12月 22 09:37 vllm-0.6.5+cpu.tar.gz
跟其他python包一样,由于未采用sudo方式安装,默认就都是安装到当前用户的主目录下的.local/bin(程序)和.local/lib/python3.10/site-packages(库)。前面我们已经将~/.local/bin加入到环境变量PATH中,因此可以执行vllm程序看看:
$ vllm -h
usage: vllm [-h] [-v] {serve,complete,chat} ...
vLLM CLI
positional arguments:
{serve,complete,chat}
serve Start the vLLM OpenAI Compatible API server
complete Generate text completions based on the given prompt via the running API server
chat Generate chat completions via the running API server
options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
然后,我们用先前的测试程序test.py再来检验一次。开始运行的时候,程序会自动下载Qwen/Qwen2-7B-Instruct大模型的文件,并保存到Qwen目录下的.cache目录中。这个表现跟首次运行的时候不一样,意思就是不用手动下载大模型文件啊,是不是跟设置了国内镜像源有关?应该是!
$ find work/llm/vllm/Qwen/ -name model-00001-of-00004*
work/llm/vllm/Qwen/model-00001-of-00004.safetensors
work/llm/vllm/Qwen/.cache/huggingface/download/model-00001-of-00004.safetensors.metadata
work/llm/vllm/Qwen/.cache/huggingface/download/model-00001-of-00004.safetensors.lock
下载完成后,test.py并没有运行成功,而是再次出错,提示libc10_cuda.so库文件找不到。我在整个根文件系统查找了该文件,确实没有;有个/usr/lib/cuda/lib64目录,但里面是空的,一个库文件都没有,看来cuda驱动并没有安装成功啊。
$ python progs/test.py
INFO 12-22 11:38:29 config.py:478] This model supports multiple tasks: {'generate', 'reward', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
WARNING 12-22 11:38:29 config.py:604] Async output processing is not supported on the current platform type cpu.
WARNING 12-22 11:38:29 cpu.py:56] CUDA graph is not supported on CPU, fallback to the eager mode.
WARNING 12-22 11:38:29 cpu.py:71] Environment variable VLLM_CPU_KVCACHE_SPACE (GB) for CPU backend is not set, using 4 by default.
INFO 12-22 11:38:29 llm_engine.py:249] Initializing an LLM engine (v0.6.5+cpu) with config: model='Qwen/Qwen2-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cpu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
INFO 12-22 11:38:31 cpu.py:33] Cannot use _Backend.FLASH_ATTN backend on CPU.
INFO 12-22 11:38:31 selector.py:141] Using Torch SDPA backend.
WARNING 12-22 11:38:31 _custom_ops.py:20] Failed to import from vllm._C with ImportError('libc10_cuda.so: cannot open shared object file: No such file or directory')
INFO 12-22 11:38:34 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:02<00:08, 2.90s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.97s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:13<00:05, 5.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00, 5.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:19<00:00, 4.94s/it]
INFO 12-22 11:38:54 cpu_executor.py:186] # CPU blocks: 4681
已杀死
哎,千回百转啊!干脆把之前安装的nvidia-cuda-toolkit卸载掉,重新找到 nvidia官网上cuda的安装方法,按其中指导的步骤操作:
CUDA Toolkit Installer
Installation Instructions:
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.6.3/local_installers/cuda-repo-ubuntu2204-12-6-local_12.6.3-560.35.05-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2204-12-6-local_12.6.3-560.35.05-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2204-12-6-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-6
安装完成之后,就可以找到cuda的库文件了:
$ sudo find / -name *cuda*.so 2>/dev/null
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/x86_64-linux-gnu/libicudata.so
/usr/local/bin/lib/ollama/runners/cuda_v12_avx/libggml_cuda_v12.so
/usr/local/bin/lib/ollama/runners/cuda_v11_avx/libggml_cuda_v11.so
/usr/local/lib/ollama/runners/cuda_v11_avx/libggml_cuda_v11.so
/usr/local/cuda-12.6/targets/x86_64-linux/lib/stubs/libcuda.so
/usr/local/cuda-12.6/targets/x86_64-linux/lib/libcudart.so
/snap/gnome-42-2204/172/usr/lib/x86_64-linux-gnu/libicudata.so
/snap/gnome-42-2204/176/usr/lib/x86_64-linux-gnu/libicudata.so
/snap/gnome-3-38-2004/143/usr/lib/x86_64-linux-gnu/libicudata.so
/opt/nvidia/nsight-compute/2024.3.2/target/linux-desktop-t210-a64/libcuda-injection.so
/opt/nvidia/nsight-compute/2024.3.2/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
然后用新的库文件再把vllm编译一遍,进到vllm-0.6.5源码目录,还是运行如下指令:
VLLM_TARGET_DEVICE=cpu python -m build
不幸的是,编译失败了,提示不能判断出恰当的CUDA架构(实际设备上根本就没有Nvdia显卡,判定不了也正常),所以此路还是不通,只能说太折腾了!
现就到这里吧,待续。。。