【vLLM 学习】CPU 离线处理

最新推荐文章于 2025-06-02 21:02:55 发布

HyperAI超神经

最新推荐文章于 2025-06-02 21:02:55 发布

阅读量296

点赞数 1

CC 4.0 BY-SA版权

分类专栏： vLLM 文章标签：人工智能 vLLM 编译器编程语言深度学习 CPU 机器学习

本文链接：https://blog.youkuaiyun.com/HyperAI/article/details/147564831

vLLM 专栏收录该内容

23 篇文章

订阅专栏

vLLM 是一款专为大语言模型推理加速而设计的框架，实现了 KV 缓存内存几乎零浪费，解决了内存管理瓶颈问题。

更多 vLLM 中文文档及教程可访问 →https://vllm.hyper.ai/

源代码：vllm-project/vllm

from vllm import LLM, SamplingParams

# Sample prompts.
# 提示示例

prompts = [
 "Hello, my name is",
 "The president of the United States is",
 "The capital of France is",
 "The future of AI is",
]
# Create a sampling params object.
# 创建 sampling params 对象
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
# 创建一个 LLM
llm = LLM(model="meta-llama/Llama-2-13b-chat-hf", cpu_offload_gb=10)
# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
# 从提示中生成文本。输出是一个 RequestOutput 列表，包含提示、生成文本和其他信息

outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
# 打印输出
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
 print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")