使用 LlamaEdge 构建本地和在线LLM聊天服务

本文链接：https://blog.youkuaiyun.com/srudfktuffk/article/details/145313596

在人工智能领域，LLM（大型语言模型）正迅速成为强大的工具。LlamaEdge 提供了一种灵活的方式，允许开发人员通过 HTTP 请求与 GGUF 格式的 LLM 进行交互，无论是在线通过 LlamaEdgeChatService，还是即将推出的本地化解决方案 LlamaEdgeChatLocal。本文将介绍如何使用 LlamaEdge 提供的 API 服务进行聊天，并展示一些实用的 Python 代码示例。

技术背景介绍

LlamaEdge 是一个专为 LLM 推理任务设计的服务，利用 WasmEdge Runtime 提供轻量级和可移植的 WebAssembly 容器环境，支持本地和在线的 LLM 交互。LlamaEdgeChatService 使得开发者能够通过 OpenAI API 兼容的服务与 LLMs 聊天，而即将推出的 LlamaEdgeChatLocal 将简化在本地设备上直接与 LLMs 交互的过程。

核心原理解析

LlamaEdge 通过 llama-api-server 提供的 API 服务，使开发者可以在任何设备上运行定制的 LLM聊天服务。它支持两种模式：标准模式和流式模式。标准模式一次性返回完整响应，而流式模式则逐步返回响应。

代码实现演示

使用 API 服务进行非流式聊天

下面是如何通过 LlamaEdge Chat Service 在非流模式下与 LLM对话的示例代码：

from langchain_community.chat_models.llama_edge import LlamaEdgeChatService
from langchain_core.messages import HumanMessage, SystemMessage

# 定义服务URL
service_url = "https://yunwu.ai/v1"  # 国内稳定访问

# 创建聊天服务实例
chat = LlamaEdgeChatService(service_url=service_url)

# 创建消息序列
system_message = SystemMessage(content="You are an AI assistant")
user_message = HumanMessage(content="What is the capital of France?")
messages = [system_message, user_message]

# 调用服务进行聊天
response = chat.invoke(messages)

print(f"[Bot] {response.content}")

使用 API 服务进行流式聊天

以下是流式聊天模式的代码示例，它逐步返回响应：

from langchain_community.chat_models.llama_edge import LlamaEdgeChatService
from langchain_core.messages import HumanMessage, SystemMessage

# 定义服务URL
service_url = "https://yunwu.ai/v1"  # 国内稳定访问

# 创建流式聊天服务实例
chat = LlamaEdgeChatService(service_url=service_url, streaming=True)

# 创建消息序列
system_message = SystemMessage(content="You are an AI assistant")
user_message = HumanMessage(content="What is the capital of Norway?")
messages = [system_message, user_message]

output = ""
for chunk in chat.stream(messages):
    output += chunk.content

print(f"[Bot] {output}")