从本地对话到智能服务接口:FastAPI封装GLM-4-9B-Chat-1M实战指南
引言:大模型落地的最后一公里困境
你是否曾遇到这些痛点:本地部署的GLM-4模型仅能通过Python脚本调用?想将其集成到业务系统却受制于复杂的模型调用逻辑?需要为团队提供统一的AI服务接口却缺乏高效方案?本文将通过10个实战步骤,教你如何使用FastAPI将GLM-4-9B-Chat-1M模型封装为高性能API服务,彻底解决大模型落地应用的技术壁垒。
读完本文你将掌握:
- GLM-4-9B-Chat-1M模型的本地部署与优化
- FastAPI服务构建的核心技术与最佳实践
- 长文本处理与流式响应的实现方案
- 模型服务的性能调优与压力测试方法
- 完整的API服务部署与监控体系
一、GLM-4-9B-Chat-1M模型深度解析
1.1 模型核心特性
GLM-4-9B-Chat-1M是THUDM推出的新一代开源对话模型,具备三大核心优势:
| 特性 | 技术参数 | 应用场景 |
|---|---|---|
| 超长上下文 | 支持100万token(约200万中文字符) | 文档处理、法律分析、代码审计 |
| 多语言支持 | 原生支持26种语言 | 跨境客服、多语言内容生成 |
| 工具调用能力 | 内置函数调用机制 | 智能问答系统、自动化办公 |
1.2 模型架构解析
核心创新点在于其 rotary positional embedding(旋转位置编码)实现,通过以下代码片段可清晰看到其工作原理:
def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
# x: [b, np, sq, hn]
b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
rot_dim = rope_cache.shape[-2] * 2
x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
rope_cache = rope_cache[:, :sq]
xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
x_out2 = torch.stack([
xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1]
], -1)
x_out2 = x_out2.flatten(3)
return torch.cat((x_out2, x_pass), dim=-1)
1.3 本地部署与环境配置
基础环境要求
- Python 3.8+
- PyTorch 2.0+
- 至少24GB显存(推荐A100或同等配置GPU)
- 系统内存32GB+
快速部署步骤
# 克隆仓库
git clone https://gitcode.com/hf_mirrors/THUDM/glm-4-9b-chat-1m.git
cd glm-4-9b-chat-1m
# 创建虚拟环境
conda create -n glm4 python=3.10
conda activate glm4
# 安装依赖
pip install torch==2.1.0 transformers==4.44.0 fastapi uvicorn pydantic-settings python-multipart
基础调用示例
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained("./", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"./",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to(device).eval()
# 构建对话
prompt = [{"role": "user", "content": "请介绍GLM-4-9B-Chat-1M的核心优势"}]
inputs = tokenizer.apply_chat_template(
prompt,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True
).to(device)
# 生成回复
gen_kwargs = {"max_length": 2048, "do_sample": True, "temperature": 0.8}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
二、FastAPI服务构建基础
2.1 FastAPI核心优势
FastAPI是一个现代、高性能的Python API框架,特别适合构建AI模型服务,其核心优势包括:
- 自动生成交互式API文档(Swagger UI和ReDoc)
- 基于Pydantic的数据验证,确保请求参数正确
- 异步处理支持,显著提升并发性能
- 类型提示支持,提高代码可维护性和IDE支持
2.2 项目结构设计
glm4-api/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI应用入口
│ ├── models/ # 数据模型定义
│ │ ├── __init__.py
│ │ └── request.py # 请求体模型
│ ├── api/ # API路由
│ │ ├── __init__.py
│ │ └── v1/
│ │ ├── __init__.py
│ │ ├── endpoints/
│ │ │ ├── __init__.py
│ │ │ └── chat.py # 对话API
│ │ └── router.py # 路由聚合
│ ├── core/ # 核心配置
│ │ ├── __init__.py
│ │ ├── config.py # 配置管理
│ │ └── logger.py # 日志配置
│ └── services/ # 业务逻辑
│ ├── __init__.py
│ └── glm_service.py # 模型服务封装
├── requirements.txt # 项目依赖
├── .env # 环境变量
└── run.py # 服务启动脚本
2.3 基础服务搭建
安装核心依赖
pip install fastapi uvicorn pydantic-settings python-multipart
最小化API示例
# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Dict, Optional
app = FastAPI(
title="GLM-4-9B-Chat-1M API Service",
description="A FastAPI service for GLM-4-9B-Chat-1M model",
version="1.0.0"
)
# 请求模型
class ChatRequest(BaseModel):
messages: List[Dict[str, str]]
temperature: Optional[float] = 0.7
max_tokens: Optional[int] = 1024
# 响应模型
class ChatResponse(BaseModel):
response: str
request_id: str
token_usage: Dict[str, int]
@app.post("/api/v1/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""基础对话API"""
# 这里将在后续章节实现实际模型调用
return {
"response": "这是一个示例响应",
"request_id": "test-12345",
"token_usage": {"prompt_tokens": 20, "completion_tokens": 50, "total_tokens": 70}
}
@app.get("/health")
async def health_check():
"""服务健康检查"""
return {"status": "healthy", "model": "glm-4-9b-chat-1m"}
启动服务
# run.py
import uvicorn
from app.main import app
if __name__ == "__main__":
uvicorn.run(
"app.main:app",
host="0.0.0.0",
port=8000,
workers=1,
reload=True
)
启动服务:python run.py
访问API文档:http://localhost:8000/docs
三、模型服务封装与优化
3.1 单例模型管理
为避免重复加载模型导致的资源浪费,实现单例模式的模型管理:
# app/services/glm_service.py
import torch
import time
import uuid
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Dict, Optional, Tuple
class GLMService:
_instance = None
_model = None
_tokenizer = None
_device = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
def load_model(self, model_path: str = "./", device: Optional[str] = None):
"""加载模型"""
start_time = time.time()
self._device = device or ("cuda" if torch.cuda.is_available() else "cpu")
print(f"Loading model from {model_path} to {self._device}...")
self._tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
self._model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
).to(self._device).eval()
load_time = time.time() - start_time
print(f"Model loaded successfully in {load_time:.2f} seconds")
return self
def generate(
self,
messages: List[Dict[str, str]],
temperature: float = 0.7,
max_tokens: int = 1024
) -> Tuple[str, Dict[str, int]]:
"""生成文本"""
if self._model is None or self._tokenizer is None:
raise RuntimeError("Model not loaded. Call load_model() first.")
# 构建输入
inputs = self._tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True
).to(self._device)
prompt_tokens = inputs['input_ids'].shape[1]
# 生成回复
gen_kwargs = {
"max_length": prompt_tokens + max_tokens,
"do_sample": temperature > 0,
"temperature": temperature,
"top_p": 0.9 if temperature > 0 else 1.0,
}
with torch.no_grad():
outputs = self._model.generate(**inputs, **gen_kwargs)
# 解码回复
response = self._tokenizer.decode(
outputs[0][inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)
# 计算token使用量
completion_tokens = outputs[0].shape[0] - prompt_tokens
token_usage = {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens
}
return response, token_usage
3.2 模型加载与初始化
# app/core/config.py
from pydantic_settings import BaseSettings
from typing import Optional
class Settings(BaseSettings):
model_path: str = "./" # 模型路径
device: Optional[str] = None # 设备,自动检测
api_prefix: str = "/api/v1"
log_level: str = "INFO"
class Config:
env_file = ".env"
# app/main.py (更新)
from app.services.glm_service import GLMService
from app.core.config import Settings
# 加载配置
settings = Settings()
# 初始化模型服务
model_service = GLMService().load_model(
model_path=settings.model_path,
device=settings.device
)
3.3 同步与异步调用实现
FastAPI支持同步和异步两种处理方式,针对模型调用特点,我们采用线程池执行同步任务的方式:
# app/api/v1/endpoints/chat.py
import time
import uuid
from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import List, Dict, Optional
from app.services.glm_service import GLMService
from app.core.config import settings
router = APIRouter()
model_service = GLMService()
# 定义请求和响应模型(与前面章节相同)
class ChatRequest(BaseModel):
messages: List[Dict[str, str]]
temperature: Optional[float] = 0.7
max_tokens: Optional[int] = 1024
class ChatResponse(BaseModel):
response: str
request_id: str
token_usage: Dict[str, int]
took: float
@router.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
"""同步对话API"""
request_id = str(uuid.uuid4())
start_time = time.time()
try:
# 使用线程池执行同步模型调用
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
response, token_usage = await app.loop.run_in_executor(
executor,
lambda: model_service.generate(
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
)
took = time.time() - start_time
return {
"response": response,
"request_id": request_id,
"token_usage": token_usage,
"took": took
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
四、高级功能实现
4.1 流式响应实现
对于长文本生成场景,流式响应能显著提升用户体验:
# app/api/v1/endpoints/stream_chat.py
import time
import uuid
import asyncio
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List, Dict, Optional, AsyncGenerator
from app.services.glm_service import GLMService
router = APIRouter()
model_service = GLMService()
class StreamChatRequest(BaseModel):
messages: List[Dict[str, str]]
temperature: Optional[float] = 0.7
max_tokens: Optional[int] = 1024
@router.post("/stream_chat")
async def stream_chat(request: StreamChatRequest):
"""流式对话API"""
request_id = str(uuid.uuid4())
async def event_generator() -> AsyncGenerator[str, None]:
try:
# 这里简化处理,实际实现需使用模型的流式生成功能
full_response = ""
response, _ = model_service.generate(
messages=request.messages,
temperature=request.temperature,
max_tokens=request.max_tokens
)
# 模拟流式输出
for i in range(0, len(response), 5):
chunk = response[i:i+5]
full_response += chunk
yield f"data: {json.dumps({
'chunk': chunk,
'request_id': request_id,
'done': False
})}\n\n"
await asyncio.sleep(0.05)
# 发送结束信号
yield f"data: {json.dumps({
'chunk': '',
'request_id': request_id,
'done': True,
'full_response': full_response
})}\n\n"
except Exception as e:
yield f"data: {json.dumps({'error': str(e), 'request_id': request_id})}\n\n"
return StreamingResponse(
event_generator(),
media_type="text/event-stream"
)
前端JavaScript调用示例:
const eventSource = new EventSource(`/api/v1/stream_chat`);
const responseElement = document.getElementById('response');
// 发送请求
fetch('/api/v1/stream_chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [{ "role": "user", "content": "请介绍大语言模型的流式输出原理" }],
temperature: 0.7
})
});
// 接收流数据
eventSource.onmessage = function(event) {
const data = JSON.parse(event.data);
if (data.error) {
console.error('Error:', data.error);
eventSource.close();
} else if (data.done) {
eventSource.close();
} else {
responseElement.innerHTML += data.chunk;
}
};
4.2 长文本处理策略
针对GLM-4-9B-Chat-1M的超长上下文能力,实现文档摘要和问答功能:
# app/api/v1/endpoints/document.py
import time
import uuid
from fastapi import APIRouter, HTTPException, UploadFile, File
from pydantic import BaseModel
from typing import List, Dict, Optional, Tuple
from app.services.glm_service import GLMService
router = APIRouter()
model_service = GLMService()
class DocumentQARequest(BaseModel):
document: str
question: str
max_tokens: Optional[int] = 512
@router.post("/document/qa")
async def document_qa(request: DocumentQARequest):
"""文档问答API"""
# 构建提示词
prompt = f"""基于以下文档内容回答问题。如果文档中没有相关信息,请回答"无法从文档中找到答案"。
文档内容:
{request.document[:10000]} # 限制文档长度,实际应用可优化
问题: {request.question}
回答:"""
messages = [{"role": "user", "content": prompt}]
try:
response, token_usage = model_service.generate(
messages=messages,
temperature=0.3, # 降低随机性,提高答案准确性
max_tokens=request.max_tokens
)
return {
"question": request.question,
"answer": response,
"token_usage": token_usage
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/document/summary")
async def document_summary(request: DocumentQARequest):
"""文档摘要API"""
# 构建提示词
prompt = f"""请为以下文档生成摘要,要求:
1. 保留核心观点和关键数据
2. 结构清晰,分点说明
3. 长度不超过300字
文档内容:
{request.document[:20000]}
摘要:"""
messages = [{"role": "user", "content": prompt}]
try:
response, token_usage = model_service.generate(
messages=messages,
temperature=0.5,
max_tokens=600 # 摘要通常需要更多token
)
return {
"summary": response,
"token_usage": token_usage
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/document/upload")
async def upload_document(file: UploadFile = File(...)):
"""上传文档并处理"""
if file.content_type not in ["text/plain", "application/pdf", "text/markdown"]:
raise HTTPException(status_code=400, detail="不支持的文件类型")
# 读取文件内容(实际应用需处理不同格式)
content = await file.read()
try:
content = content.decode("utf-8")
except:
raise HTTPException(status_code=400, detail="文件解码失败")
# 生成文档ID
doc_id = str(uuid.uuid4())
# 这里可以添加文档存储逻辑
return {
"doc_id": doc_id,
"filename": file.filename,
"content_length": len(content),
"message": "文件上传成功"
}
4.3 工具调用能力集成
利用GLM-4的工具调用能力,实现天气查询等外部工具集成:
# app/services/tool_service.py
import json
import requests
from typing import Dict, Any, Optional
class ToolService:
"""工具服务类"""
def __init__(self):
self.tools = {
"weather": self.get_weather,
"calculator": self.calculate
}
def get_weather(self, location: str) -> Dict[str, Any]:
"""获取天气信息(模拟)"""
# 实际应用中替换为真实API
return {
"location": location,
"temperature": "25°C",
"description": "晴朗",
"humidity": "45%",
"wind": "微风"
}
def calculate(self, expression: str) -> Dict[str, Any]:
"""计算数学表达式"""
try:
# 注意:实际应用中需使用安全的表达式计算方法
result = eval(expression)
return {"expression": expression, "result": result}
except Exception as e:
return {"error": str(e)}
def call_tool(self, tool_name: str, parameters: Dict[str, Any]) -> Dict[str, Any]:
"""调用工具"""
if tool_name not in self.tools:
return {"error": f"Tool {tool_name} not found"}
try:
return self.tools[tool_name](**parameters)
except Exception as e:
return {"error": str(e)}
# app/api/v1/endpoints/tool_chat.py
import time
import uuid
import json
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from typing import List, Dict, Optional
from app.services.glm_service import GLMService
from app.services.tool_service import ToolService
router = APIRouter()
model_service = GLMService()
tool_service = ToolService()
class ToolChatRequest(BaseModel):
messages: List[Dict[str, str]]
temperature: Optional[float] = 0.7
@router.post("/tool_chat")
async def tool_chat(request: ToolChatRequest):
"""工具调用对话API"""
# 定义工具描述
tools_desc = """你可以使用以下工具来回答问题:
1. 天气查询工具
- 名称:weather
- 参数:location (字符串,城市名称)
- 描述:获取指定城市的天气信息
2. 计算器工具
- 名称:calculator
- 参数:expression (字符串,数学表达式)
- 描述:计算数学表达式的值
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



