从本地对话到智能服务接口：FastAPI封装GLM-4-9B-Chat-1M实战指南-优快云博客

从本地对话到智能服务接口：FastAPI封装GLM-4-9B-Chat-1M实战指南

【免费下载链接】glm-4-9b-chat-1m 探索GLM-4-9B-Chat-1M，THUDM力作，深度学习对话新里程。多语言、长文本推理，智能工具调用，让沟通无界。项目地址: https://ai.gitcode.com/hf_mirrors/THUDM/glm-4-9b-chat-1m

引言：大模型落地的最后一公里困境

你是否曾遇到这些痛点：本地部署的GLM-4模型仅能通过Python脚本调用？想将其集成到业务系统却受制于复杂的模型调用逻辑？需要为团队提供统一的AI服务接口却缺乏高效方案？本文将通过10个实战步骤，教你如何使用FastAPI将GLM-4-9B-Chat-1M模型封装为高性能API服务，彻底解决大模型落地应用的技术壁垒。

读完本文你将掌握：

GLM-4-9B-Chat-1M模型的本地部署与优化
FastAPI服务构建的核心技术与最佳实践
长文本处理与流式响应的实现方案
模型服务的性能调优与压力测试方法
完整的API服务部署与监控体系

一、GLM-4-9B-Chat-1M模型深度解析

1.1 模型核心特性

GLM-4-9B-Chat-1M是THUDM推出的新一代开源对话模型，具备三大核心优势：

特性	技术参数	应用场景
超长上下文	支持100万token（约200万中文字符）	文档处理、法律分析、代码审计
多语言支持	原生支持26种语言	跨境客服、多语言内容生成
工具调用能力	内置函数调用机制	智能问答系统、自动化办公

1.2 模型架构解析

mermaid

核心创新点在于其 rotary positional embedding（旋转位置编码）实现，通过以下代码片段可清晰看到其工作原理：

def apply_rotary_pos_emb(x: torch.Tensor, rope_cache: torch.Tensor) -> torch.Tensor:
    # x: [b, np, sq, hn]
    b, np, sq, hn = x.size(0), x.size(1), x.size(2), x.size(3)
    rot_dim = rope_cache.shape[-2] * 2
    x, x_pass = x[..., :rot_dim], x[..., rot_dim:]
    rope_cache = rope_cache[:, :sq]
    xshaped = x.reshape(b, np, sq, rot_dim // 2, 2)
    rope_cache = rope_cache.view(-1, 1, sq, xshaped.size(3), 2)
    x_out2 = torch.stack([
        xshaped[..., 0] * rope_cache[..., 0] - xshaped[..., 1] * rope_cache[..., 1],
        xshaped[..., 1] * rope_cache[..., 0] + xshaped[..., 0] * rope_cache[..., 1]
    ], -1)
    x_out2 = x_out2.flatten(3)
    return torch.cat((x_out2, x_pass), dim=-1)

1.3 本地部署与环境配置

基础环境要求

Python 3.8+
PyTorch 2.0+
至少24GB显存（推荐A100或同等配置GPU）
系统内存32GB+

快速部署步骤

# 克隆仓库
git clone https://gitcode.com/hf_mirrors/THUDM/glm-4-9b-chat-1m.git
cd glm-4-9b-chat-1m

# 创建虚拟环境
conda create -n glm4 python=3.10
conda activate glm4

# 安装依赖
pip install torch==2.1.0 transformers==4.44.0 fastapi uvicorn pydantic-settings python-multipart

基础调用示例

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda" if torch.cuda.is_available() else "cpu"

# 加载模型和分词器
tokenizer = AutoTokenizer.from_pretrained("./", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "./",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

# 构建对话
prompt = [{"role": "user", "content": "请介绍GLM-4-9B-Chat-1M的核心优势"}]
inputs = tokenizer.apply_chat_template(
    prompt,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True
).to(device)

# 生成回复
gen_kwargs = {"max_length": 2048, "do_sample": True, "temperature": 0.8}
with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    print(response)

二、FastAPI服务构建基础

2.1 FastAPI核心优势

FastAPI是一个现代、高性能的Python API框架，特别适合构建AI模型服务，其核心优势包括：

自动生成交互式API文档（Swagger UI和ReDoc）
基于Pydantic的数据验证，确保请求参数正确
异步处理支持，显著提升并发性能
类型提示支持，提高代码可维护性和IDE支持

2.2 项目结构设计

glm4-api/
├── app/
│   ├── __init__.py
│   ├── main.py             # FastAPI应用入口
│   ├── models/             # 数据模型定义
│   │   ├── __init__.py
│   │   └── request.py      # 请求体模型
│   ├── api/                # API路由
│   │   ├── __init__.py
│   │   └── v1/
│   │       ├── __init__.py
│   │       ├── endpoints/
│   │       │   ├── __init__.py
│   │       │   └── chat.py # 对话API
│   │       └── router.py   # 路由聚合
│   ├── core/               # 核心配置
│   │   ├── __init__.py
│   │   ├── config.py       # 配置管理
│   │   └── logger.py       # 日志配置
│   └── services/           # 业务逻辑
│       ├── __init__.py
│       └── glm_service.py  # 模型服务封装
├── requirements.txt        # 项目依赖
├── .env                    # 环境变量
└── run.py                  # 服务启动脚本

2.3 基础服务搭建

安装核心依赖

pip install fastapi uvicorn pydantic-settings python-multipart

最小化API示例

# app/main.py
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Dict, Optional

app = FastAPI(
    title="GLM-4-9B-Chat-1M API Service",
    description="A FastAPI service for GLM-4-9B-Chat-1M model",
    version="1.0.0"
)

# 请求模型
class ChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 1024

# 响应模型
class ChatResponse(BaseModel):
    response: str
    request_id: str
    token_usage: Dict[str, int]

@app.post("/api/v1/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """基础对话API"""
    # 这里将在后续章节实现实际模型调用
    return {
        "response": "这是一个示例响应",
        "request_id": "test-12345",
        "token_usage": {"prompt_tokens": 20, "completion_tokens": 50, "total_tokens": 70}
    }

@app.get("/health")
async def health_check():
    """服务健康检查"""
    return {"status": "healthy", "model": "glm-4-9b-chat-1m"}

启动服务

# run.py
import uvicorn
from app.main import app

if __name__ == "__main__":
    uvicorn.run(
        "app.main:app",
        host="0.0.0.0",
        port=8000,
        workers=1,
        reload=True
    )

启动服务：python run.py

访问API文档：http://localhost:8000/docs

三、模型服务封装与优化

3.1 单例模型管理

为避免重复加载模型导致的资源浪费，实现单例模式的模型管理：

# app/services/glm_service.py
import torch
import time
import uuid
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Dict, Optional, Tuple

class GLMService:
    _instance = None
    _model = None
    _tokenizer = None
    _device = None
    
    def __new__(cls):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
        return cls._instance
    
    def load_model(self, model_path: str = "./", device: Optional[str] = None):
        """加载模型"""
        start_time = time.time()
        self._device = device or ("cuda" if torch.cuda.is_available() else "cpu")
        
        print(f"Loading model from {model_path} to {self._device}...")
        self._tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        self._model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            low_cpu_mem_usage=True,
            trust_remote_code=True
        ).to(self._device).eval()
        
        load_time = time.time() - start_time
        print(f"Model loaded successfully in {load_time:.2f} seconds")
        return self
    
    def generate(
        self, 
        messages: List[Dict[str, str]],
        temperature: float = 0.7,
        max_tokens: int = 1024
    ) -> Tuple[str, Dict[str, int]]:
        """生成文本"""
        if self._model is None or self._tokenizer is None:
            raise RuntimeError("Model not loaded. Call load_model() first.")
        
        # 构建输入
        inputs = self._tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            tokenize=True,
            return_tensors="pt",
            return_dict=True
        ).to(self._device)
        
        prompt_tokens = inputs['input_ids'].shape[1]
        
        # 生成回复
        gen_kwargs = {
            "max_length": prompt_tokens + max_tokens,
            "do_sample": temperature > 0,
            "temperature": temperature,
            "top_p": 0.9 if temperature > 0 else 1.0,
        }
        
        with torch.no_grad():
            outputs = self._model.generate(**inputs, **gen_kwargs)
        
        # 解码回复
        response = self._tokenizer.decode(
            outputs[0][inputs['input_ids'].shape[1]:], 
            skip_special_tokens=True
        )
        
        # 计算token使用量
        completion_tokens = outputs[0].shape[0] - prompt_tokens
        token_usage = {
            "prompt_tokens": prompt_tokens,
            "completion_tokens": completion_tokens,
            "total_tokens": prompt_tokens + completion_tokens
        }
        
        return response, token_usage

3.2 模型加载与初始化

# app/core/config.py
from pydantic_settings import BaseSettings
from typing import Optional

class Settings(BaseSettings):
    model_path: str = "./"  # 模型路径
    device: Optional[str] = None  # 设备，自动检测
    api_prefix: str = "/api/v1"
    log_level: str = "INFO"
    
    class Config:
        env_file = ".env"

# app/main.py (更新)
from app.services.glm_service import GLMService
from app.core.config import Settings

# 加载配置
settings = Settings()

# 初始化模型服务
model_service = GLMService().load_model(
    model_path=settings.model_path,
    device=settings.device
)

3.3 同步与异步调用实现

FastAPI支持同步和异步两种处理方式，针对模型调用特点，我们采用线程池执行同步任务的方式：

# app/api/v1/endpoints/chat.py
import time
import uuid
from fastapi import APIRouter, HTTPException, BackgroundTasks
from pydantic import BaseModel
from typing import List, Dict, Optional
from app.services.glm_service import GLMService
from app.core.config import settings

router = APIRouter()
model_service = GLMService()

# 定义请求和响应模型（与前面章节相同）
class ChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 1024

class ChatResponse(BaseModel):
    response: str
    request_id: str
    token_usage: Dict[str, int]
    took: float

@router.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    """同步对话API"""
    request_id = str(uuid.uuid4())
    start_time = time.time()
    
    try:
        # 使用线程池执行同步模型调用
        from concurrent.futures import ThreadPoolExecutor
        with ThreadPoolExecutor() as executor:
            response, token_usage = await app.loop.run_in_executor(
                executor,
                lambda: model_service.generate(
                    messages=request.messages,
                    temperature=request.temperature,
                    max_tokens=request.max_tokens
                )
            )
        
        took = time.time() - start_time
        return {
            "response": response,
            "request_id": request_id,
            "token_usage": token_usage,
            "took": took
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

四、高级功能实现

4.1 流式响应实现

对于长文本生成场景，流式响应能显著提升用户体验：

# app/api/v1/endpoints/stream_chat.py
import time
import uuid
import asyncio
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
from typing import List, Dict, Optional, AsyncGenerator
from app.services.glm_service import GLMService

router = APIRouter()
model_service = GLMService()

class StreamChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    temperature: Optional[float] = 0.7
    max_tokens: Optional[int] = 1024

@router.post("/stream_chat")
async def stream_chat(request: StreamChatRequest):
    """流式对话API"""
    request_id = str(uuid.uuid4())
    
    async def event_generator() -> AsyncGenerator[str, None]:
        try:
            # 这里简化处理，实际实现需使用模型的流式生成功能
            full_response = ""
            response, _ = model_service.generate(
                messages=request.messages,
                temperature=request.temperature,
                max_tokens=request.max_tokens
            )
            
            # 模拟流式输出
            for i in range(0, len(response), 5):
                chunk = response[i:i+5]
                full_response += chunk
                yield f"data: {json.dumps({
                    'chunk': chunk,
                    'request_id': request_id,
                    'done': False
                })}\n\n"
                await asyncio.sleep(0.05)
            
            # 发送结束信号
            yield f"data: {json.dumps({
                'chunk': '',
                'request_id': request_id,
                'done': True,
                'full_response': full_response
            })}\n\n"
        except Exception as e:
            yield f"data: {json.dumps({'error': str(e), 'request_id': request_id})}\n\n"
    
    return StreamingResponse(
        event_generator(),
        media_type="text/event-stream"
    )

前端JavaScript调用示例：

const eventSource = new EventSource(`/api/v1/stream_chat`);
const responseElement = document.getElementById('response');

// 发送请求
fetch('/api/v1/stream_chat', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    messages: [{ "role": "user", "content": "请介绍大语言模型的流式输出原理" }],
    temperature: 0.7
  })
});

// 接收流数据
eventSource.onmessage = function(event) {
  const data = JSON.parse(event.data);
  if (data.error) {
    console.error('Error:', data.error);
    eventSource.close();
  } else if (data.done) {
    eventSource.close();
  } else {
    responseElement.innerHTML += data.chunk;
  }
};

4.2 长文本处理策略

针对GLM-4-9B-Chat-1M的超长上下文能力，实现文档摘要和问答功能：

# app/api/v1/endpoints/document.py
import time
import uuid
from fastapi import APIRouter, HTTPException, UploadFile, File
from pydantic import BaseModel
from typing import List, Dict, Optional, Tuple
from app.services.glm_service import GLMService

router = APIRouter()
model_service = GLMService()

class DocumentQARequest(BaseModel):
    document: str
    question: str
    max_tokens: Optional[int] = 512

@router.post("/document/qa")
async def document_qa(request: DocumentQARequest):
    """文档问答API"""
    # 构建提示词
    prompt = f"""基于以下文档内容回答问题。如果文档中没有相关信息，请回答"无法从文档中找到答案"。
    
    文档内容:
    {request.document[:10000]}  # 限制文档长度，实际应用可优化
    
    问题: {request.question}
    回答:"""
    
    messages = [{"role": "user", "content": prompt}]
    
    try:
        response, token_usage = model_service.generate(
            messages=messages,
            temperature=0.3,  # 降低随机性，提高答案准确性
            max_tokens=request.max_tokens
        )
        
        return {
            "question": request.question,
            "answer": response,
            "token_usage": token_usage
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@router.post("/document/summary")
async def document_summary(request: DocumentQARequest):
    """文档摘要API"""
    # 构建提示词
    prompt = f"""请为以下文档生成摘要，要求:
1. 保留核心观点和关键数据
2. 结构清晰，分点说明
3. 长度不超过300字

文档内容:
{request.document[:20000]}

摘要:"""
    
    messages = [{"role": "user", "content": prompt}]
    
    try:
        response, token_usage = model_service.generate(
            messages=messages,
            temperature=0.5,
            max_tokens=600  # 摘要通常需要更多token
        )
        
        return {
            "summary": response,
            "token_usage": token_usage
        }
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@router.post("/document/upload")
async def upload_document(file: UploadFile = File(...)):
    """上传文档并处理"""
    if file.content_type not in ["text/plain", "application/pdf", "text/markdown"]:
        raise HTTPException(status_code=400, detail="不支持的文件类型")
    
    # 读取文件内容（实际应用需处理不同格式）
    content = await file.read()
    try:
        content = content.decode("utf-8")
    except:
        raise HTTPException(status_code=400, detail="文件解码失败")
    
    # 生成文档ID
    doc_id = str(uuid.uuid4())
    
    # 这里可以添加文档存储逻辑
    
    return {
        "doc_id": doc_id,
        "filename": file.filename,
        "content_length": len(content),
        "message": "文件上传成功"
    }

4.3 工具调用能力集成

利用GLM-4的工具调用能力，实现天气查询等外部工具集成：

# app/services/tool_service.py
import json
import requests
from typing import Dict, Any, Optional

class ToolService:
    """工具服务类"""
    
    def __init__(self):
        self.tools = {
            "weather": self.get_weather,
            "calculator": self.calculate
        }
    
    def get_weather(self, location: str) -> Dict[str, Any]:
        """获取天气信息（模拟）"""
        # 实际应用中替换为真实API
        return {
            "location": location,
            "temperature": "25°C",
            "description": "晴朗",
            "humidity": "45%",
            "wind": "微风"
        }
    
    def calculate(self, expression: str) -> Dict[str, Any]:
        """计算数学表达式"""
        try:
            # 注意：实际应用中需使用安全的表达式计算方法
            result = eval(expression)
            return {"expression": expression, "result": result}
        except Exception as e:
            return {"error": str(e)}
    
    def call_tool(self, tool_name: str, parameters: Dict[str, Any]) -> Dict[str, Any]:
        """调用工具"""
        if tool_name not in self.tools:
            return {"error": f"Tool {tool_name} not found"}
        
        try:
            return self.tools[tool_name](**parameters)
        except Exception as e:
            return {"error": str(e)}

# app/api/v1/endpoints/tool_chat.py
import time
import uuid
import json
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from typing import List, Dict, Optional
from app.services.glm_service import GLMService
from app.services.tool_service import ToolService

router = APIRouter()
model_service = GLMService()
tool_service = ToolService()

class ToolChatRequest(BaseModel):
    messages: List[Dict[str, str]]
    temperature: Optional[float] = 0.7

@router.post("/tool_chat")
async def tool_chat(request: ToolChatRequest):
    """工具调用对话API"""
    # 定义工具描述
    tools_desc = """你可以使用以下工具来回答问题：
    
    1. 天气查询工具
    - 名称：weather
    - 参数：location (字符串，城市名称)
    - 描述：获取指定城市的天气信息
    
    2. 计算器工具
    - 名称：calculator
    - 参数：expression (字符串，数学表达式)
    - 描述：计算数学表达式的值

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考