Bisheng指标采集：自定义监控指标与告警规则-优快云博客

Bisheng指标采集：自定义监控指标与告警规则

【免费下载链接】bisheng BISHENG毕昇是一款开源 LLM应用开发平台，主攻企业场景。项目地址: https://gitcode.com/dataelem/bisheng

引言

在企业级LLM应用开发中，系统监控和性能指标采集是确保服务稳定性和可观测性的关键环节。Bisheng作为一款面向企业场景的开源LLM应用开发平台，内置了完善的监控指标采集能力，支持自定义监控指标和灵活的告警规则配置。本文将深入探讨Bisheng的监控体系架构、指标采集机制，以及如何根据业务需求定制监控指标和告警策略。

Bisheng监控体系架构

Bisheng采用现代化的监控架构，基于Prometheus和Grafana构建完整的监控解决方案：

mermaid

内置监控指标详解

1. 系统级基础指标

Bisheng默认暴露的系统级监控指标包括：

指标类型	指标名称	描述	标签
Counter	`http_requests_total`	HTTP请求总数	method, endpoint, status
Gauge	`http_requests_in_flight`	当前处理中的请求数	method, endpoint
Histogram	`http_request_duration_seconds`	HTTP请求耗时分布	method, endpoint
Gauge	`process_cpu_seconds_total`	进程CPU使用时间	-
Gauge	`process_resident_memory_bytes`	进程内存使用量	-

2. 业务级关键指标

Bisheng还提供了丰富的业务级监控指标：

# 示例：Bisheng中的业务指标定义
from prometheus_client import Counter, Gauge, Histogram

# LLM调用相关指标
llm_requests_total = Counter(
    'bisheng_llm_requests_total',
    'Total LLM API requests',
    ['model', 'provider', 'status']
)

llm_request_duration = Histogram(
    'bisheng_llm_request_duration_seconds',
    'LLM request duration in seconds',
    ['model', 'provider']
)

# 工作流执行指标
workflow_executions_total = Counter(
    'bisheng_workflow_executions_total',
    'Total workflow executions',
    ['workflow_id', 'status']
)

workflow_execution_duration = Histogram(
    'bisheng_workflow_execution_duration_seconds',
    'Workflow execution duration in seconds',
    ['workflow_id']
)

# 知识库操作指标
knowledge_base_operations = Counter(
    'bisheng_knowledge_base_operations_total',
    'Knowledge base operations count',
    ['operation', 'status']
)

自定义监控指标实现

1. 创建自定义指标类

在Bisheng中，可以通过创建自定义的监控工具类来实现业务特定的指标采集：

# bisheng/utils/monitoring.py
from prometheus_client import Counter, Gauge, Histogram, Summary
from typing import Dict, Optional

class BishengMetrics:
    """Bisheng自定义监控指标类"""
    
    def __init__(self):
        self.metrics_registry = {}
        
    def register_llm_metric(self, model_name: str, provider: str):
        """注册LLM模型相关指标"""
        metric_prefix = f"bisheng_llm_{model_name}_{provider}"
        
        metrics = {
            'requests_total': Counter(
                f'{metric_prefix}_requests_total',
                f'Total requests for {model_name} from {provider}',
                ['status']
            ),
            'tokens_total': Counter(
                f'{metric_prefix}_tokens_total',
                f'Total tokens processed for {model_name}',
                ['type']  # input/output
            ),
            'latency_seconds': Histogram(
                f'{metric_prefix}_latency_seconds',
                f'Request latency for {model_name}',
                buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
            )
        }
        
        self.metrics_registry[f'llm_{model_name}_{provider}'] = metrics
        return metrics
    
    def register_workflow_metric(self, workflow_id: str):
        """注册工作流执行指标"""
        metric_prefix = f"bisheng_workflow_{workflow_id}"
        
        metrics = {
            'executions_total': Counter(
                f'{metric_prefix}_executions_total',
                f'Total executions for workflow {workflow_id}',
                ['status']
            ),
            'duration_seconds': Histogram(
                f'{metric_prefix}_duration_seconds',
                f'Execution duration for workflow {workflow_id}',
                buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
            ),
            'node_executions': Counter(
                f'{metric_prefix}_node_executions_total',
                f'Node executions in workflow {workflow_id}',
                ['node_type', 'status']
            )
        }
        
        self.metrics_registry[f'workflow_{workflow_id}'] = metrics
        return metrics

# 全局指标实例
bisheng_metrics = BishengMetrics()

2. 集成到业务逻辑中

将监控指标集成到具体的业务处理逻辑中：

# bisheng/api/services/llm_service.py
import time
from contextlib import contextmanager
from bisheng.utils.monitoring import bisheng_metrics

class LLMService:
    def __init__(self):
        self.metrics_cache = {}
    
    def get_metrics_for_model(self, model: str, provider: str):
        """获取或创建模型指标"""
        key = f"{model}_{provider}"
        if key not in self.metrics_cache:
            self.metrics_cache[key] = bisheng_metrics.register_llm_metric(model, provider)
        return self.metrics_cache[key]
    
    @contextmanager
    def track_llm_request(self, model: str, provider: str):
        """跟踪LLM请求的上下文管理器"""
        metrics = self.get_metrics_for_model(model, provider)
        start_time = time.time()
        status = "success"
        
        try:
            yield
        except Exception as e:
            status = "error"
            raise
        finally:
            duration = time.time() - start_time
            metrics['requests_total'].labels(status=status).inc()
            metrics['latency_seconds'].observe(duration)
    
    def track_tokens(self, model: str, provider: str, token_type: str, count: int):
        """跟踪token使用量"""
        metrics = self.get_metrics_for_model(model, provider)
        metrics['tokens_total'].labels(type=token_type).inc(count)

# 使用示例
llm_service = LLMService()

def call_llm_api(model: str, provider: str, prompt: str):
    with llm_service.track_llm_request(model, provider):
        # 实际的LLM调用逻辑
        response = make_llm_call(model, provider, prompt)
        
        # 跟踪token使用
        llm_service.track_tokens(model, provider, 'input', len(prompt.split()))
        llm_service.track_tokens(model, provider, 'output', len(response.split()))
        
        return response

Prometheus配置与数据采集

1. Prometheus scrape配置

创建Prometheus配置文件来采集Bisheng指标：

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'bisheng-backend'
    static_configs:
      - targets: ['bisheng-backend:7860']
    metrics_path: '/metrics'
    scrape_interval: 15s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'bisheng-backend'

  - job_name: 'bisheng-worker'
    static_configs:
      - targets: ['bisheng-backend-worker:7860']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'database-services'
    static_configs:
      - targets: ['mysql:3306', 'redis:6379', 'milvus:9091']
    scrape_interval: 30s

2. Docker Compose集成

在Docker部署中集成Prometheus和Grafana：

# docker-compose-monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

告警规则配置

1. Prometheus告警规则

创建针对Bisheng的告警规则：

# rules/bisheng-alerts.yml
groups:
- name: bisheng-alerts
  rules:
  
  # LLM服务告警
  - alert: LLMHighErrorRate
    expr: rate(bisheng_llm_requests_total{status="error"}[5m]) / rate(bisheng_llm_requests_total[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "LLM服务错误率过高"
      description: "LLM服务错误率超过10%，当前值: {{ $value }}"
  
  - alert: LLMHighLatency
    expr: histogram_quantile(0.95, rate(bisheng_llm_latency_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "LLM服务延迟过高"
      description: "LLM服务95%分位延迟超过5秒，当前值: {{ $value }}s"
  
  # 工作流告警
  - alert: WorkflowHighFailureRate
    expr: rate(bisheng_workflow_executions_total{status="failed"}[10m]) / rate(bisheng_workflow_executions_total[10m]) > 0.05
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "工作流执行失败率过高"
      description: "工作流执行失败率超过5%，当前值: {{ $value }}"
  
  - alert: WorkflowSlowExecution
    expr: histogram_quantile(0.90, rate(bisheng_workflow_duration_seconds_bucket[10m])) > 120
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "工作流执行缓慢"
      description: "工作流90%分位执行时间超过120秒，当前值: {{ $value }}s"
  
  # 系统资源告警
  - alert: HighMemoryUsage
    expr: process_resident_memory_bytes / (1024 * 1024) > 4096
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "内存使用量过高"
      description: "进程内存使用超过4GB，当前值: {{ $value }}MB"
  
  - alert: HighCPUUsage
    expr: rate(process_cpu_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率过高"
      description: "进程CPU使用率超过80%，当前值: {{ $value }}"

2. Alertmanager配置

配置告警通知渠道：

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  
  routes:
  - match:
      severity: critical
    receiver: 'sms-notifications'
    repeat_interval: 30m
  
  - match:
      severity: warning
    receiver: 'email-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/XXX'
    channel: '#bisheng-alerts'
    send_resolved: true
    title: '{{ .CommonAnnotations.summary }}'
    text: '{{ .CommonAnnotations.description }}'

- name: 'email-notifications'
  email_configs:
  - to: 'devops@company.com'
    from: 'alertmanager@bisheng.com'
    smarthost: 'smtp.company.com:587'
    auth_username: 'alertmanager'
    auth_password: 'password'
    send_resolved: true

- name: 'sms-notifications'
  webhook_configs:
  - url: 'http://sms-gateway/send'
    send_resolved: true

Grafana监控仪表板

1. 核心监控仪表板配置

创建全面的Bisheng监控仪表板：

{
  "dashboard": {
    "title": "Bisheng监控仪表板",
    "panels": [
      {
        "title": "LLM请求概览",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(bisheng_llm_requests_total[5m])",
            "legendFormat": "{{model}} - {{provider}}"
          }
        ]
      },
      {
        "title": "工作流执行状态",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by(status)(bisheng_workflow_executions_total)",
            "legendFormat": "{{status}}"
          }
        ]
      }
    ]
  }
}

2. 关键性能指标仪表板

mermaid

最佳实践与优化建议

1. 指标命名规范

遵循一致的指标命名约定：

使用bisheng_作为前缀区分Bisheng特定指标
采用_total后缀表示计数器指标
使用_seconds后缀表示时间指标
标签命名采用小写蛇形命名法

2. 性能优化建议

# 避免在热点路径中创建指标对象
class OptimizedMetrics:
    def __init__(self):
        self._metrics = {}
        self._lock = threading.Lock()
    
    def get_metric(self, name, labels=None):
        """线程安全的指标获取"""
        key = f"{name}_{str(sorted(labels.items()) if labels else '')}"
        
        if key not in self._metrics:
            with self._lock:
                if key not in self._metrics:
                    # 延迟创建指标对象
                    self._metrics[key] = self._create_metric(name, labels)
        
        return self._metrics[key]

3. 监控策略建议

分层监控：从基础设施到业务逻辑的全栈监控
黄金信号：关注延迟、流量、错误率、饱和度
SLO管理：基于业务目标定义服务等级目标
容量规划：基于历史数据预测资源需求

总结

Bisheng提供了强大的监控指标采集能力，通过Prometheus和Grafana的集成，可以构建完整的监控告警体系。本文详细介绍了：

Bisheng监控架构和内置指标
自定义监控指标的实现方法
Prometheus配置和数据采集
告警规则的配置和管理
Grafana仪表板的创建和优化
最佳实践和性能优化建议

通过合理的监控配置，可以确保Bisheng平台在企业环境中的稳定运行，及时发现和解决潜在问题，为业务提供可靠的技术保障。

后续规划

Bisheng监控体系的未来发展包括：

AI驱动的异常检测：利用机器学习算法自动识别异常模式
根因分析：自动关联相关指标，快速定位问题根源
成本优化：基于使用模式的智能资源调度建议
用户体验监控：从最终用户角度监控应用性能

通过持续的监控体系优化，Bisheng将为企业用户提供更加稳定、高效的LLM应用开发体验。

【免费下载链接】bisheng BISHENG毕昇是一款开源 LLM应用开发平台，主攻企业场景。项目地址: https://gitcode.com/dataelem/bisheng

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考