Bisheng指标采集:自定义监控指标与告警规则

Bisheng指标采集:自定义监控指标与告警规则

【免费下载链接】bisheng BISHENG毕昇 是一款 开源 LLM应用开发平台,主攻企业场景。 【免费下载链接】bisheng 项目地址: https://gitcode.com/dataelem/bisheng

引言

在企业级LLM应用开发中,系统监控和性能指标采集是确保服务稳定性和可观测性的关键环节。Bisheng作为一款面向企业场景的开源LLM应用开发平台,内置了完善的监控指标采集能力,支持自定义监控指标和灵活的告警规则配置。本文将深入探讨Bisheng的监控体系架构、指标采集机制,以及如何根据业务需求定制监控指标和告警策略。

Bisheng监控体系架构

Bisheng采用现代化的监控架构,基于Prometheus和Grafana构建完整的监控解决方案:

mermaid

内置监控指标详解

1. 系统级基础指标

Bisheng默认暴露的系统级监控指标包括:

指标类型指标名称描述标签
Counterhttp_requests_totalHTTP请求总数method, endpoint, status
Gaugehttp_requests_in_flight当前处理中的请求数method, endpoint
Histogramhttp_request_duration_secondsHTTP请求耗时分布method, endpoint
Gaugeprocess_cpu_seconds_total进程CPU使用时间-
Gaugeprocess_resident_memory_bytes进程内存使用量-

2. 业务级关键指标

Bisheng还提供了丰富的业务级监控指标:

# 示例:Bisheng中的业务指标定义
from prometheus_client import Counter, Gauge, Histogram

# LLM调用相关指标
llm_requests_total = Counter(
    'bisheng_llm_requests_total',
    'Total LLM API requests',
    ['model', 'provider', 'status']
)

llm_request_duration = Histogram(
    'bisheng_llm_request_duration_seconds',
    'LLM request duration in seconds',
    ['model', 'provider']
)

# 工作流执行指标
workflow_executions_total = Counter(
    'bisheng_workflow_executions_total',
    'Total workflow executions',
    ['workflow_id', 'status']
)

workflow_execution_duration = Histogram(
    'bisheng_workflow_execution_duration_seconds',
    'Workflow execution duration in seconds',
    ['workflow_id']
)

# 知识库操作指标
knowledge_base_operations = Counter(
    'bisheng_knowledge_base_operations_total',
    'Knowledge base operations count',
    ['operation', 'status']
)

自定义监控指标实现

1. 创建自定义指标类

在Bisheng中,可以通过创建自定义的监控工具类来实现业务特定的指标采集:

# bisheng/utils/monitoring.py
from prometheus_client import Counter, Gauge, Histogram, Summary
from typing import Dict, Optional

class BishengMetrics:
    """Bisheng自定义监控指标类"""
    
    def __init__(self):
        self.metrics_registry = {}
        
    def register_llm_metric(self, model_name: str, provider: str):
        """注册LLM模型相关指标"""
        metric_prefix = f"bisheng_llm_{model_name}_{provider}"
        
        metrics = {
            'requests_total': Counter(
                f'{metric_prefix}_requests_total',
                f'Total requests for {model_name} from {provider}',
                ['status']
            ),
            'tokens_total': Counter(
                f'{metric_prefix}_tokens_total',
                f'Total tokens processed for {model_name}',
                ['type']  # input/output
            ),
            'latency_seconds': Histogram(
                f'{metric_prefix}_latency_seconds',
                f'Request latency for {model_name}',
                buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
            )
        }
        
        self.metrics_registry[f'llm_{model_name}_{provider}'] = metrics
        return metrics
    
    def register_workflow_metric(self, workflow_id: str):
        """注册工作流执行指标"""
        metric_prefix = f"bisheng_workflow_{workflow_id}"
        
        metrics = {
            'executions_total': Counter(
                f'{metric_prefix}_executions_total',
                f'Total executions for workflow {workflow_id}',
                ['status']
            ),
            'duration_seconds': Histogram(
                f'{metric_prefix}_duration_seconds',
                f'Execution duration for workflow {workflow_id}',
                buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
            ),
            'node_executions': Counter(
                f'{metric_prefix}_node_executions_total',
                f'Node executions in workflow {workflow_id}',
                ['node_type', 'status']
            )
        }
        
        self.metrics_registry[f'workflow_{workflow_id}'] = metrics
        return metrics

# 全局指标实例
bisheng_metrics = BishengMetrics()

2. 集成到业务逻辑中

将监控指标集成到具体的业务处理逻辑中:

# bisheng/api/services/llm_service.py
import time
from contextlib import contextmanager
from bisheng.utils.monitoring import bisheng_metrics

class LLMService:
    def __init__(self):
        self.metrics_cache = {}
    
    def get_metrics_for_model(self, model: str, provider: str):
        """获取或创建模型指标"""
        key = f"{model}_{provider}"
        if key not in self.metrics_cache:
            self.metrics_cache[key] = bisheng_metrics.register_llm_metric(model, provider)
        return self.metrics_cache[key]
    
    @contextmanager
    def track_llm_request(self, model: str, provider: str):
        """跟踪LLM请求的上下文管理器"""
        metrics = self.get_metrics_for_model(model, provider)
        start_time = time.time()
        status = "success"
        
        try:
            yield
        except Exception as e:
            status = "error"
            raise
        finally:
            duration = time.time() - start_time
            metrics['requests_total'].labels(status=status).inc()
            metrics['latency_seconds'].observe(duration)
    
    def track_tokens(self, model: str, provider: str, token_type: str, count: int):
        """跟踪token使用量"""
        metrics = self.get_metrics_for_model(model, provider)
        metrics['tokens_total'].labels(type=token_type).inc(count)

# 使用示例
llm_service = LLMService()

def call_llm_api(model: str, provider: str, prompt: str):
    with llm_service.track_llm_request(model, provider):
        # 实际的LLM调用逻辑
        response = make_llm_call(model, provider, prompt)
        
        # 跟踪token使用
        llm_service.track_tokens(model, provider, 'input', len(prompt.split()))
        llm_service.track_tokens(model, provider, 'output', len(response.split()))
        
        return response

Prometheus配置与数据采集

1. Prometheus scrape配置

创建Prometheus配置文件来采集Bisheng指标:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'bisheng-backend'
    static_configs:
      - targets: ['bisheng-backend:7860']
    metrics_path: '/metrics'
    scrape_interval: 15s
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        replacement: 'bisheng-backend'

  - job_name: 'bisheng-worker'
    static_configs:
      - targets: ['bisheng-backend-worker:7860']
    metrics_path: '/metrics'
    scrape_interval: 15s

  - job_name: 'database-services'
    static_configs:
      - targets: ['mysql:3306', 'redis:6379', 'milvus:9091']
    scrape_interval: 30s

2. Docker Compose集成

在Docker部署中集成Prometheus和Grafana:

# docker-compose-monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

告警规则配置

1. Prometheus告警规则

创建针对Bisheng的告警规则:

# rules/bisheng-alerts.yml
groups:
- name: bisheng-alerts
  rules:
  
  # LLM服务告警
  - alert: LLMHighErrorRate
    expr: rate(bisheng_llm_requests_total{status="error"}[5m]) / rate(bisheng_llm_requests_total[5m]) > 0.1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "LLM服务错误率过高"
      description: "LLM服务错误率超过10%,当前值: {{ $value }}"
  
  - alert: LLMHighLatency
    expr: histogram_quantile(0.95, rate(bisheng_llm_latency_seconds_bucket[5m])) > 5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "LLM服务延迟过高"
      description: "LLM服务95%分位延迟超过5秒,当前值: {{ $value }}s"
  
  # 工作流告警
  - alert: WorkflowHighFailureRate
    expr: rate(bisheng_workflow_executions_total{status="failed"}[10m]) / rate(bisheng_workflow_executions_total[10m]) > 0.05
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "工作流执行失败率过高"
      description: "工作流执行失败率超过5%,当前值: {{ $value }}"
  
  - alert: WorkflowSlowExecution
    expr: histogram_quantile(0.90, rate(bisheng_workflow_duration_seconds_bucket[10m])) > 120
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "工作流执行缓慢"
      description: "工作流90%分位执行时间超过120秒,当前值: {{ $value }}s"
  
  # 系统资源告警
  - alert: HighMemoryUsage
    expr: process_resident_memory_bytes / (1024 * 1024) > 4096
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "内存使用量过高"
      description: "进程内存使用超过4GB,当前值: {{ $value }}MB"
  
  - alert: HighCPUUsage
    expr: rate(process_cpu_seconds_total[5m]) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率过高"
      description: "进程CPU使用率超过80%,当前值: {{ $value }}"

2. Alertmanager配置

配置告警通知渠道:

# alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-notifications'
  
  routes:
  - match:
      severity: critical
    receiver: 'sms-notifications'
    repeat_interval: 30m
  
  - match:
      severity: warning
    receiver: 'email-notifications'

receivers:
- name: 'slack-notifications'
  slack_configs:
  - api_url: 'https://hooks.slack.com/services/XXX'
    channel: '#bisheng-alerts'
    send_resolved: true
    title: '{{ .CommonAnnotations.summary }}'
    text: '{{ .CommonAnnotations.description }}'

- name: 'email-notifications'
  email_configs:
  - to: 'devops@company.com'
    from: 'alertmanager@bisheng.com'
    smarthost: 'smtp.company.com:587'
    auth_username: 'alertmanager'
    auth_password: 'password'
    send_resolved: true

- name: 'sms-notifications'
  webhook_configs:
  - url: 'http://sms-gateway/send'
    send_resolved: true

Grafana监控仪表板

1. 核心监控仪表板配置

创建全面的Bisheng监控仪表板:

{
  "dashboard": {
    "title": "Bisheng监控仪表板",
    "panels": [
      {
        "title": "LLM请求概览",
        "type": "stat",
        "targets": [
          {
            "expr": "rate(bisheng_llm_requests_total[5m])",
            "legendFormat": "{{model}} - {{provider}}"
          }
        ]
      },
      {
        "title": "工作流执行状态",
        "type": "piechart",
        "targets": [
          {
            "expr": "sum by(status)(bisheng_workflow_executions_total)",
            "legendFormat": "{{status}}"
          }
        ]
      }
    ]
  }
}

2. 关键性能指标仪表板

mermaid

最佳实践与优化建议

1. 指标命名规范

遵循一致的指标命名约定:

  • 使用bisheng_作为前缀区分Bisheng特定指标
  • 采用_total后缀表示计数器指标
  • 使用_seconds后缀表示时间指标
  • 标签命名采用小写蛇形命名法

2. 性能优化建议

# 避免在热点路径中创建指标对象
class OptimizedMetrics:
    def __init__(self):
        self._metrics = {}
        self._lock = threading.Lock()
    
    def get_metric(self, name, labels=None):
        """线程安全的指标获取"""
        key = f"{name}_{str(sorted(labels.items()) if labels else '')}"
        
        if key not in self._metrics:
            with self._lock:
                if key not in self._metrics:
                    # 延迟创建指标对象
                    self._metrics[key] = self._create_metric(name, labels)
        
        return self._metrics[key]

3. 监控策略建议

  1. 分层监控:从基础设施到业务逻辑的全栈监控
  2. 黄金信号:关注延迟、流量、错误率、饱和度
  3. SLO管理:基于业务目标定义服务等级目标
  4. 容量规划:基于历史数据预测资源需求

总结

Bisheng提供了强大的监控指标采集能力,通过Prometheus和Grafana的集成,可以构建完整的监控告警体系。本文详细介绍了:

  • Bisheng监控架构和内置指标
  • 自定义监控指标的实现方法
  • Prometheus配置和数据采集
  • 告警规则的配置和管理
  • Grafana仪表板的创建和优化
  • 最佳实践和性能优化建议

通过合理的监控配置,可以确保Bisheng平台在企业环境中的稳定运行,及时发现和解决潜在问题,为业务提供可靠的技术保障。

后续规划

Bisheng监控体系的未来发展包括:

  1. AI驱动的异常检测:利用机器学习算法自动识别异常模式
  2. 根因分析:自动关联相关指标,快速定位问题根源
  3. 成本优化:基于使用模式的智能资源调度建议
  4. 用户体验监控:从最终用户角度监控应用性能

通过持续的监控体系优化,Bisheng将为企业用户提供更加稳定、高效的LLM应用开发体验。

【免费下载链接】bisheng BISHENG毕昇 是一款 开源 LLM应用开发平台,主攻企业场景。 【免费下载链接】bisheng 项目地址: https://gitcode.com/dataelem/bisheng

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值