Bisheng指标采集:自定义监控指标与告警规则
【免费下载链接】bisheng BISHENG毕昇 是一款 开源 LLM应用开发平台,主攻企业场景。 项目地址: https://gitcode.com/dataelem/bisheng
引言
在企业级LLM应用开发中,系统监控和性能指标采集是确保服务稳定性和可观测性的关键环节。Bisheng作为一款面向企业场景的开源LLM应用开发平台,内置了完善的监控指标采集能力,支持自定义监控指标和灵活的告警规则配置。本文将深入探讨Bisheng的监控体系架构、指标采集机制,以及如何根据业务需求定制监控指标和告警策略。
Bisheng监控体系架构
Bisheng采用现代化的监控架构,基于Prometheus和Grafana构建完整的监控解决方案:
内置监控指标详解
1. 系统级基础指标
Bisheng默认暴露的系统级监控指标包括:
| 指标类型 | 指标名称 | 描述 | 标签 |
|---|---|---|---|
| Counter | http_requests_total | HTTP请求总数 | method, endpoint, status |
| Gauge | http_requests_in_flight | 当前处理中的请求数 | method, endpoint |
| Histogram | http_request_duration_seconds | HTTP请求耗时分布 | method, endpoint |
| Gauge | process_cpu_seconds_total | 进程CPU使用时间 | - |
| Gauge | process_resident_memory_bytes | 进程内存使用量 | - |
2. 业务级关键指标
Bisheng还提供了丰富的业务级监控指标:
# 示例:Bisheng中的业务指标定义
from prometheus_client import Counter, Gauge, Histogram
# LLM调用相关指标
llm_requests_total = Counter(
'bisheng_llm_requests_total',
'Total LLM API requests',
['model', 'provider', 'status']
)
llm_request_duration = Histogram(
'bisheng_llm_request_duration_seconds',
'LLM request duration in seconds',
['model', 'provider']
)
# 工作流执行指标
workflow_executions_total = Counter(
'bisheng_workflow_executions_total',
'Total workflow executions',
['workflow_id', 'status']
)
workflow_execution_duration = Histogram(
'bisheng_workflow_execution_duration_seconds',
'Workflow execution duration in seconds',
['workflow_id']
)
# 知识库操作指标
knowledge_base_operations = Counter(
'bisheng_knowledge_base_operations_total',
'Knowledge base operations count',
['operation', 'status']
)
自定义监控指标实现
1. 创建自定义指标类
在Bisheng中,可以通过创建自定义的监控工具类来实现业务特定的指标采集:
# bisheng/utils/monitoring.py
from prometheus_client import Counter, Gauge, Histogram, Summary
from typing import Dict, Optional
class BishengMetrics:
"""Bisheng自定义监控指标类"""
def __init__(self):
self.metrics_registry = {}
def register_llm_metric(self, model_name: str, provider: str):
"""注册LLM模型相关指标"""
metric_prefix = f"bisheng_llm_{model_name}_{provider}"
metrics = {
'requests_total': Counter(
f'{metric_prefix}_requests_total',
f'Total requests for {model_name} from {provider}',
['status']
),
'tokens_total': Counter(
f'{metric_prefix}_tokens_total',
f'Total tokens processed for {model_name}',
['type'] # input/output
),
'latency_seconds': Histogram(
f'{metric_prefix}_latency_seconds',
f'Request latency for {model_name}',
buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
)
}
self.metrics_registry[f'llm_{model_name}_{provider}'] = metrics
return metrics
def register_workflow_metric(self, workflow_id: str):
"""注册工作流执行指标"""
metric_prefix = f"bisheng_workflow_{workflow_id}"
metrics = {
'executions_total': Counter(
f'{metric_prefix}_executions_total',
f'Total executions for workflow {workflow_id}',
['status']
),
'duration_seconds': Histogram(
f'{metric_prefix}_duration_seconds',
f'Execution duration for workflow {workflow_id}',
buckets=[1.0, 5.0, 10.0, 30.0, 60.0, 300.0]
),
'node_executions': Counter(
f'{metric_prefix}_node_executions_total',
f'Node executions in workflow {workflow_id}',
['node_type', 'status']
)
}
self.metrics_registry[f'workflow_{workflow_id}'] = metrics
return metrics
# 全局指标实例
bisheng_metrics = BishengMetrics()
2. 集成到业务逻辑中
将监控指标集成到具体的业务处理逻辑中:
# bisheng/api/services/llm_service.py
import time
from contextlib import contextmanager
from bisheng.utils.monitoring import bisheng_metrics
class LLMService:
def __init__(self):
self.metrics_cache = {}
def get_metrics_for_model(self, model: str, provider: str):
"""获取或创建模型指标"""
key = f"{model}_{provider}"
if key not in self.metrics_cache:
self.metrics_cache[key] = bisheng_metrics.register_llm_metric(model, provider)
return self.metrics_cache[key]
@contextmanager
def track_llm_request(self, model: str, provider: str):
"""跟踪LLM请求的上下文管理器"""
metrics = self.get_metrics_for_model(model, provider)
start_time = time.time()
status = "success"
try:
yield
except Exception as e:
status = "error"
raise
finally:
duration = time.time() - start_time
metrics['requests_total'].labels(status=status).inc()
metrics['latency_seconds'].observe(duration)
def track_tokens(self, model: str, provider: str, token_type: str, count: int):
"""跟踪token使用量"""
metrics = self.get_metrics_for_model(model, provider)
metrics['tokens_total'].labels(type=token_type).inc(count)
# 使用示例
llm_service = LLMService()
def call_llm_api(model: str, provider: str, prompt: str):
with llm_service.track_llm_request(model, provider):
# 实际的LLM调用逻辑
response = make_llm_call(model, provider, prompt)
# 跟踪token使用
llm_service.track_tokens(model, provider, 'input', len(prompt.split()))
llm_service.track_tokens(model, provider, 'output', len(response.split()))
return response
Prometheus配置与数据采集
1. Prometheus scrape配置
创建Prometheus配置文件来采集Bisheng指标:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'bisheng-backend'
static_configs:
- targets: ['bisheng-backend:7860']
metrics_path: '/metrics'
scrape_interval: 15s
relabel_configs:
- source_labels: [__address__]
target_label: instance
replacement: 'bisheng-backend'
- job_name: 'bisheng-worker'
static_configs:
- targets: ['bisheng-backend-worker:7860']
metrics_path: '/metrics'
scrape_interval: 15s
- job_name: 'database-services'
static_configs:
- targets: ['mysql:3306', 'redis:6379', 'milvus:9091']
scrape_interval: 30s
2. Docker Compose集成
在Docker部署中集成Prometheus和Grafana:
# docker-compose-monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
告警规则配置
1. Prometheus告警规则
创建针对Bisheng的告警规则:
# rules/bisheng-alerts.yml
groups:
- name: bisheng-alerts
rules:
# LLM服务告警
- alert: LLMHighErrorRate
expr: rate(bisheng_llm_requests_total{status="error"}[5m]) / rate(bisheng_llm_requests_total[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "LLM服务错误率过高"
description: "LLM服务错误率超过10%,当前值: {{ $value }}"
- alert: LLMHighLatency
expr: histogram_quantile(0.95, rate(bisheng_llm_latency_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "LLM服务延迟过高"
description: "LLM服务95%分位延迟超过5秒,当前值: {{ $value }}s"
# 工作流告警
- alert: WorkflowHighFailureRate
expr: rate(bisheng_workflow_executions_total{status="failed"}[10m]) / rate(bisheng_workflow_executions_total[10m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "工作流执行失败率过高"
description: "工作流执行失败率超过5%,当前值: {{ $value }}"
- alert: WorkflowSlowExecution
expr: histogram_quantile(0.90, rate(bisheng_workflow_duration_seconds_bucket[10m])) > 120
for: 10m
labels:
severity: warning
annotations:
summary: "工作流执行缓慢"
description: "工作流90%分位执行时间超过120秒,当前值: {{ $value }}s"
# 系统资源告警
- alert: HighMemoryUsage
expr: process_resident_memory_bytes / (1024 * 1024) > 4096
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用量过高"
description: "进程内存使用超过4GB,当前值: {{ $value }}MB"
- alert: HighCPUUsage
expr: rate(process_cpu_seconds_total[5m]) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "CPU使用率过高"
description: "进程CPU使用率超过80%,当前值: {{ $value }}"
2. Alertmanager配置
配置告警通知渠道:
# alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'sms-notifications'
repeat_interval: 30m
- match:
severity: warning
receiver: 'email-notifications'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/XXX'
channel: '#bisheng-alerts'
send_resolved: true
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'email-notifications'
email_configs:
- to: 'devops@company.com'
from: 'alertmanager@bisheng.com'
smarthost: 'smtp.company.com:587'
auth_username: 'alertmanager'
auth_password: 'password'
send_resolved: true
- name: 'sms-notifications'
webhook_configs:
- url: 'http://sms-gateway/send'
send_resolved: true
Grafana监控仪表板
1. 核心监控仪表板配置
创建全面的Bisheng监控仪表板:
{
"dashboard": {
"title": "Bisheng监控仪表板",
"panels": [
{
"title": "LLM请求概览",
"type": "stat",
"targets": [
{
"expr": "rate(bisheng_llm_requests_total[5m])",
"legendFormat": "{{model}} - {{provider}}"
}
]
},
{
"title": "工作流执行状态",
"type": "piechart",
"targets": [
{
"expr": "sum by(status)(bisheng_workflow_executions_total)",
"legendFormat": "{{status}}"
}
]
}
]
}
}
2. 关键性能指标仪表板
最佳实践与优化建议
1. 指标命名规范
遵循一致的指标命名约定:
- 使用
bisheng_作为前缀区分Bisheng特定指标 - 采用
_total后缀表示计数器指标 - 使用
_seconds后缀表示时间指标 - 标签命名采用小写蛇形命名法
2. 性能优化建议
# 避免在热点路径中创建指标对象
class OptimizedMetrics:
def __init__(self):
self._metrics = {}
self._lock = threading.Lock()
def get_metric(self, name, labels=None):
"""线程安全的指标获取"""
key = f"{name}_{str(sorted(labels.items()) if labels else '')}"
if key not in self._metrics:
with self._lock:
if key not in self._metrics:
# 延迟创建指标对象
self._metrics[key] = self._create_metric(name, labels)
return self._metrics[key]
3. 监控策略建议
- 分层监控:从基础设施到业务逻辑的全栈监控
- 黄金信号:关注延迟、流量、错误率、饱和度
- SLO管理:基于业务目标定义服务等级目标
- 容量规划:基于历史数据预测资源需求
总结
Bisheng提供了强大的监控指标采集能力,通过Prometheus和Grafana的集成,可以构建完整的监控告警体系。本文详细介绍了:
- Bisheng监控架构和内置指标
- 自定义监控指标的实现方法
- Prometheus配置和数据采集
- 告警规则的配置和管理
- Grafana仪表板的创建和优化
- 最佳实践和性能优化建议
通过合理的监控配置,可以确保Bisheng平台在企业环境中的稳定运行,及时发现和解决潜在问题,为业务提供可靠的技术保障。
后续规划
Bisheng监控体系的未来发展包括:
- AI驱动的异常检测:利用机器学习算法自动识别异常模式
- 根因分析:自动关联相关指标,快速定位问题根源
- 成本优化:基于使用模式的智能资源调度建议
- 用户体验监控:从最终用户角度监控应用性能
通过持续的监控体系优化,Bisheng将为企业用户提供更加稳定、高效的LLM应用开发体验。
【免费下载链接】bisheng BISHENG毕昇 是一款 开源 LLM应用开发平台,主攻企业场景。 项目地址: https://gitcode.com/dataelem/bisheng
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



