Kratos监控体系搭建:Prometheus指标与Grafana可视化
一、痛点与解决方案
你是否还在为微服务架构下的监控难题而困扰?服务响应延迟、错误率飙升却无法快速定位问题根源?本文将带你基于Kratos框架构建完整的监控体系,通过Prometheus采集核心指标,结合Grafana实现可视化监控面板,让服务运行状态尽在掌握。
读完本文你将获得:
- 从零搭建Kratos服务监控基础设施的完整流程
- 核心业务指标与系统指标的采集实现方案
- 高可用监控架构的设计与最佳实践
- 生产级Grafana监控面板的配置与优化技巧
二、监控体系架构设计
2.1 整体架构
2.2 核心组件说明
| 组件 | 功能描述 | 技术选型 |
|---|---|---|
| 指标采集层 | 埋点与暴露服务运行指标 | Kratos Metrics中间件 |
| 指标收集层 | 定时拉取并存储指标数据 | Prometheus 2.45+ |
| 可视化层 | 指标展示与监控面板 | Grafana 10.0+ |
| 告警层 | 异常指标检测与通知 | Prometheus AlertManager |
三、环境准备与部署
3.1 安装Prometheus
# 下载并安装Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64
# 创建配置文件prometheus.yml
cat > prometheus.yml << EOF
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kratos-services'
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:8000', 'localhost:8001'] # Kratos服务地址列表
EOF
# 启动Prometheus
./prometheus --config.file=prometheus.yml &
3.2 安装Grafana
# 安装Grafana
sudo apt-get install -y adduser libfontconfig1
wget https://dl.grafana.com/enterprise/release/grafana-enterprise_10.0.3_amd64.deb
sudo dpkg -i grafana-enterprise_10.0.3_amd64.deb
# 启动Grafana服务
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
四、Kratos服务指标埋点实现
4.1 Metrics中间件集成
Kratos框架通过middleware/metrics包提供了开箱即用的指标采集能力,支持服务端和客户端两种 metrics 采集模式:
package main
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/metric"
metricsdk "go.opentelemetry.io/otel/sdk/metric"
"go.opentelemetry.io/otel/exporters/prometheus"
"github.com/go-kratos/kratos/v2"
"github.com/go-kratos/kratos/v2/middleware"
"github.com/go-kratos/kratos/v2/middleware/metrics"
"github.com/go-kratos/kratos/v2/transport/grpc"
"github.com/go-kratos/kratos/v2/transport/http"
)
func main() {
// 创建Prometheus exporter
exporter, err := prometheus.New()
if err != nil {
panic(err)
}
// 设置MeterProvider
provider := metricsdk.NewMeterProvider(
metricsdk.WithReader(exporter),
metricsdk.WithView(metrics.DefaultSecondsHistogramView(metrics.DefaultServerSecondsHistogramName)),
)
otel.SetMeterProvider(provider)
meter := otel.GetMeterProvider().Meter("kratos-demo")
// 创建默认指标
serverRequests, err := metrics.DefaultRequestsCounter(meter, metrics.DefaultServerRequestsCounterName)
if err != nil {
panic(err)
}
serverSeconds, err := metrics.DefaultSecondsHistogram(meter, metrics.DefaultServerSecondsHistogramName)
if err != nil {
panic(err)
}
// 创建HTTP服务器
httpSrv := http.NewServer(
http.Address(":8000"),
http.Middleware(
metrics.Server(
metrics.WithRequests(serverRequests),
metrics.WithSeconds(serverSeconds),
),
),
)
// 创建gRPC服务器
grpcSrv := grpc.NewServer(
grpc.Address(":9000"),
grpc.Middleware(
metrics.Server(
metrics.WithRequests(serverRequests),
metrics.WithSeconds(serverSeconds),
),
),
)
// 注册服务实现...
// 启动服务
app := kratos.New(
kratos.Name("kratos-demo"),
kratos.Server(
httpSrv,
grpcSrv,
),
)
if err := app.Run(); err != nil {
panic(err)
}
}
4.2 核心指标说明
Kratos Metrics中间件默认采集以下核心指标:
| 指标名称 | 类型 | 标签 | 描述 |
|---|---|---|---|
| server_requests_code_total | Counter | kind, operation, code, reason | 请求总数按状态码分布 |
| server_requests_seconds_bucket | Histogram | kind, operation | 请求延迟分布直方图 |
其中默认的延迟桶边界为:0.005, 0.01, 0.025, 0.05, 0.1, 0.250, 0.5, 1(单位:秒),覆盖了微服务场景下99%的请求延迟范围。
五、自定义业务指标实现
5.1 业务指标定义
package metrics
import (
"context"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/metric"
)
var (
meter = otel.GetMeterProvider().Meter("order-service")
// 订单相关指标
orderTotalCounter metric.Int64Counter
paymentAmountSum metric.Float64SumObserver
)
func init() {
var err error
// 订单总数计数器
orderTotalCounter, err = meter.Int64Counter(
"order_total",
metric.WithDescription("Total number of orders created"),
metric.WithUnit("{order}"),
)
if err != nil {
panic(err)
}
// 支付金额总和观测器
paymentAmountSum, err = meter.Float64SumObserver(
"payment_amount_total",
func(_ context.Context, result metric.Float64ObserverResult) {
// 从数据库查询实际支付总金额
amount := queryTotalPaymentAmount()
result.Observe(amount)
},
metric.WithDescription("Total payment amount"),
metric.WithUnit("CNY"),
)
if err != nil {
panic(err)
}
}
// 记录新订单创建
func RecordNewOrder(ctx context.Context, orderType string) {
orderTotalCounter.Add(ctx, 1,
metric.WithAttributes(
attribute.String("order_type", orderType),
attribute.String("status", "created"),
),
)
}
5.2 在业务逻辑中使用
package service
import (
"context"
"github.com/go-kratos/kratos/v2/log"
"kratos-demo/internal/biz"
"kratos-demo/internal/metrics"
"kratos-demo/internal/model"
)
type OrderService struct {
orderRepo biz.OrderRepo
log *log.Helper
}
func NewOrderService(orderRepo biz.OrderRepo, logger log.Logger) *OrderService {
return &OrderService{
orderRepo: orderRepo,
log: log.NewHelper(logger),
}
}
func (s *OrderService) CreateOrder(ctx context.Context, req *model.CreateOrderRequest) (*model.Order, error) {
// 业务逻辑处理...
order, err := s.orderRepo.Create(ctx, req)
if err != nil {
return nil, err
}
// 记录业务指标
metrics.RecordNewOrder(ctx, req.OrderType)
return order, nil
}
六、Prometheus配置与优化
6.1 基础配置
# prometheus.yml 核心配置
global:
scrape_interval: 15s # 全局默认拉取间隔
evaluation_interval: 15s # 规则评估间隔
scrape_timeout: 10s # 拉取超时时间
rule_files:
- "alert_rules.yml" # 告警规则文件
scrape_configs:
- job_name: 'kratos-services'
metrics_path: '/metrics'
scrape_interval: 5s # 针对Kratos服务的拉取间隔
honor_labels: true # 保留原始标签
static_configs:
- targets: ['service-a:8000', 'service-b:8000']
labels:
group: 'business-services'
- targets: ['service-c:8000', 'service-d:8000']
labels:
group: 'infrastructure-services'
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
6.2 高级配置:服务发现
对于动态扩缩容的Kubernetes环境,建议使用服务发现:
- job_name: 'k8s-services'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_label_app]
regex: kratos-service
action: keep
- source_labels: [__meta_kubernetes_endpoint_port_name]
regex: http-metrics
action: keep
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service_name
6.3 存储优化
# prometheus启动参数优化
./prometheus \
--config.file=prometheus.yml \
--storage.tsdb.path=data/ \
--storage.tsdb.retention.time=15d \ # 数据保留15天
--storage.tsdb.wal-compression \ # 启用WAL压缩
--web.enable-lifecycle \ # 启用远程管理API
--query.max-concurrency=20 \ # 查询并发限制
--query.timeout=2m # 查询超时时间
七、Grafana监控面板配置
7.1 添加Prometheus数据源
- 登录Grafana,进入Configuration > Data Sources
- 点击Add data source,选择Prometheus
- 配置Prometheus服务器URL(如http://prometheus:9090)
- 其他配置保持默认,点击Save & Test
7.2 导入Kratos监控面板
- 进入Create > Import
- 输入面板ID:1860(Node Exporter Full)和12856(Prometheus 2.0 Stats)
- 选择已配置的Prometheus数据源
- 点击Import完成导入
7.3 自定义业务监控面板
{
"annotations": {
"list": [
{
"builtIn": 1,
"datasource": "-- Grafana --",
"enable": true,
"hide": true,
"iconColor": "rgba(0, 211, 255, 1)",
"name": "Annotations & Alerts",
"type": "dashboard"
}
]
},
"editable": true,
"gnetId": null,
"graphTooltip": 0,
"id": 1,
"iteration": 1685467234567,
"links": [],
"panels": [
{
"aliasColors": {},
"bars": false,
"dashLength": 10,
"dashes": false,
"datasource": "Prometheus",
"fieldConfig": {
"defaults": {
"links": []
},
"overrides": []
},
"fill": 1,
"fillGradient": 0,
"gridPos": {
"h": 8,
"w": 12,
"x": 0,
"y": 0
},
"hiddenSeries": false,
"id": 2,
"legend": {
"avg": false,
"current": false,
"max": false,
"min": false,
"show": true,
"total": false,
"values": false
},
"lines": true,
"linewidth": 1,
"nullPointMode": "null",
"options": {
"alertThreshold": true
},
"percentage": false,
"pluginVersion": "10.0.3",
"pointradius": 2,
"points": false,
"renderer": "flot",
"seriesOverrides": [],
"spaceLength": 10,
"stack": false,
"steppedLine": false,
"targets": [
{
"expr": "sum(rate(server_requests_code_total{code=~\"2..\"}[5m])) / sum(rate(server_requests_code_total[5m]))",
"interval": "",
"legendFormat": "成功率",
"refId": "A"
}
],
"thresholds": [],
"timeFrom": null,
"timeRegions": [],
"timeShift": null,
"title": "请求成功率",
"tooltip": {
"shared": true,
"sort": 0,
"value_type": "individual"
},
"type": "graph",
"xaxis": {
"buckets": null,
"mode": "time",
"name": null,
"show": true,
"values": []
},
"yaxes": [
{
"format": "percentunit",
"label": null,
"logBase": 1,
"max": "1",
"min": "0.9",
"show": true
},
{
"format": "short",
"label": null,
"logBase": 1,
"max": null,
"min": null,
"show": true
}
],
"yaxis": {
"align": false,
"alignLevel": null
}
}
// 更多面板配置...
],
"refresh": "5s",
"schemaVersion": 37,
"style": "dark",
"tags": [],
"templating": {
"list": []
},
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": [
"5s",
"10s",
"30s",
"1m",
"5m",
"15m",
"30m",
"1h",
"2h",
"1d"
]
},
"timezone": "",
"title": "Kratos业务监控面板",
"uid": "kratos-business",
"version": 1
}
八、告警规则配置与最佳实践
8.1 基础告警规则
# alert_rules.yml
groups:
- name: service_alerts
rules:
- alert: HighErrorRate
expr: sum(rate(server_requests_code_total{code=~"5.."}[5m])) / sum(rate(server_requests_code_total[5m])) > 0.05
for: 2m
labels:
severity: critical
service: kratos
annotations:
summary: "服务错误率过高"
description: "错误率 {{ $value | humanizePercentage }} 超过阈值 5% (持续时间: {{ $labels.for }})"
value: "{{ $value | humanizePercentage }}"
dashboard_url: "http://grafana:3000/d/kratos-business"
- alert: SlowResponseTime
expr: histogram_quantile(0.95, sum(rate(server_requests_seconds_bucket[5m])) by (le, service)) > 0.5
for: 5m
labels:
severity: warning
service: kratos
annotations:
summary: "服务响应延迟过高"
description: "{{ $labels.service }} P95响应时间 {{ $value | humanizeDuration }} 超过 500ms"
8.2 告警抑制规则
# 抑制规则配置
route:
group_by: ['alertname', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 4h
receiver: 'webhook'
routes:
- match:
severity: critical
receiver: 'sms'
continue: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['service', 'instance']
九、监控最佳实践与性能优化
9.1 指标设计原则
- 相关性:仅采集与业务目标相关的指标
- 简洁性:避免过度设计指标维度,控制标签基数
- 可聚合性:确保指标可以在不同维度聚合分析
- 一致性:遵循统一的命名规范和单位标准
9.2 性能优化策略
-
指标采样
- 对高频低价值指标采用采样采集(如10%采样率)
- 使用直方图而非_summary_类型减少存储开销
-
存储优化
# Prometheus存储优化参数 --storage.tsdb.retention.time=15d # 数据保留15天 --storage.tsdb.max-block-duration=2h # 块持续时间 --storage.tsdb.min-block-duration=2h # 最小块持续时间 --storage.tsdb.wal-compression # 启用WAL压缩 -
查询优化
- 避免使用
rate()函数对短时间窗口计算 - 对大时间范围查询使用降采样
- 复杂查询使用Recording Rule预计算
- 避免使用
9.3 高可用部署
十、总结与展望
本文详细介绍了基于Kratos框架构建Prometheus+Grafana监控体系的完整方案,从环境搭建、指标埋点、数据采集到可视化展示和告警配置,覆盖了监控体系的各个环节。通过这套方案,你可以实现对微服务架构的全方位监控,及时发现并解决潜在问题。
未来监控体系的演进方向:
- 基于eBPF的零侵入式性能监控
- 结合机器学习的异常检测与根因分析
- 监控数据与日志/链路数据的关联分析
- 服务网格(Service Mesh)下的流量监控
希望本文能帮助你构建更稳定、更可靠的微服务系统。如果你有任何问题或建议,欢迎在评论区留言讨论。
如果觉得本文对你有帮助,请点赞、收藏并关注,下期将带来《Kratos分布式追踪实战》
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



