Pinpoint集群证书过期监控:Prometheus告警

Pinpoint集群证书过期监控:Prometheus告警

【免费下载链接】pinpoint 【免费下载链接】pinpoint 项目地址: https://gitcode.com/gh_mirrors/pin/pinpoint

问题背景

在分布式系统中,SSL/TLS证书过期会导致服务中断,影响Pinpoint全链路追踪功能。传统手动检查方式效率低下,通过Prometheus监控证书过期时间并配置自动告警,可实现提前预警,保障集群稳定性。

实现原理

Pinpoint集群通过以下流程实现证书监控:

  1. 指标暴露:Java Agent或Collector通过Micrometer注册证书过期指标
  2. 数据采集:Prometheus定期拉取暴露的指标数据
  3. 告警规则:Prometheus根据预定义规则判断证书是否即将过期
  4. 通知渠道:Alertmanager将告警信息发送至邮件、Slack等平台

环境准备

依赖组件

  • Pinpoint Collector 2.5.0+(已集成Micrometer)
  • Prometheus 2.30.0+
  • Alertmanager 0.23.0+

关键依赖包

项目pom.xml已包含Prometheus指标导出依赖:

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

collector/pom.xml

证书指标暴露

实现代码

在Collector模块中添加证书监控代码,通过Java SSL API读取证书信息并注册指标:

import io.micrometer.core.instrument.MeterRegistry;
import java.security.cert.X509Certificate;
import java.util.Date;

public class SslCertificateMonitor {
    public SslCertificateMonitor(MeterRegistry registry) {
        X509Certificate cert = loadCertificate(); // 加载服务端证书
        long expirySeconds = (cert.getNotAfter().getTime() - new Date().getTime()) / 1000;
        
        registry.gauge("ssl_certificate_expiry_seconds", expirySeconds);
        registry.gauge("ssl_certificate_expiry_days", expirySeconds / 86400);
    }
}

collector/src/main/java/com/navercorp/pinpoint/collector/monitor/SslCertificateMonitor.java

指标验证

启动Collector后访问Prometheus端点:

curl http://collector-host:9997/actuator/prometheus | grep ssl_certificate

预期输出:

ssl_certificate_expiry_seconds 2592000.0
ssl_certificate_expiry_days 30.0

Prometheus配置

采集配置

修改prometheus.yml添加Pinpoint目标:

scrape_configs:
  - job_name: 'pinpoint'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['collector-1:9997', 'collector-2:9997']

prometheus/prometheus.yml

告警规则

创建证书过期告警规则:

groups:
- name: certificate_alerts
  rules:
  - alert: SslCertificateExpiringSoon
    expr: ssl_certificate_expiry_days < 15
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "SSL certificate expiring soon"
      description: "Certificate on {{ $labels.instance }} will expire in {{ $value }} days"
      
  - alert: SslCertificateExpired
    expr: ssl_certificate_expiry_days <= 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "SSL certificate expired"
      description: "Certificate on {{ $labels.instance }} has expired"

prometheus/rules/certificate.rules.yml

监控面板

通过Grafana创建证书监控仪表盘,添加以下指标:

  • 剩余天数趋势:ssl_certificate_expiry_days
  • 证书状态分布:count(ssl_certificate_expiry_days > 0)

基础设施监控面板

告警通知

Alertmanager配置

配置Alertmanager发送邮件通知:

route:
  receiver: 'email_notifications'
receivers:
- name: 'email_notifications'
  email_configs:
  - to: 'admin@example.com'
    send_resolved: true

alertmanager/config.yml

告警处理流程

  1. 收到warning级别告警后,立即更新证书
  2. 执行证书轮换脚本:./scripts/update-certificate.sh
  3. 验证新证书生效:curl -I https://pinpoint-collector:443
  4. 确认告警恢复

最佳实践

自动化证书更新

集成Certbot自动续期Let's Encrypt证书:

# 安装Certbot
apt-get install certbot

# 创建续期脚本
cat > /etc/cron.daily/renew-cert.sh << EOF
#!/bin/bash
certbot renew --quiet
cp /etc/letsencrypt/live/pinpoint.example.com/* /opt/pinpoint/cert/
systemctl restart pinpoint-collector
EOF

多环境监控

为开发/测试/生产环境配置不同告警阈值:

# 开发环境放宽告警阈值
- alert: SslCertificateExpiringSoon
  expr: ssl_certificate_expiry_days < 7
  labels:
    environment: dev

常见问题

指标未采集到

  1. 检查Collector配置:management.endpoints.web.exposure.include: prometheus
  2. 验证证书文件权限:ls -l /opt/pinpoint/cert/
  3. 查看应用日志:grep "SslCertificateMonitor" /var/log/pinpoint/collector.log

告警延迟

调整Prometheus抓取间隔:

scrape_interval: 30s
scrape_timeout: 10s

参考文档

总结

通过Prometheus监控Pinpoint集群证书过期状态,结合自动化告警和处理流程,可有效避免证书过期导致的服务中断。建议定期演练证书更新流程,确保监控系统持续可靠运行。

【免费下载链接】pinpoint 【免费下载链接】pinpoint 项目地址: https://gitcode.com/gh_mirrors/pin/pinpoint

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值