实时监控告警:Apache DolphinScheduler与Prometheus/Grafana集成

实时监控告警:Apache DolphinScheduler与Prometheus/Grafana集成

【免费下载链接】dolphinscheduler Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code 【免费下载链接】dolphinscheduler 项目地址: https://gitcode.com/gh_mirrors/do/dolphinscheduler

1. 痛点与解决方案

你是否还在为Apache DolphinScheduler(工作流调度系统)的运行状态监控而烦恼?当任务失败、资源耗尽或系统异常时,是否无法及时获取通知?本文将详细介绍如何通过Prometheus(指标收集工具)和Grafana(可视化平台)构建DolphinScheduler的实时监控告警系统,解决以下核心痛点:

  • 盲区监控:无法实时掌握Master/Worker节点健康状态
  • 滞后告警:任务失败后不能第一时间响应
  • 性能瓶颈:难以定位系统资源瓶颈
  • 趋势分析:缺乏历史数据用于容量规划

读完本文后,你将获得:

  • 一套完整的监控指标采集方案
  • 3种开箱即用的告警规则配置
  • 5个关键业务仪表盘模板
  • 2套高可用部署架构设计

2. 监控体系架构设计

2.1 整体架构

mermaid

2.2 核心组件分工

组件功能关键技术点
DolphinScheduler业务系统Metrics暴露(基于Micrometer)
Prometheus时序数据存储拉取式采集、PromQL查询
Grafana可视化平台自定义仪表盘、告警配置
Alertmanager告警管理分组、抑制、静默规则

3. 环境准备与前置条件

3.1 软件版本兼容性

组件最低版本推荐版本
Apache DolphinScheduler3.0.03.2.0+
Prometheus2.30.02.45.0
Grafana8.0.010.2.0
JDK1.811

3.2 网络端口规划

组件端口用途访问控制
DolphinScheduler Metrics9100指标暴露仅允许Prometheus访问
Prometheus9090服务端口内网访问
Grafana3000Web界面可配置认证
Alertmanager9093告警端口仅允许Prometheus访问

4. DolphinScheduler配置改造

4.1 启用Metrics采集

修改dolphinscheduler-common/src/main/resources/application.properties

# 启用Micrometer metrics
spring.metrics.export.prometheus.enabled=true
management.endpoints.web.exposure.include=health,info,metrics,prometheus
management.metrics.tags.application=dolphinscheduler
management.metrics.tags.cluster=${cluster.name:default}
management.server.port=9100

4.2 关键指标说明

指标类型核心指标单位告警阈值
节点健康jvm_threads_live_threads>500
任务调度dolphinscheduler_task_success_total15分钟内=0
资源使用process_cpu_usage%>80
数据库连接hikaricp_connections_active>200
网络IOdolphinscheduler_net_io_bytes_total字节突发增长>200MB/s

4.3 集群配置验证

执行以下命令验证指标暴露是否成功:

curl http://${master-node-ip}:9100/actuator/prometheus | grep dolphinscheduler_task

预期输出:

dolphinscheduler_task_success_total{application="dolphinscheduler",cluster="default",} 1256.0
dolphinscheduler_task_failed_total{application="dolphinscheduler",cluster="default",} 8.0

5. Prometheus部署与配置

5.1 安装Prometheus

# 下载安装包
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz

# 解压安装
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
cd prometheus-2.45.0.linux-amd64

# 创建系统服务
cat > /etc/systemd/system/prometheus.service << EOF
[Unit]
Description=Prometheus Server
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/prometheus/prometheus \
  --config.file=/opt/prometheus/prometheus.yml \
  --storage.tsdb.path=/opt/prometheus/data \
  --web.listen-address=:9090 \
  --web.enable-lifecycle

[Install]
WantedBy=multi-user.target
EOF

5.2 配置文件prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'dolphinscheduler-master'
    static_configs:
      - targets: ['master01:9100', 'master02:9100']
        labels:
          instance_type: 'master'
  
  - job_name: 'dolphinscheduler-worker'
    static_configs:
      - targets: ['worker01:9100', 'worker02:9100', 'worker03:9100']
        labels:
          instance_type: 'worker'
  
  - job_name: 'dolphinscheduler-api'
    static_configs:
      - targets: ['api01:9100']
        labels:
          instance_type: 'api'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

5.3 热加载配置

curl -X POST http://localhost:9090/-/reload

6. 关键监控指标详解

6.1 系统层指标

指标名称说明PromQL示例
node_cpu_seconds_totalCPU使用时间rate(node_cpu_seconds_total{mode!="idle"}[5m])
node_memory_MemAvailable_bytes可用内存node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100
node_disk_io_time_seconds_total磁盘IO时间rate(node_disk_io_time_seconds_total[5m])

6.2 应用层指标

mermaid

7. Grafana仪表盘配置

7.1 数据来源配置

  1. 登录Grafana后,进入Configuration > Data Sources
  2. 点击Add data source,选择Prometheus
  3. 配置URL:http://prometheus:9090
  4. 点击Save & Test验证连接

7.2 关键仪表盘模板

7.2.1 集群概览仪表盘
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "datasource",
          "uid": "grafana"
        },
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": 1,
  "iteration": 1694567890000,
  "links": [],
  "panels": [],
  "refresh": "10s",
  "schemaVersion": 38,
  "style": "dark",
  "tags": ["dolphinscheduler"],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {},
  "timezone": "",
  "title": "DolphinScheduler集群监控",
  "uid": "dolphinscheduler-overview",
  "version": 1
}
7.2.2 任务监控仪表盘关键面板
面板名称图表类型PromQL表达式说明
任务成功率单值图sum(dolphinscheduler_task_instance_success) / (sum(dolphinscheduler_task_instance_success) + sum(dolphinscheduler_task_instance_failed)) * 100业务健康度核心指标
任务类型分布饼图sum(dolphinscheduler_task_instance_total) by (task_type)分析任务负载特征
流程实例时长直方图histogram_quantile(0.95, sum(rate(dolphinscheduler_process_instance_duration_seconds_bucket[5m])) by (le))流程耗时分布

8. 告警规则配置

8.1 配置Alertmanager

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'email-notifications'
  
receivers:
- name: 'email-notifications'
  email_configs:
  - to: 'admin@example.com'
    send_resolved: true
    smarthost: 'smtp.example.com:587'
    auth_username: 'alert@example.com'
    auth_password: 'secret'
    from: 'DolphinScheduler Alert'

8.2 核心告警规则

创建dolphinscheduler_alerts.yml

groups:
- name: dolphinscheduler_alerts
  rules:
  - alert: HighCpuUsage
    expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) > 0.8
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High CPU usage detected"
      description: "CPU usage is above 80% for 5 minutes (current value: {{ $value }})"
      
  - alert: TaskFailureRate
    expr: sum(dolphinscheduler_task_instance_failed{status="FAILURE"}) / sum(dolphinscheduler_task_instance_total) > 0.05
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: "High task failure rate"
      description: "Task failure rate is above 5% for 10 minutes (current value: {{ $value }})"
      
  - alert: MasterNodeDown
    expr: up{job="dolphinscheduler-master"} == 0
    for: 3m
    labels:
      severity: critical
    annotations:
      summary: "Master node is down"
      description: "Master node {{ $labels.instance }} has been down for 3 minutes"

在Prometheus配置中引用:

rule_files:
  - "dolphinscheduler_alerts.yml"

9. 高可用部署方案

9.1 基础部署架构

mermaid

9.2 资源配置建议

环境规模Prometheus CPUPrometheus内存存储容量数据保留期
小型(<50节点)2核4GB100GB15天
中型(50-200节点)4核8GB500GB30天
大型(>200节点)8核16GB1TB60天

10. 最佳实践与优化建议

10.1 指标采集优化

  1. 指标过滤:通过metric_relabel_configs过滤非关键指标

    metric_relabel_configs:
    - source_labels: [__name__]
      regex: 'go_.*'
      action: drop
    
  2. 采样频率:不同指标采用差异化采集间隔

    scrape_configs:
    - job_name: 'dolphinscheduler-master'
      scrape_interval: 10s
      scrape_timeout: 5s
    

10.2 常见问题排查

问题现象可能原因解决方案
Prometheus拉取指标超时节点网络延迟增加scrape_timeout至10s
Grafana图表无数据指标标签不匹配检查PromQL中的标签过滤条件
告警风暴告警规则阈值过低调整for字段时长或增加阈值

11. 总结与展望

本文详细介绍了DolphinScheduler与Prometheus/Grafana的集成方案,从架构设计、环境配置到告警优化,构建了一套完整的监控体系。通过这套方案,运维团队可以实时掌握系统运行状态,开发团队能够快速定位问题,业务团队则能获得更稳定的服务保障。

未来监控体系可向以下方向演进:

  • 智能化告警:结合机器学习实现异常检测
  • 全景可观测性:整合Logging/Tracing形成可观测性平台
  • 自动扩缩容:基于监控指标实现弹性伸缩

建议定期对监控体系进行审计,包括:

  1. 季度指标有效性审查
  2. 月度告警风暴分析
  3. 双周仪表盘优化迭代

通过持续优化,使监控系统真正成为DolphinScheduler稳定运行的"守护神"。

【免费下载链接】dolphinscheduler Apache DolphinScheduler is the modern data orchestration platform. Agile to create high performance workflow with low-code 【免费下载链接】dolphinscheduler 项目地址: https://gitcode.com/gh_mirrors/do/dolphinscheduler

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值