keep事件管理:从告警到 incident 全流程

keep事件管理:从告警到 incident 全流程

【免费下载链接】keep The open-source alerts management and automation platform 【免费下载链接】keep 项目地址: https://gitcode.com/GitHub_Trending/kee/keep

痛点与解决方案

你是否还在为分散的告警信息焦头烂额?是否因事件响应流程混乱导致故障恢复延迟?keep事件管理平台提供从告警检测到incident(事件)解决的全流程自动化方案,帮助团队高效协同,缩短故障恢复时间(MTTR)。本文将详细介绍如何利用keep构建端到端的事件管理体系,包含实战案例与最佳实践。

读完本文你将掌握:

  • 告警规则配置与多源数据整合方法
  • 事件自动创建与智能关联技术
  • 基于工作流的事件升级与响应机制
  • 事件生命周期管理与复盘优化策略
  • 5个核心场景的完整配置示例

核心概念与架构

关键术语定义

术语英文定义
告警Alert监控系统检测到的异常信号,包含原始事件数据
事件Incident由一个或多个相关告警聚合形成的故障实体,需人工介入
工作流Workflow自动化处理逻辑,用于告警处理、事件升级等场景
富化Enrichment为告警/事件添加上下文信息(如业务影响、负责人)的过程
关联Correlation将相关告警聚合到同一事件的机制(规则/AI驱动)

全流程架构图

mermaid

告警配置与检测

多源数据接入

keep支持从任意数据源创建告警,以下是常见数据接入方式:

# 从VictoriaMetrics查询CPU使用率示例
workflow:
  id: vm-cpu-alert
  name: 服务器CPU使用率监控
  triggers:
    - type: interval
      value: 60  # 每分钟执行一次
  steps:
    - name: query-vm
      provider:
        type: victoriametrics
        config: "{{ providers.victoria-metrics }}"
        with:
          query: avg(rate(process_cpu_seconds_total[5m])) by (instance)
          queryType: query_range
  actions:
    - name: create-cpu-alert
      if: "{{ value.1 }} > 0.8"  # CPU使用率超过80%触发告警
      provider:
        type: keep
        with:
          alert:
            name: "高CPU使用率告警"
            severity: "{{ 'critical' if value.1 > 0.9 else 'warning' }}"
            labels:
              instance: "{{ value.0 }}"
              environment: "生产环境"

告警状态管理

keep告警引擎支持完整的状态生命周期管理:

mermaid

配置示例:

# 包含状态管理的告警规则
workflow:
  id: stateful-alert-example
  triggers:
    - type: interval
      value: 30
  steps:
    - name: check-db-connections
      provider:
        type: mysql
        config: "{{ providers.mysql-prod }}"
        with:
          query: "SELECT count(*) as connections FROM information_schema.processlist"
  actions:
    - name: create-connection-alert
      provider:
        type: keep
        with:
          for: 120  # 持续2分钟才变为firing状态
          if: "{{ steps.check-db-connections.results.connections }} > 1000"
          alert:
            name: "数据库连接数过高"
            status: "{{ 'firing' if steps.check-db-connections.results.connections > 1000 else 'resolved' }}"

事件创建与关联

自动关联机制

keep支持三种事件关联策略:

关联类型适用场景配置复杂度
规则-based已知关联模式(如相同服务+相同环境)
AI辅助复杂/未知关联模式
手动关联临时特殊场景

规则-based关联配置示例:

# 基于标签的告警自动关联事件
workflow:
  id: alert-to-incident-correlator
  triggers:
    - type: alert
      events:
        - created
  actions:
    - name: create-or-update-incident
      provider:
        type: keep
        with:
          incident:
            name: "服务{{ alert.service }}异常"
            severity: "{{ alert.severity }}"
            correlation_key: "{{ alert.service }}-{{ alert.environment }}"  # 相同服务+环境的告警关联到同一事件
            alerts:
              - "{{ alert.id }}"

事件创建工作流

# 从Sentry告警创建事件示例
workflow:
  id: sentry-incident-creator
  name: Sentry异常事件创建器
  triggers:
    - type: alert
      filters:
        - key: source
          value: sentry
        - key: severity
          value: critical
  actions:
    - name: create-incident
      provider:
        type: keep
        with:
          incident:
            name: "[{{ alert.project }}] {{ alert.title }}"
            description: "{{ alert.message }}"
            severity: "{{ alert.severity }}"
            service: "{{ alert.project }}"
            environment: "{{ alert.environment }}"
            alerts:
              - "{{ alert.id }}"
            assignee: "{{ 'team-lead' if alert.user_count > 100 else 'oncall' }}"

事件富化与升级

元数据富化

通过工作流为事件添加关键上下文信息:

# 事件元数据自动富化
workflow:
  id: incident-metadata-enricher
  triggers:
    - type: incident
      events:
        - created
  actions:
    - name: enrich-incident
      provider:
        type: keep
        with:
          enrich_incident:
            - key: business_impact
              value: "{{ 'high' if incident.service in ['payment', 'checkout'] else 'medium' }}"
            - key: runbook_url
              value: "https://wiki.example.com/runbooks/{{ incident.service }}"
            - key: sla_minutes
              value: "{{ 30 if incident.business_impact == 'high' else 60 }}"
            - key: affected_users
              value: "{{ steps.get-user-count.results.total }}"
  steps:
    - name: get-user-count
      provider:
        type: http
        with:
          url: "https://api.example.com/users?service={{ incident.service }}"
          method: GET

多阶段升级策略

基于严重性和响应时间的自动升级流程:

# 事件分级升级工作流
workflow:
  id: incident-tier-escalation
  triggers:
    - type: incident
      events:
        - updated
  actions:
    - name: notify-tier1
      if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes < 10 }}"
      provider:
        type: slack
        with:
          channel: "#oncall-tier1"
          message: "紧急事件: {{ incident.name }} 需要立即处理! <@{{ incident.assignee }}>"
    
    - name: escalate-to-tier2
      if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes >= 10 and incident.age_minutes < 20 }}"
      provider:
        type: slack
        with:
          channel: "#oncall-tier2"
          message: "事件升级: {{ incident.name }} Tier1未响应 <@tier2-lead>"
        enrich_incident:
          - key: assignee
            value: "tier2-lead"
    
    - name: escalate-to-management
      if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes >= 20 }}"
      provider:
        type: pagerduty
        with:
          service_key: "{{ providers.pagerduty.management-key }}"
          event_action: trigger
          payload:
            summary: "严重事件升级至管理层: {{ incident.name }}"
            severity: critical

事件响应与处理

自动化响应动作

根据事件类型自动执行修复操作:

# 自动重启异常服务工作流
workflow:
  id: auto-restart-service
  triggers:
    - type: incident
      filters:
        - key: service
          value: "api-gateway"
        - key: severity
          value: "high"
        - key: symptoms
          value: "unhealthy"
  steps:
    - name: check-service-health
      provider:
        type: http
        with:
          url: "https://{{ incident.environment }}-api-gateway.example.com/health"
          method: GET
    
    - name: restart-service
      if: "{{ steps.check-service-health.results.status != 'ok' }}"
      provider:
        type: kubectl
        with:
          command: "rollout restart deployment/api-gateway -n {{ incident.environment }}"
          cluster: "{{ incident.environment }}"
    
    - name: verify-restart
      provider:
        type: http
        with:
          url: "https://{{ incident.environment }}-api-gateway.example.com/health"
          method: GET
          retry: 3
          delay: 30
    
    - name: update-incident-status
      if: "{{ steps.verify-restart.results.status == 'ok' }}"
      provider:
        type: keep
        with:
          enrich_incident:
            - key: status
              value: "resolved"
            - key: resolution
              value: "自动重启服务恢复"

跨平台协同响应

集成Jira和Slack实现团队协作:

# 事件响应协同工作流
workflow:
  id: incident-collaboration
  triggers:
    - type: incident
      events:
        - created
  actions:
    - name: create-jira-ticket
      provider:
        type: jira
        with:
          project_key: "INC"
          issuetype: "Incident"
          summary: "{{ incident.name }}"
          description: |
            *事件描述:* {{ incident.description }}
            *影响范围:* {{ incident.business_impact }}
            *开始时间:* {{ incident.start_time }}
            *相关告警:* {{ incident.alerts | length }}个
          priority: "{{ 'Highest' if incident.severity == 'critical' else 'High' }}"
          custom_fields:
            customfield_10000: "{{ incident.id }}"  # 事件ID关联
            customfield_10001: "{{ incident.sla_minutes }}"  # SLA时间
      enrich_incident:
        - key: jira_ticket_id
          value: "{{ results.issue.key }}"
        - key: jira_ticket_url
          value: "{{ results.issue.self }}"
    
    - name: create-slack-channel
      provider:
        type: slack
        with:
          channel_name: "inc-{{ incident.id }}"
          purpose: "处理事件: {{ incident.name }}"
          members: "{{ incident.assignee }},tech-lead,sre-oncall"
      enrich_incident:
        - key: slack_channel
          value: "{{ results.channel.id }}"

事件生命周期管理

状态流转与监控

事件状态机定义与监控:

mermaid

自动关闭与清理

基于监控数据的自动关闭策略:

# 事件自动关闭工作流
workflow:
  id: incident-auto-closer
  triggers:
    - type: interval
      value: 30  # 每30分钟检查一次
  steps:
    - name: get-resolved-incidents
      provider:
        type: keep
        with:
          version: 2
          filter: "status == 'resolved' and last_updated > 'now-1h'"
  actions:
    - name: close-incident
      foreach: "{{ steps.get-resolved-incidents.results }}"
      if: "{{ keep.datetime_compare(foreach.value.resolved_time, 'now-30m') < 0 }}"  # 已解决30分钟以上
      provider:
        type: keep
        with:
          update_incident:
            id: "{{ foreach.value.id }}"
            status: "closed"
            resolution: "自动确认恢复"

实战场景与最佳实践

场景1:电商支付系统异常处理

# 支付服务异常全流程处理
workflow:
  id: payment-service-incident-handler
  name: 支付服务异常处理流程
  triggers:
    - type: incident
      filters:
        - key: service
          value: payment
        - key: severity
          value: critical
  steps:
    - name: check-transaction-status
      provider:
        type: sql
        with:
          query: "SELECT COUNT(*) as failed_tx FROM transactions WHERE status='failed' AND created_at > NOW() - INTERVAL 5 MINUTE"
          db_type: postgresql
          connection: "{{ providers.payment-db }}"
    
    - name: enable-maintenance-mode
      if: "{{ steps.check-transaction-status.results.failed_tx > 10 }}"
      provider:
        type: http
        with:
          url: "https://api.example.com/maintenance"
          method: POST
          body: '{"service": "payment", "mode": "on", "message": "系统维护中,支付暂时不可用"}'
    
    - name: notify-customer-support
      provider:
        type: zendesk
        with:
          ticket:
            subject: "支付系统异常通知"
            comment: "支付服务当前不可用,技术团队正在处理"
            priority: "urgent"
            tags: ["payment", "system-down"]

场景2:云资源扩容自动化

# 基于事件的自动扩容工作流
workflow:
  id: auto-scale-based-on-incident
  name: 事件驱动的自动扩容
  triggers:
    - type: incident
      filters:
        - key: symptoms
          value: "high-cpu"
        - key: status
          value: "open"
  steps:
    - name: get-current-pods
      provider:
        type: kubectl
        with:
          command: "get pods -n {{ incident.environment }} -l app={{ incident.service }} --no-headers | wc -l"
          cluster: "{{ incident.environment }}"
    
    - name: scale-up
      if: "{{ steps.get-current-pods.results < 10 }}"  # 最大扩容到10个pod
      provider:
        type: kubectl
        with:
          command: "scale deployment/{{ incident.service }} --replicas={{ steps.get-current-pods.results + 2 }} -n {{ incident.environment }}"
          cluster: "{{ incident.environment }}"
      enrich_incident:
        - key: auto_scaled
          value: "true"
        - key: scaling_history
          value: "Scaled from {{ steps.get-current-pods.results }} to {{ steps.get-current-pods.results + 2 }} replicas"

监控与持续优化

事件管理指标

关键性能指标(KPIs)监控:

指标定义目标值数据来源
MTTA平均确认时间< 5分钟事件创建至确认的时间差
MTTR平均解决时间< 60分钟事件创建至关闭的时间差
自动关闭率自动解决事件占比> 30%事件关闭方式统计
误报率误报事件占比< 5%手动标记为误报的事件
升级率需要升级的事件占比< 20%触发升级流程的事件

规则优化工作流

# 告警规则自动优化建议
workflow:
  id: alert-rule-optimizer
  triggers:
    - type: interval
      value: 1440  # 每天执行一次
  steps:
    - name: get-frequent-alerts
      provider:
        type: keep
        with:
          version: 2
          filter: "status == 'firing' and count(last_7d) > 100 and severity == 'info'"
  actions:
    - name: suggest-threshold-adjustment
      foreach: "{{ steps.get-frequent-alerts.results }}"
      provider:
        type: slack
        with:
          channel: "#alerting-rules"
          message: |
            告警优化建议:
            告警名称: {{ foreach.value.name }}
            7天触发次数: {{ foreach.value.count }}
            当前阈值: {{ foreach.value.condition }}
            建议操作: 提高阈值或降低严重性
            历史数据: {{ foreach.value.stats.p95 }}

总结与后续步骤

通过本文介绍的keep事件管理流程,你可以构建从告警检测到事件解决的完整闭环。关键收获包括:

  1. 多源数据整合:通过灵活的工作流配置接入任意数据源
  2. 智能事件关联:基于规则和AI将相关告警聚合为事件
  3. 自动化响应:减少人工干预,加速故障恢复
  4. 全流程可观测:事件状态、处理进度和绩效指标透明化

后续学习路径

  1. 深入学习CEL表达式编写复杂条件判断
  2. 配置AI辅助的事件分类与根因分析
  3. 构建自定义事件仪表盘与SLA监控
  4. 实现跨团队的事件响应协作机制

扩展资源

  • keep官方文档:详细API和配置说明
  • 事件管理最佳实践指南:ITIL与SRE融合方法
  • 工作流模板库:100+预定义自动化流程

【免费下载链接】keep The open-source alerts management and automation platform 【免费下载链接】keep 项目地址: https://gitcode.com/GitHub_Trending/kee/keep

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值