keep事件管理:从告警到 incident 全流程
痛点与解决方案
你是否还在为分散的告警信息焦头烂额?是否因事件响应流程混乱导致故障恢复延迟?keep事件管理平台提供从告警检测到incident(事件)解决的全流程自动化方案,帮助团队高效协同,缩短故障恢复时间(MTTR)。本文将详细介绍如何利用keep构建端到端的事件管理体系,包含实战案例与最佳实践。
读完本文你将掌握:
- 告警规则配置与多源数据整合方法
- 事件自动创建与智能关联技术
- 基于工作流的事件升级与响应机制
- 事件生命周期管理与复盘优化策略
- 5个核心场景的完整配置示例
核心概念与架构
关键术语定义
| 术语 | 英文 | 定义 |
|---|---|---|
| 告警 | Alert | 监控系统检测到的异常信号,包含原始事件数据 |
| 事件 | Incident | 由一个或多个相关告警聚合形成的故障实体,需人工介入 |
| 工作流 | Workflow | 自动化处理逻辑,用于告警处理、事件升级等场景 |
| 富化 | Enrichment | 为告警/事件添加上下文信息(如业务影响、负责人)的过程 |
| 关联 | Correlation | 将相关告警聚合到同一事件的机制(规则/AI驱动) |
全流程架构图
告警配置与检测
多源数据接入
keep支持从任意数据源创建告警,以下是常见数据接入方式:
# 从VictoriaMetrics查询CPU使用率示例
workflow:
id: vm-cpu-alert
name: 服务器CPU使用率监控
triggers:
- type: interval
value: 60 # 每分钟执行一次
steps:
- name: query-vm
provider:
type: victoriametrics
config: "{{ providers.victoria-metrics }}"
with:
query: avg(rate(process_cpu_seconds_total[5m])) by (instance)
queryType: query_range
actions:
- name: create-cpu-alert
if: "{{ value.1 }} > 0.8" # CPU使用率超过80%触发告警
provider:
type: keep
with:
alert:
name: "高CPU使用率告警"
severity: "{{ 'critical' if value.1 > 0.9 else 'warning' }}"
labels:
instance: "{{ value.0 }}"
environment: "生产环境"
告警状态管理
keep告警引擎支持完整的状态生命周期管理:
配置示例:
# 包含状态管理的告警规则
workflow:
id: stateful-alert-example
triggers:
- type: interval
value: 30
steps:
- name: check-db-connections
provider:
type: mysql
config: "{{ providers.mysql-prod }}"
with:
query: "SELECT count(*) as connections FROM information_schema.processlist"
actions:
- name: create-connection-alert
provider:
type: keep
with:
for: 120 # 持续2分钟才变为firing状态
if: "{{ steps.check-db-connections.results.connections }} > 1000"
alert:
name: "数据库连接数过高"
status: "{{ 'firing' if steps.check-db-connections.results.connections > 1000 else 'resolved' }}"
事件创建与关联
自动关联机制
keep支持三种事件关联策略:
| 关联类型 | 适用场景 | 配置复杂度 |
|---|---|---|
| 规则-based | 已知关联模式(如相同服务+相同环境) | 低 |
| AI辅助 | 复杂/未知关联模式 | 中 |
| 手动关联 | 临时特殊场景 | 低 |
规则-based关联配置示例:
# 基于标签的告警自动关联事件
workflow:
id: alert-to-incident-correlator
triggers:
- type: alert
events:
- created
actions:
- name: create-or-update-incident
provider:
type: keep
with:
incident:
name: "服务{{ alert.service }}异常"
severity: "{{ alert.severity }}"
correlation_key: "{{ alert.service }}-{{ alert.environment }}" # 相同服务+环境的告警关联到同一事件
alerts:
- "{{ alert.id }}"
事件创建工作流
# 从Sentry告警创建事件示例
workflow:
id: sentry-incident-creator
name: Sentry异常事件创建器
triggers:
- type: alert
filters:
- key: source
value: sentry
- key: severity
value: critical
actions:
- name: create-incident
provider:
type: keep
with:
incident:
name: "[{{ alert.project }}] {{ alert.title }}"
description: "{{ alert.message }}"
severity: "{{ alert.severity }}"
service: "{{ alert.project }}"
environment: "{{ alert.environment }}"
alerts:
- "{{ alert.id }}"
assignee: "{{ 'team-lead' if alert.user_count > 100 else 'oncall' }}"
事件富化与升级
元数据富化
通过工作流为事件添加关键上下文信息:
# 事件元数据自动富化
workflow:
id: incident-metadata-enricher
triggers:
- type: incident
events:
- created
actions:
- name: enrich-incident
provider:
type: keep
with:
enrich_incident:
- key: business_impact
value: "{{ 'high' if incident.service in ['payment', 'checkout'] else 'medium' }}"
- key: runbook_url
value: "https://wiki.example.com/runbooks/{{ incident.service }}"
- key: sla_minutes
value: "{{ 30 if incident.business_impact == 'high' else 60 }}"
- key: affected_users
value: "{{ steps.get-user-count.results.total }}"
steps:
- name: get-user-count
provider:
type: http
with:
url: "https://api.example.com/users?service={{ incident.service }}"
method: GET
多阶段升级策略
基于严重性和响应时间的自动升级流程:
# 事件分级升级工作流
workflow:
id: incident-tier-escalation
triggers:
- type: incident
events:
- updated
actions:
- name: notify-tier1
if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes < 10 }}"
provider:
type: slack
with:
channel: "#oncall-tier1"
message: "紧急事件: {{ incident.name }} 需要立即处理! <@{{ incident.assignee }}>"
- name: escalate-to-tier2
if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes >= 10 and incident.age_minutes < 20 }}"
provider:
type: slack
with:
channel: "#oncall-tier2"
message: "事件升级: {{ incident.name }} Tier1未响应 <@tier2-lead>"
enrich_incident:
- key: assignee
value: "tier2-lead"
- name: escalate-to-management
if: "{{ incident.severity == 'critical' and incident.status == 'open' and incident.age_minutes >= 20 }}"
provider:
type: pagerduty
with:
service_key: "{{ providers.pagerduty.management-key }}"
event_action: trigger
payload:
summary: "严重事件升级至管理层: {{ incident.name }}"
severity: critical
事件响应与处理
自动化响应动作
根据事件类型自动执行修复操作:
# 自动重启异常服务工作流
workflow:
id: auto-restart-service
triggers:
- type: incident
filters:
- key: service
value: "api-gateway"
- key: severity
value: "high"
- key: symptoms
value: "unhealthy"
steps:
- name: check-service-health
provider:
type: http
with:
url: "https://{{ incident.environment }}-api-gateway.example.com/health"
method: GET
- name: restart-service
if: "{{ steps.check-service-health.results.status != 'ok' }}"
provider:
type: kubectl
with:
command: "rollout restart deployment/api-gateway -n {{ incident.environment }}"
cluster: "{{ incident.environment }}"
- name: verify-restart
provider:
type: http
with:
url: "https://{{ incident.environment }}-api-gateway.example.com/health"
method: GET
retry: 3
delay: 30
- name: update-incident-status
if: "{{ steps.verify-restart.results.status == 'ok' }}"
provider:
type: keep
with:
enrich_incident:
- key: status
value: "resolved"
- key: resolution
value: "自动重启服务恢复"
跨平台协同响应
集成Jira和Slack实现团队协作:
# 事件响应协同工作流
workflow:
id: incident-collaboration
triggers:
- type: incident
events:
- created
actions:
- name: create-jira-ticket
provider:
type: jira
with:
project_key: "INC"
issuetype: "Incident"
summary: "{{ incident.name }}"
description: |
*事件描述:* {{ incident.description }}
*影响范围:* {{ incident.business_impact }}
*开始时间:* {{ incident.start_time }}
*相关告警:* {{ incident.alerts | length }}个
priority: "{{ 'Highest' if incident.severity == 'critical' else 'High' }}"
custom_fields:
customfield_10000: "{{ incident.id }}" # 事件ID关联
customfield_10001: "{{ incident.sla_minutes }}" # SLA时间
enrich_incident:
- key: jira_ticket_id
value: "{{ results.issue.key }}"
- key: jira_ticket_url
value: "{{ results.issue.self }}"
- name: create-slack-channel
provider:
type: slack
with:
channel_name: "inc-{{ incident.id }}"
purpose: "处理事件: {{ incident.name }}"
members: "{{ incident.assignee }},tech-lead,sre-oncall"
enrich_incident:
- key: slack_channel
value: "{{ results.channel.id }}"
事件生命周期管理
状态流转与监控
事件状态机定义与监控:
自动关闭与清理
基于监控数据的自动关闭策略:
# 事件自动关闭工作流
workflow:
id: incident-auto-closer
triggers:
- type: interval
value: 30 # 每30分钟检查一次
steps:
- name: get-resolved-incidents
provider:
type: keep
with:
version: 2
filter: "status == 'resolved' and last_updated > 'now-1h'"
actions:
- name: close-incident
foreach: "{{ steps.get-resolved-incidents.results }}"
if: "{{ keep.datetime_compare(foreach.value.resolved_time, 'now-30m') < 0 }}" # 已解决30分钟以上
provider:
type: keep
with:
update_incident:
id: "{{ foreach.value.id }}"
status: "closed"
resolution: "自动确认恢复"
实战场景与最佳实践
场景1:电商支付系统异常处理
# 支付服务异常全流程处理
workflow:
id: payment-service-incident-handler
name: 支付服务异常处理流程
triggers:
- type: incident
filters:
- key: service
value: payment
- key: severity
value: critical
steps:
- name: check-transaction-status
provider:
type: sql
with:
query: "SELECT COUNT(*) as failed_tx FROM transactions WHERE status='failed' AND created_at > NOW() - INTERVAL 5 MINUTE"
db_type: postgresql
connection: "{{ providers.payment-db }}"
- name: enable-maintenance-mode
if: "{{ steps.check-transaction-status.results.failed_tx > 10 }}"
provider:
type: http
with:
url: "https://api.example.com/maintenance"
method: POST
body: '{"service": "payment", "mode": "on", "message": "系统维护中,支付暂时不可用"}'
- name: notify-customer-support
provider:
type: zendesk
with:
ticket:
subject: "支付系统异常通知"
comment: "支付服务当前不可用,技术团队正在处理"
priority: "urgent"
tags: ["payment", "system-down"]
场景2:云资源扩容自动化
# 基于事件的自动扩容工作流
workflow:
id: auto-scale-based-on-incident
name: 事件驱动的自动扩容
triggers:
- type: incident
filters:
- key: symptoms
value: "high-cpu"
- key: status
value: "open"
steps:
- name: get-current-pods
provider:
type: kubectl
with:
command: "get pods -n {{ incident.environment }} -l app={{ incident.service }} --no-headers | wc -l"
cluster: "{{ incident.environment }}"
- name: scale-up
if: "{{ steps.get-current-pods.results < 10 }}" # 最大扩容到10个pod
provider:
type: kubectl
with:
command: "scale deployment/{{ incident.service }} --replicas={{ steps.get-current-pods.results + 2 }} -n {{ incident.environment }}"
cluster: "{{ incident.environment }}"
enrich_incident:
- key: auto_scaled
value: "true"
- key: scaling_history
value: "Scaled from {{ steps.get-current-pods.results }} to {{ steps.get-current-pods.results + 2 }} replicas"
监控与持续优化
事件管理指标
关键性能指标(KPIs)监控:
| 指标 | 定义 | 目标值 | 数据来源 |
|---|---|---|---|
| MTTA | 平均确认时间 | < 5分钟 | 事件创建至确认的时间差 |
| MTTR | 平均解决时间 | < 60分钟 | 事件创建至关闭的时间差 |
| 自动关闭率 | 自动解决事件占比 | > 30% | 事件关闭方式统计 |
| 误报率 | 误报事件占比 | < 5% | 手动标记为误报的事件 |
| 升级率 | 需要升级的事件占比 | < 20% | 触发升级流程的事件 |
规则优化工作流
# 告警规则自动优化建议
workflow:
id: alert-rule-optimizer
triggers:
- type: interval
value: 1440 # 每天执行一次
steps:
- name: get-frequent-alerts
provider:
type: keep
with:
version: 2
filter: "status == 'firing' and count(last_7d) > 100 and severity == 'info'"
actions:
- name: suggest-threshold-adjustment
foreach: "{{ steps.get-frequent-alerts.results }}"
provider:
type: slack
with:
channel: "#alerting-rules"
message: |
告警优化建议:
告警名称: {{ foreach.value.name }}
7天触发次数: {{ foreach.value.count }}
当前阈值: {{ foreach.value.condition }}
建议操作: 提高阈值或降低严重性
历史数据: {{ foreach.value.stats.p95 }}
总结与后续步骤
通过本文介绍的keep事件管理流程,你可以构建从告警检测到事件解决的完整闭环。关键收获包括:
- 多源数据整合:通过灵活的工作流配置接入任意数据源
- 智能事件关联:基于规则和AI将相关告警聚合为事件
- 自动化响应:减少人工干预,加速故障恢复
- 全流程可观测:事件状态、处理进度和绩效指标透明化
后续学习路径
- 深入学习CEL表达式编写复杂条件判断
- 配置AI辅助的事件分类与根因分析
- 构建自定义事件仪表盘与SLA监控
- 实现跨团队的事件响应协作机制
扩展资源
- keep官方文档:详细API和配置说明
- 事件管理最佳实践指南:ITIL与SRE融合方法
- 工作流模板库:100+预定义自动化流程
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



