Prometheus & AlertManger 采集、评估、告警等配置项

最新推荐文章于 2025-04-03 19:20:13 发布

BUG弄潮儿

最新推荐文章于 2025-04-03 19:20:13 发布

阅读量1.1k

点赞数 11

文章标签： prometheus

本文链接：https://blog.youkuaiyun.com/huangjinjin520/article/details/144065495

版权

Prometheus

global（全局）配置

scrape_interval
默认情况下抓取指标数据的频率

# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]

scrape_timeout

 # How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]

evaluation_interval
评估规则的频率

# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]

scrape_configs下的scrape_config配置

scrape_interval

# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]

scrape_timeout

# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]

alerting.alertmanagers下的alertmanager_config配置

timeout

# Per-target Alertmanager timeout when pushing alerts.
[ timeout: <duration> | default = 10s ]

AlertManger

全局

resolve_timeout

ResolveTimeout是alertmanager使用的默认值，如果 alerts 不包括EndsAt，在这个时间过后，如果 alerts 没有被更新，AlertManager 会将其声明为已解决(Resolved)。
这个参数对来自Prometheus的 alerts 没有影响，因为它们总是包括EndsAt


 # ResolveTimeout is the default value used by alertmanager if the alert does
 # not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
 # This has no impact on alerts from Prometheus, as they always include EndsAt.
 [ resolve_timeout: <duration> | default = 5m ]

route的配置

group_wait
一组告警第一次发送之前等待的时间。用于等待抑制告警，或等待同一组告警采集更多初始告警后一起发送。(一般设置为0秒一几分钟)

# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
# If omitted, child routes inherit the group_wait of the parent route.
[ group_wait: <duration> | default = 30s ]

group_interval

一组已发送初始通知的告警接收到新告警后，再次发送通知前等待的时间(一般设置为5分钟或更多)

# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.) If omitted, child routes
# inherit the group_interval of the parent route.
[ group_interval: <duration> | default = 5m ]

repeat_interval

一条成功发送的告警，在再次发送通知之前等待的时间。(通常设置为3小时或更长时间)

# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more). If omitted,
# child routes inherit the repeat_interval of the parent route.
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should be a multiple of `group_interval`.
[ repeat_interval: <duration> | default = 4h ]

备注：route.routes下的配置也可以配置group_wait、group_interval、repeat_interval这三个配置，对route下相同的这三个配置进行覆盖

group_wait：alertmanager 在接收到一条新的告警（第一次出现的告警）时，将这条告警发送给 receiver 之前需要等待的时间

group_interval：对于一条已经出现过的告警，alertmanager 检查会每隔 group_interval 时间检查一次告警

repeat_interval：对于一条已经出现过的告警，每隔 repeat_interval 会重新发送给 receiver

https://kkgithub.com/prometheus/alertmanager/issues/2647
https://blog.youkuaiyun.com/acdsxdas/article/details/143477501
https://blog.youkuaiyun.com/qq_37843943/article/details/120665690

分析

1、Prometheus：无论告警规则配置的for（持续时间)为多长；在产生第一个告警后，都得间隔evaluation_interval的时间周期才评估一次是否告警，evaluation_interval模式时间1m

2、Alertmanager：在收到一条新的告警之后，会等待 group_wait 时间，对这条新的告警做一些分组、更新、静默的操作。当第一条告警经过 group_wait 时间之后，Alertmanager 会每隔 group_interval 时间检查一次这条告警，判断是否需要对这条告警进行一些操作，当 Alertmanager 经过 n 次 group_interval 的检查后，n * group_interval 恰好大于 repeat_interval 的时候，Alertmanager 才会将这条告警再次发送给对应的 receiver。

group_wait: 10s
group_interval: 20s
repeat_interval: 30s

从第一次超过阈值到ops-alert的时间计算：

60s (evaluation_interval) + 2 * 20s(group_interval) = 1m40s

Alertmanager到ops-alert是时间计算：

2 * 20s(group_interval) = 40s