Prometheus
global(全局)配置
- scrape_interval
默认情况下抓取指标数据的频率
# How frequently to scrape targets by default.
[ scrape_interval: <duration> | default = 1m ]
- scrape_timeout
# How long until a scrape request times out.
[ scrape_timeout: <duration> | default = 10s ]
- evaluation_interval
评估规则的频率
# How frequently to evaluate rules.
[ evaluation_interval: <duration> | default = 1m ]
scrape_configs下的scrape_config配置
- scrape_interval
# How frequently to scrape targets from this job.
[ scrape_interval: <duration> | default = <global_config.scrape_interval> ]
- scrape_timeout
# Per-scrape timeout when scraping this job.
[ scrape_timeout: <duration> | default = <global_config.scrape_timeout> ]
alerting.alertmanagers下的alertmanager_config配置
- timeout
# Per-target Alertmanager timeout when pushing alerts.
[ timeout: <duration> | default = 10s ]
AlertManger
全局
- resolve_timeout
ResolveTimeout是alertmanager使用的默认值,如果 alerts 不包括EndsAt,在这个时间过后,如果 alerts 没有被更新,AlertManager 会将其声明为已解决(Resolved)。
这个参数对来自Prometheus的 alerts 没有影响,因为它们总是包括EndsAt
# ResolveTimeout is the default value used by alertmanager if the alert does
# not include EndsAt, after this time passes it can declare the alert as resolved if it has not been updated.
# This has no impact on alerts from Prometheus, as they always include EndsAt.
[ resolve_timeout: <duration> | default = 5m ]
route的配置
- group_wait
一组告警第一次发送之前等待的时间。用于等待抑制告警,或等待同一组告警采集更多初始告警后一起发送。(一般设置为0秒 一几分钟)
# How long to initially wait to send a notification for a group
# of alerts. Allows to wait for an inhibiting alert to arrive or collect
# more initial alerts for the same group. (Usually ~0s to few minutes.)
# If omitted, child routes inherit the group_wait of the parent route.
[ group_wait: <duration> | default = 30s ]
- group_interval
一组已发送初始通知的告警接收到新告警后,再次发送通知前等待的时间(一般设置为5分钟或更多)
# How long to wait before sending a notification about new alerts that
# are added to a group of alerts for which an initial notification has
# already been sent. (Usually ~5m or more.) If omitted, child routes
# inherit the group_interval of the parent route.
[ group_interval: <duration> | default = 5m ]
- repeat_interval
一条成功发送的告警,在再次发送通知之前等待的时间。(通常设置为3小时或更长时间)
# How long to wait before sending a notification again if it has already
# been sent successfully for an alert. (Usually ~3h or more). If omitted,
# child routes inherit the repeat_interval of the parent route.
# Note that this parameter is implicitly bound by Alertmanager's
# `--data.retention` configuration flag. Notifications will be resent after either
# repeat_interval or the data retention period have passed, whichever
# occurs first. `repeat_interval` should be a multiple of `group_interval`.
[ repeat_interval: <duration> | default = 4h ]
备注:route.routes下的配置也可以配置group_wait、group_interval、repeat_interval这三个配置,对route下相同的这三个配置进行覆盖
group_wait:alertmanager 在接收到一条新的告警(第一次出现的告警)时,将这条告警发送给 receiver 之前需要等待的时间
group_interval:对于一条已经出现过的告警,alertmanager 检查会每隔 group_interval 时间检查一次告警
repeat_interval: 对于一条已经出现过的告警,每隔 repeat_interval 会重新发送给 receiver
https://kkgithub.com/prometheus/alertmanager/issues/2647
https://blog.youkuaiyun.com/acdsxdas/article/details/143477501
https://blog.youkuaiyun.com/qq_37843943/article/details/120665690
分析
1、Prometheus:无论告警规则配置的for(持续时间)为多长;在产生第一个告警后,都得间隔evaluation_interval的时间周期才评估一次是否告警,evaluation_interval模式时间1m
2、Alertmanager:在收到一条新的告警之后,会等待 group_wait 时间,对这条新的告警做一些分组、更新、静默的操作。当第一条告警经过 group_wait 时间之后,Alertmanager 会每隔 group_interval 时间检查一次这条告警,判断是否需要对这条告警进行一些操作,当 Alertmanager 经过 n 次 group_interval 的检查后,n * group_interval 恰好大于 repeat_interval 的时候,Alertmanager 才会将这条告警再次发送给对应的 receiver。
group_wait: 10s
group_interval: 20s
repeat_interval: 30s
从第一次超过阈值到ops-alert的时间计算:
60s (evaluation_interval) + 2 * 20s(group_interval) = 1m40s
Alertmanager到ops-alert是时间计算:
2 * 20s(group_interval) = 40s