9.0.0版本告警有bug: Can’t split service id into 2 parts 官方在9.1.0修复了
配置文件说明
alarm-settings.yml 默认配置内容
rules:# Rule unique name, must be ended with `_rule`.# 1.0service_resp_time_rule:metrics-name: service_resp_time
op:">"threshold:1000period:10count:3silence-period:5message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
# 2.0service_sla_rule:# Metrics value need to be long, double or intmetrics-name: service_sla
op:"<"threshold:8000# The length of time to evaluate the metricsperiod:10# How many times after the metrics match the condition, will trigger alarmcount:2# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.silence-period:3message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
# 3.0service_resp_time_percentile_rule:# Metrics value need to be long, double or intmetrics-name: service_percentile
op:">"threshold:1000,1000,1000,1000,1000period:10count:3silence-period:5message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
# 4.0service_instance_resp_time_rule:metrics-name: service_instance_resp_time
op:">"threshold:1000period:10count:2silence-period:5message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
# 5.0database_access_resp_time_rule:metrics-name: database_access_resp_time
threshold:1000op:">"period:10count:2message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
# 6.0endpoint_relation_resp_time_rule:metrics-name: endpoint_relation_resp_time
threshold:1000op:">"period:10count:2message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
1. service_resp_time_rule(服务响应时间)
service_resp_time_rule:metrics-name: service_resp_time
op:">"threshold:1000period:10count:3silence-period:5message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
silence-period: 在告警触发后多少次检查不会重新触发告警,默认与 period 相同,这里是 5 次。
message: 告警消息模板,表示服务的响应时间超过 1000 毫秒。
2. service_sla_rule(服务的 SLA成功率)
service_sla_rule:metrics-name: service_sla
op:"<"threshold:8000period:10count:2silence-period:3message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
metrics-name: service_sla,表示服务的 SLA(服务级别协议)成功率。
op: <,表示当成功率低于阈值时触发告警。
threshold: 阈值为 8000,表示成功率低于 80% 时触发告警。
period: 10 分钟时间窗口。
count: 成功率低于 80% 的次数为 2 次时触发告警。
silence-period: 告警触发后,告警在接下来的 3 次检查中不会重复触发。
message: 提供告警的消息模板,显示服务的 SLA 成功率低于 80%。
3. service_resp_time_percentile_rule(服务的百分位响应时间)
service_resp_time_percentile_rule:metrics-name: service_percentile
op:">"threshold:1000,1000,1000,1000,1000period:10count:3silence-period:5message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
service_instance_resp_time_rule:metrics-name: service_instance_resp_time
op:">"threshold:1000period:10count:2silence-period:5message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:metrics-name: database_access_resp_time
threshold:1000op:">"period:10count:2message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:metrics-name: endpoint_relation_resp_time
threshold:1000op:">"period:10count:2message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes