Skywalking链路追踪--告警篇

1、概述

Skywalking发送告警的基本原理是每隔一段时间轮询skywalking-oap收集到的链路追踪的数据,再根据所配置的告警规则(如服务响应时间、服务响应时间百分比)等,如果达到阈值则发送响应的告警信息。 发送告警信息是以线程池异步的方式调用webhook接口完成的,具体的webhook接口可以由使用者自行定义,从而可以在指定的webhook接口中自行编写各种告警方式,比如钉钉告警、邮件告警等等。告警的信息也可以在RocketBotui中查看到。

目前对应我前面文章中部署的8.5.0版本支持的告警接口如下:

  • 普通webhook
  • gRPCHook
  • Slack Chat Hook
  • WeChat Hook(微信告警)
  • Dingtalk Hook(钉钉告警)
  • Feishu Hook(飞书告警)

2、告警规则

2.1 默认告警规则

通过/apps/apache-skywalking-apm-bin/config/oal/core.oal 定义指标数据如何查询数据

通过/apps/apache-skywalking-apm-bin/config/alarm-settings.yml定义告警规则.

对一定时间范围内的数据进行计算后告警.

Skywalking中,告警规则称为rule,默认安装的Skywalking oap server组件中包含了告警规则的配置文件,位于安装目录下config文件夹下alarm-settings.yml文件中,在容器中运行的也是一样的

bash-5.0# pwd

/skywalking      

bash-5.0# cat config/alarm-settings.yml

以下是默认的告警规则配置文件内容:

rules:

  # Rule unique name, must be ended with `_rule`.

  service_resp_time_rule:

    metrics-name: service_resp_time

    op: ">"

    threshold: 1000

    period: 10

    count: 3

    silence-period: 5

    message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.

  service_sla_rule:

    # Metrics value need to be long, double or int

    metrics-name: service_sla

    op: "<"

    threshold: 8000

    # The length of time to evaluate the metrics

    period: 10

    # How many times after the metrics match the condition, will trigger alarm

    count: 2

    # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.

    silence-period: 3

    message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes

  service_resp_time_percentile_rule:

    # Metrics value need to be long, double or int

    metrics-name: service_percentile

    op: ">"

    threshold: 1000,1000,1000,1000,1000

    period: 10

    count: 3

    silence-period: 5

    message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000

  service_instance_resp_time_rule:

    metrics-name: service_instance_resp_time

    op: ">"

    threshold: 1000

    period: 10

    count: 2

    silence-period: 5

    message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes

  database_access_resp_time_rule:

    metrics-name: database_access_resp_time

    threshold: 1000

    op: ">"

    period: 10

    count: 2

    message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes

  endpoint_relation_resp_time_rule:

    metrics-name: endpoint_relation_resp_time

    threshold: 1000

    op: ">"

    period: 10

    count: 2

    message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes

#  Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.

#  Because the number of endpoint is much more than service and instance.

#

#  endpoint_avg_rule:

#    metrics-name: endpoint_avg

#    op: ">"

#    threshold: 1000

#    period: 10

#    count: 2

#    silence-period: 5

#    message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes

#alarm-settings.yml中添加web钩子来调用分发小程序实现钉钉告警(部署在10.60.4.57服务器上,启动命令为:cd /data/skywalking-alter

                                                                                                      nohup python skywalking-alter.py &)

webhooks:

#  - http://127.0.0.1/notify/

#  - http://127.0.0.1/go-wechat/

现用alarm-settings.yml

rules:

   # Rule unique name, must be ended with `_rule`.

   service_resp_time_rule:

     metrics-name: service_resp_time

     op: ">"

     threshold: 10000

     period: 10

     count: 2

     silence-period: 10

     message: 服务【{name}】的平均响应时间在最近10分钟内有2分钟超过10

   service_sla_l500_rule:

     # Metrics value need to be longdouble or int

     metrics-name: service_sla_l500

     op: "<"

     threshold: 8500

     # The length of time to evaluate the metrics

     period: 10

     # How many times after the metrics match the condition, will trigger alarm

     count: 3

     # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.

     silence-period: 10

     message: 服务【{name}】的请求成功率在最近10分钟内有3分钟低于85%

   instance_jvm_old_gc_count_rule:

     metrics-name: instance_jvm_old_gc_count

     op: ">"

     threshold: 100

     period: 3

     count: 1

     silence-period: 3

     message: 实例【{name}】的oldGC次数在最近3分钟内有1分钟大于100

#   endpoint_avg_rule:                                                                                                                                                                                                                                                                  

#     metrics-name: endpoint_avg                                                                                                                                                                                                                                                        

#     op: ">"                                                                                                                                                                                                                                                                                         

#     threshold: 8000                                                                                                                                                                                                                                                                                 

#     period: 10                                                                                                                                                                                                                                                                                      

#     count: 1                                                                                                                                                                                                                                                                                        

#     silence-period: 10                                                                                                                                                                                                                                                                             

#     message: 端口【{name}】的响应时间在最近10分钟内超过8

   service_cpm_rule:

     # Metrics value need to be longdouble or int

     metrics-name: service_cpm

     op: ">"

     threshold: 100000

     # The length of time to evaluate the metrics

     period: 1

     # How many times after the metrics match the condition, will trigger alarm

     count: 1

     # How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.

     silence-period: 1

     message: 服务【{name}】的请求次数在最近10分钟内有1分钟超过了10w次

#wechatHooks:

#   textTemplate: |-

#     {

#       "msgtype""text",

#       "text": {

#         "content""SkyWalking 链路追踪告警: \n %s."

#       }

#     }

#   webhooks:

#     #- http://10.10.4.62/dingding-webhook/robot/send?access_token=xxxxxxxxxxxxxxxxxxxx

webhooks:

     - http://10.10.10.10:5000/send_alter

2.2 告警规则详解

首先提示声明了告警规则名称应该具有唯一性,且必须以 _rule 结尾,这里是service_resp_time_rule(服务响应时间)

  • metrics-name:告警指标,指标度量值为longdoubleint类型
  • op:度量值和阈值的比较方式,这里是大于
  • threshold:阈值,这里是1000,毫秒为单位
  • period:评估度量标准的时间长度,也就是告警检查周期,分钟为单位
  • count:累计达到多少次告警值后触发告警
  • silence-period:忽略相同告警信息的周期,默认与告警检查周期一致。简单来说,就是在触发告警时开始计时N,在N+period时间内保持沉默silence不会再次触发告警,这和alertmanager的告警抑制类似
  • message:告警消息主体,通过变量在发送消息时进行自动替换

除此之外,还有以下可选(高级)规则配置:

到这里,就能分析出上面列出的所有默认告警规则的含义,依次为:

1 最近3分钟内服务平均响应时间超过1

2 最近2分钟内服务成功率低于80

3 最近3分钟的服务响应时间百分位超过1

4 最近2分钟内服务实例的平均响应时间超过1

5 最近2分钟内数据库访问的平均响应时间超过1

6 最近2分钟内端点平均响应时间超过1

7 过去2分钟内端点关系的平均响应时间超过1

  这条规则默认没有打开,并且提示:由于端点的数量远远多于服务和实例,活动端点相关度量告警将比服务和服务实例度量告警消耗更多内存

配置好告警规则后重启Skywalking oap server,结果验证:

如果要添加自定义告警,首先需要在 oal 文件中添加一个指标,

这里添加一个自定义告警:当接口返回状态码为 404,50, 502, 503, 504 其中一个,就发送告警:

root@nx-eks-ctl:/usr/local/src/zhdya/skywalking/skywalking-kubernetes/chart/skywalking/files/conf.d#  下面有个README.md 文件,主要是为了介绍如何去自定义。

Files under `oap/*` will override the counterparts under the Docker image's `/skywalking/config/*`, with the directory structure retained, here are some examples:

| File under `files/config.d/oap` directory | Overrides the file under Docker image's `/skywalking/config/` |

| ---- | -------- |

| `files/config.d/oap/application.yml`                 | `/skywalking/config/application.yml`                  |

| `files/config.d/oap/log4j2.xml`                      | `/skywalking/config/log4j2.xml`                       |

| `files/config.d/oap/alarm-settings.yml`              | `/skywalking/config/alarm-settings.yml`               |

| `files/config.d/oap/endpoint-name-grouping.yml`      | `/skywalking/config/endpoint-name-grouping.yml`       |

| `files/config.d/oap/oal/core.oal`                    | `/skywalking/config/oal/core.oal`                     |

| `files/config.d/oap/oal/browser.oal`                 | `/skywalking/config/oal/browser.oal`                  |

| `files/config.d/oap/oc-rules/oap.yaml`               | `/skywalking/config/oc-rules/oap.yaml`                |

| `...`                                                | `...`                                                 |

例如我需要变更core.oal,复制早先的所有内容,然后追加(需要先新建一个oal的):

// devops 20220704

endpoint_abnormal = from(Endpoint.*).filter(responseCode in [404500502503504]).count();

3、分发小程序

启动命令为:

cd /data/skywalking-alter

nohup python skywalking-alter.py &

skywalking-alter.py代码:

# !/usr/bin/env python
# _*_ coding: utf-8 _*_
import sys
reload(sys)
sys.setdefaultencoding('utf8')
import re
import requests
from flask import Flask, request, json
import smtplib
from email.mime.text import MIMEText
import time
import json
import hmac
import hashlib
import base64
import urlparse
import urllib
app = Flask(__name__)
@app.route("/send_alter", methods=["POST"])
def send_alter():
    datas = request.json
    headers = {'Content-Type': 'application/json'}
    webhook = 'http://10.10.4.62/dingding-webhook/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxx'
    webhook_zhulong = 'http://10.10.4.62/dingding-webhook/robot/send?access_token=xxxxxxxxxxxxxxxxxxx'
    fd0 = open('/data/skywalking-alter/project/all_group')
    fd1 = open('/data/skywalking-alter/project/group_cuihan')
    fd2 = open('/data/skywalking-alter/project/group_guoqilin')
    fd3 = open('/data/skywalking-alter/project/group_lilianxin')
    fd4 = open('/data/skywalking-alter/project/group_quhuijiang')
    fd5 = open('/data/skywalking-alter/project/group_ranchengcheng')
    fd6 = open('/data/skywalking-alter/project/group_zhulong')
    fd7 = open('/data/skywalking-alter/project/group_zhangbin')
    fd8 = open('/data/skywalking-alter/project/group_wangyu')
    fd9 = open('/data/skywalking-alter/project/group_yangdaxin')
    fd10 = open('/data/skywalking-alter/project/group_duanshuwen')
    fd11 = open('/data/skywalking-alter/project/group_zhaolongji')
    fd12 = open('/data/skywalking-alter/project/group_tangpingping')
    fd13 = open('/data/skywalking-alter/project/group_null')
    fd14 = open('/data/skywalking-alter/project/group_null')
    fd15 = open('/data/skywalking-alter/project/group_null')
    lines0 = fd0.readlines()
    lines1 = fd1.readlines()
    lines2 = fd2.readlines()
    lines3 = fd3.readlines()
    lines4 = fd4.readlines()
    lines5 = fd5.readlines()
    lines6 = fd6.readlines()
    lines7 = fd7.readlines()
    lines8 = fd8.readlines()
    lines9 = fd9.readlines()
    lines10 = fd10.readlines()
    lines11 = fd11.readlines()
    lines12 = fd12.readlines()
    lines13 = fd13.readlines()
    lines14 = fd14.readlines()
    lines15 = fd15.readlines()
    for data in datas:
        msg = {
            "scope": data['scope'],
            "name": data['name'],
            "rule_name": data['ruleName'],
            "alarm_message": data['alarmMessage'],
            "start_time": time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(int(data['startTime']) / 1000)),
        }
        for read in lines0:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines1:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines2:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines3:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines4:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines5:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines6:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
                timestamp = long(round(time.time() * 1000))
                secret = 'xxxxxxxxxxxxxxxxx'
                secret_enc = bytes(secret).encode('utf-8')
                string_to_sign = '{}\n{}'.format(timestamp, secret)
                string_to_sign_enc = bytes(string_to_sign).encode('utf-8')
                hmac_code = hmac.new(secret_enc, string_to_sign_enc, digestmod=hashlib.sha256).digest()
                sign = urllib.quote_plus(base64.b64encode(hmac_code))
                url = "http://10.10.4.62/dingding-webhook/robot/send?access_token=xxxxxxxxxxxxxxxxxxxxxxxxxxx"+str(timestamp)+"&sign="+str(sign)
                headers1 = {'Content-Type': 'application/json;charset=utf-8'}
                send_msg_tpl_zhulong = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(url, data=json.dumps(send_msg_tpl_zhulong), headers=headers1)
        for read in lines7:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines8:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines9:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines10:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines11:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines12:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            "15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines13:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines14:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n触发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)
        for read in lines15:
            read = read.replace('\n', '')
            if read in data['name']:
                send_msg_tpl = {
                    "msgtype": "text",
                    "text": {
                        "content": "服务名称:{name} \n���发规则:{rule_name} \n告警时间:{start_time} \n告警内容:{alarm_message} ".format(
                            **msg)
                    },
                    "at": {
                        "atMobiles": [
                            #"15885740000"
                        ],
                        "atUserIds": [
                            "manager5345"
                        ],
                        "isAtAll": False
                    }
                }
                requests.post(webhook, data=json.dumps(send_msg_tpl), headers=headers)

    return "成功"

if __name__ == "__main__":
    app.run("0.0.0.0", 5000, debug=True)

 

# !/usr/bin/env python
# _*_ coding: utf-8 _*_
#coding=utf-8
import requests

url = 'http://10.60.4.57:5000/send_alter'
d = [{
    "scopeId": 1,
    "scope": "SERVICE",
    "name": "gzlles-dcb-app.affairs-service",
    "id0": 12,
    "id1": 0,
    "ruleName": "service_resp_time_rule",
    "alarmMessage": "alarmMessage xxxxxxxxxxx 测试一下",
    "startTime": 15884741400
}]

r = requests.post(url, json=d)
print(r.text)

项目分组目录:10.10.10.10:/data/skywalking-alter/project

命名规则为group_项目负责人姓名拼音,文件内容为服务名称 . 命名空间,all_group文件为项目总集合(注意:all_group与各项目分组内容不能重复,否则钉钉告警会重复)

<think>好的,用户想了解SkyWalking告警恢复的配置方法。首先,我需要回顾一下SkyWalking告警机制。根据引用[3],SkyWalking告警规则定义在alarm-settings.yml文件中,其中每个规则都包含触发条件和恢复条件。恢复通常与触发条件相关联,比如当指标回到正常范围时自动恢复。 接下来,用户提到的告警恢复可能需要配置恢复条件。例如,在规则中设置一个持续时间,当指标在指定时间内恢复正常,则触发恢复通知。引用[1]中提到告警持续时间设置为30分钟,这可能意味着恢复需要满足条件持续一段时间。 然后,检查引用[4]中的告警指标,比如服务响应时间、成功率等,恢复配置应该对应这些指标的阈值。例如,当平均响应时间低于100ms时恢复告警。需要确认在alarm-settings.yml中是否支持设置恢复的阈值或条件。 另外,引用[2]提到SkyWalking的部署和存储配置,但用户的问题集中在告警恢复,可能不需要涉及部署部分。不过,存储后端的配置可能影响告警数据的持久化,进而影响恢复检测的时间窗口。 用户可能需要具体的配置示例,比如在规则中添加恢复相关的参数,如recoveryStep和recoveryExpression。同时,需要确保通知方式包含恢复通知,比如在webhooks中处理恢复事件。 最后,综合这些信息,整理出配置步骤和示例,确保用户能正确设置告警恢复,并验证配置是否生效。可能需要提醒用户注意合理设置阈值和持续时间,避免误报或漏报。</think>关于SkyWalking告警恢复的配置方法,以下是具体实现步骤和注意事项: ### 一、告警恢复机制原理 SkyWalking告警恢复机制基于指标状态的持续监测。当触发告警的条件不再满足且**保持正常状态超过指定时间窗口**时,系统会自动标记为恢复状态[^3]。恢复判断采用与触发告警相同的滑动时间窗口算法,确保状态变化的稳定性[^1]。 ### 二、核心配置方法 在`alarm-settings.yml`文件中进行以下配置: ```yaml rules: service_resp_time_rule: metrics-name: service_resp_time # 监控指标名称 threshold: 1000 # 触发阈值(ms) op: ">" # 比较运算符 period: 10 # 检测周期(分钟) count: 2 # 连续触发次数 silence-period: 5 # 静默期(分钟) recovery-step: 3 # 恢复检测周期数 include-names: # 监控服务列表 - serviceA - serviceB webhooks: # 恢复通知配置 - http://alert-system/recovery ``` 关键恢复参数说明: 1. `recovery-step`: 要求指标**连续满足正常状态的时间窗口数**(默认等于触发条件的`count`值) 2. `silence-period`: 相同告警的最小间隔时间,影响恢复通知频率 3. 恢复判定公式:$$ \text{当前值} \leq (\text{阈值} - \text{缓冲区间}) \quad \text{持续} \quad \text{recovery-step} \times \text{period} \text{分钟} $$ ### 三、恢复通知配置 在`webhooks`中需区分告警类型处理: ```yaml webhooks: - http://alert-system/alert?type=trigger - http://alert-system/alert?type=recovery ``` ### 四、验证配置 1. 通过`kubectl exec`进入OAP容器检查配置加载状态 2. 在SkyWalking UI的`Alarm`页面查看历史记录 3. 使用测试工具模拟指标波动,验证恢复触发: ```bash # 生成测试流量 wrk -t4 -c100 -d300s http://service-endpoint ``` ### 五、注意事项 1. 恢复阈值建议设置为触发阈值的80%-90%,避免频繁震荡[^4] 2. 生产环境推荐设置`recovery-step ≥ 2`,防止瞬时抖动 3. 分布式环境下需考虑时钟同步问题 4. 结合日志监控系统实现告警-恢复闭环验证
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值