Prometheus监控url存活及Alertmanager告警🍕
上篇文章中已经部署了Prometheus及其组件,Prometheus需要监控node节点的url存活状态,需要在Prometheus控制节点部署blackbox_exporter组件。
blackbox_exporter
是 Prometheus 监控系统中的一种 exporter,它用于监控网络服务的可用性和性能。blackbox_exporter
允许用户通过 HTTP、HTTPS、DNS、TCP 和 ICMP 等协议对网络端点进行探测,并收集相关的指标数据。
以下是 blackbox_exporter
的一些主要特点和用途:
主要特点
- 多种协议支持:
blackbox_exporter
支持多种协议,包括 HTTP、HTTPS、DNS、TCP 和 ICMP,使得它能够监控不同类型的服务。 - 自定义探针:用户可以自定义探针(probes)来执行特定的检查,比如检查 HTTP 响应状态码、响应时间、SSL 证书有效期等。
- 模块化配置:通过模块化的配置文件,用户可以为不同的探测目标定义不同的探针配置。
- 指标暴露:
blackbox_exporter
会将探测结果以 Prometheus 指标的形式暴露出来,这些指标可以被 Prometheus 服务器抓取并存储。 - 安全性:支持使用 TLS 加密连接进行探测,确保数据传输的安全性。
用途
- 网站可用性监控:检查网站是否能够成功响应请求,以及响应时间是否在合理范围内。
- SSL 证书监控:监控 SSL 证书的有效期,确保证书不会过期。
- 网络延迟监控:通过 ICMP 探测(如 ping)来监控网络延迟。
- 端口监控:通过 TCP 探测来检查服务端口是否开放并能够接受连接。
- DNS 监控:检查 DNS 服务器是否能够正确解析域名。
blackbox_exporter
的配置通常涉及两个主要部分:Exporter 本身的配置和 Prometheus 的抓取配置。
一、部署blackbox_exporter
① 下载安装
[root@localhost ~]# wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.25.0/blackbox_exporter-0.25.0.linux-amd64.tar.gz
[root@localhost ~]# tar -xvf blackbox_exporter-0.25.0.linux-amd64.tar.gz -C /usr/local/
[root@localhost ~]# cd /usr/local/
[root@localhost local]# mv blackbox_exporter-0.25.0.linux-amd64 blackbox_exporter
[root@localhost local]# cd blackbox_exporter
② 修改配置文件
[root@localhost blackbox_exporter]# vim blackbox.yml
modules:
http_2xx:
prober: http
http:
preferred_ip_protocol: "ip4"
http_post_2xx:
prober: http
http:
method: POST
tcp_connect:
prober: tcp
pop3s_banner:
prober: tcp
tcp:
query_response:
- expect: "^+OK"
tls: true
tls_config:
insecure_skip_verify: false
grpc:
prober: grpc
grpc:
tls: true
preferred_ip_protocol: "ip4"
grpc_plain:
prober: grpc
grpc:
tls: false
service: "service1"
ssh_banner:
prober: tcp
tcp:
query_response:
- expect: "^SSH-2.0-"
- send: "SSH-2.0-blackbox-ssh-check"
irc_banner:
prober: tcp
tcp:
query_response:
- send: "NICK prober"
- send: "USER prober prober prober :prober"
- expect: "PING :([^ ]+)"
send: "PONG ${1}"
- expect: "^:[^ ]+ 001"
icmp:
prober: icmp
icmp_ttl5:
prober: icmp
timeout: 5s
icmp:
ttl: 5
③ 启动blackbox_exporter
[root@localhost ~]# vim /usr/lib/systemd/system/blackbox_exporter.service
[Unit]
Description=Prometheus Blackbox Exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/blackbox_exporter/blackbox_exporter --config.file=/usr/local/blackbox_exporter/blackbox.yml
[Install]
WantedBy=multi-user.target
[root@localhost ~]# systemctl daemon-reload
[root@localhost ~]# systemctl enable --now blackbox_exporter
blackbox_exporter的grafana对应模板可以导入id:9965
二、Prometheus配置对blackbox_exporter抓取
在 Prometheus 的配置文件中,需要设置对 blackbox_exporter
的抓取:
在scrape_configs块中添加 “http_status” 任务
[root@localhost prometheus]# vim /usr/local/promethues/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rules/*.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: "prometheus"
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ["localhost:9090"]
- job_name: "node"
static_configs:
- targets: ["localhost:9100"]
- job_name: "http_status"
file_sd_configs:
- files:
- /usr/local/prometheus/file_sd_config/*.yml
metrics_path: /probe
params:
module: [http_2xx]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: localhost:9115 # 替换为 blackbox_exporter 的地址
- job_name: "alertmanager"
static_configs:
- targets: ["localhost:9093"]
配置要监控url的目标
[root@localhost prometheus]# vim /usr/local/prometheus/file_sd_config/slb_jilin.yml
#####################################吉林业务############################################
- targets:
- "http://60.60.60.60:50" //配置需要监控的url
labels:
environment: "Prod"
region: "阿里云-华北2-北京"
job: "http_status"
AlertReceivers: "吉林"
project: "吉林project"
service: "考生端"
ecs: "192.168.1.1"
- targets:
- "http://23.23.23.23:88" //配置需要监控的url
labels:
environment: "Prod"
region: "阿里云-华北2-北京"
job: "http_status"
AlertReceivers: "吉林"
project: "吉林project"
service: "管理端"
ecs: "192.168.1.2"
配置告警规则
[root@localhost prometheus]# vim /usr/local/prometheus/rules/alert_http_status_code.yml
groups:
- name: http_status_code_rules
rules:
- alert: HTTP_Status_Not_200
expr: probe_http_status_code{job="http_status"} != 200
for: 1m
labels:
severity: critical
component: web-service
environment: Prod # 添加环境标签
service: web-service # 添加服务名称标签
team: devops # 添加负责团队标签
annotations:
summary: "HTTP Status Code Not 200"
description: "{{ $labels.instance }} 程序无法访问!!!"
details: "HTTP 状态码为 {{ $value }},请检查服务状态。" # 添加详细信息
访问web页面可以看到,已经获取到http_status任务目标
三、Alertmanager配置告警
修改alertmanager.yml配置文件。
主要组成部分
-
routes: 这是一个路由配置的列表,定义了如何根据告警的标签将告警路由到不同的接收器(receivers)。
-
match: 每个路由都有一个
match
条件,用于匹配告警的标签。在这个例子中,所有的匹配条件都是基于AlertReceivers
标签。 -
receiver: 指定当告警匹配到该路由时,告警将被发送到哪个接收器。接收器可以是电子邮件、Slack、Webhook 等。
[root@localhost alertmanager]# vim alertmanager.yml
global:
resolve_timeout: 5m
## 这里为qq邮箱 SMTP 服务地址,官方地址为 smtp.qq.com 端口为 465 或 587,同时要设置开启 POP3/SMTP 服务。
smtp_smarthost: 'smtp.126.com:465'
# smtp_from: ' "告警机器人" <Noleaf@126.com>'
smtp_from: '=?UTF-8?B?5ZGK6K2m5py65Zmo5Lq6?= <Noleaf@126.com>'
# smtp_from: 'Noleaf@126.com'
smtp_auth_username: 'Noleaf@126.com'
#授权码,不是密码,在 QQ 邮箱服务端设置开启 POP3/SMTP 服务时会提示
smtp_auth_password: 'LLLLLLLLLLLLLLJ'
smtp_require_tls: false
#1、模板
templates:
- '/usr/local/alertmanager/templates/*.tmpl'
####################################路由配置##################################################
#2、路由
route:
group_by: ['job', 'project', 'service']
group_wait: 10s
group_interval: 5m
repeat_interval: 4h
#设置默认接收器(必须)
receiver: 'email'
routes:
- match:
AlertReceivers: '广东'
receiver: 'guangdong-receivers'
- match:
AlertReceivers: '贵州'
receiver: 'guizhou-receivers'
- match:
AlertReceivers: '湖北'
receiver: 'hubei-receivers'
- match:
AlertReceivers: '吉林'
receiver: 'jilin-receivers'
####################################接收器配置##########################################
#3、接收器
receivers:
- name: 'email'
email_configs:
- to: '1515151515@qq.com'
send_resolved: true
html: '{{ template "email.alert.recovery.html" . }}'
headers: { Subject: "Prometheus [Warning] 告警邮件" }
#单独设置告警恢复
#- name: 'restore-email'
# email_configs:
# - to: '15151515@qq.com'
# html: '{{ template "email.recovery.html" . }}'
# headers:
# Subject: "告警恢复通知"
#广东业务
- name: 'guangdong-receivers'
email_configs:
- to: '1515151515@qq.com'
send_resolved: true
html: '{{ template "email.alert.recovery.html" . }}'
headers: { Subject: "广东业务 [Warning] 告警邮件" }
#贵州业务
- name: 'guizhou-receivers'
email_configs:
- to: '151151551@qq.com'
send_resolved: true
html: '{{ template "email.alert.recovery.html" . }}'
headers: { Subject: "贵州业务 [Warning] 告警邮件" }
#湖北业务
- name: 'hubei-receivers'
email_configs:
- to: '1515151511@qq.com'
send_resolved: true
html: '{{ template "email.alert.recovery.html" . }}'
headers: { Subject: "湖北业务 [Warning] 告警邮件" }
#吉林业务
- name: 'jilin-receivers'
email_configs:
- to: '1515151515@qq.com'
send_resolved: true
html: '{{ template "email.alert.recovery.html" . }}'
headers: { Subject: "吉林业务 [Warning] 告警邮件" }
#######################################抑制器配置####################################################
# 抑制器配置
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
#确保这个配置下的标签内容相同才会抑制,也就是说警报中必须有这三个标签值才会被抑制。
equal: ['alertname', 'job', 'instance']
告警邮件模板
[root@localhost alertmanager]# cat templates/email_alert_recovery_html.tmpl
{{ define "email.alert.recovery.html" }}
{{ if gt (len .Alerts.Firing) 0 }}
{{ range $index, $alert := .Alerts.Firing }}
========= <span style="color:red;font-size:36px;font-weight:bold;"> 告警通知 </span>=========
<br>
<span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br>
<span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ $alert.Labels.alertname }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }} <br>
<span style="font-size:20px;font-weight:bold;"> 故障主机:</span> {{ $alert.Labels.instance }} {{ $alert.Labels.device }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ .Annotations.summary }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} <br>
{{/*注释信息:range 语句用来遍历 .Labels.SortedPairs 中的每个标签对*/}}
{{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}}
<span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br>
<ul>
<li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li>
</ul>
<span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
============== = end = ==============<br>
<br>
<br>
<br>
<span style="font-size:18px;font-weight:normal;">如服务器正常下线或维护,请忽略本邮件!</span>
<br>
{{ end }}
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
{{ range $index, $alert := .Alerts.Resolved }}
========= <span style="color:#00FF00;font-size:36px;font-weight:bold;"> 告警恢复 </span>=========
<br>
<span style="font-size:20px;font-weight:bold;"> 告警程序:</span> Alertmanager <br>
<span style="font-size:20px;font-weight:bold;"> 告警主题:</span> {{ $alert.Annotations.summary }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警主机:</span> {{ .Labels.instance }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警类型:</span> {{ .Labels.alertname }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警级别:</span> {{ $alert.Labels.severity }} <br>
<span style="font-size:20px;font-weight:bold;"> 告警状态:</span> {{ .Status }}<br>
<span style="font-size:20px;font-weight:bold;"> 告警详情:</span> {{ if eq .Status "resolved" }} 程序已恢复访问。{{ else }} {{ $alert.Annotations.message }}{{ $alert.Annotations.description }} {{ end }}<br>
{{/*<span style="font-size:20px;font-weight:bold;"> 主机标签:</span> {{ range .Labels.SortedPairs }} <br> [{{ .Name }}: {{ .Value | html }} ]{{ end }}<br>*/}}
<span style="font-size:20px;font-weight:bold;"> 主机标签:</span><br>
<ul>
<li>environment: {{ with $alert.Labels.environment }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>region: {{ with $alert.Labels.region }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>project: {{ with $alert.Labels.project }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>service: {{ with $alert.Labels.service }}{{ . }}{{ else }}N/A{{ end }}</li>
<li>ecs: {{ with $alert.Labels.ecs }}{{ . }}{{ else }}N/A{{ end }}</li>
</ul>
<span style="font-size:20px;font-weight:bold;"> 故障时间:</span> {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
<span style="font-size:20px;font-weight:bold;"> 恢复时间:</span> {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}<br>
============= = end = ==============
<br>
<br>
<br>
<br>
{{ end }}
{{ end }}
{{ end }}
邮件展示: