之前已经分享过prometheus+grafana的安装和配置:https://blog.youkuaiyun.com/weixin_50801368/article/details/115668747?spm=1001.2014.3001.5502
想要触发prometheus告警功能必须安装alertmanager
安装alertmanager:
wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv alertmanager-0.21.0.linux-amd64 alertmanager
启停脚本
vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
[Service]
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --web.listen-address=:9093 --data.retention=120h
Restart=on-failur
[Install]
WantedBy=multi-user.target
alertmanager配置文件
global:
resolve_timeout: 1m # 每1分钟检测一次是否恢复
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
wechat_api_corp_id: '' # 企业微信中企业ID
wechat_api_secret: '' # 企业微信中,应用的Secret
templates:
- '/usr/local/alertmanager/template/*.tmpl'
route:
receiver: 'wechat'
group_by: ['env','instance','type','group','job','alertname']
group_wait: 10s # 初次发送告警延时
group_interval: 10s # 距离第一次发送告警,等待多久再次发送告警
repeat_interval: 3m # 告警重发时间
receivers:
- name: 'wechat'
wechat_configs:
- send_resolved: true
message: '{{ template "wechat.default.message" . }}'
to_party: '27' # 企业微信中创建的接收告警的部门【告警机器人】的部门ID
agent_id: '1000010' # 企业微信中创建的应用的ID
api_secret: '' # 企业微信中,应用的Secret
我配置的是企业微信的告警
参考:https://www.cnblogs.com/miaocbin/p/13706164.html
systemctl start alertmanager
systemctl enable alertmanager
ip:9093 访问alertmanager
告警模板
[root@localhost alertmanager]# ls
alertmanager alertmanager1.yml alertmanager.yml amtool data LICENSE NOTICE template
[root@localhost alertmanager]# vim template/wechat.tmpl
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态:{{ .Status }}
告警级别:{{ .Labels.severity }}
告警类型:{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值:{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型:{{ .Labels.alertname }}
告警状态:{{ .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end = =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}
修改prometheus配置文件
alerting:
alertmanagers:
- static_configs:
- targets:
- 192.168.1.163:9093
rule_files:
- "rules/*_rules.yml"
- "rules/*_alerts.yml" 在配置文件中添加
- job_name: 'alertmanager'
static_configs:
- targets: ['192.168.1.163:9093']
添加告警规则
[root@localhost prometheus]# ls
console_libraries consoles LICENSE NOTICE prometheus prometheus.yml promtool rules
[root@localhost prometheus]# vim rules/node_alerts.yml
groups:
- name: 实例存活告警规则
rules:
- alert: 实例存活告警
expr: up{job="CM"} == 0 or up{job="alibaba"} == 0 or up{job="K8S"} == 0 or up{job="CICD"} == 0 or up{job="test-service"} == 0 or up{job="dev-service"} == 0 or up{job="service"} == 0 or up{job="ELK"} == 0
for: 1m
labels:
user: prometheus
severity: Disaster
annotations:
summary: "Instance {{ $labels.instance }} is down"
description: "Instance {{ $labels.instance }} of hostname {{ $labels.nodename }} has been down for more than 1 minutes."
value: "{{ $value }}"
- name: 内存告警规则
rules:
- alert: "内存使用率告警"
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
user: prometheus
severity: warning
annotations:
summary: "服务器: {{$labels.alertname}} 内存报警"
description: "{{ $labels.alertname }} 内存资源利用率大于80%!(当前值: {{ $value }}%)"
value: "{{ $value }}"