prometheus告警_{{ template "wechat.default.message" . }}-优快云博客

本文链接：https://blog.youkuaiyun.com/weixin_50801368/article/details/115672097

本文介绍了如何安装和配置Prometheus的告警组件Alertmanager，并详细展示了设置企业微信告警的步骤，包括Alertmanager的启停脚本、配置文件、告警模板以及Prometheus的配置更新，确保系统异常时能够及时通过企业微信通知。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

之前已经分享过prometheus+grafana的安装和配置：https://blog.youkuaiyun.com/weixin_50801368/article/details/115668747?spm=1001.2014.3001.5502

想要触发prometheus告警功能必须安装alertmanager

安装alertmanager：

wget https://github.com/prometheus/alertmanager/releases/download/v0.21.0/alertmanager-0.21.0.linux-amd64.tar.gz
tar -xvf alertmanager-0.21.0.linux-amd64.tar.gz -C /usr/local/
cd /usr/local/
mv alertmanager-0.21.0.linux-amd64 alertmanager

启停脚本

vim /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager


[Service]
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data --web.listen-address=:9093 --data.retention=120h
Restart=on-failur

[Install]
WantedBy=multi-user.target

alertmanager配置文件

global:
  resolve_timeout: 1m   # 每1分钟检测一次是否恢复
  wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/'
  wechat_api_corp_id: ''      # 企业微信中企业ID
  wechat_api_secret: ''      # 企业微信中，应用的Secret

templates:
  - '/usr/local/alertmanager/template/*.tmpl'

route:
  receiver: 'wechat'
  group_by: ['env','instance','type','group','job','alertname']
  group_wait: 10s       # 初次发送告警延时
  group_interval: 10s   # 距离第一次发送告警，等待多久再次发送告警
  repeat_interval: 3m   # 告警重发时间

receivers:
- name: 'wechat'
  wechat_configs:
  - send_resolved: true
    message: '{{ template "wechat.default.message" . }}'
    to_party: '27'         # 企业微信中创建的接收告警的部门【告警机器人】的部门ID
    agent_id: '1000010'     # 企业微信中创建的应用的ID
    api_secret: '' # 企业微信中，应用的Secret

我配置的是企业微信的告警

参考：https://www.cnblogs.com/miaocbin/p/13706164.html

systemctl start alertmanager

systemctl enable alertmanager

ip:9093 访问alertmanager

告警模板

[root@localhost alertmanager]# ls
alertmanager  alertmanager1.yml  alertmanager.yml  amtool  data  LICENSE  NOTICE  template
[root@localhost alertmanager]# vim template/wechat.tmpl
{{ define "wechat.default.message" }}
{{- if gt (len .Alerts.Firing) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 监控报警 =========
告警状态：{{   .Status }}
告警级别：{{ .Labels.severity }}
告警类型：{{ $alert.Labels.alertname }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
触发阀值：{{ .Annotations.value }}
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- if gt (len .Alerts.Resolved) 0 -}}
{{- range $index, $alert := .Alerts -}}
{{- if eq $index 0 }}
========= 异常恢复 =========
告警类型：{{ .Labels.alertname }}
告警状态：{{   .Status }}
故障主机: {{ $alert.Labels.instance }}
告警主题: {{ $alert.Annotations.summary }}
告警详情: {{ $alert.Annotations.message }}{{ $alert.Annotations.description}};
故障时间: {{ ($alert.StartsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
恢复时间: {{ ($alert.EndsAt.Add 28800e9).Format "2006-01-02 15:04:05" }}
{{- if gt (len $alert.Labels.instance) 0 }}
实例信息: {{ $alert.Labels.instance }}
{{- end }}
========= = end =  =========
{{- end }}
{{- end }}
{{- end }}
{{- end }}

修改prometheus配置文件

alerting:
  alertmanagers:
  - static_configs:
    - targets:
        - 192.168.1.163:9093
rule_files:
  - "rules/*_rules.yml"
  - "rules/*_alerts.yml"   在配置文件中添加



  - job_name: 'alertmanager'
    static_configs:
    - targets: ['192.168.1.163:9093']

添加告警规则

[root@localhost prometheus]# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool  rules
[root@localhost prometheus]# vim rules/node_alerts.yml
groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警
    expr: up{job="CM"} == 0 or up{job="alibaba"} == 0 or up{job="K8S"} == 0 or up{job="CICD"} == 0 or up{job="test-service"} == 0 or up{job="dev-service"} == 0 or up{job="service"} == 0 or up{job="ELK"} == 0
    for: 1m
    labels:
      user: prometheus
      severity: Disaster
    annotations:
      summary: "Instance {{ $labels.instance }} is down"
      description: "Instance {{ $labels.instance }} of hostname {{ $labels.nodename }} has been down for more than 1 minutes."
      value: "{{ $value }}"

- name: 内存告警规则
  rules:
  - alert: "内存使用率告警"
    expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
    for: 1m
    labels:
      user: prometheus
      severity: warning
    annotations:
      summary: "服务器: {{$labels.alertname}} 内存报警"
      description: "{{ $labels.alertname }} 内存资源利用率大于80%！(当前值: {{ $value }}%)"
      value: "{{ $value }}"