Prometheus监控- 第5天_prometheus expr-优快云博客

本文链接：https://blog.youkuaiyun.com/qq_26825559/article/details/141532176

7.10 基于Prometheus的全方位监控平台–企业中需要哪些告警Rules

一、前言
Prometheus中的告警规则允许你基于PromQL表达式定义告警触发条件，Prometheus后端对这些触发规则进行周期性计算，当满足触发条件后则会触发告警通知。

在企业中，为了确保业务的稳定性和可靠性，Prometheus告警规则非常重要。以下是需要考虑的几个维度：
业务维度：在企业中，不同的业务拥有不同的指标和告警规则。例如，对于ToC平台，需要监控订单量、库存、支付成功率等指标，以确保业务的正常运行。
环境维度：企业中通常会有多个环境，例如开发、测试、预生产和生产环境等。由于每个环境的特点不同，因此需要为每个环境制定不同的告警规则。
应用程序维度：不同的应用程序拥有不同的指标和告警规则。例如，在监控Web应用程序时，需要监控HTTP请求失败率、响应时间和内存使用情况等指标。
基础设施维度：企业中的基础设施包括服务器、网络设备和存储设备等。在监控基础设施时，需要监控CPU使用率、磁盘空间和网络带宽等指标。
二、定义告警规则
一条典型的告警规则如下所示：

   groups:
    - name: general.rules
      rules:
      - alert: InstanceDown
        expr: |
          up{
   job=~"other-ECS|k8s-nodes|prometheus"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance {
   { $labels.instance }} 停止工作"
          description: "{
   { $labels.instance }} 主机名：{
   { $labels.hostname }} 已经停止1分钟以上."

在告警规则文件中，我们可以将一组相关的规则设置定义在一个group下。
在每一个group中我们可以定义多个告警规则(rule)。一条告警规则主要由以下几部分组成：

alert：告警规则的名称。
expr：基于PromQL表达式告警触发条件，用于计算是否有时间序列满足该条件。
for：评估等待时间，可选参数。用于表示只有当触发条件持续一段时间后才发送告警。在等待期间新产生告警的状态为pending。
labels：自定义标签，允许用户指定要附加到告警上的一组附加标签。
annotations：用于指定一组附加信息，比如用于描述告警详细信息的文字等，annotations的内容在告警产生时会一同作为参数发送到Alertmanager。

三、企业中的告警rules
结合公司的业务场景参考：

https://samber.github.io/awesome-prometheus-alerts/rules#kubernetes

3.1、Node.rules

groups:

name: node.rules
rules:
- alert: NodeFilesystemUsage
  expr: |
  100 - (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} : { { $labels.mountpoint }} 分区使用率过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} : { { $labels.mountpoint }} 分区使用大于85% (当前值: { { $value }})”
- alert: NodeMemoryUsage
  expr: |
  100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 85
  for: 5m
  labels:
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} 内存使用率过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} 内存使用大于85% (当前值: { { $value }})”
- alert: NodeCPUUsage
  expr: |
  100 - (avg(irate(node_cpu_seconds_total{mode=“idle”}[5m])) by (instance) * 100) > 85
  for: 10m
  labels:
  hostname: ‘{ {$labels.hostname}}’
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} CPU使用率过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} CPU使用大于85% (当前值: { { $value }})”
- alert: TCP_Estab
  expr: |
  node_netstat_Tcp_CurrEstab > 5500
  for: 5m
  labels:
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} TCP_Estab链接过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} TCP_Estab链接过高!(当前值: { { $value }})”
- alert: TCP_TIME_WAIT
  expr: |
  node_sockstat_TCP_tw > 3000
  for: 5m
  labels:
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} TCP_TIME_WAIT过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} TCP_TIME_WAIT过高!(当前值: { { $value }})”
- alert: TCP_Sockets
  expr: |
  node_sockstat_sockets_used > 10000
  for: 5m
  labels:
  severity: warning
  annotations:
  summary: “Instance { { $labels.instance }} TCP_Sockets链接过高”
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} TCP_Sockets链接过高!(当前值: { { $value }})”
- alert: KubeNodeNotReady
  expr: |
  kube_node_status_condition{condition=“Ready”,status=“true”} == 0
  for: 1m
  labels:
  severity: critical
  annotations:
  description: ‘{ { $labels.node }} NotReady已经1分钟.’
- alert: KubernetesMemoryPressure
  expr: kube_node_status_condition{condition=“MemoryPressure”,status=“true”} == 1
  for: 2m
  labels:
  severity: critical
  annotations:
  summary: Kubernetes memory pressure (instance { { $labels.instance }})
  description: “{ { $labels.node }} has MemoryPressure condition VALUE = { { $value }}”
- alert: KubernetesDiskPressure
  expr: kube_node_status_condition{condition=“DiskPressure”,status=“true”} == 1
  for: 2m
  labels:
  severity: critical
  annotations:
  summary: Kubernetes disk pressure (instance { { $labels.instance }})
  description: “{ { $labels.node }} has DiskPressure condition.”
- alert: KubernetesContainerOomKiller
  expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason=“OOMKilled”}[10m]) == 1
  for: 10m
  labels:
  severity: warning
  annotations:
  summary: Kubernetes container oom killer (instance { { $labels.instance }})
  description: “{ { $labels.namespace }}/{ { $labels.pod }} has been OOMKilled { { $value }} times in the last 10 minutes.”
- alert: KubernetesJobFailed
  expr: kube_job_status_failed > 0
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: Kubernetes Job failed (instance { { $KaTeX parse error: Expected 'EOF', got '}' at position 17: \dotsabels.instance }̲}) descriptio\dots$ labels.namespace}}/{ {$labels.job_name}} failed to complete."
- alert: UnusualDiskReadRate
  expr: |
  sum by (job,instance) (irate(node_disk_read_bytes_total[5m])) / 1024 / 1024 > 140
  for: 5m
  labels:
  severity: critical
  hostname: ‘{ { $labels.hostname }}’
  annotations:
  description: ‘{ { $labels.instance }} 主机名：{ { $labels.hostname }} 持续5分钟磁盘读取数据(> 140 MB/s) (当前值: { { $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s’
- alert: UnusualDiskWriteRate
  expr: |
  sum by (job,instance) (irate(node_disk_written_bytes_total[5m])) / 1024 / 1024 > 140
  for: 5m
  labels:
  severity: critical
  hostname: ‘{ { $labels.hostname }}’
  annotations:
  description: ‘{ { $labels.instance }} 主机名：{ { $labels.hostname }} 持续5分钟磁盘写入数据(> 140 MB/s) (当前值: { { $value }}) 阿里云ESSD PL0最大吞吐量180MB/s, PL1最大350MB/s’
- alert: UnusualNetworkThroughputIn
  expr: |
  sum by (job,instance) (irate(node_network_receive_bytes_total{job=~“aws-hk-monitor|k8s-nodes”}[5m])) / 1024 / 1024 > 80
  for: 5m
  labels:
  severity: critical
  annotations:
  description: ‘{ { $labels.instance }} 主机名：{ { $labels.hostname }} 持续5分钟网络带宽接收数据(> 80 MB/s) (当前值: { { $value }})’
- alert: UnusualNetworkThroughputOut
  expr: |
  sum by (job,instance) (irate(node_network_transmit_bytes_total{job=~“aws-hk-monitor|k8s-nodes”}[5m])) / 1024 / 1024 > 80
  for: 5m
  labels:
  severity: critical
  annotations:
  description: ‘{ { $labels.instance }} 主机名：{ { $labels.hostname }} 持续5分钟网络带宽发送数据(> 80 MB/s) (当前值: { { $value }})’
- alert: SystemdServiceCrashed
  expr: |
  node_systemd_unit_state{state=“failed”} == 1
  for: 5m
  labels:
  severity: warning
  annotations:
  description: ‘{ { $labels.instance }} 主机名：{ { $KaTeX parse error: Expected 'EOF', got '}' at position 17: \dotsabels.hostname }̲} 上的{{$ labels.name}}服务有问题已经5分钟，请及时处理’
- alert: HostDiskWillFillIn24Hours
  expr: (node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~“tmpfs”}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
  for: 2m
  labels:
  severity: warning
  annotations:
  summary: Host disk will fill in 24 hours (instance { { $labels.instance }})
  description: “{ { $labels.instance }} 主机名：{ { $labels.hostname }} 以当前写入速率，预计文件系统将在未来24小时内耗尽空间!”
- alert: HostOutOfInodes
  expr: node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0
  for: 2m
  labels: