Kube-Prometheus 监控Ingress-Nginx

1、修改Ingress yaml文件

1、修改Ingress Service内容

apiVersion: v1
kind: Service
metadata:
 annotations:
   prometheus.io/scrape: "true"
   prometheus.io/port: "10254"
..
spec:
  ports:
  - name: prometheus
    port: 10254
    targetPort: 10254
..

修改完成最终效果:
apiVersion: v1
kind: Service
metadata:
  annotations:
    prometheus.io/port: "10254"
    prometheus.io/scrape: "true"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
    app.kubernetes.io/version: 1.3.1
  name: ingress-nginx-controller
  namespace: ingress-nginx
spec:
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - appProtocol: http
    name: http
    port: 80
    protocol: TCP
    targetPort: http
  - appProtocol: https
    name: https
    port: 443
    protocol: TCP
    targetPort: https
  - name: prometheus
    port: 10254
    targetPort: 10254
  selector:
    app.kubernetes.io/component: controller
    app.kubernetes.io/instance: ingress-nginx
    app.kubernetes.io/name: ingress-nginx
  type: NodePort

2、修改Ingress deployment内容

apiVersion: v1
kind: Deployment
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "10254"
..
spec:
  ports:
    - name: prometheus
      containerPort: 10254
      ..
      
重新apply一下yaml文件让修改的配置生效
kubectl apply -f ingress-deploy.yml 

3、测试一下修改是否正常

kubectl get po,svc -n ingress-nginx
NAME                                       READY   STATUS      RESTARTS   AGE
pod/ingress-nginx-admission-create-jf247   0/1     Completed   0          4d
pod/ingress-nginx-admission-patch-tn57z    0/1     Completed   0          4d
pod/ingress-nginx-controller-4hv44         1/1     Running     0          21m
pod/ingress-nginx-controller-94z5v         1/1     Running     0          21m
pod/ingress-nginx-controller-vsplw         1/1     Running     0          22m

NAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                      AGE
service/ingress-nginx-controller             NodePort    10.99.244.219   <none>        80:30993/TCP,443:31984/TCP,10254:31795/TCP   4d
service/ingress-nginx-controller-admission   ClusterIP   10.110.65.196   <none>        443/TCP                                      4d

curl 127.0.0.1:10254/metrics		# 在Ingress的节点上运行一下看看是否可以获取到资源
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 2.4853e-05
go_gc_duration_seconds{quantile="0.25"} 5.7798e-05
go_gc_duration_seconds{quantile="0.5"} 7.5043e-05
go_gc_duration_seconds{quantile="0.75"} 9.8753e-05
go_gc_duration_seconds{quantile="1"} 0.001074475
go_gc_duration_seconds_sum 0.010298983
go_gc_duration_seconds_count 104
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 100
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.18.2"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 1.2434328e+07
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 8.21745944e+08

2、新增Ingress ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
    name: ingress-nginx
    namespace: monitoring
spec:
  endpoints:
  - interval: 15s
    port: prometheus
  namespaceSelector:
    matchNames:
    - ingress-nginx
  selector:
    matchLabels:
      app.kubernetes.io/component: controller
      app.kubernetes.io/instance: ingress-nginx
      app.kubernetes.io/name: ingress-nginx
      app.kubernetes.io/part-of: ingress-nginx
      app.kubernetes.io/version: 1.3.1
---
# 在对应的ns中创建角色
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
rules:
- apiGroups:
  - ""
  resources:
  - services
  - endpoints
  - pods
  verbs:
  - get
  - list
  - watch
---
# 绑定角色 prometheus-k8s 角色到 Role
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: prometheus-k8s
  namespace: ingress-nginx
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: prometheus-k8s
subjects:
- kind: ServiceAccount
  name: prometheus-k8s # Prometheus 容器使用的 serviceAccount,kube-prometheus默认使用prometheus-k8s这个用户
  namespace: monitoring

3、添加报警规则

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    prometheus: k8s
    role: alert-rules
  name: nginx-ingress-rules
  namespace: monitoring
spec:
  groups:
  - name: nginx-ingress-rules
    rules:
    - alert: NginxFailedtoLoadConfiguration
      expr: nginx_ingress_controller_config_last_reload_successful == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Nginx Ingress Controller配置文件加载失败"
        description: "Nginx Ingress Controller的配置文件加载失败,请检查配置文件是否正确。"
    - alert: NginxHighHttp4xxErrorRate
      expr: rate(nginx_ingress_controller_requests{status=~"^404"}[5m]) * 100 > 1
      for: 1m
      labels:
        severity: warining
      annotations:
        description: Nginx high HTTP 4xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
        summary: "Too many HTTP requests with status 404 (> 1%)"
    - alert: NginxHighHttp5xxErrorRate
      expr: rate(nginx_ingress_controller_requests{status=~"^5.."}[5m]) * 100 > 1
      for: 1m
      labels:
        severity: warining
      annotations:
        description: Nginx high HTTP 5xx error rate ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
        summary: "Too many HTTP requests with status 5xx (> 1%)"
    - alert: NginxLatencyHigh
      expr: histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (host, node)) > 3
      for: 2m
      labels:
        severity: warining
      annotations:
       description: Nginx latency high ( namespaces {{ $labels.exported_namespace }} host {{ $labels.host }} )
       summary: "Nginx p99 latency is higher than 3 seconds"
    - alert: NginxHighRequestRate
      expr: rate(nginx_ingress_controller_nginx_process_requests_total[5m]) * 100 > 1000
      for: 1m
      labels:
        severity: warning
      annotations:
        description: Nginx ingress controller high request rate ( instance {{ $labels.instance }} namespaces {{ $labels.namespaces }} pod {{$labels.pod}})
        summary: "Nginx ingress controller high request rate (> 1000 requests per second)"
    - alert: SSLCertificateExpiration15day
      expr: nginx_ingress_controller_ssl_expire_time_seconds < 1296000
      for: 30m
      labels:
        severity: warning
      annotations:
        summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
        description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 15 days.
    - alert: SSLCertificateExpiration7day
      expr: nginx_ingress_controller_ssl_expire_time_seconds < 604800
      for: 30m
      labels:
        severity: critical
      annotations:
        summary: SSL/TLS certificate for {{ $labels.host $labels.secret_name }} is about to expire
        description: The SSL/TLS certificate for {{ $labels.host $labels.secret_name }} will expire in less than 7 days.

4、导入Grafana模板

Ingress-nginx模板ID:9614、14314

<think>好的,用户想了解如何在Prometheus中设置和配置Kubernetes监控规则,包括教程和最佳实践。首先,我需要回忆一下Prometheus监控K8S的基本流程。通常,这涉及到部署Prometheus到集群中,可能使用Operator来简化管理,然后配置服务发现、抓取规则、告警规则等。 接下来,用户可能需要具体的步骤,比如如何部署Prometheus。这时候需要提到常用的方法,比如使用Helm Chart或者手动部署。同时,服务发现的配置是关键,Prometheus需要自动发现K8S中的各种资源,比如Pod、Service、Node等,这通常通过kubernetes_sd_configs来实现。这部分需要详细说明配置文件的写法,尤其是relabel_configs的使用,因为标签重写对于过滤和标记目标很重要。 然后是监控规则的配置,包括哪些核心指标是必须监控的,比如节点资源使用率、Pod状态、Deployment可用性等。对于每个指标,需要给出具体的PromQL表达式,例如CPU使用率的计算方式。同时,要提到如何将这些规则写入Prometheus的规则文件,可能需要使用ConfigMap来管理这些规则,特别是当使用Prometheus Operator时,可以通过自定义资源定义(CRD)来声明告警规则。 告警配置部分,用户可能需要知道如何将Prometheus与Alertmanager集成,设置通知方式,比如邮件或Slack。这部分需要提到Alertmanager的配置文件和路由规则,确保告警能够正确路由并通知到相关人员。 可视化方面,Grafana的使用是必不可少的,需要指导用户如何导入现有的仪表盘模板,比如Kubernetes Cluster Monitoring,或者根据需求自定义仪表盘。这里可以提到Grafana的数据源配置以及如何利用PromQL查询来创建图表。 最佳实践部分,应该包括标签管理、资源限制、定期审查监控规则、备份配置和安全性考虑。例如,建议使用有意义的标签来区分环境,避免监控组件本身消耗过多资源,以及如何定期更新告警阈值以适应集群的变化。 最后,用户可能需要参考资料,比如官方文档和社区资源,确保信息的准确性和可靠性。同时,生成的相关问题需要覆盖用户可能进一步询问的方向,比如性能优化、高可用性配置、排除常见问题的方法等。 需要注意的是,回答中要引用提供的参考资料,比如Prometheus在K8S中的最佳实践,Grafana的集成方法等,确保引用标识正确添加。同时,保持语言的中文和结构清晰,分步骤说明,便于用户理解和操作。</think>### 在Prometheus中配置Kubernetes监控规则的步骤 #### 1. **部署PrometheusKubernetes集群** - **推荐方式**:使用Helm Chart或Prometheus Operator快速部署。 ```bash # 使用Helm安装Prometheus Stack(包含Prometheus、Alertmanager、Grafana) helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm install prometheus prometheus-community/kube-prometheus-stack ``` - **手动配置**:通过`Deployment`和`Service`资源部署,需配置RBAC权限和存储卷[^1]。 #### 2. **配置服务发现(Service Discovery)** Prometheus通过`kubernetes_sd_configs`自动发现K8S资源。在`prometheus.yml`中添加以下配置: ```yaml scrape_configs: - job_name: 'kubernetes-nodes' kubernetes_sd_configs: - role: node # 监控节点资源 relabel_configs: - source_labels: [__address__] regex: '(.*):10250' replacement: '${1}:9100' # 重写端口为kubelet暴露的metrics端口 target_label: __address__ ``` - **关键标签重写**:通过`relabel_configs`过滤无关目标,例如仅监控生产环境Pod: ```yaml - source_labels: [__meta_kubernetes_namespace] action: keep regex: production ``` #### 3. **定义核心监控规则** - **节点资源监控**: - **CPU使用率**: ```promql (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 ``` - **内存使用率**: ```promql (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 ``` - **Pod健康状态**: ```promql kube_pod_status_phase{phase=~"Pending|Unknown"} > 0 ``` - **Deployment副本可用性**: ```promql kube_deployment_status_replicas_available / kube_deployment_spec_replicas < 0.8 ``` #### 4. **告警规则配置(alert.rules.yml)** 将规则文件挂载到Prometheus容器,或通过Prometheus Operator的`PrometheusRule` CRD定义: ```yaml groups: - name: Kubernetes-Alerts rules: - alert: NodeCPUHigh expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance)) * 100 > 80 for: 10m labels: severity: critical annotations: summary: "节点CPU使用率过高: {{ $labels.instance }}" ``` #### 5. **集成Alertmanager** - **路由配置**(alertmanager.yml): ```yaml route: group_by: [alertname, cluster] receiver: 'slack-notifications' receivers: - name: 'slack-notifications' slack_configs: - api_url: 'https://hooks.slack.com/services/XXX' ``` #### 6. **可视化配置(Grafana)** - **导入官方仪表盘**:ID为`3119`的“Kubernetes Cluster Monitoring”模板。 - **自定义查询**:例如监控Service的请求延迟: ```promql histogram_quantile(0.95, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le, service)) ``` --- ### **最佳实践** 1. **标签管理**:为监控对象添加`env=prod`或`team=backend`等标签,便于过滤和聚合[^3]。 2. **资源限制**:为Prometheus Pod设置内存限制(如8GiB以上),避免OOM问题。 3. **定期审查规则**:每季度优化过时的告警阈值,删除冗余指标。 4. **备份配置**:将`prometheus.yml`和告警规则纳入版本控制(如Git)。 5. **安全加固**:启用HTTPS和RBAC,限制`/metrics`端口的访问权限[^2]。 --- ### **引用示例** 通过Prometheus Operator可简化K8S监控的配置流程,实现声明式管理。Grafana与Prometheus的集成能显著提升指标可视化的效率。 ---
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

运维生涯记录

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值