ingress-nginx监控体系:Prometheus+Grafana实战教程
概述
在现代Kubernetes集群中,ingress-nginx作为最流行的入口控制器之一,承担着至关重要的流量管理职责。构建完善的监控体系不仅能帮助运维团队实时掌握系统状态,更能快速定位和解决潜在问题。本文将深入探讨ingress-nginx的监控体系构建,通过Prometheus+Grafana的组合实现全方位的监控可视化。
监控架构设计
ingress-nginx的监控体系采用标准的云原生监控架构:
核心监控指标分类
| 指标类别 | 关键指标 | 监控目的 |
|---|---|---|
| 请求流量 | nginx_ingress_controller_requests | 监控请求吞吐量和性能 |
| 连接状态 | nginx_ingress_controller_nginx_process_connections | 跟踪并发连接数 |
| 响应状态 | HTTP状态码分布 | 识别错误和异常 |
| 资源使用 | CPU/内存使用率 | 资源容量规划 |
| 配置管理 | 配置重载状态 | 确保配置变更正确性 |
环境准备与部署
1. 安装ingress-nginx控制器
首先部署ingress-nginx控制器到Kubernetes集群:
# deploy.yaml
apiVersion: v1
kind: Namespace
metadata:
name: ingress-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ingress-nginx-controller
namespace: ingress-nginx
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
spec:
replicas: 1
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
template:
metadata:
labels:
app.kubernetes.io/name: ingress-nginx
app.kubernetes.io/part-of: ingress-nginx
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
prometheus.io/path: "/metrics"
spec:
containers:
- name: nginx-ingress-controller
image: registry.k8s.io/ingress-nginx/controller:v1.13.2
args:
- /nginx-ingress-controller
- --election-id=ingress-controller-leader
- --ingress-class=nginx
- --configmap=$(POD_NAMESPACE)/ingress-nginx-controller
- --report-node-internal-ip-address
- --metrics-per-host=false
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
- name: metrics
containerPort: 10254
livenessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
readinessProbe:
httpGet:
path: /healthz
port: 10254
scheme: HTTP
2. 部署Prometheus监控
创建Prometheus配置来采集ingress-nginx指标:
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'ingress-nginx'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ingress-nginx
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
3. 部署Grafana可视化
创建Grafana部署并导入官方仪表板:
# grafana-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: grafana
namespace: monitoring
spec:
replicas: 1
selector:
matchLabels:
app: grafana
template:
metadata:
labels:
app: grafana
spec:
containers:
- name: grafana
image: grafana/grafana:latest
ports:
- containerPort: 3000
volumeMounts:
- mountPath: /var/lib/grafana
name: grafana-storage
- mountPath: /etc/grafana/provisioning/dashboards
name: grafana-dashboards
volumes:
- name: grafana-storage
emptyDir: {}
- name: grafana-dashboards
configMap:
name: grafana-dashboards
关键监控指标详解
请求相关指标
# 总请求量
sum(rate(nginx_ingress_controller_requests[2m]))
# 成功率(非4xx/5xx响应)
sum(rate(nginx_ingress_controller_requests{status!~"[4-5].*"}[2m])) /
sum(rate(nginx_ingress_controller_requests[2m]))
# 按Ingress分类的请求量
sum(rate(nginx_ingress_controller_requests[2m])) by (ingress)
# 错误率监控
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[2m])) /
sum(rate(nginx_ingress_controller_requests[2m]))
性能与资源指标
# NGINX连接数
nginx_ingress_controller_nginx_process_connections{state="active"}
# 请求延迟分布
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (le))
# 内存使用
nginx_ingress_controller_nginx_process_resident_memory_bytes
# CPU使用
rate(nginx_ingress_controller_nginx_process_cpu_seconds_total[2m])
配置管理指标
# 配置重载状态
nginx_ingress_controller_config_last_reload_successful
# 配置重载次数
changes(nginx_ingress_controller_config_last_reload_successful_timestamp_seconds[30s])
Grafana仪表板配置实战
核心仪表板面板
ingress-nginx官方提供了功能丰富的Grafana仪表板,包含以下关键面板:
-
控制器概览面板
- 请求量实时监控
- 连接数统计
- 成功率指标
- 配置重载状态
-
Ingress详细视图
- 按Ingress分类的请求流量
- 各Ingress的成功率分析
- 上游服务健康状态
-
资源使用面板
- CPU和内存使用趋势
- 网络I/O压力监控
- 磁盘使用情况
自定义告警规则
创建Prometheus告警规则文件:
# alert-rules.yaml
groups:
- name: ingress-nginx-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(nginx_ingress_controller_requests{status=~"5.."}[5m])) by (ingress) /
sum(rate(nginx_ingress_controller_requests[5m])) by (ingress) > 0.05
for: 10m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.ingress }}"
description: "Error rate is above 5% for more than 10 minutes"
- alert: ConfigReloadFailure
expr: nginx_ingress_controller_config_last_reload_successful == 0
for: 5m
labels:
severity: warning
annotations:
summary: "NGINX configuration reload failed"
description: "NGINX configuration has failed to reload for 5 minutes"
- alert: HighRequestLatency
expr: |
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 10m
labels:
severity: warning
annotations:
summary: "High request latency detected"
description: "95th percentile request latency is above 1 second"
实战部署步骤
步骤1:创建监控命名空间
kubectl create namespace monitoring
步骤2:部署Prometheus
# 创建Prometheus配置
kubectl apply -f prometheus-config.yaml -n monitoring
# 部署Prometheus服务器
kubectl apply -f prometheus-deployment.yaml -n monitoring
步骤3:部署Grafana
# 创建Grafana配置映射
kubectl create configmap grafana-dashboards \
--from-file=nginx.json=./nginx-dashboard.json \
-n monitoring
# 部署Grafana
kubectl apply -f grafana-deployment.yaml -n monitoring
步骤4:配置数据源和仪表板
通过Grafana界面配置Prometheus数据源:
- URL: http://prometheus:9090
- Access: Server (default)
导入官方ingress-nginx仪表板,ID为9614。
步骤5:验证监控状态
# 检查Prometheus目标状态
kubectl port-forward svc/prometheus 9090:9090 -n monitoring
# 访问 http://localhost:9090/targets
# 检查Grafana仪表板
kubectl port-forward svc/grafana 3000:3000 -n monitoring
# 访问 http://localhost:3000
高级监控技巧
1. 多维度监控
通过标签实现多维度监控分析:
# 按命名空间监控
sum(rate(nginx_ingress_controller_requests[2m])) by (exported_namespace)
# 按服务监控
sum(rate(nginx_ingress_controller_requests[2m])) by (service)
# 按状态码监控
sum(rate(nginx_ingress_controller_requests[2m])) by (status)
2. 性能优化监控
# 缓存命中率
sum(rate(nginx_ingress_controller_requests{cache="HIT"}[2m])) /
sum(rate(nginx_ingress_controller_requests[2m]))
# 上游响应时间
histogram_quantile(0.95,
sum(rate(nginx_ingress_controller_response_duration_seconds_bucket[2m])) by (le))
3. 容量规划指标
# 预测容量需求
predict_linear(nginx_ingress_controller_nginx_process_resident_memory_bytes[1h], 3600)
# 连接池使用率
nginx_ingress_controller_nginx_process_connections{state="active"} /
nginx_ingress_controller_nginx_process_connections{state="waiting"}
故障排查与优化
常见问题排查
| 问题现象 | 可能原因 | 解决方案 |
|---|---|---|
| 高错误率 | 上游服务异常 | 检查后端服务健康状态 |
| 高延迟 | 网络或资源瓶颈 | 优化NGINX配置,增加资源 |
| 配置重载失败 | 配置语法错误 | 检查Ingress资源配置 |
| 内存持续增长 | 内存泄漏 | 调整worker_processes配置 |
性能优化建议
-
调整worker进程数
controller: config: worker-processes: "4" -
优化连接池设置
controller: config: worker-connections: "10240" keepalive-timeout: "75s" -
启用缓存优化
controller: config: proxy-buffer-size: "16k" proxy-buffers: "4 64k"
总结
构建完善的ingress-nginx监控体系是保障Kubernetes集群稳定运行的关键环节。通过Prometheus+Grafana的组合,我们可以实现:
- ✅ 实时监控请求流量和性能指标
- ✅ 多维度分析和故障定位
- ✅ 自动化告警和容量规划
- ✅ 历史数据追溯和趋势分析
遵循本文的实战指南,您将能够快速搭建起专业的ingress-nginx监控平台,为业务系统的稳定运行提供有力保障。记得定期审查监控指标和告警规则,根据业务发展不断优化监控体系。
下一步行动建议:
- 部署本文介绍的监控体系
- 配置关键业务指标的告警规则
- 建立监控数据定期review机制
- 根据业务特点定制监控仪表板
通过持续优化监控体系,您将能够更好地掌控ingress-nginx的运行状态,确保业务流量的稳定性和可靠性。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



