ingress-nginx告警配置:Prometheus Alertmanager
概述
在现代Kubernetes环境中,ingress-nginx作为最流行的Ingress控制器之一,承担着集群入口流量的重要职责。为确保业务的高可用性和稳定性,建立完善的监控告警体系至关重要。本文将深入探讨如何为ingress-nginx配置Prometheus和Alertmanager告警系统,帮助您构建可靠的监控告警机制。
核心监控指标
ingress-nginx提供了丰富的Prometheus指标,主要包括以下几类:
请求相关指标
nginx_ingress_controller_requests_total
nginx_ingress_controller_request_duration_seconds
nginx_ingress_controller_request_size_bytes
nginx_ingress_controller_response_size_bytes
连接状态指标
nginx_ingress_controller_nginx_process_connections_total
nginx_ingress_controller_nginx_process_requests_total
上游服务指标
nginx_ingress_controller_upstream_latency_seconds
nginx_ingress_controller_upstream_requests_total
Prometheus配置
基础监控部署
首先部署Prometheus监控组件:
# deploy/prometheus/prometheus.yaml
global:
scrape_interval: 10s
scrape_configs:
- job_name: 'ingress-nginx-endpoints'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ingress-nginx
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
ServiceMonitor配置
使用ServiceMonitor自动发现ingress-nginx监控端点:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ingress-nginx
namespace: ingress-nginx
spec:
selector:
matchLabels:
app.kubernetes.io/name: ingress-nginx
endpoints:
- port: metrics
interval: 30s
path: /metrics
Alertmanager告警规则
关键告警规则配置
创建PrometheusRule资源定义关键告警:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: ingress-nginx-alerts
namespace: ingress-nginx
spec:
groups:
- name: ingress-nginx
rules:
- alert: NginxIngressDown
expr: up{job="ingress-nginx"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Nginx Ingress controller down"
description: "Nginx Ingress controller has been down for more than 5 minutes"
- alert: HighErrorRate
expr: rate(nginx_ingress_controller_requests_total{status=~"5.."}[5m]) / rate(nginx_ingress_controller_requests_total[5m]) > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "High error rate on ingress-nginx"
description: "Error rate is above 5% for the last 10 minutes"
- alert: HighRequestLatency
expr: histogram_quantile(0.95, rate(nginx_ingress_controller_request_duration_seconds_bucket[5m])) > 3
for: 5m
labels:
severity: warning
annotations:
summary: "High request latency on ingress-nginx"
description: "95th percentile request latency is above 3 seconds"
- alert: HighUpstreamLatency
expr: histogram_quantile(0.95, rate(nginx_ingress_controller_upstream_latency_seconds_bucket[5m])) > 2
for: 5m
labels:
severity: warning
annotations:
summary: "High upstream latency"
description: "95th percentile upstream latency is above 2 seconds"
容量规划告警
- alert: NginxConnectionsHigh
expr: nginx_ingress_controller_nginx_process_connections_total > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "High number of nginx connections"
description: "Nginx connections exceed 10000"
- alert: CertificateExpiringSoon
expr: (ssl_certificate_expiry_time_seconds - time()) < 86400 * 7
for: 0m
labels:
severity: warning
annotations:
summary: "SSL certificate expiring soon"
description: "SSL certificate will expire in less than 7 days"
Alertmanager配置
基础告警路由配置
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.example.com:587'
smtp_from: 'alertmanager@example.com'
smtp_auth_username: 'alertmanager'
smtp_auth_password: 'password'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-nginx-pager'
routes:
- match:
severity: critical
receiver: 'team-nginx-pager'
repeat_interval: 1h
- match:
severity: warning
receiver: 'team-nginx-slack'
receivers:
- name: 'team-nginx-pager'
email_configs:
- to: 'nginx-team@example.com'
send_resolved: true
- name: 'team-nginx-slack'
slack_configs:
- api_url: 'https://hooks.slack.com/services/TOKEN'
channel: '#nginx-alerts'
send_resolved: true
抑制规则配置
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
部署最佳实践
Helm Chart配置
使用Helm部署时启用监控功能:
controller:
metrics:
enabled: true
service:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "10254"
serviceMonitor:
enabled: true
additionalLabels:
release: prometheus
prometheusRule:
enabled: true
rules: []
自定义告警规则
通过values.yaml自定义告警规则:
controller:
prometheusRule:
enabled: true
rules:
- alert: CustomHighTraffic
expr: rate(nginx_ingress_controller_requests_total[5m]) > 1000
for: 10m
labels:
severity: warning
annotations:
summary: "High traffic volume detected"
description: "Request rate exceeds 1000 requests per second"
故障排查与优化
常见问题排查
性能优化建议
- 调整监控频率:根据集群规模调整scrape_interval
- 优化指标采集:只采集必要的指标减少资源消耗
- 设置合理的阈值:基于历史数据设置告警阈值
- 定期审查告警规则:确保告警规则的有效性和准确性
总结
通过本文的配置指南,您可以建立完整的ingress-nginx监控告警体系。关键要点包括:
- 正确配置Prometheus监控端点发现
- 定义关键的业务和技术指标告警规则
- 合理配置Alertmanager告警路由和通知渠道
- 定期审查和优化告警策略
完善的监控告警系统能够帮助您及时发现和处理ingress-nginx相关问题,确保业务的高可用性和稳定性。建议定期进行告警演练,验证告警系统的有效性。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



