云原生架构下的Azure Kubernetes服务监控实践指南-优快云博客

云原生架构下的Azure Kubernetes服务监控实践指南

概述

在云原生架构中，监控（Monitoring）是确保系统稳定性和可观测性（Observability）的核心支柱。Azure Kubernetes服务（AKS）作为微软云原生生态的重要组成部分，提供了完整的监控解决方案。本文将深入探讨如何在AKS环境中构建高效的监控体系，涵盖从基础指标收集到高级告警配置的全流程实践。

监控体系架构

mermaid

核心监控组件

1. Azure Monitor for Containers

Azure Monitor for Containers是AKS监控的核心服务，提供：

自动发现和监控：自动检测AKS集群中的所有容器和工作负载
性能指标收集：CPU、内存、磁盘和网络指标
日志聚合：集中收集和分析容器日志
健康状态监控：集群和节点健康状态监控

2. Prometheus集成

AKS原生支持Prometheus监控，无需额外部署服务器：

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
    - job_name: 'kubernetes-apiservers'
      kubernetes_sd_configs:
      - role: endpoints
      scheme: https
      tls_config:
        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

3. 应用性能监控（APM）

通过Application Insights实现应用层监控：

// 在.NET应用中集成Application Insights
public void ConfigureServices(IServiceCollection services)
{
    services.AddApplicationInsightsTelemetry(Configuration);
    services.AddApplicationInsightsKubernetesEnricher();
}

// 自定义遥测数据
public class OrderService
{
    private readonly TelemetryClient _telemetryClient;
    
    public async Task ProcessOrder(Order order)
    {
        using (_telemetryClient.StartOperation<RequestTelemetry>("ProcessOrder"))
        {
            _telemetryClient.TrackEvent("OrderProcessingStarted");
            // 业务逻辑
            _telemetryClient.TrackMetric("OrderValue", order.TotalAmount);
        }
    }
}

监控配置实践

1. 启用Azure Monitor

# 创建AKS集群时启用监控
az aks create \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --enable-addons monitoring \
  --generate-ssh-keys

# 为现有集群启用监控
az aks enable-addons \
  --resource-group myResourceGroup \
  --name myAKSCluster \
  --addons monitoring

2. DaemonSet配置

监控代理以DaemonSet形式部署在每个节点：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: omsagent
  namespace: kube-system
spec:
  selector:
    matchLabels:
      app: omsagent
  template:
    metadata:
      labels:
        app: omsagent
    spec:
      containers:
      - name: omsagent
        image: "mcr.microsoft.com/azuremonitor/containerinsights/ciprod:ciprod11012023"
        env:
        - name: WSID
          valueFrom:
            secretKeyRef:
              name: omsagent-secret
              key: WSID
        - name: KEY
          valueFrom:
            secretKeyRef:
              name: omsagent-secret
              key: KEY

监控指标体系

1. 基础设施指标

指标类别	具体指标	告警阈值	监控频率
节点CPU	cpuUsagePercentage	>80%	15秒
节点内存	memoryWorkingSet	>85%	15秒
磁盘空间	diskUsedPercentage	>90%	60秒
网络流量	networkBytesTotal	自定义	30秒

2. 应用性能指标

mermaid

日志管理策略

1. 日志收集配置

apiVersion: v1
kind: ConfigMap
metadata:
  name: container-azm-ms-agentconfig
  namespace: kube-system
data:
  agent-settings: |-
    {
      "log_collection_settings": {
        "collect_container_logs": true,
        "container_logs": {
          "exclude_namespaces": ["kube-system"],
          "include_namespaces": ["default", "production"],
          "tail": "100"
        }
      }
    }

2. Kusto查询示例

// 查询特定命名空间的错误日志
ContainerLog
| where ContainerName == "myapp"
| where LogEntry contains "error"
| where TimeGenerated > ago(1h)
| project TimeGenerated, LogEntry, PodName
| order by TimeGenerated desc

// 性能指标聚合
Perf
| where ObjectName == "K8SContainer"
| where CounterName == "cpuUsageNanoCores"
| summarize avg(CounterValue) by bin(TimeGenerated, 1m), InstanceName
| render timechart

告警与通知体系

1. 告警规则配置

{
  "location": "global",
  "properties": {
    "description": "AKS节点CPU使用率过高",
    "severity": 2,
    "enabled": true,
    "scopes": [
      "/subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/Microsoft.ContainerService/managedClusters/{aks-name}"
    ],
    "evaluationFrequency": "PT1M",
    "windowSize": "PT5M",
    "criteria": {
      "allOf": [
        {
          "threshold": 80,
          "name": "CPUUsage",
          "metricName": "cpuUsagePercentage",
          "operator": "GreaterThan",
          "timeAggregation": "Average",
          "metricNamespace": "Microsoft.ContainerService/managedClusters"
        }
      ]
    },
    "actions": [
      {
        "actionGroupId": "/subscriptions/{sub-id}/resourceGroups/{rg-name}/providers/microsoft.insights/actionGroups/{ag-name}"
      }
    ]
  }
}

2. 多级告警策略

告警级别	触发条件	通知方式	响应时间要求
P0紧急	服务不可用	电话+短信	<5分钟
P1重要	性能严重下降	短信+邮件	<15分钟
P2警告	资源使用率警告	邮件	<1小时
P3信息	配置变更通知	邮件	<4小时

仪表盘与可视化

1. 自定义监控仪表盘

{
  "lenses": {
    "0": {
      "order": 0,
      "parts": {
        "0": {
          "position": {
            "x": 0,
            "y": 0,
            "rowSpan": 4,
            "colSpan": 6
          },
          "metadata": {
            "inputs": [],
            "type": "Extension/Microsoft_Azure_Monitoring/PartType/LogsChartPart",
            "settings": {
              "content": {
                "Query": "Perf | where ObjectName == 'K8SNode' | where CounterName == 'cpuUsagePercentage' | summarize avg(CounterValue) by bin(TimeGenerated, 1m), InstanceName | render timechart",
                "Title": "节点CPU使用率"
              }
            }
          }
        }
      }
    }
  }
}

2. 关键性能指标看板

mermaid

最佳实践与优化建议

1. 监控数据保留策略

数据类型	保留期限	存储层级	访问频率
实时指标	30天	热存储	高频率
历史日志	90天	温存储	中等频率
归档数据	1年	冷存储	低频率
合规数据	7年	归档存储	极少访问

2. 成本优化策略

采样率调整：对非关键指标降低采集频率
日志过滤：避免收集调试级别日志
数据压缩：启用日志压缩功能
存储分层：根据访问频率选择存储层级

3. 安全监控考虑

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: security-alerts
  namespace: monitoring
spec:
  groups:
  - name: security.rules
    rules:
    - alert: UnauthorizedAccessAttempt
      expr: increase(kube_audit_events_total{verb="create",resource="secrets"}[5m]) > 0
      for: 2m
      labels:
        severity: critical
      annotations:
        description: 检测到未经授权的密钥访问尝试

故障排查与诊断

1. 常见问题诊断流程

mermaid

2. 自动化修复脚本

#!/bin/bash
# 自动诊断和修复AKS节点问题
NODE=$1

echo "检查节点 $NODE 状态..."
kubectl describe node $NODE

# 检查资源压力
if kubectl top node $NODE | awk 'NR>1 {if($3+0>90) exit 1}'; then
    echo "节点内存使用率正常"
else
    echo "节点内存使用率过高，尝试驱逐Pod"
    kubectl drain $NODE --ignore-daemonsets --delete-emptydir-data
fi

总结

构建完善的AKS监控体系需要从多个维度考虑：基础设施监控、应用性能监控、日志管理和告警通知。通过Azure Monitor for Containers、Prometheus集成和Application Insights的组合使用，可以建立全方位的监控解决方案。

关键成功因素包括：

自动化配置：通过Infrastructure as Code管理监控配置
多维度监控：覆盖基础设施、应用、业务等多个层面
智能告警：基于机器学习减少误报，提高告警准确性
持续优化：定期评审监控策略，适应业务变化

通过本文介绍的实践指南，您可以构建出既满足当前需求又具备扩展性的AKS监控体系，为云原生应用的稳定运行提供有力保障。

创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考