5分钟上手!Apache SkyWalking容器化部署全攻略:从Docker到K8s监控平台搭建

5分钟上手!Apache SkyWalking容器化部署全攻略:从Docker到K8s监控平台搭建

【免费下载链接】skywalking APM, Application Performance Monitoring System 【免费下载链接】skywalking 项目地址: https://gitcode.com/gh_mirrors/sky/skywalking

你还在为分布式系统监控头疼?服务链路追踪配置复杂?告警规则难以定制?本文将带你从零开始,通过Docker Compose快速部署Apache SkyWalking(APM,应用性能监控系统),并进阶到Kubernetes集群环境,全程仅需5分钟核心操作,解决90%的微服务监控痛点。

读完本文你将掌握:

  • Docker Compose一键启动完整监控栈(含存储、OAP服务、UI)
  • 两种存储方案(ElasticSearch/BanyanDB)的无缝切换
  • 容器化Agent的自动注入与服务发现
  • K8s环境下的监控数据持久化方案
  • 生产级告警规则配置与可视化展示

快速体验:Docker Compose一键部署

Apache SkyWalking提供了开箱即用的Docker Compose配置,包含完整的监控基础设施。通过以下命令可在3分钟内启动包含存储、OAP服务器和UI的完整环境:

# 克隆仓库
git clone https://gitcode.com/gh_mirrors/sky/skywalking
cd skywalking

# 启动服务栈(默认使用ElasticSearch存储)
docker compose --profile elasticsearch up -d

核心配置文件定义了三个关键服务组件及其依赖关系:

  • 存储服务:ElasticSearch或BanyanDB,用于持久化监控数据
  • OAP服务器:接收、分析监控数据并提供查询能力
  • Web UI:可视化监控面板与告警展示

docker-compose.yml配置解析

docker/docker-compose.yml文件采用多profile设计,支持不同存储方案的快速切换:

version: '3.8'
services:
  # ElasticSearch存储方案
  elasticsearch:
    profiles: ["elasticsearch"]
    image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2
    environment:
      - discovery.type=single-node
      - "ES_JAVA_OPTS=-Xms512m -Xmx512m"
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      retries: 3

  # BanyanDB存储方案(轻量化时序数据库)
  banyandb:
    profiles: ["banyandb"]
    image: ghcr.io/apache/skywalking-banyandb:a091ac0c3efa7305288ae9fb8853bffb2186583a
    command: standalone --stream-root-path /tmp/stream-data --measure-root-path /tmp/measure-data

  # OAP服务器基础配置
  oap-base: &oap-base
    image: ghcr.io/apache/skywalking/oap:latest
    ports:
      - "11800:11800"  # gRPC数据接收端口
      - "12800:12800"  # HTTP查询端口
    environment:
      SW_HEALTH_CHECKER: default
      SW_TELEMETRY: prometheus
      JAVA_OPTS: "-Xms2048m -Xmx2048m"

  # 连接ElasticSearch的OAP服务
  oap-es:
    <<: *oap-base
    profiles: ["elasticsearch"]
    depends_on:
      elasticsearch:
        condition: service_healthy
    environment:
      <<: *oap-env
      SW_STORAGE: elasticsearch
      SW_STORAGE_ES_CLUSTER_NODES: elasticsearch:9200

  # Web UI服务
  ui:
    image: ghcr.io/apache/skywalking/ui:latest
    ports:
      - "8080:8080"
    environment:
      SW_OAP_ADDRESS: http://oap:12800  # 指向OAP服务器的地址

存储方案切换

根据监控规模选择合适的存储方案:

  • ElasticSearch:适合大规模分布式系统,支持复杂查询和聚合分析
  • BanyanDB:SkyWalking原生时序数据库,资源占用低,写入性能优异

切换存储方案只需修改启动命令的profile参数:

# 使用BanyanDB存储(轻量级部署)
docker compose --profile banyandb up -d

# 查看服务状态
docker compose ps

容器化Agent配置与服务发现

在容器环境中,SkyWalking Agent可通过环境变量注入实现无侵入式部署。以下是典型的微服务容器配置示例:

FROM openjdk:11-jre-slim
WORKDIR /app
COPY target/*.jar app.jar

# Agent配置(通过环境变量注入)
ENV SW_AGENT_NAME=payment-service \
    SW_AGENT_COLLECTOR_BACKEND_SERVICES=oap:11800 \
    SW_AGENT_SPAN_LIMIT=200

# 启动命令(注入Agent)
ENTRYPOINT ["java", "-javaagent:/skywalking/agent/skywalking-agent.jar", "-jar", "app.jar"]

Agent自动注入方案

在K8s环境中,可通过MutatingWebhook实现Agent的自动注入,避免手动修改每个服务的Dockerfile:

# 注入配置示例
apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-agent-config
data:
  agent.service_name: "default"
  collector.backend_service: "oap.skywalking.svc:11800"

官方文档提供了完整的服务Agent配置指南,包含不同语言Agent的集成方案。

存储方案深度对比

ElasticSearch vs BanyanDB

特性ElasticSearchBanyanDB
部署复杂度中(需配置集群)低(单机模式)
资源占用高(建议4G+内存)低(512M内存可运行)
查询性能优(全文检索支持)优(时序数据优化)
数据保留策略支持TTL自动清理原生支持时序数据生命周期
适用场景大规模分布式系统中小规模服务或边缘环境

通过修改docker-compose.yml中的环境变量可实现存储方案的动态切换:

# 切换到BanyanDB存储
services:
  oap:
    environment:
      SW_STORAGE: banyandb
      SW_STORAGE_BANYANDB_TARGETS: banyandb:17912

K8s生产环境部署

部署架构

在Kubernetes环境中,推荐采用以下部署架构确保高可用性:

mermaid

持久化存储配置

使用StatefulSet部署ElasticSearch确保数据持久化:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  serviceName: elasticsearch
  replicas: 3
  template:
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2
        env:
        - name: discovery.zen.ping.unicast.hosts
          value: "elasticsearch-0.elasticsearch,elasticsearch-1.elasticsearch"
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

监控数据可视化

SkyWalking与Prometheus/Grafana无缝集成,可通过官方提供的Grafana模板快速构建自定义监控面板:

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "gnetId": null,
  "graphTooltip": 0,
  "id": 1,
  "iteration": 1620256625746,
  "links": [],
  "panels": [
    {
      "aliasColors": {},
      "bars": false,
      "dashLength": 10,
      "dashes": false,
      "datasource": "Prometheus",
      "fieldConfig": {
        "defaults": {
          "links": []
        },
        "overrides": []
      },
      "fill": 1,
      "fillGradient": 0,
      "gridPos": {
        "h": 9,
        "w": 24,
        "x": 0,
        "y": 0
      },
      "hiddenSeries": false,
      "id": 2,
      "legend": {
        "avg": false,
        "current": false,
        "max": false,
        "min": false,
        "show": true,
        "total": false,
        "values": false
      },
      "lines": true,
      "linewidth": 1,
      "nullPointMode": "null",
      "options": {
        "alertThreshold": true
      },
      "percentage": false,
      "pluginVersion": "7.5.5",
      "pointradius": 2,
      "points": false,
      "renderer": "flot",
      "seriesOverrides": [],
      "spaceLength": 10,
      "stack": false,
      "steppedLine": false,
      "targets": [
        {
          "expr": "sum(rate(skywalking_service_resp_time_sum[5m])) / sum(rate(skywalking_service_resp_time_count[5m]))",
          "interval": "",
          "legendFormat": "Average Response Time",
          "refId": "A"
        }
      ],
      "thresholds": [],
      "timeFrom": null,
      "timeRegions": [],
      "timeShift": null,
      "title": "Service Average Response Time",
      "tooltip": {
        "shared": true,
        "sort": 0,
        "value_type": "individual"
      },
      "type": "graph",
      "xaxis": {
        "buckets": null,
        "mode": "time",
        "name": null,
        "show": true,
        "values": []
      },
      "yaxes": [
        {
          "format": "ms",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": "0",
          "show": true
        },
        {
          "format": "short",
          "label": null,
          "logBase": 1,
          "max": null,
          "min": null,
          "show": true
        }
      ],
      "yaxis": {
        "align": false,
        "alignLevel": null
      }
    }
  ],
  "refresh": "5s",
  "schemaVersion": 27,
  "style": "dark",
  "tags": [],
  "templating": {
    "list": []
  },
  "time": {
    "from": "now-6h",
    "to": "now"
  },
  "timepicker": {
    "refresh_intervals": [
      "5s",
      "10s",
      "30s",
      "1m",
      "5m",
      "15m",
      "30m",
      "1h",
      "2h",
      "1d"
    ]
  },
  "timezone": "",
  "title": "SkyWalking Service Monitoring",
  "uid": "skywalking-service",
  "version": 1
}

告警规则配置

通过修改dist-material/alarm-settings.yml配置文件定义告警规则:

rules:
  service_resp_time_rule:
    metrics-name: service_resp_time
    op: ">"
    threshold: 1000
    period: 10
    count: 3
    silence-period: 5
    message: "Service {name} response time is more than 1000ms in 3 minutes of last 10 minutes."
  service_sla_rule:
    metrics-name: service_sla
    op: "<"
    threshold: 8000
    period: 10
    count: 2
    silence-period: 5
    message: "Service {name} success rate is lower than 80% in 2 minutes of last 10 minutes."

告警规则支持多种通知渠道(Email/Slack/WebHook),详细配置可参考官方告警文档

常见问题与优化

数据量增长过快

当监控数据量超过预期时,可通过以下方式优化:

  1. 调整采样率:修改Agent配置降低Trace采样率
    agent.sample_n_per_3_secs=10
    
  2. 配置数据TTL:在存储配置中设置数据保留时间
  3. 启用数据聚合:通过OAL脚本配置指标聚合规则

高可用部署

生产环境建议部署多副本OAP服务器,并配置负载均衡:

# OAP Deployment示例
apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-oap
spec:
  replicas: 3
  selector:
    matchLabels:
      app: oap
  template:
    metadata:
      labels:
        app: oap
    spec:
      containers:
      - name: oap
        image: apache/skywalking-oap-server:9.7.0
        env:
        - name: SW_CLUSTER
          value: kubernetes
        - name: SW_CLUSTER_K8S_NAMESPACE
          value: skywalking

总结与进阶

本文介绍的容器化部署方案已覆盖从开发测试到生产环境的全流程需求。通过Docker Compose可快速搭建功能验证环境,而K8s部署方案则满足了生产级可用性和扩展性要求。

进阶学习建议:

  1. 自定义OAL指标:通过OAL语法定义业务相关指标
  2. 分布式追踪深入:学习Trace数据协议实现跨语言追踪
  3. 性能优化:参考后端性能调优指南

完整的部署文档和最佳实践可在官方文档库中找到,社区还提供了丰富的示例配置帮助用户快速上手。

【免费下载链接】skywalking APM, Application Performance Monitoring System 【免费下载链接】skywalking 项目地址: https://gitcode.com/gh_mirrors/sky/skywalking

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值