零宕机部署!Apache SkyWalking on OpenShift容器监控实践指南

零宕机部署!Apache SkyWalking on OpenShift容器监控实践指南

【免费下载链接】skywalking APM, Application Performance Monitoring System 【免费下载链接】skywalking 项目地址: https://gitcode.com/gh_mirrors/sky/skywalking

你是否在OpenShift集群中遇到过微服务性能黑盒问题?作为容器平台上最受欢迎的分布式追踪解决方案,Apache SkyWalking能帮你实现服务调用链可视化、性能指标监控和告警自动化。本文将带你通过3个核心步骤完成从环境准备到高级配置的全流程部署,包含10+实用配置项和2个生产级优化技巧。

部署架构概览

SkyWalking在OpenShift环境中采用经典的"探针-后端-UI"三层架构,通过容器化部署实现与Kubernetes生态的无缝集成:

mermaid

核心组件说明:

前置条件与环境准备

在开始部署前,请确保满足以下环境要求:

  1. OpenShift集群:4.6+版本,至少3个工作节点
  2. 存储资源:为Elasticsearch准备至少10Gi PV(持久卷)
  3. 权限配置:具有cluster-admin角色或创建CRD的权限
  4. 网络策略:允许Pod间通信(11800/tcp, 12800/tcp端口)

使用以下命令检查集群状态:

oc get nodes
oc get storageclass

部署步骤详解

1. 存储后端部署

SkyWalking支持Elasticsearch和BanyanDB两种存储方案,这里以Elasticsearch为例:

# elasticsearch-deploy.yaml 关键配置
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: elasticsearch
spec:
  serviceName: elasticsearch
  replicas: 1
  template:
    spec:
      containers:
      - name: elasticsearch
        image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2
        env:
        - name: discovery.type
          value: single-node
        - name: ES_JAVA_OPTS
          value: "-Xms512m -Xmx512m"
        ports:
        - containerPort: 9200
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        volumeMounts:
        - name: data
          mountPath: /usr/share/elasticsearch/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi

执行部署命令:

oc apply -f elasticsearch-deploy.yaml
oc rollout status statefulset/elasticsearch

生产环境建议使用3节点Elasticsearch集群,配置文件参考dist-material/config-examples/alarm-settings.yml中的存储优化项

2. OAP Server部署

创建OAP Server的Deployment资源,关键配置如下:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-oap
spec:
  replicas: 2  # 生产环境建议至少2副本
  selector:
    matchLabels:
      app: skywalking-oap
  template:
    metadata:
      labels:
        app: skywalking-oap
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "1234"
    spec:
      containers:
      - name: oap
        image: ghcr.io/apache/skywalking/oap:latest
        ports:
        - containerPort: 11800  # gRPC端口
        - containerPort: 12800  # HTTP端口
        env:
        - name: SW_STORAGE
          value: elasticsearch
        - name: SW_STORAGE_ES_CLUSTER_NODES
          value: elasticsearch:9200
        - name: JAVA_OPTS
          value: "-Xms2G -Xmx2G"  # 根据节点资源调整
        resources:
          limits:
            cpu: "1"
            memory: "4Gi"
          requests:
            cpu: "500m"
            memory: "2Gi"

部署命令:

oc apply -f oap-deploy.yaml
oc expose deployment skywalking-oap --port=12800

OAP Server的Dockerfile定义了基础镜像和启动流程,详见docker/oap/Dockerfile,其中第48行暴露了三个关键端口:12800(HTTP)、11800(gRPC)和1234(健康检查)。

3. UI服务部署与路由配置

UI服务部署相对简单,重点是配置OAP Server地址:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: skywalking-ui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: skywalking-ui
  template:
    metadata:
      labels:
        app: skywalking-ui
    spec:
      containers:
      - name: ui
        image: ghcr.io/apache/skywalking/ui:latest
        ports:
        - containerPort: 8080
        env:
        - name: SW_OAP_ADDRESS
          value: http://skywalking-oap:12800
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: skywalking-ui
spec:
  to:
    kind: Service
    name: skywalking-ui
  tls:
    termination: Edge

创建路由后通过以下命令获取访问地址:

oc get route skywalking-ui -o jsonpath='{.spec.host}'

UI服务的启动脚本docker/ui/docker-entrypoint.sh处理了环境变量注入和端口映射,确保与OAP Server的通信安全可靠。

应用接入与数据采集

Java应用自动注入

通过OpenShift的Init Container机制实现Agent自动挂载:

spec:
  initContainers:
  - name: agent-init
    image: ghcr.io/apache/skywalking/agent:latest
    command: ['sh', '-c', 'cp -r /skywalking/agent /agent']
    volumeMounts:
    - name: agent-volume
      mountPath: /agent
  containers:
  - name: app
    image: your-app-image:latest
    volumeMounts:
    - name: agent-volume
      mountPath: /skywalking/agent
    env:
    - name: JAVA_TOOL_OPTIONS
      value: "-javaagent:/skywalking/agent/skywalking-agent.jar"
    - name: SW_AGENT_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.labels['app']
    - name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
      value: "skywalking-oap:11800"
  volumes:
  - name: agent-volume
    emptyDir: {}

监控指标配置

修改OAP Server配置开启Prometheus指标导出:

env:
- name: SW_TELEMETRY
  value: prometheus
- name: SW_PROMETHEUS_EXPORTER_ENABLED
  value: "true"
- name: SW_PROMETHEUS_EXPORTER_PORT
  value: "1234"

完整配置项可参考docker/oap/docker-entrypoint.sh中的环境变量处理逻辑。

高级配置与优化

资源限制与自动扩缩容

为确保SkyWalking集群稳定运行,建议配置资源限制和HPA:

# OAP Server资源配置示例
resources:
  limits:
    cpu: "2"
    memory: "4Gi"
  requests:
    cpu: "1"
    memory: "2Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: skywalking-oap
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: skywalking-oap
  minReplicas: 2
  maxReplicas: 5
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

告警规则配置

通过ConfigMap挂载自定义告警规则:

apiVersion: v1
kind: ConfigMap
metadata:
  name: skywalking-alarm-rules
data:
  alarm-settings.yml: |-
    rules:
      service_resp_time_rule:
        metrics-name: service_resp_time
        op: ">"
        threshold: 1000
        period: 10
        count: 3
        silence-period: 5
        message: "服务 {name} 响应时间超过1秒"
---
# 在OAP Deployment中挂载
volumeMounts:
- name: alarm-rules
  mountPath: /skywalking/config/alarm-settings.yml
  subPath: alarm-settings.yml
volumes:
- name: alarm-rules
  configMap:
    name: skywalking-alarm-rules

完整的告警规则配置示例可参考dist-material/alarm-settings.yml,包含20+预设规则。

部署验证与问题排查

状态检查命令

# 检查Pod状态
oc get pods -l app=skywalking-oap
oc get pods -l app=skywalking-ui

# 查看日志
oc logs -f deployment/skywalking-oap
oc logs -f deployment/skywalking-ui

# 测试OAP Server健康状态
oc exec -it deployment/skywalking-oap -- curl http://localhost:12800/internal/l7check

常见问题解决

  1. OAP启动失败:检查Elasticsearch连接状态,确认环境变量SW_STORAGE_ES_CLUSTER_NODES配置正确
  2. UI无法访问OAP:验证SW_OAP_ADDRESS是否设置为OAP Service的ClusterIP:Port
  3. 数据持久化问题:确保PVC已正确绑定,可通过oc describe pvc查看存储状态

生产环境最佳实践

存储优化

对于大规模部署,推荐使用BanyanDB作为存储后端,相比Elasticsearch具有更低的资源占用:

env:
- name: SW_STORAGE
  value: banyandb
- name: SW_STORAGE_BANYANDB_TARGETS
  value: "banyandb:17912"

BanyanDB的Docker Compose配置见docker/docker-compose.yml的43-60行,包含健康检查和数据持久化配置。

安全配置

  1. TLS加密:使用OpenShift的Service Mesh为服务间通信启用mTLS
  2. RBAC控制:创建专用Service Account并限制权限
  3. 敏感信息:通过Secret管理存储密码等敏感配置
# 安全上下文配置示例
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000
  allowPrivilegeEscalation: false
  capabilities:
    drop: ["ALL"]

总结与后续学习

通过本文你已掌握在OpenShift环境部署SkyWalking的核心流程,包括:

  • 基于Dockerfile构建自定义镜像(docker/oap/Dockerfile
  • 配置多副本高可用部署
  • 实现应用无侵入式接入
  • 配置告警和监控

下一步可深入学习:

  • 集成OpenTelemetry收集多语言应用数据
  • 使用SkyWalking Rover监控基础设施
  • 基于Trace数据进行性能瓶颈分析

完整文档可参考项目的docs/目录,包含架构设计、API文档和高级配置指南。

【免费下载链接】skywalking APM, Application Performance Monitoring System 【免费下载链接】skywalking 项目地址: https://gitcode.com/gh_mirrors/sky/skywalking

创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值