零宕机部署!Apache SkyWalking on OpenShift容器监控实践指南
你是否在OpenShift集群中遇到过微服务性能黑盒问题?作为容器平台上最受欢迎的分布式追踪解决方案,Apache SkyWalking能帮你实现服务调用链可视化、性能指标监控和告警自动化。本文将带你通过3个核心步骤完成从环境准备到高级配置的全流程部署,包含10+实用配置项和2个生产级优化技巧。
部署架构概览
SkyWalking在OpenShift环境中采用经典的"探针-后端-UI"三层架构,通过容器化部署实现与Kubernetes生态的无缝集成:
核心组件说明:
- OAP Server:接收追踪数据并分析,对应docker/oap/Dockerfile构建的镜像
- UI服务:提供可视化界面,配置文件路径docker/ui/docker-entrypoint.sh
- 存储后端:支持Elasticsearch或BanyanDB,配置示例见docker/docker-compose.yml
前置条件与环境准备
在开始部署前,请确保满足以下环境要求:
- OpenShift集群:4.6+版本,至少3个工作节点
- 存储资源:为Elasticsearch准备至少10Gi PV(持久卷)
- 权限配置:具有
cluster-admin角色或创建CRD的权限 - 网络策略:允许Pod间通信(11800/tcp, 12800/tcp端口)
使用以下命令检查集群状态:
oc get nodes
oc get storageclass
部署步骤详解
1. 存储后端部署
SkyWalking支持Elasticsearch和BanyanDB两种存储方案,这里以Elasticsearch为例:
# elasticsearch-deploy.yaml 关键配置
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: elasticsearch
spec:
serviceName: elasticsearch
replicas: 1
template:
spec:
containers:
- name: elasticsearch
image: docker.elastic.co/elasticsearch/elasticsearch-oss:7.4.2
env:
- name: discovery.type
value: single-node
- name: ES_JAVA_OPTS
value: "-Xms512m -Xmx512m"
ports:
- containerPort: 9200
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
volumeMounts:
- name: data
mountPath: /usr/share/elasticsearch/data
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
执行部署命令:
oc apply -f elasticsearch-deploy.yaml
oc rollout status statefulset/elasticsearch
生产环境建议使用3节点Elasticsearch集群,配置文件参考dist-material/config-examples/alarm-settings.yml中的存储优化项
2. OAP Server部署
创建OAP Server的Deployment资源,关键配置如下:
apiVersion: apps/v1
kind: Deployment
metadata:
name: skywalking-oap
spec:
replicas: 2 # 生产环境建议至少2副本
selector:
matchLabels:
app: skywalking-oap
template:
metadata:
labels:
app: skywalking-oap
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "1234"
spec:
containers:
- name: oap
image: ghcr.io/apache/skywalking/oap:latest
ports:
- containerPort: 11800 # gRPC端口
- containerPort: 12800 # HTTP端口
env:
- name: SW_STORAGE
value: elasticsearch
- name: SW_STORAGE_ES_CLUSTER_NODES
value: elasticsearch:9200
- name: JAVA_OPTS
value: "-Xms2G -Xmx2G" # 根据节点资源调整
resources:
limits:
cpu: "1"
memory: "4Gi"
requests:
cpu: "500m"
memory: "2Gi"
部署命令:
oc apply -f oap-deploy.yaml
oc expose deployment skywalking-oap --port=12800
OAP Server的Dockerfile定义了基础镜像和启动流程,详见docker/oap/Dockerfile,其中第48行暴露了三个关键端口:12800(HTTP)、11800(gRPC)和1234(健康检查)。
3. UI服务部署与路由配置
UI服务部署相对简单,重点是配置OAP Server地址:
apiVersion: apps/v1
kind: Deployment
metadata:
name: skywalking-ui
spec:
replicas: 1
selector:
matchLabels:
app: skywalking-ui
template:
metadata:
labels:
app: skywalking-ui
spec:
containers:
- name: ui
image: ghcr.io/apache/skywalking/ui:latest
ports:
- containerPort: 8080
env:
- name: SW_OAP_ADDRESS
value: http://skywalking-oap:12800
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
name: skywalking-ui
spec:
to:
kind: Service
name: skywalking-ui
tls:
termination: Edge
创建路由后通过以下命令获取访问地址:
oc get route skywalking-ui -o jsonpath='{.spec.host}'
UI服务的启动脚本docker/ui/docker-entrypoint.sh处理了环境变量注入和端口映射,确保与OAP Server的通信安全可靠。
应用接入与数据采集
Java应用自动注入
通过OpenShift的Init Container机制实现Agent自动挂载:
spec:
initContainers:
- name: agent-init
image: ghcr.io/apache/skywalking/agent:latest
command: ['sh', '-c', 'cp -r /skywalking/agent /agent']
volumeMounts:
- name: agent-volume
mountPath: /agent
containers:
- name: app
image: your-app-image:latest
volumeMounts:
- name: agent-volume
mountPath: /skywalking/agent
env:
- name: JAVA_TOOL_OPTIONS
value: "-javaagent:/skywalking/agent/skywalking-agent.jar"
- name: SW_AGENT_NAME
valueFrom:
fieldRef:
fieldPath: metadata.labels['app']
- name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
value: "skywalking-oap:11800"
volumes:
- name: agent-volume
emptyDir: {}
监控指标配置
修改OAP Server配置开启Prometheus指标导出:
env:
- name: SW_TELEMETRY
value: prometheus
- name: SW_PROMETHEUS_EXPORTER_ENABLED
value: "true"
- name: SW_PROMETHEUS_EXPORTER_PORT
value: "1234"
完整配置项可参考docker/oap/docker-entrypoint.sh中的环境变量处理逻辑。
高级配置与优化
资源限制与自动扩缩容
为确保SkyWalking集群稳定运行,建议配置资源限制和HPA:
# OAP Server资源配置示例
resources:
limits:
cpu: "2"
memory: "4Gi"
requests:
cpu: "1"
memory: "2Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: skywalking-oap
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: skywalking-oap
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
告警规则配置
通过ConfigMap挂载自定义告警规则:
apiVersion: v1
kind: ConfigMap
metadata:
name: skywalking-alarm-rules
data:
alarm-settings.yml: |-
rules:
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 5
message: "服务 {name} 响应时间超过1秒"
---
# 在OAP Deployment中挂载
volumeMounts:
- name: alarm-rules
mountPath: /skywalking/config/alarm-settings.yml
subPath: alarm-settings.yml
volumes:
- name: alarm-rules
configMap:
name: skywalking-alarm-rules
完整的告警规则配置示例可参考dist-material/alarm-settings.yml,包含20+预设规则。
部署验证与问题排查
状态检查命令
# 检查Pod状态
oc get pods -l app=skywalking-oap
oc get pods -l app=skywalking-ui
# 查看日志
oc logs -f deployment/skywalking-oap
oc logs -f deployment/skywalking-ui
# 测试OAP Server健康状态
oc exec -it deployment/skywalking-oap -- curl http://localhost:12800/internal/l7check
常见问题解决
- OAP启动失败:检查Elasticsearch连接状态,确认环境变量
SW_STORAGE_ES_CLUSTER_NODES配置正确 - UI无法访问OAP:验证
SW_OAP_ADDRESS是否设置为OAP Service的ClusterIP:Port - 数据持久化问题:确保PVC已正确绑定,可通过
oc describe pvc查看存储状态
生产环境最佳实践
存储优化
对于大规模部署,推荐使用BanyanDB作为存储后端,相比Elasticsearch具有更低的资源占用:
env:
- name: SW_STORAGE
value: banyandb
- name: SW_STORAGE_BANYANDB_TARGETS
value: "banyandb:17912"
BanyanDB的Docker Compose配置见docker/docker-compose.yml的43-60行,包含健康检查和数据持久化配置。
安全配置
- TLS加密:使用OpenShift的Service Mesh为服务间通信启用mTLS
- RBAC控制:创建专用Service Account并限制权限
- 敏感信息:通过Secret管理存储密码等敏感配置
# 安全上下文配置示例
securityContext:
runAsNonRoot: true
runAsUser: 1000
fsGroup: 1000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
总结与后续学习
通过本文你已掌握在OpenShift环境部署SkyWalking的核心流程,包括:
- 基于Dockerfile构建自定义镜像(docker/oap/Dockerfile)
- 配置多副本高可用部署
- 实现应用无侵入式接入
- 配置告警和监控
下一步可深入学习:
- 集成OpenTelemetry收集多语言应用数据
- 使用SkyWalking Rover监控基础设施
- 基于Trace数据进行性能瓶颈分析
完整文档可参考项目的docs/目录,包含架构设计、API文档和高级配置指南。
创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考



