以下是针对生产环境的 Kubespray 安装 Kubernetes 集群的深度优化指南,涵盖高可用架构、安全加固、性能调优和灾备方案,满足企业级需求。
一、生产环境架构设计
推荐拓扑
关键组件
组件 | 生产级配置 | 说明 |
---|---|---|
Master节点 | 3节点(跨机架/可用区) | 避免单点故障 |
etcd集群 | 3节点独立部署(SSD磁盘,低延迟网络) | 与Master分离,避免资源争用 |
负载均衡器 | HAProxy + Keepalived(Active-Standby) | 提供虚拟IP(VIP) |
Worker节点 | 按业务分池(CPU密集型/GPU/内存优化) | 资源隔离 |
网络插件 | Calico with IPIP/BGP模式 | 支持NetworkPolicy |
二、生产环境部署流程
1. 节点准备(所有节点)
# 禁用Swap并优化内核参数
sudo swapoff -a
sudo sed -i '/swap/s/^/#/' /etc/fstab
# 设置sysctl参数
cat <<EOF | sudo tee /etc/sysctl.d/99-k8s.conf
net.ipv4.ip_forward = 1
net.bridge.bridge-nf-call-iptables = 1
net.bridge.bridge-nf-call-ip6tables = 1
vm.swappiness = 0
vm.max_map_count = 262144
EOF
sudo sysctl -p /etc/sysctl.d/99-k8s.conf
# 安装基础工具
sudo apt-get update && sudo apt-get install -y \
apt-transport-https ca-certificates curl \
ipvsadm ipset conntrack ntp
2. 负载均衡器配置(HAProxy + Keepalived)
/etc/haproxy/haproxy.cfg
frontend k8s-api
bind *:6443
mode tcp
default_backend k8s-masters
backend k8s-masters
balance roundrobin
mode tcp
server master1 192.168.1.10:6443 check
server master2 192.168.1.11:6443 check
server master3 192.168.1.12:6443 check
/etc/keepalived/keepalived.conf
vrrp_script chk_haproxy {
script "killall -0 haproxy"
interval 2
}
vrrp_instance VI_1 {
interface eth0
state MASTER # 备节点设为BACKUP
virtual_router_id 51
priority 100 # 备节点设为更低值
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_haproxy
}
}
三、Kubespray 生产级配置
1. 关键参数优化 (inventory/mycluster/group_vars
)
all.yml
# 容器运行时
container_manager: containerd
# 网络插件(Calico生产推荐)
kube_network_plugin: calico
calico_ipip_mode: "CrossSubnet" # 跨子网用IPIP,同子网用BGP
calico_vxlan_mode: "Never"
# 镜像仓库(私有仓库认证)
gcr_image_repo: "registry.example.com/google_containers"
docker_registry_auths:
"registry.example.com":
username: "user"
password: "pass"
k8s-cluster.yml
# 高可用配置
kubernetes_ha_cluster: true
apiserver_loadbalancer_domain_name: "k8s-api.example.com"
loadbalancer_apiserver:
address: 192.168.1.100 # VIP地址
port: 6443
# 资源限制
kubelet_max_pods: 250
kubelet_pods_per_core: 10
# 安全加固
kube_encrypt_secret_data: true # 加密Secrets
enable_pod_security_policies: true
etcd.yml
# etcd独立集群配置
etcd_deployment_type: host
etcd_data_dir: "/var/lib/etcd"
etcd_disk_priority: high
etcd_compaction_retention: "2" # 小时为单位
四、安全加固方案
1. 证书管理
# 自定义CA根证书
openssl genrsa -out ca.key 2048
openssl req -x509 -new -nodes -key ca.key -days 3650 -out ca.crt
# Kubespray配置
kube_certificates_custom_ca: true
kube_certificates_ca_crt: "{{ lookup('file', '/path/to/ca.crt') }}"
kube_certificates_ca_key: "{{ lookup('file', '/path/to/ca.key') }}"
2. RBAC与策略
# 启用OPA Gatekeeper
gatekeeper_enabled: true
gatekeeper_version: v3.12.0
# 网络策略示例
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
五、监控与日志
1. 监控栈部署
# 启用Prometheus+Alertmanager
prometheus_enabled: true
alertmanager_enabled: true
grafana_enabled: true
# 关键告警规则
prometheus_alert_rules:
- name: KubeAPIHighLatency
expr: histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket[5m])) by (le) > 1
2. 日志收集
# EFK日志栈
efk_enabled: true
elasticsearch_data_storage_class: "ssd"
fluentd_logrotate_enabled: true
kibana_ingress_enabled: true
六、灾备与升级
1. 集群备份
# etcd快照备份
ETCDCTL_API=3 etcdctl snapshot save snapshot.db \
--endpoints=https://etcd1:2379 \
--cacert=/etc/ssl/etcd/ca.crt \
--cert=/etc/ssl/etcd/server.crt \
--key=/etc/ssl/etcd/server.key
# Velero云原生备份
velero install \
--provider aws \
--bucket k8s-backup \
--secret-file ./credentials
2. 滚动升级策略
# 分阶段升级
ansible-playbook upgrade-cluster.yml \
--limit=workers_first \
-e kube_version=v1.28.3
ansible-playbook upgrade-cluster.yml \
--limit=masters \
-e kube_version=v1.28.3
七、生产验证清单
- 高可用测试
- 模拟Master节点宕机:
systemctl stop kubelet
- 验证API服务连续性:
curl -k https://VIP:6443/healthz
- 模拟Master节点宕机:
- 故障恢复
# 快速恢复etcd节点 ETCDCTL_API=3 etcdctl snapshot restore snapshot.db \ --data-dir /var/lib/etcd-new
- 性能压测
kubectl apply -f https://k8s.io/examples/application/deployment.yaml kubectl run stress --image=loadimpact/k6 run -< script.js
八、关键生产建议
- 网络隔离
- 使用Calico的
NetworkSet
隔离敏感Pod - 启用
egressGateway
控制出口流量
- 使用Calico的
- 资源配额
apiVersion: v1 kind: ResourceQuota metadata: name: prod-quota spec: hard: requests.cpu: "100" requests.memory: 200Gi limits.cpu: "200" limits.memory: 400Gi
- 审计日志
# 启用K8s审计 audit_enabled: true audit_log_maxbackup: 10 audit_policy_path: "/etc/kubernetes/audit-policy.yaml"
注:生产环境部署后,定期执行
kubespray-check
进行集群健康扫描,并建立持续集成流水线管理集群配置变更。
通过以上配置,可构建符合金融级要求的Kubernetes生产环境,满足等保2.0/PCI-DSS等合规标准。