【kubernetes】公有云上的kubernetes集群自动伸缩方案

最新推荐文章于 2025-04-22 14:08:08 发布

潘星文sama

最新推荐文章于 2025-04-22 14:08:08 发布

阅读量685

点赞数

分类专栏： kubernetes azure 文章标签： kubernetes azure docker 容器运维

本文链接：https://blog.youkuaiyun.com/weixin_42526859/article/details/112260440

版权

kubernetes 同时被 2 个专栏收录

11 篇文章

订阅专栏

azure

4 篇文章

订阅专栏

本文介绍Kubernetes集群如何通过cluster-autoscaler实现自动伸缩，包括根据资源使用情况自动增减节点数量的过程。演示了在Azure环境中部署及配置cluster-autoscaler的方法，并通过实例展示了当集群资源不足时自动增加节点，以及资源充足时减少节点的具体操作。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

kubernetes集群自动伸缩

这里的集群自动伸缩指的是根据集群的资源使用状况，自动的调整集群节点的数目，目的是充分利用集群资源，节省成本。主要应用下面两个场景：

集群资源不足，导致pod部署失败，自动弹出新节点
集群中有一些节点在很长一段时间内没有得到充分利用，它们的pod可以放置在其他现有节点上，并且删除该节点

cluster-autoscaler

目前主流方案是使用上面的cluster-autoscaler做到集群的自动伸缩，支持如下云供应商：

AliCloud
Azure
AWS
BaiduCloud
CloudStack
HuaweiCloud
Packet
IonosCloud
OVHcloud

下面是azure上的配置实例，cluster-autoscaler可以作为addon部署到集群中，下面是aks-engine部署模板中的cluster-autoscaler的配置

aks-engine模板配置

"addons": [
        {
            "name": "cluster-autoscaler",
            "enabled": true,
            "pools": [
              {
                "name": "prdconapl",
                "config": {
                  "min-nodes": "3",
                  "max-nodes": "500"
                }
              },
              {
                "name": "prdconeny",
                "config": {
                  "min-nodes": "3",
                  "max-nodes": "3"
                }
              }
            ],
            "config": {
              "scan-interval": "1m"
            }
          }

集群配置两个agent pool，其中prdconapl的最小节点数为3，最大为500，prdconeny最大和最小都为1，也就是prdconeny不会进行伸缩。

部署集群
查看auto-sacaler

vmadmin@reh-connectivity-jumpbox:~$ kubectl get po -n kube-system | grep cluster-autoscaler
cluster-autoscaler-86744d8775-d7n8x             1/1     Running   0          17h

vmadmin@reh-connectivity-jumpbox:~$ kubectl describe po cluster-autoscaler -n kube-system
Name:                 cluster-autoscaler-86744d8775-d7n8x
Namespace:            kube-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 k8s-master-81692357-0/172.16.2.235
Start Time:           Tue, 05 Jan 2021 09:41:34 +0000
Labels:               app=cluster-autoscaler
                      pod-template-hash=86744d8775
Annotations:          kubernetes.io/psp: privileged
Status:               Running
IP:                   172.16.2.235
IPs:
  IP:           172.16.2.235
Controlled By:  ReplicaSet/cluster-autoscaler-86744d8775
Containers:
  cluster-autoscaler:
    Container ID:  docker://fa45e88574d42655ac84002437b84044502e31a19a47a266649e64db0a53a952
    Image:         mcr.microsoft.com/oss/kubernetes/autoscaler/cluster-autoscaler:v1.17.3
    Image ID:      docker-pullable://mcr.microsoft.com/oss/kubernetes/autoscaler/cluster-autoscaler@sha256:288952aa6e7eba7b9a4f2bdac6fd0e96c0b58051b3539f1062444b8b8283b1c3
    Port:          <none>
    Host Port:     <none>
    Command:
      ./cluster-autoscaler
      --logtostderr=true
      --cloud-provider=azure
      --skip-nodes-with-local-storage=false
      --scan-interval=1m
      --expendable-pods-priority-cutoff=-10
      --ignore-daemonsets-utilization=false
      --ignore-mirror-pods-utilization=false
      --max-autoprovisioned-node-group-count=15
      --max-empty-bulk-delete=10
      --max-failing-time=15m0s
      --max-graceful-termination-sec=600
      --max-inactivity=10m0s
      --max-node-provision-time=15m0s
      --max-nodes-total=0
      --max-total-unready-percentage=45
      --memory-total=0:6400000
      --min-replica-count=0
      --namespace=kube-system
      --new-pod-scale-up-delay=0s
      --node-autoprovisioning-enabled=false
      --ok-total-unready-count=3
      --scale-down-candidates-pool-min-count=50
      --scale-down-candidates-pool-ratio=0.1
      --scale-down-delay-after-add=10m0s
      --scale-down-delay-after-delete=1m
      --scale-down-delay-after-failure=3m0s
      --scale-down-enabled=true
      --scale-down-non-empty-candidates-count=30
      --scale-down-unneeded-time=10m0s
      --scale-down-unready-time=20m0s
      --scale-down-utilization-threshold=0.5
      --skip-nodes-with-local-storage=false
      --skip-nodes-with-system-pods=true
      --stderrthreshold=2
      --unremovable-node-recheck-timeout=5m0s
      --v=3
      --write-status-configmap=true
      --balance-similar-node-groups=true
      --nodes=3:500:k8s-prdconapl-81692357-vmss
      --nodes=1:1:k8s-prdconeny-81692357-vmss
    State:          Running
      Started:      Tue, 05 Jan 2021 09:41:54 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  300Mi
    Requests:
      cpu:     100m
      memory:  300Mi
    Environment:
      ARM_CLOUD:                           AzureChinaCloud
      ARM_SUBSCRIPTION_ID:                 <set to the key 'SubscriptionID' in secret 'cluster-autoscaler-azure'>  Optional: false
      ARM_RESOURCE_GROUP:                  <set to the key 'ResourceGroup' in secret 'cluster-autoscaler-azure'>   Optional: false
      ARM_TENANT_ID:                       <set to the key 'TenantID' in secret 'cluster-autoscaler-azure'>        Optional: false
      ARM_CLIENT_ID:                       <set to the key 'ClientID' in secret 'cluster-autoscaler-azure'>        Optional: false
      ARM_CLIENT_SECRET:                   <set to the key 'ClientSecret' in secret 'cluster-autoscaler-azure'>    Optional: false
      ARM_VM_TYPE:                         <set to the key 'VMType' in secret 'cluster-autoscaler-azure'>          Optional: false
      ARM_USE_MANAGED_IDENTITY_EXTENSION:  true
    Mounts:
      /etc/ssl/certs/ca-certificates.crt from ssl-certs (ro)
      /var/lib/waagent/ from waagent (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-autoscaler-token-vbv8b (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             True
  ContainersReady   True
  PodScheduled      True
Volumes:
  ssl-certs:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/ssl/certs/ca-certificates.crt
    HostPathType:
  waagent:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/waagent/
    HostPathType:
  cluster-autoscaler-token-vbv8b:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-autoscaler-token-vbv8b
    Optional:    false
QoS Class:       Guaranteed
Node-Selectors:  kubernetes.azure.com/role=master
                 kubernetes.io/os=linux
Tolerations:     node-role.kubernetes.io/master=true:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>

可以看出，下面两个两个参数指定了vmss的节点个数范围：
–nodes=3:500:k8s-prdconapl-81692357-vmss
–nodes=1:1:k8s-prdconeny-81692357-vmss

伸缩测试

查看当前集群

vmadmin@reh-connectivity-jumpbox:~$ kubectl get nodes
NAME                                STATUS   ROLES    AGE   VERSION
k8s-master-81692357-0               Ready    master   17h   v1.17.11
k8s-master-81692357-1               Ready    master   17h   v1.17.11
k8s-master-81692357-2               Ready    master   17h   v1.17.11
k8s-prdconapl-81692357-vmss000000   Ready    agent    17h   v1.17.11
k8s-prdconapl-81692357-vmss000001   Ready    agent    17h   v1.17.11
k8s-prdconapl-81692357-vmss000002   Ready    agent    17h   v1.17.11
k8s-prdconeny-81692357-vmss000000   Ready    agent    17h   v1.17.11

部署测试的应用，比如nginx，并且配置resource

vmadmin@reh-connectivity-jumpbox:~$ kubectl create deployment nginx --image nginx
deployment.apps/nginx created
vmadmin@reh-connectivity-jumpbox:~$ kubectl set resources deployment nginx --limits=cpu=3000m,memory=5120Mi
deployment.apps/nginx resource requirements updated
vmadmin@reh-connectivity-jumpbox:~$ kubectl set resources deployment nginx --requests=cpu=2000m,memory=4096Mi
deployment.apps/nginx resource requirements updated
vmadmin@reh-connectivity-jumpbox:~$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
nginx-5f6bdd864f-sg8wl   1/1     Running   0          82s

查看当前集群状况

vmadmin@reh-connectivity-jumpbox:~$ kubectl get nodes
NAME                                STATUS   ROLES    AGE   VERSION
k8s-master-81692357-0               Ready    master   18h   v1.17.11
k8s-master-81692357-1               Ready    master   18h   v1.17.11
k8s-master-81692357-2               Ready    master   18h   v1.17.11
k8s-prdconapl-81692357-vmss000000   Ready    agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000001   Ready    agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000002   Ready    agent    18h   v1.17.11
k8s-prdconeny-81692357-vmss000000   Ready    agent    18h   v1.17.11

调整nginx的replicas到50

vmadmin@reh-connectivity-jumpbox:~$ kubectl scale deployment/nginx --replicas=50
deployment.apps/nginx scaled
vmadmin@reh-connectivity-jumpbox:~$ kubectl get pods
NAME                     READY   STATUS              RESTARTS   AGE
nginx-5f6bdd864f-25pl6   0/1     Pending             0          32s
nginx-5f6bdd864f-2w2fl   0/1     ContainerCreating   0          33s
nginx-5f6bdd864f-4kw5l   1/1     Running             0          33s
nginx-5f6bdd864f-54qc2   0/1     Pending             0          32s
nginx-5f6bdd864f-5gqh8   0/1     Pending             0          32s
nginx-5f6bdd864f-62zz7   1/1     Running             0          33s
nginx-5f6bdd864f-6d9f2   1/1     Running             0          33s
nginx-5f6bdd864f-6wwp9   1/1     Running             0          33s
nginx-5f6bdd864f-72zzf   1/1     Running             0          33s
nginx-5f6bdd864f-78ztw   1/1     Running             0          33s
nginx-5f6bdd864f-7mnlk   1/1     Running             0          33s
nginx-5f6bdd864f-7qq7c   1/1     Running             0          33s
nginx-5f6bdd864f-9f86r   0/1     ContainerCreating   0          33s
nginx-5f6bdd864f-9smrz   1/1     Running             0          33s
nginx-5f6bdd864f-fwqkc   0/1     Pending             0          32s
nginx-5f6bdd864f-fzph7   1/1     Running             0          33s
nginx-5f6bdd864f-gcwpw   1/1     Running             0          33s
nginx-5f6bdd864f-gp78l   0/1     Pending             0          32s

vmadmin@reh-connectivity-jumpbox:~$ kubectl describe po nginx-5f6bdd864f-54qc2
Name:           nginx-5f6bdd864f-54qc2
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app=nginx
                pod-template-hash=5f6bdd864f
Annotations:    kubernetes.io/psp: privileged
Status:         Pending
IP:
IPs:            <none>
Controlled By:  ReplicaSet/nginx-5f6bdd864f
Containers:
  nginx:
    Image:      nginx
    Port:       <none>
    Host Port:  <none>
    Limits:
      cpu:     3
      memory:  5Gi
    Requests:
      cpu:        2
      memory:     4Gi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-mdkhc (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  default-token-mdkhc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-mdkhc
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age               From                Message
  ----     ------            ----              ----                -------
  Normal   TriggeredScaleUp  73s               cluster-autoscaler  pod triggered scale-up: [{k8s-prdconapl-81692357-vmss 3->7 (max: 500)}]
  Warning  FailedScheduling  5s (x3 over 76s)  default-scheduler   0/7 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 4 Insufficient cpu.

可以看出已经出现了有些pod因为资源不足处于Pending状态，并且触发了节点的扩展，prdconapl由3扩展到7台VM，
Events:
Type Reason Age From Message

Normal TriggeredScaleUp 73s cluster-autoscaler pod triggered scale-up: [{k8s-prdconapl-81692357-vmss 3->7 (max: 500)}]
Warning FailedScheduling 5s (x3 over 76s) default-scheduler 0/7 nodes are available: 3 node(s) had taints that the pod didn’t tolerate, 4 Insufficient cpu.

再次查看节点状况：

vmadmin@reh-connectivity-jumpbox:~$ kubectl get nodes
NAME                                STATUS     ROLES    AGE   VERSION
k8s-master-81692357-0               Ready      master   18h   v1.17.11
k8s-master-81692357-1               Ready      master   18h   v1.17.11
k8s-master-81692357-2               Ready      master   18h   v1.17.11
k8s-prdconapl-81692357-vmss000000   Ready      agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000001   Ready      agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000002   Ready      agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000004   Ready      <none>   12s   v1.17.11
k8s-prdconapl-81692357-vmss000005   NotReady   <none>   3s    v1.17.11
k8s-prdconapl-81692357-vmss000006   Ready      <none>   33s   v1.17.11
k8s-prdconeny-81692357-vmss000000   Ready      agent    18h   v1.17.11

可以看出，prdconapl这个节点pool正在增加节点，扩展完成后再次确认应用状况

vmadmin@reh-connectivity-jumpbox:~$ kubectl get pods | grep Pending

可以看出，已经没有Pending的pod了。

恢复nginx的replicas到1

vmadmin@reh-connectivity-jumpbox:~$ kubectl scale deployment/nginx --replicas=1
deployment.apps/nginx scaled

vmadmin@reh-connectivity-jumpbox:~$ kubectl get pods
NAME                     READY   STATUS        RESTARTS   AGE
nginx-5f6bdd864f-2w2fl   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-5gqh8   1/1     Running       0          6m19s
nginx-5f6bdd864f-h7fg5   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-hz4k4   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-jv7rt   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-sgvr7   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-t292t   0/1     Terminating   0          6m20s
nginx-5f6bdd864f-xnz5j   0/1     Terminating   0          6m19s

默认配置，10分钟后会触发节点的缩放

I0106 03:58:48.247318       1 scale_down.go:431] Scale-down calculation: ignoring 2 nodes unremovable in the last 5m0s
I0106 03:58:48.247482       1 cluster.go:93] Fast evaluation: k8s-prdconapl-81692357-vmss000000 for removal
I0106 03:58:48.247541       1 cluster.go:124] Fast evaluation: node k8s-prdconapl-81692357-vmss000000 may be removed
I0106 03:58:48.247548       1 cluster.go:93] Fast evaluation: k8s-prdconapl-81692357-vmss000003 for removal
I0106 03:58:48.247572       1 cluster.go:124] Fast evaluation: node k8s-prdconapl-81692357-vmss000003 may be removed
I0106 03:58:48.247692       1 scale_down.go:716] k8s-prdconapl-81692357-vmss000000 was unneeded for 11m2.097579977s
I0106 03:58:48.247740       1 scale_down.go:716] k8s-prdconapl-81692357-vmss000003 was unneeded for 9m1.6866918s
I0106 03:58:48.247873       1 cluster.go:93] Detailed evaluation: k8s-prdconapl-81692357-vmss000000 for removal
I0106 03:58:48.247933       1 cluster.go:124] Detailed evaluation: node k8s-prdconapl-81692357-vmss000000 may be removed
I0106 03:58:48.247953       1 scale_down.go:827] Scale-down: removing node k8s-prdconapl-81692357-vmss000000, utilization: {0.04375 0.010228299454857455 0 cpu 0.04375}, pods to reschedule: kubernetes-dashboard/dashboard-metrics-scraper-95856bb87-lrxkp

vmadmin@reh-connectivity-jumpbox:~$ kubectl get nodes
NAME                                STATUS     ROLES    AGE   VERSION
k8s-master-81692357-0               Ready      master   18h   v1.17.11
k8s-master-81692357-1               Ready      master   18h   v1.17.11
k8s-master-81692357-2               Ready      master   18h   v1.17.11
k8s-prdconapl-81692357-vmss000000   NotReady   agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000001   Ready      agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000002   Ready      agent    18h   v1.17.11
k8s-prdconapl-81692357-vmss000003   Ready      agent    17m   v1.17.11
k8s-prdconeny-81692357-vmss000000   Ready      agent    18h   v1.17.11