aws eks 使用autoscaler控制节点组的启动和终止

本文链接：https://blog.youkuaiyun.com/sinat_41567654/article/details/129128219

参考资料

https://docs.amazonaws.cn/eks/latest/userguide/autoscaling.html
https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md

部署过程

先决条件

创建eks集群
iam oidc提供商关联

asg上标签

k8s.io/cluster-autoscaler/cluster-name	owned
k8s.io/cluster-autoscaler/enabled		true

创建所需权限，这里方便测试把条件键删了增大适用范围

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeAutoScalingGroups",
                "ec2:DescribeLaunchTemplateVersions",
                "autoscaling:DescribeTags",
                "autoscaling:DescribeLaunchConfigurations",
                "ec2:DescribeInstanceTypes"
            ],
            "Resource": "*"
        }
    ]
}

创建sa关联

eksctl create iamserviceaccount \
  --cluster=testca \
  --namespace=kube-system \
  --name=cluster-autoscaler \
  --attach-policy-arn=arn:aws-cn:iam::037047667284:policy/AmazonEKSClusterAutoscalerPolicy \
  --override-existing-serviceaccounts \
  --approve

部署autoscaler

curl -O https://raw.githubusercontent.com/kubernetes/autoscaler/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-autodiscover.yaml
kubectl apply -f cluster-autoscaler-autodiscover.yaml

有些无法下载的镜像手动导入

# 导出镜像
# sudo nerdctl -n=k8s.io save -o temp.tar quay.io/prometheus/node-exporter:v1.5.0
# sudo ctr -n=k8s.io image  export --platform=linux/amd64  temp.tar quay.io/prometheus/node-exporter:v1.5.0
export imagename=registry.k8s.io/autoscaling/cluster-autoscaler:v1.22.2
# docker pull $imagename
docker save -o temp.tar $imagename && aws s3 cp temp.tar s3://zhaojiew-test
# 导入镜像
aws s3 cp s3://zhaojiew-test/temp.tar . && sudo ctr -n=k8s.io image import temp.tar
# nerdctl -n=k8s.io load -i temp.tar
# docker load -i temp.tar

# 查看镜像
ctr -n=k8s.io image ls

提交测试pod

$ cat echo-dep.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: echo-dep
spec:
  selector:
    matchLabels:
      app: http-echo
  replicas: 10
  template:
    metadata:
      labels:
        app: http-echo
    spec:
      containers:
      - name: http-echo
        image: hashicorp/http-echo:0.2.3
        args:
        - "-text=foo"
        ports:
        - containerPort: 5678

之后由于当前节点数量受限，启动新节点满足调度要求。新节点的启动主要还是判断pod的调度，如果pod上有nodeselector或者亲和度和污点标记，则autoscaler只会启动满足要求的节点

autoscaler只会修改desired count，不会变动asg的最大值和最小值

关闭节点的过程，到达默认时间（目测5分钟）后节点被关闭

filter_out_schedulable.go:82] No schedulable pods                                         static_autoscaler.go:420] No unschedulable pods                                           static_autoscaler.go:467] Calculating unneeded nodes                                     scale_down.go:448] Node ip-192-168-20-239.cn-north-1.compute.internal - cpu utilization 0.122798                                                                                 static_autoscaler.go:510] ip-192-168-20-239.cn-north-1.compute.internal is unneeded since 2023-02-20 06:26:06.892099794 +0000 UTC m=+1034.537371483 duration 0s
static_autoscaler.go:534] Starting scale down                                             scale_down.go:829] ip-192-168-20-239.cn-north-1.compute.internal was unneeded for 0s     scale_down.go:918] No candidates for scale down                                           delete.go:103] Successfully added DeletionCandidateTaint on node ip-192-168-20-239.cn-north-1.compute.internal

关于全部的可配置参数

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

不会删除的node

https://blog.youkuaiyun.com/hello2mao/article/details/80418625

节点上有pod被PodDisruptionBudget控制器限制。
节点上有命名空间是kube-system的pods。
节点上的pod不是被控制器创建，例如不是被deployment, replica set, job, stateful set创建。
节点上有pod使用了本地存储
节点上pod驱逐后无处可去，即没有其他node能调度这个pod
节点有注解：”cluster-autoscaler.kubernetes.io/scale-down-disabled”: “true”

从0扩展

当从0节点向上扩展时，群集自动定标器读取 ASG tags以获取关于节点规范的信息，即label和taint。

由于节点还未启动，此时无法从集群内部读取节点的label和taint信息，因此需要从asg上的tag上获取。autoscaler实际上并不会使用这些label和taint的tag信息。而是由userdata配置到节点上

Key: k8s.io/cluster-autoscaler/node-template/resources/$RESOURCE_NAME 
Value: 5 
Key: k8s.io/cluster-autoscaler/node-template/label/$LABEL_KEY 
Value: $LABEL_VALUE 
Key: k8s.io/cluster-autoscaler/node-template/taint/$TAINT_KEY 
Value: NoSchedule

1.24集群开始不需要手动往asg上加tag，这个过程得到了简化，但是需要给autoscaler角色DescribeNodegroup的权限。此外，当asg的的值与节点组本身发生冲突时，优先选择asg标记的值

除此之外还可以通过asg的tag覆盖autoscaler的全局设置

k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownutilizationthreshold: 0.5 (overrides --scale-down-utilization-threshold value for that specific ASG)
k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledowngpuutilizationthreshold: 0.5 (overrides --scale-down-gpu-utilization-threshold value for that specific ASG)
k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunneededtime: 10m0s (overrides --scale-down-unneeded-time value for that specific ASG)
k8s.io/cluster-autoscaler/node-template/autoscaling-options/scaledownunreadytime: 20m0s (overrides --scale-down-unready-time value for that specific ASG)

为了加快测试，将--scale-down-unneeded-time设置为1m

添加--skip-nodes-with-system-pods=false 确保节点缩减为0时pod能够完全驱逐，不会考虑daemon-set

错误和解决

1.24版本镜像开始多加了一条权限，坑

aws_cloud_provider.go:386] Failed to generate AWS EC2 Instance Types: UnauthorizedOperation: You are not authorized to perform this operation

注意：一定要修改--node-group-auto-discovery字段，否则会报node group config找不到

containers:
    - command
    - ./cluster-autoscaler
    - --v=4
    - --stderrthreshold=info
    - --cloud-provider=aws
    - --skip-nodes-with-local-storage=false
    - --expander=least-waste
    - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/my-cluster
    - --balance-similar-node-groups
    - --skip-nodes-with-system-pods=false