本文介绍如何在AWS EKS上使用Karpenter和KEDA实现GPU工作负载的自动扩缩容。我们将创建一个GPU节点池,并部署一个使用GPU的AI应用程序,该应用程序可以根据SQS队列中的消息数量自动扩缩容。
前提条件
- 一个AWS EKS集群,已安装Karpenter和KEDA
- AWS CLI已配置并具有适当的权限
- 一个Amazon ECR存储库用于存储AI应用程序的Docker镜像
准备工作
首先,导出以下环境变量:
步骤1: 创建GPU节点池
首先,我们需要创建一个GPU节点池,以便Karpenter可以根据需求启动GPU实例。创建一个名为${ENV}-${SERVER_NAME}-ai-nodepool.yaml的文件,内容如下:
cat <<EOF> ${ENV}-${SERVER_NAME}-ai-nodepool.yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
name: ${ENV}-${SERVER_NAME}-ai-gpu
spec:
template:
metadata:
labels: # 此处修改成您希望增加到节点的Label,应用在Kubernetes上
node-type: gpu
app: ${ENV}-${SERVER_NAME}-ai
spec:
taints: # 此处修改成您希望增加到节点的Taint,如无则删除即可
- key: nvidia.com/${ENV}-${SERVER_NAME}-ai-gpu
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: kubernetes.io/os
operator: In
values: ["linux"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # 允许并优先使用Spot实例,如不需要使用Spot可删除
- key: node.kubernetes.io/instance-type # GPU机型实例类型较少,可固定实例类型;也可使用 karpenter.k8s.aws/instance-family 使Karpenter自动选择实例类型
operator: In
values:
- g5.xlarge
- g5.2xlarge
nodeClassRef:
name: ${ENV}-${SERVER_NAME}-ai-gpu # 与下方EC2NodeClass一致
limits:
nvidia.com/gpu: 100 # 限制最大实例数量,以GPU卡计算
disruption:
consolidationPolicy: WhenEmpty # 仅在实例空时才回收节点
consolidateAfter: 30s # Pod缩容后30秒后回收节点
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
name: ${ENV}-${SERVER_NAME}-ai-gpu
spec:
amiFamily: Bottlerocket # 使用Bottlerocket操作系统
role: "KarpenterNodeRole-${CLUSTER_NAME}"
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
kubernetes.io/role/internal-elb: "1" # 选择使用私有子网
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: "${CLUSTER_NAME}"
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 10Gi
volumeType: gp3
deleteOnTermination: true
iops: 3000
throughput: 125
- deviceName: /dev/xvdb
ebs:
volumeSize: 80Gi
volumeType: gp3
deleteOnTermination: true
iops: 3000
throughput: 125
#snapshotID: snap-0ec10130db1e9d9fb
EOF - 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
- 58.
- 59.
- 60.
- 61.
- 62.
- 63.
- 64.
- 65.
- 66.
- 67.
- 68.
- 69.
- 70.
这个配置文件定义了一个GPU节点池,使用g5.xlarge和g5.2xlarge实例类型。它还配置了一些标签、污点和卷。根据您的需求进行相应调整。
应用这个配置:
步骤2: 部署AI应用程序
创建iam角色给容器使用
cat >iam-policy.json<<EOF{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": "s3:GetObject",
"Resource": "*"
},
{
"Sid": "VisualEditor1",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage"
],
"Resource": [
"arn:aws:sqs:us-east-1:930700710668:test-light-ai-input",
"arn:aws:sqs:us-east-1:930700710668:test-light-ai-ouput"
]
}
]
}
EOF
export AWS_REGION=us-east-1
eksctl create iamserviceaccount --cluster=<clusterName> --name=<serviceAccountName> --namespace=<serviceAccountNamespace> --attach-policy-arn=<policyARN> --override-existing-serviceaccounts --approve- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
接下来,我们将部署一个使用GPU的AI应用程序。创建一个名为${ENV}-${SERVER_NAME}-ai.yaml的文件,内容如下:
cat <<EOF> ${ENV}-${SERVER_NAME}-ai.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app.kubernetes.io/name: ${ENV}-${SERVER_NAME}-ai
name: ${ENV}-${SERVER_NAME}-ai
namespace: default
spec:
selector:
matchLabels:
app.kubernetes.io/name: ${ENV}-${SERVER_NAME}-ai
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 0
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
app.kubernetes.io/name: ${ENV}-${SERVER_NAME}-ai
spec:
containers:
- image: 930700710668.dkr.ecr.us-east-1.amazonaws.com/ai/light:test #更换为你的镜像地址
imagePullPolicy: IfNotPresent
name: ${ENV}-${SERVER_NAME}-ai
command: ["python3", "run_app.py"] #更换为你服务的启动命令
resources:
limits:
nvidia.com/gpu: "1"
requests:
cpu: "4"
memory: 16Gi
nvidia.com/gpu: "1"
serviceAccount: imageaioneks #更换你的服务角色
nodeSelector:
app: ${ENV}-${SERVER_NAME}-ai
tolerations:
- effect: NoSchedule
key: nvidia.com/${ENV}-${SERVER_NAME}-ai-gpu
operator: Exists
EOF- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
这个配置文件定义了一个Deployment,使用我们之前推送到Amazon ECR的AI应用程序镜像。它请求1个GPU,4个CPU和16GB内存。nodeSelector和tolerations确保Pod只能在GPU节点上运行。
应用这个配置:
步骤3: 配置KEDA扩缩容
创建SQS弹性角色,用于实现SQS触发Pod弹性。
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"sqs:DeleteMessage",
"sqs:ReceiveMessage",
"sqs:SendMessage",
"sqs:getqueueattributes"
],
"Resource": [
"arn:aws:sqs:us-east-1:930700710668:test-light-ai-input",
"arn:aws:sqs:us-east-1:930700710668:test-light-ai-ouput"
]
}
]
}
aws iam create-policy --policy-name keda-sqs-policy --policy-document file://iam-policy.json
export AWS_REGION=us-east-1
eksctl create iamserviceaccount --cluster=pro-image-ai --name=keda-sqs-role --namespace=default --attach-policy-arn=arn:aws:iam::930700710668:policy/keda-sqs-policy- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
最后,我们将配置KEDA,使AI应用程序能够根据SQS队列中的消息数量自动扩缩容。创建一个名为${ENV}-${SERVER_NAME}-ai-scaled-object-sqs.yml的文件,内容如下:
cat <<EOF> ${ENV}-${SERVER_NAME}-ai-scaled-object-sqs.yml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ${ENV}-${SERVER_NAME}-ai-scaled-object-deployment
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ${ENV}-${SERVER_NAME}-ai
pollingInterval: 1
cooldownPeriod: 60
minReplicaCount: 2
maxReplicaCount: 20
triggers:
- authenticationRef:
name: keda-sqs-role #更换为你的角色
metadata:
awsRegion: us-east-1
identityOwner: operator
queueLength: "1"
queueURL: https://sqs.us-east-1.amazonaws.com/930700710668/test-light-ai-input #更换为你的消息列表
type: aws-sqs-queue
EOF- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
这个配置文件定义了一个ScaledObject,它监视一个SQS队列,并根据队列中的消息数量来扩缩容AI应用程序的副本数量。根据您的需求调整minReplicaCount、maxReplicaCount和queueURL。
应用这个配置:
就是这样!现在,您的AI应用程序将根据SQS队列中的消息数量自动扩缩容,并且Karpenter会根据需求启动或终止GPU实例。
532

被折叠的 条评论
为什么被折叠?



