前些年做AI项目的时候经常用到显卡,大多数时候都是传统部署,对于资源的利用率并不高,而显卡也不便宜,K8S集群内调用显卡可以更加细致地进行显卡计算资源的分配,提高资源利用率。
之前记录和显卡相关的一些笔记:
K8S 笔记(启用gpu支持)——筑梦之路_筑梦之路的博客-优快云博客_unauthorized (401): invalid credentials provided
ubuntu 16.04 编译安装ffmpeg GPU加速_筑梦之路的博客-优快云博客
ubuntu server lts ffmpeg gpu版编译构建过程——筑梦之路_筑梦之路的博客-优快云博客
构建ffmpeg gpu版docker镜像 硬件加速处理视频流——筑梦之路_筑梦之路的博客-优快云博客
这里再次进行总结归纳一下。
1. docker启用GPU支持
# 修改docker配置
cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"data-root": "/var/lib/docker",
"exec-opts": ["native.cgroupdriver=systemd"],
"registry-mirrors": [
"https://docker.mirrors.ustc.edu.cn",
"http://hub-mirror.c.163.com"
],
"insecure-registries": ["127.0.0.1/8"],
"max-concurrent-downloads": 10,
"live-restore": true,
"log-driver": "json-file",
"log-level": "warn",
"log-opts": {
"max-size": "50m",
"max-file": "1"
},
"storage-driver": "overlay2"
}
#重启docker服务
systemctl daemon-reload
systemctl restart docker
2. K8S中给GPU节点打上标签
kubectl label nodes 192.168.10.156 nvidia.com/gpu.present=true
# 检查
kubectl get nodes -L nvidia.com/gpu.present
3. helm安装GPU支持的插件
curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
sudo apt-get install apt-transport-https --yes
echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
sudo apt-get update && apt-get install helm
helm install \
--version=0.10.0 \
--generate-name \
nvdp/nvidia-device-plugin
这一步比较重要,没有插件K8S无法调用GPU
# 检查验证是否成功加载插件
kubectl describe node 192.168.10.156 | grep nvidia
nvidia.com/gpu.present=true
nvidia.com/gpu: 1
nvidia.com/gpu: 1
kube-system nvidia-device-plugin-1937727239-fcc2x 0 (0%) 0 (0%) 0 (0%) 0 (0%) 30s
nvidia.com/gpu 0 0
4. 创建pod测试一下
#将镜像放到GPU节点上
docker pull registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
docker save -o tensorflow-gpu.tar registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
docker load -i tensorflow-gpu.tar
#pod的yaml
cat gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
name: test-gpu
labels:
test-gpu: "true"
spec:
containers:
- name: training
image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
command:
- python
- tensorflow-sample-code/tfjob/docker/mnist/main.py
- --max_steps=300
- --data_dir=tensorflow-sample-code/data
resources:
limits:
nvidia.com/gpu: 1
tolerations:
- effect: NoSchedule
operator: Exists
# kubectl apply -f gpu-test.yaml
# 查看日志,正常调用,没有报错表示已经成功
kubectl logs test-gpu