kubernetes集群使用GPU及安装kubeflow1.0.RC操作步骤
安装显卡驱动
安装CUDA
sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
sudo yum clean all
sudo yum -y install nvidia-driver-latest-dkms cuda
sudo yum -y install cuda-drivers
如缺少gcc依赖,则实行如下命令
yum install kernel-devel kernel-doc kernel-headers gcc\* glibc\* glibc-\*
安装nvidia驱动
rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org
rpm -Uvh http://www.elrepo.org/elrepo-release-7.0-3.el7.elrepo.noarch.rpm
yum install -y kmod-nvidia
禁用nouvean
###在GRUB_CMDLINE_LINUX添加 rdblacklist=nouveau 项
echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
重启,查看nouveau是否被禁用成功
lsmod|grep nouv
没有任何输出,则表示nouveau已被禁用
查看服务器显卡信息
[root@master ~]# nvidia-smi
Tue Jan 14 03:46:41 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:18:00.0 Off | 0 |
| N/A 29C P8 10W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:86:00.0 Off | 0 |
| N/A 25C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|============================================================================

本文详述了在Kubernetes集群中安装和配置GPU的过程,包括驱动、CUDA、NVIDIA-DOCKER等关键组件的设置。同时,深入介绍了如何安装和配置KubeFlow 1.0.RC版本,涵盖各种服务、控制器、存储类和PV/PVC的创建,以及通过NFS实现文件存储。
最低0.47元/天 解锁文章





