总体目标
- 每个人的环境互相不影响
- 每个人分配10个端口,其中第一个端口是ssh连接的端口(内网连接)
- 要能用GPU跑深度学习实验
- 可以用不同版本的cuda(暂未实现,实现了再写)
- 可以ssh连接(目前内网可以)
- 外网可以ssh连接(内网穿透用frpc可以实现,但每个人的配置不一样,目前没法统一配置)
- 最好有一个图形化的管理界面(GitHub没找到,哪天有闲自己写一个吧),类似这种
探索过程
照着nvidia-docker的文档装好后,运行
sudo docker run --rm --gpus all nvidia/cuda nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"process_linux.go:432: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: requirement error: unsatisfied condition: cuda>=11.0, please update your driver to a newer version, or use an earlier cuda container\\\\n\\\"\"": unknown.
这个必须本机装了cuda11才能用,目前我还没找到怎么本机不装cuda,然后docker装里cuda可以用的办法
root@e498175d1a92:/# systemctl start sshd
Failed to connect to bus: No such file or directory
这个要默认运行/sbin/init
,还要加--privileged
参考:https://blog.youkuaiyun.com/cheney__chen/article/details/81639203
改了密码ssh也提示密码错误,这里要改ssh的配置,允许root登录
参考:https://blog.youkuaiyun.com/zilaike/article/details/78922524
vim /etc/ssh/sshd_config
service sshd restart
给了实验室同学ip但连不上,原因是搞错ip了
换源
docker阿里源
去网页 https://cr.console.aliyun.com/cn-hangzhou/instances/mirrors
apt阿里源
参考:https://blog.youkuaiyun.com/hang916/article/details/79465458
deb http://mirrors.aliyun.com/ubuntu/ xenial main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial main
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates main
deb http://mirrors.aliyun.com/ubuntu/ xenial universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-updates universe
deb http://mirrors.aliyun.com/ubuntu/ xenial-security main
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security main
deb http://mirrors.aliyun.com/ubuntu/ xenial-security universe
deb-src http://mirrors.aliyun.com/ubuntu/ xenial-security universe
pip豆瓣源
参考:https://blog.youkuaiyun.com/qq_32768743/article/details/78916808
[global]
timeout =6000
index-url =http://pypi.douban.com/simple/
[install]
use-mirrors =true
mirrors =http://pypi.douban.com/simple/
trusted-host =pypi.douban.com
测试GPU的可用性
查看cuda和cudnn版本
参考:https://www.jianshu.com/p/9c0dee9bb2b7
nvcc -V
cat /usr/local/cuda/version.txt
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
tf的
参考:https://blog.youkuaiyun.com/lidichengfo0412/article/details/102637824
import tensorflow as tf
tf.test.is_gpu_available()
pytorch的
参考:https://blog.youkuaiyun.com/weixin_35576881/article/details/89709116
import torch
torch.cuda.is_available()
遇到过的一些错误
- 版本要一致
https://github.com/tensorflow/tensorflow/issues/4349 - tensorflow各个版本的CUDA以及Cudnn版本对应关系
https://tensorflow.google.cn/install/source
docker常用操作
删除所有容器
sudo docker rm `sudo docker ps -a -q`
停止所有容器,并删除所有容器
sudo docker stop `sudo docker ps -q` && sudo docker rm `sudo docker ps -a -q`
指定容器名称
--name xxx
挂载目录
-v 宿主机目录:docker里的目录
映射端口,范围映射
-p 10100:22
-p 10101-10109:10101-10109
使用中的其他问题
- 出现
perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
LANGUAGE = (unset),
LC_ALL = (unset),
LC_TERMINAL_VERSION = "3.3.6",
LC_TERMINAL = "iTerm2",
LANG = "zh_CN.UTF-8"
are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C")
需要在.bashrc中增加
export LC_ALL=C
- apt update 0% working
可以尝试apt clean
我的问题是要把nvidia中的一些源去掉
参考资料
-
【docker入门】10分钟,快速学会docker
https://www.bilibili.com/video/BV1R4411F7t9 -
Nvidia-docker 配置深度学习环境服务器(cuda+cudnn+anaconda+python)GPU服务器的配
https://www.bilibili.com/video/BV1bk4y1B7T5 -
为实验室建立公用GPU服务器
https://abcdabcd987.com/setup-shared-gpu-server-for-labs/ -
用 Docker 建立一个公用 GPU 服务器
https://gitchat.youkuaiyun.com/columnTopic/5a13c07375462408e0da8e72