Docker配置nnunet训练环境
进行下面步骤请确保本机是linux,并配置好nvidia docker。
启动docker
加入了**–rm**,每次关闭后销毁容器
将d盘下的目录映射到/workspace
#rm
sudo docker run --gpus all -it --ipc=host --rm -v /media/sky/D/ruifeng/CoTr-main/:/workspace nvcr.io/nvidia/pytorch:20.11-py3
#没有rm,推荐这个命令
sudo docker run --gpus all -it --ipc=host -v /media/sky/F/ruifeng/CoTr-main/:/workspace nvcr.io/nvidia/pytorch:20.11-py3
启动后,下边的命令都是在容器内进行,直到exit退出容器。
设置docker容器的时区,方便查看保存文件的日期,根据提示选择时区,
#在容器内,默认为root
apt install ntp
#选择6 Asia ,然后70 shagnhai
配置容器内pip源
#华为源
pip config set global.index-url https://repo.huaweicloud.com/repository/pypi/simple
配置路径,容器内容的路径
把该配置写入到.bashrc中,wq退出
vim ./bashrc
export nnUNet_raw_data_base="/workspace/nnUNet_raw"
export nnUNet_preprocessed="/workspace/nnUNet_preprocessed"
export RESULTS_FOLDER="/workspace/nnUNet_trained_models"
:wq保存退出
然后激活
source ./bashrc
安装
可以通过cd … 切换目录
cd nnUNet
pip install -e .
cd CoTr_package
pip install -e .
训练
#batchsize为2 显存8g
python run_training.py -gpu='0' -outpath='CoTr' -p nnUNetPlansv2.1_ps48_192_192_bs2
#batchsize为4 显存15g
python run_training.py -gpu='0' -outpath='CoTr' -p nnUNetPlansv2.1_ps48_192_192_bs4
#batchsize为6 使用显存大概20g
python run_training.py -gpu='0' -outpath='CoTr' -p nnUNetPlansv2.1_ps48_192_192_bs6
停止训练
直接停止docker运行,exit可退出
exit
再次启动
先用 docker ps -a
找到对应的已经停止了的容器id,
(base) sky@sky:/media/sky/D/ruifeng/nnUNet-master$ sudo docker ps -l
sudo: unable to resolve host sky
[sudo] password for sky:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
46e8cae289d6 nvcr.io/nvidia/pytorch:20.11-py3 "/usr/local/bin/nvid…" 2 hours ago Exited (0) About an hour ago strange_euclid
然后用
sudo docker start -ia 46e
#docker start -ia <containerid>
(base) sky@sky:/media/sky/D/ruifeng/nnUNet-master$ sudo docker start -ia 46e
sudo: unable to resolve host sky
=============
== PyTorch ==
=============
NVIDIA Release 20.11 (build 17345815)
PyTorch Version 1.8.0a0+17f8c32
Container image Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Copyright (c) 2014-2020 Facebook Inc.
Copyright (c) 2011-2014 Idiap Research Institute (Ronan Collobert)
Copyright (c) 2012-2014 Deepmind Technologies (Koray Kavukcuoglu)
Copyright (c) 2011-2012 NEC Laboratories America (Koray Kavukcuoglu)
Copyright (c) 2011-2013 NYU (Clement Farabet)
Copyright (c) 2006-2010 NEC Laboratories America (Ronan Collobert, Leon Bottou, Iain Melvin, Jason Weston)
Copyright (c) 2006 Idiap Research Institute (Samy Bengio)
Copyright (c) 2001-2004 Idiap Research Institute (Ronan Collobert, Samy Bengio, Johnny Mariethoz)
Copyright (c) 2015 Google Inc.
Copyright (c) 2015 Yangqing Jia
Copyright (c) 2013-2016 The Caffe contributors
All rights reserved.
NVIDIA Deep Learning Profiler (dlprof) Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.
root@46e8cae289d6:/workspace# ll
Enjoy it!
下面的内容可以不用-----------------------------------
通过杀掉进程停止, 25710为pid号,在NVIDIA smi可以看到进程号
kill -s 9 25710
保存对容器的修改
当你对某一个容器做了修改之后(通过在容器中运行某一个命令),可以把对容器的修改保存下来,这样下次可以从保存后的最新状态运行该容器。docker中保存状态的过程称之为committing,它保存的新旧状态之间的区别,从而产生一个新的版本。
需要停止容器
首先使用 docker ps -l命令获得安装完ping命令之后容器的id。然后把这个镜像保存为learn/ping。
docker ps -l
提示:
-
运行docker commit,可以查看该命令的参数列表。
-
你需要指定要提交保存容器的ID。(译者按:通过docker ps -l 命令获得)
-
无需拷贝完整的id,通常来讲最开始的三至四个字母即可区分。(译者按:非常类似git里面的版本号)
正确的命令:
docker commit 698 learn/ping**
执行完docker commit命令之后,会返回新版本镜像的id号。