在用python train.py params训练模型时,训练程序偶尔会崩溃,崩溃后,需要添加resume ckpt进行继续训练。以下代码能自动检查训练程序是否正常,若检查到训练程序崩溃,则自动加载最新ckpt进行resume训练。
#!/bin/bash
# Get the current directory
dir=$(pwd)
# Loop indefinitely
while true
do
# Run python train.py in the background and get its process id
python train.py &
pid=$!
# Wait for the process to finish
wait $pid
# Check the exit code of the process
if [ $? -ne 0 ]
then
# If the exit code is not zero, the process crashed
echo "Process $pid crashed"
# Add a sleep command here to wait for 10 seconds before restarting
sleep 10
# Find the latest ckpt file in the current directory
latest_ckpt=$(ls -t $dir/checkpoints/epoch_*.pth | head -n 1)
# Check if the file exists
if [ -f $latest_ckpt ]
then
# If the file exists, resume training with it
echo "Resuming training with $latest_ckpt"
python train.py --resume $latest_ckpt &
pid=$!
else
# If the file does not exist, start training from scratch
echo "No checkpoint file found, starting training from scratch"
python train.py &
pid=$!
fi
else
# If the exit code is zero, the process finished normally
echo "Process $pid finished normally"
break
fi
done
上述代码使用流程如下:
- 步骤一:将脚本保存为一个文件,比如 run.sh,并放在工作目录下
- 步骤二:给脚本添加可执行权限,使用 chmod 命令,例如:
chmod +x run.sh
- 步骤三:直接运行脚本,使用 ./ 前缀,例如:
./run.sh
- 步骤四:等待脚本执行完成,或者按 Ctrl+C 中断脚本。脚本会在后台运行 python train.py 命令,并检查是否崩溃,如果崩溃,就找到最近的 ckpt 文件,并使用 --resume 参数继续训练。