改动了某模型的python脚本后,训练过程中总是不定期出现程序崩溃,有时能完整跑完几个epoch,有时一个epoch每跑完就崩溃了,错误类似如下:
Traceback (most recent call last):
File "tools/train.py", line 311, in <module>
raise e
File "tools/train.py", line 305, in <module>
args.level,
File "/usr/local/lib/python3.6/site-packages/at/engine/ddptrainer.py", line 234, in launch
args,
File "/root/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/root/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/.local/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 136, in join
signal_name=name
torch.multiprocessing.spawn.ProcessExitedException: proce

探讨了PyTorch多进程训练中出现的SIGSEGV错误现象,详细分析了问题定位过程,最终发现是由numpy数组越界访问引起的,并讨论了Python环境中越界访问的异常行为。
最低0.47元/天 解锁文章
1539

被折叠的 条评论
为什么被折叠?



