Traceback (most recent call last):
File "train.py", line 461, in <module>
train(hyp, opt, device, tb_writer)
File "train.py", line 271, in train
pred = model(imgs) # forward
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 154, in forward
replicas = self.replicate(self.module, self.device_ids[:len(inputs)])
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/data_parallel.py", line 159, in replicate
return replicate(module, device_ids, not torch.is_grad_enabled())
File "/usr/local/lib/python3.8/dist-packages/torch/nn/parallel/replicate.py", line 113, in replicate
replica = module._replicate_for_data_parallel()
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1378, in _replicate_for_data_parallel
replica.__dict__ = self.__dict__.copy()
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 775, in __setattr__
def remove_from(*dicts_or_sets):
File "/usr/local/lib/python3.8/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 25795) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
排雷
1. 有些同学说只要 dataloader.num_workers = 0,可以解决,已经亲自试过,无法解决问题
2. 调小bach_size,当然,这个方法是可以让程序跑起来,但是看看GPU利用率吧,单张卡显存跑不满,GPU利用率也很低
错误原因:
由于在docker镜像中默认限制了shm(shared memory),然而数据处理时pythorch则使用了shm。这就导致了在运行多线程时会将超出限制的DataLoader并直接被kill掉。
dataloader从RAM中找本轮迭代要用的batch,如果找到了就使用。如果没找到,就要num_worker个worker继续加载batch到内存,直到dataloader在RAM中找到目标batch。
num_worker设置得大,好处是寻batch速度快,因为下一轮迭代的batch很可能在上一轮/上上一轮…迭代时已经加载好了。
坏处是内存开销大,也加重了CPU负担(worker加载数据到RAM的进程是CPU复制的嘛)。
num_workers的经验设置值是自己电脑/服务器的CPU核心数,如果CPU很强、RAM也很充足,就可以设置得更大些。
解决办法:
在起Docker容器时,设置 –shm-size
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=2,3 --shm-size 8G -it --rm dev:v1 /bin/bash
[PS] 注意这里的 --shm-size 8G 其大小可以根据个人电脑的配置设置适合自己的大小
参考资料: