conda环境安装pytorch,版本不能太新,否则有会有:0:rocdevice.cpp :2673: 1263352162 us: 21870: [tid:0x7f64e3eff640] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29这样的错误
pip install torch==1.13.1+rocm5.2 torchvision==0.14.1+rocm5.2 torchaudio==0.13.1+rocm5.2 -f https://download.pytorch.org/whl/rocm5.2/torch_stable.html
安装opencl,这个在gate守门员会用,训练是不用的
sudo pacman -S rocm-opencl-runtime
最后在python前面加上,防止遇到rocBLAS error: Cannot read /opt/rocm/lib/rocblas/library/TensileLibrary.dat: Illegal seek for GPU arch : gfx1010
List of available TensileLibrary Files :
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx900.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1100.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1030.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1102.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx941.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx90a.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx940.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx908.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx1101.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx906.dat”
“/opt/rocm/lib/rocblas/library/TensileLibrary_lazy_gfx942.dat”
./train.sh: 第 87 行:20822 已中止 (核心已转储)python3
HSA_OVERRIDE_GFX_VERSION=10.3.0 python xxx.py
这样就开始训练了,我目前用b4c32,batch_size=8,其他用默认设置,看看训几天能不能出个能玩的模型
之前用4090训了7个小时,迭代了12代,那个模型一直在倒数一二路下棋