安装NCCL问题:nvcc fatal : Value 'gnu++0x' is not defined for option 'std'

本文介绍了一种在CentOS 7.2环境下安装NCCL过程中遇到的错误:nvcc fatal: Value 'gnu++0x' is not defined for option 'std'。通过调整环境变量CXX的设置,成功解决了该问题。

今天安装NCCL的时候碰见下面这个问题。本博客为解决方案。

环境:
CentOS Linux release 7.2.1511
CUDA 7.5
gcc version 4.8.5 20150623

报错:

nvcc fatal : Value 'gnu++0x' is not defined for option 'std'

在Ubuntu 14.04上安装成功,调试Makefile后发现Ubuntu 14.04与Centos 7的内置变量 $CXX 不同。
调试命令为make --just-print
在Ubuntu 14.04中 CXX=g++
在Centos 7中, CXX=g++ std=gnu++0x
这个多余的std=gnu++0x就是罪魁祸首,nvcc不认这个参数,坑爹玩意!
命令:

export CXX=g++

搞定。

解析:rank1]:[E611 14:53:56.648088045 ProcessGroupNCCL.cpp:616] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600026 milliseconds before timing out. [rank1]:[E611 14:53:56.667673782 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3. [rank3]:[E611 14:53:56.712240041 ProcessGroupNCCL.cpp:616] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. [rank3]:[E611 14:53:56.712686282 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 3] Exception (either an error or timeout) detected by watchdog at work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3. [rank2]:[E611 14:53:56.715825137 ProcessGroupNCCL.cpp:616] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600058 milliseconds before timing out. [rank2]:[E611 14:53:56.716079060 ProcessGroupNCCL.cpp:1785] [PG ID 0 PG GUID 0(default_pg) Rank 2] Exception (either an error or timeout) detected by watchdog at work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3. [rank3]:[E611 14:53:56.085195876 ProcessGroupNCCL.cpp:1834] [PG ID 0 PG GUID 0(default_pg) Rank 3] Timeout at NCCL work: 4, last enqueued NCCL work: 4, last completed NCCL work: 3. [rank3]:[E611 14:53:56.085221749 ProcessGroupNCCL.cpp:630] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E611 14:53:56.085229836 ProcessGroupNCCL.cpp:636] [Rank 3] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E611 14:53:56.086910182 ProcessGroupNCCL.cpp:1595] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out. Exception raised from checkTimeout at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:618 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fe37db6c446 in /home/h00488860/anaconda3/envs/ll0114/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x282 (0x7fe333a2a772 in /home/h00488860/anaconda3/envs/ll0114/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fe333a31bb3 in /home/h00488860/anaconda3/envs/ll0114/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7fe333a3361d in /home/h00488860/anaconda3/envs/ll0114/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0x145c0 (0x7fe37e04b5c0 in /home/h00488860/anaconda3/envs/ll0114/lib/python3.10/site-packages/torch/lib/libtorch.so) frame #5: <unknown function> + 0x94ac3 (0x7fe38246dac3 in /lib/x86_64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0x126850 (0x7fe3824ff850 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=4, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600077 milliseconds before timing out.
06-12
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f203e86aa6d in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f203e86c7f0 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f203e86defd in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd6df4 (0x7f202e850df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x8609 (0x7f21357f2609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f21355b1353 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f203d54f5e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b4abe (0x7f203e83cabe in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xe07bed (0x7f203e48fbed in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xd6df4 (0x7f202e850df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x8609 (0x7f21357f2609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #5: clone + 0x43 (0x7f21355b1353 in /lib/x86_64-linux-gnu/libc.so.6) what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=719, OpType=_ALLGATHER_BASE, NumelIn=8, NumelOut=16, Timeout(ms)=1800000) ran for 1800024 milliseconds before timing out. Exception raised from checkTimeout at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:635 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f1625eb15e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x23d (0x7f16271cca6d in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0xc80 (0x7f16271ce7f0 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x14d (0x7f16271cfefd in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xd6df4 (0x7f16171b2df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x8609 (0x7f171e154609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7f171df13353 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1905 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x98 (0x7f1625eb15e8 in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b4abe (0x7f162719eabe in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: <unknown function> + 0xe07bed (0x7f1626df1bed in /data1/users/heyu/uv_env/pyhy/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xd6df4 (0x7f16171b2df4 in /lib/x86_64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x8609 (0x7f171e154609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #5: clone + 0x43 (0x7f171df13353 in /lib/x86_64-linux-gnu/libc.so.6)
最新发布
07-17
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值