Connection reset,Connection reset by peer,Software caused connection abort :socket write error,recv ...

本文详细解析了Java中Socket编程遇到的各种异常情况,特别是Connection reset by peer错误的多种原因,包括并发连接数过多、客户端异常中断、防火墙设置等问题,并提供了排查和解决这些异常的具体方法。

The Best way to China


java.net.SocketException:Connection reset by peer: socket write error
java.net.SocketException:Connection reset
java.net.SocketException:Software caused connection abort :socket write error

java.net.SocketException: Software caused connection abort: recv failed

Please tell me when reading from a socket connection how an IOExcption with "Connection reset " message can occur.
I am basically connecting to a Windows server program .
Now I am using this exception for detecting invalid user login
So i want know that what are the other ways this exception can occur ,

please help me .
thanks and regards
haix

 http://forum.java.sun.com/thread.jspa?threadID=560591&messageID=2755358

http://forum.java.sun.com/thread.jspa?threadID=430179&messageID=4429682

http://forum.java.sun.com/thread.jspa?threadID=609696&messageID=3341613

小弟我用java实现了一个联通SGIP发送,接收短信的程序。
不过在接收过程中,由于联通发送完deliver命令后要间隔16秒才发送unbind命令,此时作为服务器端的我的程序抛出
java.net.socketexception connection reset错误!
因为我在接收数据的时候inputstread.read(bytes[])是一个阻塞函数,如果没有接收到输入流就会阻塞住不动,但是错误的信息应该是socket已经断开。所以让我很奇怪,搞了一周也没有解决这个问题,请大伙帮忙啊!

 

http://topic.youkuaiyun.com/u/20080319/10/0285f5b9-5035-4022-8c3a-5ddc18637777.html

Connection reset by peer
web程序的服务器段,在链接数据库的时候被同一个“对等点”重置了。对等点重置的意思其实就是被同一个权限相同的管理员或者是程序给强制占用了权限,好像目前连接被断了一样,实际上这个时候连接并没有断开,是被“重置”了。就是能找到连接,但程序之间比较笨,自己找不到失去的那个连接了。
“连接被对等点(peer)重置”,这时,只要把防火墙关闭就好了。就是说暂时找不到那个以前的连接了,也许断了,也许没有断,但就是找不到。

 

10053 您的主机中的软件放弃了一个已建立的连接。 
//////////////////////////////////////////////
一个以建立的连接被用户的主机上的软件终止,可能是因为一次数据   
  传输超时或者是协议错误。还有就是不要再连接事件中发送消息

 

总结一下: 
1.你的socket队列中没有空间了 
2.receiver never acknowledges data sent on a datastream socket(接受者不承认在数据流接口上发送的数据)
3.A connection will timeout if the local system doesn't receive an (ACK)nowledgement for data sent

Connection reset by peer 
抛出的异常也有可能是客户端中断连接。 当客户端中断连接的时候服务器也会抛出这个异常出来。

就是说客户端正在连接的时候 突然终止 了连接,这样,服务器端会抛出Connection reset by peer 异常出来

http://topic.youkuaiyun.com/u/20080402/16/7fe0a9c2-cef5-4756-8c45-157555cd0097.html

第4个异常是java.net.SocketException: (Connection reset或者 Connect reset by peer:Socket write error)。该异常在客户端和服务器端均有可能发生,引起该异常的原因有两个,第一个就是如果一端的Socket被关闭(或主动关闭或者因为异常退出而引起的关闭),另一端仍发送数据,发送的第一个数据包引发该异常 (Connect reset by peer)。另一个是一端退出,但退出时并未关闭该连接,另一端如果在从连接中读数据则抛出该异常(Connection reset)。简单的说就是在连接断开后的读和写操作引起的。

 

http://topic.youkuaiyun.com/u/20080328/10/e08d894a-319a-4985-8407-50e103305e6c.html

 

我这里有关于网络异常方面的建议,发上去大家学习:
第1个异常是java.net.BindException:Address already in use: JVM_Bind。该异常发生在服务器端进行new ServerSocket(port)(port是一个0,65536的整型值)操作时。异常的原因是以为与port一样的一个端口已经被启动,并进行监听。此时用netstat –an命令,可以看到一个Listending状态的端口。只需要找一个没有被占用的端口就能解决这个问题。


第2个异常是java.net.ConnectException: Connection refused: connect。该异常发生在客户端进行 new Socket(ip, port)操作时,该异常发生的原因是或者具有ip地址的机器不能找到(也就是说从当前机器不存在到指定ip路由),或者是该ip存在,但找不到指定的端口进行监听。出现该问题,首先检查客户端的ip和port是否写错了,如果正确则从客户端ping一下服务器看是否能 ping通,如果能ping通(服务服务器端把ping禁掉则需要另外的办法),则看在服务器端的监听指定端口的程序是否启动,这个肯定能解决这个问题。


第3个异常是java.net.SocketException: Socket is closed,该异常在客户端和服务器均可能发生。异常的原因是己方主动关闭了连接后(调用了Socket的close方法)再对网络连接进行读写操作。


第4个异常是java.net.SocketException: (Connection reset或者 Connect reset by peer:Socket write error)。该异常在客户端和服务器端均有可能发生,引起该异常的原因有两个,第一个就是如果一端的Socket被关闭(或主动关闭或者因为异常退出而引起的关闭),另一端仍发送数据,发送的第一个数据包引发该异常 (Connect reset by peer)。另一个是一端退出,但退出时并未关闭该连接,另一端如果在从连接中读数据则抛出该异常(Connection reset)。简单的说就是在连接断开后的读和写操作引起的。


第5个异常是java.net.SocketException: Broken pipe。该异常在客户端和服务器均有可能发生。在第4个异常的第一种情况中(也就是抛出SocketExcepton:Connect reset by peer:Socket write error后),如果再继续写数据则抛出该异常。前两个异常的解决方法是首先确保程序退出前关闭所有的网络连接,其次是要检测对方的关闭连接操作,发现对方关闭连接后自己也要关闭该连接。

  客户端错误代码10053 Software caused connection abort(软件原因导致连接中断)

 

又涉及到一个问题就是阻塞函数和非阻塞函数,阻塞Socket和非阻塞Socket

一是阻塞函数,一是非阻塞函数。所谓阻塞函数,是指其完成指定的任务之前不允许程序调用另一个函数,在Windows下还会阻塞本线程消息的发送。所谓非阻塞函数,是指操作启动之后,如果可以立即得到结果就返回结果,否则返回表示结果需要等待的错误信息,不等待任务完成函数就返回

http://www.aka.org.cn/Lectures/002/Lecture-2.1.8/Lecture-2.1.8/new_page_15.htm

http://www.cppblog.com/kenlistian/archive/2007/12/27/39746.html

http://hi.baidu.com/evenque/blog/item/1ccfc63ffc3527c17d1e7188.html

http://www.cic.tsinghua.edu.cn/jdx/lunwen/WinSockx.htm

 

Connection reset by peer的原因:
经常出现的Connection reset by peer: 原因可能是多方面的,不过更常见的原因是:
①:服务器的并发连接数超过了其承载量,服务器会将其中一些连接Down掉;
②:客户关掉了浏览器,而服务器还在给客户端发送数据;
③:浏览器端按了Stop
很多人都说是客户端造成的,没有办法控制,是个比较郁闷的问题。

 

引起该问题的原因是由于此时Server端连接已经被复位,而Client依然通过该连接在接收和发送数据,在网上搜索了一下该错误,发现该错误引起的原因大都是防火墙的原因,嘿嘿,又学了一招。

 

socket, nio socket 及nio socket框架MINA总结

Windows Sockets Error Codes

http://msdn2.microsoft.com/en-us/library/ms740668.aspx


socket通信有通信的规则,   如果你希望保持长连接,   就应该有个通信协议,   包括写入\0也是规则的一部分,   传完一个文件等待下一个.   要可不保持长连接,   可使用webservice,   这样你的协议变的更为可读,   更容易包装成产品.  
   
  看你的程序希望read结束,   不象是希望保持长连接的样子,   晕ing

 

经常出现的Connection reset by peer: 原因可能是多方面的,不过更常见的原因是:
①:服务器的并发连接数超过了其承载量,服务器会将其中一些连接Down掉;
②:客户关掉了浏览器,而服务器还在给客户端发送数据;
③:浏览器端按了Stop
很多人都说是客户端造成的,没有办法控制,是个比较郁闷的问题。

 


这是网络连接断掉引起的,一般是由于通过了防火墙,而防火墙一般都会有超时的机制,在网络连接长时间不传输数据时,会切断这个TCP的session,这时就会导致Connection reset by peer error

 


http://topic.youkuaiyun.com/t/20060915/12/5024325.html


沟通非阻塞IO与阻塞IO - 输入流

沟通非阻塞IO与阻塞IO - 输出流

附加该问题的最近结论
1.我使用MyEclipse单步调试,当调试到inputStream 的时候,看变量,发现一个问题,
那就是SocketInputStream的Channel是null,为什么那,我不知道

又在网络上找到几句话粘贴到这里吧!如下

"No buffer space available , recv failed"

谢谢sandyen(杉叶)的回答,我在网上也搜到这个,但是不是这个原因。
问题已解决,确实不是程序的问题。
netstat -an发现有大量的端口占用,监听很多机器的139,445端口。
确定机器中了震荡波,下载补丁安装重启,问题解决。
导致这个异常的原因应该是系统的socket大量的资源被占用,
导致没有足够的资源接收前台上报或者回复的数据。

http://topic.youkuaiyun.com/t/20060315/11/4615627.html



转载于:https://www.cnblogs.com/kaixin110/archive/2008/04/11/1148671.html

[2025-09-03 19:55:59] 140706823c44:16583:16896 [1] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<19868> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3805 remoteGidsfe80::5200:e6ff:feef:3835 hca ibp0p1 140706823c44:16583:16896 [1] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16584:16900 [2] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:48b4 remoteGidsfe80::5200:e6ff:feef:3804 hca ibp3p0 140706823c44:16584:16900 [2] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16582:16894 [0] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<39536> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:3837 remoteGidsfe80::5200:e6ff:feef:4887 hca ibp1p3 140706823c44:16582:16894 [0] NCCL INFO transport/net.cc:1393 -> 6 [2025-09-03 19:55:59] 140706823c44:16585:16898 [3] transport/net_ib.cc:2458 NCCL WARN NET/IB: Got completion from peer 127.0.0.1<55900> with status=12 opcode=0 len=0 vendor err 129 (Recv) localGid fe80::5200:e6ff:feef:4886 remoteGidsfe80::5200:e6ff:feef:48b6 hca ibp2p2 140706823c44:16585:16898 [3] NCCL INFO transport/net.cc:1393 -> 6 [rank1]:[E903 20:05:00.083473453 ProcessGroupNCCL.cpp:685] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. [rank1]:[E903 20:05:00.083725262 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 1] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank1]:[E903 20:05:00.083739310 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank1]:[E903 20:05:00.083812782 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 1] First PG on this rank to signal dumping. [rank0]:[E903 20:05:00.155314876 ProcessGroupNCCL.cpp:685] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. [rank0]:[E903 20:05:00.155505084 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 0] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank0]:[E903 20:05:00.155519676 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank0]:[E903 20:05:00.155597596 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 0] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.160329608 ProcessGroupNCCL.cpp:685] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. [rank3]:[E903 20:05:00.160487240 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 3] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank3]:[E903 20:05:00.160497256 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank3]:[E903 20:05:00.160547272 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 3] First PG on this rank to signal dumping. [rank2]:[E903 20:05:00.249766401 ProcessGroupNCCL.cpp:685] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600069 milliseconds before timing out. [rank2]:[E903 20:05:00.249872834 ProcessGroupNCCL.cpp:2252] [PG ID 0 PG GUID 0(default_pg) Rank 2] failure detected by watchdog at work sequence id: 1 PG status: last enqueued work: 1, last completed work: -1 [rank2]:[E903 20:05:00.249881826 ProcessGroupNCCL.cpp:732] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value. [rank2]:[E903 20:05:00.249934338 ProcessGroupNCCL.cpp:2584] [PG ID 0 PG GUID 0(default_pg) Rank 2] First PG on this rank to signal dumping. [rank3]:[E903 20:05:00.459968512 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 3] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank0]:[E903 20:05:00.459959936 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 0] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank2]:[E903 20:05:00.459993568 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 2] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank3]:[E903 20:05:00.460112000 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 3] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank0]:[E903 20:05:00.460113952 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 0] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank2]:[E903 20:05:00.460117472 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 2] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 [rank1]:[E903 20:05:00.463863369 ProcessGroupNCCL.cpp:1870] [PG ID 0 PG GUID 0(default_pg) Rank 1] Received a dump signal due to a collective timeout from this local rank and we will try our best to dump the debug info. Last enqueued NCCL work: 1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn't run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc. [rank1]:[E903 20:05:00.464040010 ProcessGroupNCCL.cpp:1589] [PG ID 0 PG GUID 0(default_pg) Rank 1] ProcessGroupNCCL preparing to dump debug info. Include stack trace: 1 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16583:17177 [1] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16583:16885 [1] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16582:17179 [0] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16582:16886 [0] NCCL INFO misc/socket.cc:916 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16585:17180 [3] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16585:16890 [3] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:64 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:81 -> 3 140706823c44:16584:17182 [2] NCCL INFO misc/socket.cc:864 -> 3 140706823c44:16584:16889 [2] NCCL INFO misc/socket.cc:916 -> 3 Creation of the build directory /opt/Megatron-LM/megatron/legacy/fused_kernels/build failed 140706823c44:16583:17177 [1] NCCL INFO comm 0xafcbcf8345e0 rank 1 nranks 4 cudaDev 1 busId 906000 - Abort COMPLETE [rank1]:[E903 20:06:01.727205717 ProcessGroupNCCL.cpp:746] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E903 20:06:01.727244053 ProcessGroupNCCL.cpp:760] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E903 20:06:01.727875735 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank1]: Traceback (most recent call last): [rank1]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank1]: pretrain( [rank1]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank1]: initialize_megatron(extra_args_provider=extra_args_provider, [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank1]: _compile_dependencies() [rank1]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank1]: torch.distributed.barrier() [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank1]: return func(*args, **kwargs) [rank1]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank1]: work = group.barrier(opts=opts) [rank1]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 1. what(): [PG ID 0 PG GUID 0(default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600011 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xfdcd78e09b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xfdcd78e1038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xfdcd78e118ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xfdcd77ee4180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xfdcd78dc88e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xfdcd78e11c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xfdcd7797aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xfdcdb9a6d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xfdcdb9ad5edc in /lib/aarch64-linux-gnu/libc.so.6) 140706823c44:16585:17180 [3] NCCL INFO comm 0xba60cf534e60 rank 3 nranks 4 cudaDev 3 busId 1906000 - Abort COMPLETE [rank3]:[E903 20:06:01.856367983 ProcessGroupNCCL.cpp:746] [Rank 3] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E903 20:06:01.856392463 ProcessGroupNCCL.cpp:760] [Rank 3] To avoid data inconsistency, we are taking the entire process down. 140706823c44:16582:17179 [0] NCCL INFO comm 0xbdefb65320e0 rank 0 nranks 4 cudaDev 0 busId 806000 - Abort COMPLETE [rank3]:[E903 20:06:01.856965041 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]:[E903 20:06:01.857166705 ProcessGroupNCCL.cpp:746] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E903 20:06:01.857203857 ProcessGroupNCCL.cpp:760] [Rank 0] To avoid data inconsistency, we are taking the entire process down. what(): [PG ID 0 PG GUID 0(default_pg) Rank 3] Process group watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600084 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xe7e01e5a9b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xe7e01e5b038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xe7e01e5b18ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xe7e01d684180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xe7e01e5688e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xe7e01e5b1c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xe7e01d11aa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xe7e05f20d5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xe7e05f275edc in /lib/aarch64-linux-gnu/libc.so.6) [rank0]:[E903 20:06:01.857829171 ProcessGroupNCCL.cpp:2068] [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' [rank0]: Traceback (most recent call last): [rank0]: File "/opt/Megatron-LM/pretrain_gpt.py", line 245, in <module> [rank0]: pretrain( [rank0]: File "/opt/Megatron-LM/megatron/training/training.py", line 193, in pretrain [rank0]: initialize_megatron(extra_args_provider=extra_args_provider, [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 100, in initialize_megatron [rank0]: _compile_dependencies() [rank0]: File "/opt/Megatron-LM/megatron/training/initialize.py", line 173, in _compile_dependencies [rank0]: torch.distributed.barrier() [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper [rank0]: return func(*args, **kwargs) [rank0]: File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 4811, in barrier [rank0]: work = group.barrier(opts=opts) [rank0]: torch.distributed.DistBackendError: NCCL communicator was aborted on rank 0. what(): [PG ID 0 PG GUID 0(default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600086 milliseconds before timing out. Exception raised from checkTimeout at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:688 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x22c (0xf144de949b7c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0xdfc (0xf144de95038c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::Watchdog::run() + 0xcc (0xf144de9518ac in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #4: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #5: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #6: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) Exception raised from run at /build/pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2074 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xd0 (0xf144dda24180 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so) frame #1: <unknown function> + 0x11b88e0 (0xf144de9088e0 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::Watchdog::run() + 0x45c (0xf144de951c3c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so) frame #3: <unknown function> + 0xdaa9c (0xf144dd4baa9c in /lib/aarch64-linux-gnu/libstdc++.so.6) frame #4: <unknown function> + 0x7d5b8 (0xf1451f5ad5b8 in /lib/aarch64-linux-gnu/libc.so.6) frame #5: <unknown function> + 0xe5edc (0xf1451f615edc in /lib/aarch64-linux-gnu/libc.so.6) W0903 20:06:01.224000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16582 closing signal SIGTERM W0903 20:06:01.225000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16584 closing signal SIGTERM W0903 20:06:01.226000 16517 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 16585 closing signal SIGTERM E0903 20:06:01.856000 16517 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: -6) local_rank: 1 (pid: 16583) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 7, in <module> sys.exit(main()) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 901, in main run(args) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ====================================================== pretrain_gpt.py FAILED ------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-09-03_20:06:01 host : 140706823c44 rank : 1 (local_rank: 1) exitcode : -6 (pid: 16583) error_file: <N/A> traceback : Signal 6 (SIGABRT) received by PID 16583 ======================================================
最新发布
09-05
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值