Spark在shuffle数据的时候遇到的问题：java.io.IOException: Connection reset by peer

最新推荐文章于 2023-09-09 16:16:18 发布

转载最新推荐文章于 2023-09-09 16:16:18 发布 · 1.3w 阅读

文章标签：

#spark #shuffle

spark 专栏收录该内容

12 篇文章

订阅专栏

本文探讨了Spark在进行大数据集shuffle操作时遇到的问题，特别是在使用Netty传输数据时发生的IOException。文章分析了错误产生的原因，并提出了一种解决方案，通过更改配置使用NIO替代Netty来减少此类错误的发生。

java.io.IOException: Connection reset by peer
        at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
        at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
        at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
        at sun.nio.ch.IOUtil.read(IOUtil.java:192)
        at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
        at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
        at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
        at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
        at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
        at java.lang.Thread.run(Thread.java:745)

2016-03-10,14:39:26,362 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=19949029161, chunkIndex=13}, buffer=FileSegmentManagedBuffer{file=/home/work/hdd9/yarn/c3prc-hadoop/nodemanager/usercache/h_sns/appcache/application_1447144693824_327984/blockmgr-fa013657-df4b-402c-84d4-8fc022853d88/35/shuffle_2_1290_0.data, offset=0, length=1039492}} to /10.114.2.44:61221; closing connection

在spark1.2中，大数据集shuffle的时候，节点之间传输数据时使用netty，有的时候会出现问题。修改方法为：

val conf = new SparkConf().set("spark.shuffle.blockTransferService", "nio"), 即用nio代替netty。

但是我试了以下不行。

之后，我观察了以下Spark UI中失败的stage的error信息，发现所有的error都是从同一个节点报出来的。

我们可以猜测是不是该节点在shuffle过程中分配到过多的数据导致。

检查了以下代码，发现大数据集a中（a的数据形式为（key，value））出现大量key为“”的空字符串形式，导致a.leftOuterJoin(b)的时候，会出现大量key为空字符串的结果。这些结果会分配到同一个节点中，导致该节点崩溃。

http://m.blog.youkuaiyun.com/article/details?id=50848392