HADOOP datanode SendBlock毛刺现象分析

最新推荐文章于 2024-05-17 18:08:49 发布

teriy

最新推荐文章于 2024-05-17 18:08:49 发布

阅读量2.5k

点赞数

CC 4.0 BY-SA版权

分类专栏： HDFS分析文章标签： hadoop returning exception socket 多线程 null

本文链接：https://blog.youkuaiyun.com/teriy/article/details/6694827

HDFS分析专栏收录该内容

1 篇文章

订阅专栏

本文详细探讨了HDFS中datanode使用NIO进行底层通信的实现机制，包括readBlock方法的运作流程、数据发送方式的选择以及transferTo方法带来的性能提升。同时，文章揭示了在使用transferTo方法时遇到的一个JRE bug，以及为解决此问题所做的调整，最后通过实验验证了调整后通信效率的显著提升。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这几天研究HDFS的datanode 和 client端的数据通信,发现目前datanode是使用NIO来做底层通信的.在datanode的DataXceiver的readBlock方法就是将datanode的block中的数据发送到客户端或者其他的datanode中去.我们可以看以下几段代码.


  try {
      try {
        blockSender = new BlockSender(block, startOffset, length,
            true, true, false, datanode, clientTraceFmt);
      } catch(IOException e) {
        out.writeShort(DataTransferProtocol.OP_STATUS_ERROR);
        throw e;
      }

      out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status
      long read = blockSender.sendBlock(out, baseStream, null); // send data

readBlock 在读取client的包头之后就开始发送相关的数据,发送是通过2个流,out就是普通的socket流,baseSteam是用于NIO的.在sendBlock中做了一些checksum计算和偏移操作后就调用sendchunks发送客户端所需要的data.此处如果transferToAllowed在配置文件中设置就是用sendfile方法加快发送流程,不然直接socket流来发送.

      if (transferToAllowed && !verifyChecksum && 
          baseStream instanceof SocketOutputStream && 
          blockIn instanceof FileInputStream) {
        
        FileChannel fileChannel = ((FileInputStream)blockIn).getChannel();
        
        // blockInPosition also indicates sendChunks() uses transferTo.
        blockInPosition = fileChannel.position();
        streamForSendChunks = baseStream;
......
 while (endOffset > offset) {
        long len = sendChunks(pktBuf, maxChunksPerPacket, 
                              streamForSendChunks);
        offset += len;
        totalRead += len + ((len + bytesPerChecksum - 1)/bytesPerChecksum*
                            checksumSize);
        seqno++;

在sendChunks方法中我们发现如果使用transferTo()的方法来NIO发送数据时候必须做一个wait 动作,注释的解释是此处避免一个jre的bug.

          //first write the packet
          sockOut.write(buf, 0, dataOff);
          // no need to flush. since we know out is not a buffered stream.
          sockOut.transferToFully(fileChannel, blockInPosition, len);
        }
---------------------------------------------
 public void transferToFully(FileChannel fileCh, long position, int count) 
                              throws IOException {
    
    while (count > 0) {
      /* 
       * Ideally we should wait after transferTo returns 0. But because of
       * a bug in JRE on Linux (http://bugs.sun.com/view_bug.do?bug_id=5103988),
       * which throws an exception instead of returning 0, we wait for the
       * channel to be writable before writing to it. If you ever see 
       * IOException with message "Resource temporarily unavailable" 
       * thrown here, please let us know.
       * 
       * Once we move to JAVA SE 7, wait should be moved to correct place.
       */
      waitForWritable();
      int nTransfered = (int) fileCh.transferTo(position, count, getChannel());

也就是这个waitForWritable();方法我们在用jstack dump信息时发现很多线程都block在这个地方,原因是由于下面file的transferTo动作在写入时要wait到这个channel是否可写.底层有个register和select的方法来做底层的等待,并有超时判断.

    /**
     * Waits on the channel with the given timeout using one of the 
     * cached selectors. It also removes any cached selectors that are
     * idle for a few seconds.
     * 
     * @param channel
     * @param ops
     * @param timeout
     * @return
     * @throws IOException
     */
    int select(SelectableChannel channel, int ops, long timeout) 
                                                   throws IOException {
     
      SelectorInfo info = get(channel);
      
      SelectionKey key = null;
      int ret = 0;
      
      try {
        while (true) {
          long start = (timeout == 0) ? 0 : System.currentTimeMillis();

          key = channel.register(info.selector, ops);// Step1
          ret = info.selector.select(timeout);//Step2
          
          if (ret != 0) {
            return ret;
          }
          
          /* Sometimes select() returns 0 much before timeout for 
           * unknown reasons. So select again if required.
           */
          if (timeout > 0) {
            timeout -= System.currentTimeMillis() - start;
            if (timeout <= 0) {
              return 0;
            }
          }

奇怪的是Step1和Step2之间等待的时间极度不均匀,在用btrace跟踪后发生分布的很散列.

                   timecost(千分之一毫秒)------- Distribution ------------- count
                    128 |                                         0
                    256 |                                         51895
                    512 |@@                                       243717
                   1024 |@@@@@@@@@@@@@                            1410085
                   2048 |@@@@@@@@@@@@@@@@@@@@                     2135480
                   4096 |@@                                       283689
                   8192 |                                         8247
                  16384 |                                         6856
                  32768 |                                         6741
                  65536 |                                         8390
                 131072 |                                         2019
                 262144 |                                         171
                 524288 |                                         21
                1048576 |                                         3
                2097152 |                                         0

有大量的毛刺产生.目前HDFS这个select的超时时间设置过长.默认为8分钟.可有下列参数改小到100毫秒做个实验.

<property> 
     <name>dfs.datanode.socket.write.timeout</name> 
     <value>1000</value> 
</property>

可以从以下直方图看出毛刺可以消除.

                  value  ------------- Distribution ------------- count
                    128 |                                         0
                    256 |                                         221
                    512 |                                         6896
                   1024 |@@@@@@@@@@@@@                            98509
                   2048 |@@@@@@@@@@@@@@@@@@@@@                    154922
                   4096 |@@@                                      23579
                   8192 |                                         1490
                  16384 |                                         438
                  32768 |                                         85
                  65536 |                                         5
                 131072 |                                         0

---------------------------------------------