hadoop 4.1.0 cdh4读文件源码分析

最新推荐文章于 2020-11-23 09:43:24 发布

转载最新推荐文章于 2020-11-23 09:43:24 发布 · 411 阅读

文章标签：

#hadoop #源码 #code #读文件

hadoop 专栏收录该内容

21 篇文章

订阅专栏

本文详细解析了Hadoop中客户端读取文件的流程，包括客户端如何与NameNode交互获取Block信息，选择合适的DataNode进行读取，以及通过BlockReader逐个Packet读取数据并进行校验的过程。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

上篇文章分析了hadoop写文件的流程，既然明白了文件是怎么写入的，再来理解读就顺畅一些了。

同样的，本文主要探讨客户端的实现，同样的，我依然推荐读一下http://www.cnblogs.com/duguguiyu/archive/2009/02/22/1396034.html，读文件的大致流程如下：

不论是文件读取，还是文件的写入，主控服务器扮演的都是中介的角色。客户端把自己的需求提交给主控服务器，主控服务器挑选合适的数据服务器，介绍给客户端，让客户端和数据服务器单聊，要读要写随你们便。这种策略类似于DMA，降低了主控服务器的负载，提高了效率。。。
因此，在文件读写操作中，最主要的通信，发生在客户端与数据服务器之间。它们之间跑的协议是ClientDatanodeProtocol。从这个协议中间，你无法看到和读写相关的接口，因为，在Hadoop中，读写操作是不走RPC机制的，而是另立门户，独立搭了一套通信框架。在数据服务器一端，DataNode类中有一个DataXceiverServer类的实例，它在一个单独的线程等待请求，一旦接到，就启动一个DataXceiver的线程，处理此次请求。一个请求一个线程，对于数据服务器来说，逻辑上很简单。当下，DataXceiver支持的请求类型有六种，具体的请求包和回复包格式，请参见这里，这里，这里。在Hadoop的实现中，并没有用类来封装这些请求，而是按流的次序写下来，这给代码阅读带来挺多的麻烦，也对代码的维护带来一定的困难，不知道是出于何种考虑。。。
相比于写，文件的读取实在是一个简单的过程。在客户端DFSClient中，有一个DFSClient.DFSInputStream类。当需要读取一个文件的时候，会生成一个DFSInputStream的实例。它会先调用ClientProtocol定义getBlockLocations接口，提供给NameNode文件路径、读取位置、读取长度信息，从中取得一个LocatedBlocks类的对象，这个对象包含一组LocatedBlock，那里面有所规定位置中包含的所有数据块信息，以及数据块对应的所有数据服务器的位置信息。当读取开始后，DFSInputStream会先尝试从某个数据块对应的一组数据服务器中选出一个，进行连接。这个选取算法，在当下的实现中，非常简单，就是选出第一个未挂的数据服务器，并没有加入客户端与数据服务器相对位置的考量。读取的请求，发送到数据服务器后，自然会有DataXceiver来处理，数据被一个包一个包发送回客户端，等到整个数据块的数据都被读取完了，就会断开此链接，尝试连接下一个数据块对应的数据服务器，整个流程，依次如此反复，直到所有想读的都读取完了为止。。。

跟写文件类似，读文件的主要逻辑在DFSInputStream类中。先看下构造函数：

[java]view plaincopy 
   
 DFSInputStream(DFSClient dfsClient, String src, int buffersize, boolean verifyChecksum  
                  ) throws IOException, UnresolvedLinkException {  
     this.dfsClient = dfsClient;  
     this.verifyChecksum = verifyChecksum;  
     this.buffersize = buffersize;  
     this.src = src;  
     this.socketCache = dfsClient.socketCache;  
     prefetchSize = dfsClient.getConf().prefetchSize;  
     timeWindow = dfsClient.getConf().timeWindow;  
     nCachedConnRetry = dfsClient.getConf().nCachedConnRetry;  
     openInfo();  
   }  

在写文件的准备工作方法openInfo中会向namenode读取被打开文件所需的所有BlockId。

[java]view plaincopy 
   
 LocatedBlocks newInfo = dfsClient.getLocatedBlocks(src, 0, prefetchSize);  
     if (DFSClient.LOG.isDebugEnabled()) {  
       DFSClient.LOG.debug("newInfo = " + newInfo);  
     }  
     if (newInfo == null) {  
       throw new IOException("Cannot open filename " + src);  
     }  
   
     if (locatedBlocks != null) {  
       Iterator<LocatedBlock> oldIter = locatedBlocks.getLocatedBlocks().iterator();  
       Iterator<LocatedBlock> newIter = newInfo.getLocatedBlocks().iterator();  
       while (oldIter.hasNext() && newIter.hasNext()) {  
         if (! oldIter.next().getBlock().equals(newIter.next().getBlock())) {  
           throw new IOException("Blocklist for " + src + " has changed!");  
         }  
       }  
     }  
     locatedBlocks = newInfo;  
     long lastBlockBeingWrittenLength = 0;  
     if (!locatedBlocks.isLastBlockComplete()) {  
       final LocatedBlock last = locatedBlocks.getLastLocatedBlock();  
       if (last != null) {  
         if (last.getLocations().length == 0) {  
           return -1;  
         }  
         final long len = readBlockLength(last);  
         last.getBlock().setNumBytes(len);  
         lastBlockBeingWrittenLength = len;   
       }  
     }  
   
     currentNode = null;  
     return lastBlockBeingWrittenLength;  

1.dfsClient.getLocatedBlocks方法实际调用了namenode.getBlockLocations返回所有的blockId。

2.查看blockId信息是否已被cache，如没有则将cache赋值。

3.判断该文件是否isLastBlockComplete，在hadoop中写文件实际是把block写入到datanode中，而namenode是通过datanode定期的汇报得知该文件到底由哪几个block组成的。因此，在读某个文件时可能存在datanode还未汇报给namenode的情况，因此，我们在读文件时只能读到最后一个汇报的block块。

下面看下read方法。

[java]view plaincopy 
   
 public synchronized int read(final byte buf[], int off, int len) throws IOException {  
     ReaderStrategy byteArrayReader = new ByteArrayStrategy(buf);  
   
     return readWithStrategy(byteArrayReader, off, len);  
   }  

首先会创建一个ByteArrayStrategy的reader，这种reader会将block依次读到buf数组中，hadoop还提供一个ByteBufferStrategy用来支持NIO模式的读。

然后执行readWithStrategy。

[java]view plaincopy 
   
 try {  
           // currentNode can be left as null if previous read had a checksum  
           // error on the same block. See HDFS-3067  
           if (pos > blockEnd || currentNode == null) {  
             currentNode = blockSeekTo(pos);  
           }  
           int realLen = (int) Math.min(len, (blockEnd - pos + 1L));  
           int result = readBuffer(strategy, off, realLen, corruptedBlockMap);  
             
           if (result >= 0) {  
             pos += result;  
           } else {  
             // got a EOS from reader though we expect more data on it.  
             throw new IOException("Unexpected EOS from the reader");  
           }  
           if (dfsClient.stats != null && result != -1) {  
             dfsClient.stats.incrementBytesRead(result);  
           }  
           return result;  

1.pos指当前读文件的偏移量，首先根据pos获取当前应该读的block对象

2.根据block对象向namenode询问有哪些datanode拥有该block，选择需要读取的datanode

3.建立client-datanode的链接，创建BlockReader。

4.readBuffer方法调用不同的readStrategy的doRead方法从block中读取想要的数据。

下面对上面三步分别解释下：

1.获取block对象

[java]view plaincopy 
   
 public int findBlock(long offset) {  
     // create fake block of size 0 as a key  
     LocatedBlock key = new LocatedBlock(  
         new ExtendedBlock(), new DatanodeInfo[0], 0L, false);  
     key.setStartOffset(offset);  
     key.getBlock().setNumBytes(1);  
     Comparator<LocatedBlock> comp =   
       new Comparator<LocatedBlock>() {  
         // Returns 0 iff a is inside b or b is inside a  
         @Override  
         public int compare(LocatedBlock a, LocatedBlock b) {  
           long aBeg = a.getStartOffset();  
           long bBeg = b.getStartOffset();  
           long aEnd = aBeg + a.getBlockSize();  
           long bEnd = bBeg + b.getBlockSize();  
           if(aBeg <= bBeg && bEnd <= aEnd   
               || bBeg <= aBeg && aEnd <= bEnd)  
             return 0; // one of the blocks is inside the other  
           if(aBeg < bBeg)  
             return -1; // a's left bound is to the left of the b's  
           return 1;  
         }  
       };  
     return Collections.binarySearch(blocks, key, comp);  
   }  

获得Block对象的核心方法是findBlock方法。通过比较各个block在整个文件中的位移来确定当前位移在哪个block中

2.获得datanode

[java]view plaincopy 
   
 static DatanodeInfo bestNode(DatanodeInfo nodes[],   
                                AbstractMap<DatanodeInfo, DatanodeInfo> deadNodes)  
                                throws IOException {  
     if (nodes != null) {   
       for (int i = 0; i < nodes.length; i++) {  
         if (!deadNodes.containsKey(nodes[i])) {  
           return nodes[i];  
         }  
       }  
     }  
     throw new IOException("No live nodes contain current block");  
   }  

获得best的datanode很简单。就是遍历该block所有的datanode，按顺序取最前面的。不过，其实在namenode端返回datanodeList时就是按照优先级顺序返回的。

[java]view plaincopy 
   
 private DNAddrPair chooseDataNode(LocatedBlock block)  
     throws IOException {  
     while (true) {  
       DatanodeInfo[] nodes = block.getLocations();  
       try {  
         DatanodeInfo chosenNode = bestNode(nodes, deadNodes);  
         final String dnAddr =  
             chosenNode.getXferAddr(dfsClient.connectToDnViaHostname());  
         if (DFSClient.LOG.isDebugEnabled()) {  
           DFSClient.LOG.debug("Connecting to datanode " + dnAddr);  
         }  
         InetSocketAddress targetAddr = NetUtils.createSocketAddr(dnAddr);  
         return new DNAddrPair(chosenNode, targetAddr);  
       } catch (IOException ie) {  
         String blockInfo = block.getBlock() + " file=" + src;  
         if (failures >= dfsClient.getMaxBlockAcquireFailures()) {  
           throw new BlockMissingException(src, "Could not obtain block: " + blockInfo,  
                                           block.getStartOffset());  
         }  
           
         if (nodes == null || nodes.length == 0) {  
           DFSClient.LOG.info("No node available for block: " + blockInfo);  
         }  
         DFSClient.LOG.info("Could not obtain block " + block.getBlock()  
             + " from any node: " + ie  
             + ". Will get new block locations from namenode and retry...");  
         try {  
           // Introducing a random factor to the wait time before another retry.  
           // The wait time is dependent on # of failures and a random factor.  
           // At the first time of getting a BlockMissingException, the wait time  
           // is a random number between 0..3000 ms. If the first retry  
           // still fails, we will wait 3000 ms grace period before the 2nd retry.  
           // Also at the second retry, the waiting window is expanded to 6000 ms  
           // alleviating the request rate from the server. Similarly the 3rd retry  
           // will wait 6000ms grace period before retry and the waiting window is  
           // expanded to 9000ms.   
           double waitTime = timeWindow * failures +       // grace period for the last round of attempt  
             timeWindow * (failures + 1) * DFSUtil.getRandom().nextDouble(); // expanding time window for each failure  
           DFSClient.LOG.warn("DFS chooseDataNode: got # " + (failures + 1) + " IOException, will wait for " + waitTime + " msec.");  
           Thread.sleep((long)waitTime);  
         } catch (InterruptedException iex) {  
         }  
         deadNodes.clear(); //2nd option is to remove only nodes[blockId]  
         openInfo();  
         block = getBlockAt(block.getStartOffset(), false);  
         failures++;  
         continue;  
     }  
     }  
   }   

这边主要是创建datanode的socket失败的重试，采取了graceful的sleep方式，可以学习一下，不过好像实际sleep的方式跟注释中描述的不一样。

3.创建BlockReader

[java]view plaincopy 
   
 // Can't local read a block under construction, see HDFS-2757  
     if (dfsClient.shouldTryShortCircuitRead(dnAddr) &&  
         !blockUnderConstruction()) {  
       return DFSClient.getLocalBlockReader(dfsClient.conf, src, block,  
           blockToken, chosenNode, dfsClient.hdfsTimeout, startOffset,  
           dfsClient.connectToDnViaHostname());  
     }  

ShortCircuitRead是hadoop的一个优化。在client与datanode在同一台机器时会直接读本地文件而不是通过socket向datanode读取block。

[java]view plaincopy 
   
 // Allow retry since there is no way of knowing whether the cached socket  
     // is good until we actually use it.  
     for (int retries = 0; retries <= nCachedConnRetry && fromCache; ++retries) {  
       SocketAndStreams sockAndStreams = null;  
       // Don't use the cache on the last attempt - it's possible that there  
       // are arbitrarily many unusable sockets in the cache, but we don't  
       // want to fail the read.  
       if (retries < nCachedConnRetry) {  
         sockAndStreams = socketCache.get(dnAddr);  
       }  
       Socket sock;  
       if (sockAndStreams == null) {  
         fromCache = false;  
   
         sock = dfsClient.socketFactory.createSocket();  
           
         // TCP_NODELAY is crucial here because of bad interactions between  
         // Nagle's Algorithm and Delayed ACKs. With connection keepalive  
         // between the client and DN, the conversation looks like:  
         //   1. Client -> DN: Read block X  
         //   2. DN -> Client: data for block X  
         //   3. Client -> DN: Status OK (successful read)  
         //   4. Client -> DN: Read block Y  
         // The fact that step #3 and #4 are both in the client->DN direction  
         // triggers Nagling. If the DN is using delayed ACKs, this results  
         // in a delay of 40ms or more.  
         //  
         // TCP_NODELAY disables nagling and thus avoids this performance  
         // disaster.  
         sock.setTcpNoDelay(true);  
   
         NetUtils.connect(sock, dnAddr,  
             dfsClient.getRandomLocalInterfaceAddr(),  
             dfsClient.getConf().socketTimeout);  
         sock.setSoTimeout(dfsClient.getConf().socketTimeout);  
       } else {  
         sock = sockAndStreams.sock;  
       }  
   
       try {  
         // The OP_READ_BLOCK request is sent as we make the BlockReader  
         BlockReader reader =  
             BlockReaderFactory.newBlockReader(dfsClient.getConf(),  
                                        sock, file, block,  
                                        blockToken,  
                                        startOffset, len,  
                                        bufferSize, verifyChecksum,  
                                        clientName,  
                                        dfsClient.getDataEncryptionKey(),  
                                        sockAndStreams == null ? null : sockAndStreams.ioStreams);  
         return reader;  
       } catch (IOException ex) {  
         // Our socket is no good.  
         DFSClient.LOG.debug("Error making BlockReader. Closing stale " + sock, ex);  
         if (sockAndStreams != null) {  
           sockAndStreams.close();  
         } else {  
           sock.close();  
         }  
         err = ex;  
       }  

首先对socket进行了初始化。这边设置了TCPNODELAY。因为client-datanode的交互是严格时序的。如果不设置client会非常慢。

BlockReaderFactory.newBlockReader创建BlockReader对象。后面就通过这个reader来依次读出block的内容。

4.BlockReader读block

[java]view plaincopy 
   
 public synchronized int read(byte[] buf, int off, int len)   
                               throws IOException {  
   
    if (curDataSlice == null || curDataSlice.remaining() == 0 && bytesNeededToFinish > 0) {  
      readNextPacket();  
    }  
    if (curDataSlice.remaining() == 0) {  
      // we're at EOF now  
      return -1;  
    }  
      
    int nRead = Math.min(curDataSlice.remaining(), len);  
    curDataSlice.get(buf, off, nRead);  
      
    return nRead;  

[java]view plaincopy 
   
 //Read packet headers.  
     packetReceiver.receiveNextPacket(in);  
   
     PacketHeader curHeader = packetReceiver.getHeader();  
     curDataSlice = packetReceiver.getDataSlice();  
     assert curDataSlice.capacity() == curHeader.getDataLen();  
       
     if (LOG.isTraceEnabled()) {  
       LOG.trace("DFSClient readNextPacket got header " + curHeader);  
     }  
   
     // Sanity check the lengths  
     if (!curHeader.sanityCheck(lastSeqNo)) {  
          throw new IOException("BlockReader: error in packet header " +  
                                curHeader);  
     }  
       
     if (curHeader.getDataLen() > 0) {  
       int chunks = 1 + (curHeader.getDataLen() - 1) / bytesPerChecksum;  
       int checksumsLen = chunks * checksumSize;  
   
       assert packetReceiver.getChecksumSlice().capacity() == checksumsLen :  
         "checksum slice capacity=" + packetReceiver.getChecksumSlice().capacity() +   
           " checksumsLen=" + checksumsLen;  
         
       lastSeqNo = curHeader.getSeqno();  
       if (verifyChecksum && curDataSlice.remaining() > 0) {  
         // N.B.: the checksum error offset reported here is actually  
         // relative to the start of the block, not the start of the file.  
         // This is slightly misleading, but preserves the behavior from  
         // the older BlockReader.  
         checksum.verifyChunkedSums(curDataSlice,  
             packetReceiver.getChecksumSlice(),  
             filename, curHeader.getOffsetInBlock());  
       }  
       bytesNeededToFinish -= curHeader.getDataLen();  
     }      
       
     // First packet will include some data prior to the first byte  
     // the user requested. Skip it.  
     if (curHeader.getOffsetInBlock() < startOffset) {  
       int newPos = (int) (startOffset - curHeader.getOffsetInBlock());  
       curDataSlice.position(newPos);  
     }  
   
     // If we've now satisfied the whole client read, read one last packet  
     // header, which should be empty  
     if (bytesNeededToFinish <= 0) {  
       readTrailingEmptyPacket();  
       if (verifyChecksum) {  
         sendReadResult(Status.CHECKSUM_OK);  
       } else {  
         sendReadResult(Status.SUCCESS);  
       }  
     }  

这边调用了RemoteBlockReader2的read方法读取block。这边的逻辑也不简单，datanode把block按一个个packet发送过来，每发送一个packet都需要checkSum校验数据正确性和检查packet的头数据，把实际data数据放到curDataSlice中，数据正确后会发送response给datanode，只有收到了client的ack信息datanode才会发送下一个packet。

刚读这边代码有一个奇怪的地方。如果用户从block中读一段数据到字符数据中时，如果读的长度超过block的大小，超过的部分不会被读到。不过后来仔细看了下，DFSInputStream的readWithStrategy方法每次都只会读一个int型，所以就不会出现此问题了。

总结一下：

1.客户端采用了read，ack的强时序模式而且没有用线程来receive数据，保证了读的正确性，但也稍微降低了读的效率。

2.socket采用了NIO提高了效率，这可能是与较早版本最大的改进。

3.虽然说0.92版本shortCircuit没有实现，但看cdh4代码中是有的。不知道如何测试是否能够使用。