这几天研究HDFS的datanode 和 client端的数据通信,发现目前datanode是使用NIO来做底层通信的.在datanode的DataXceiver的readBlock方法就是将datanode的block中的数据发送到客户端或者其他的datanode中去.我们可以看以下几段代码.
try {
try {
blockSender = new BlockSender(block, startOffset, length,
true, true, false, datanode, clientTraceFmt);
} catch(IOException e) {
out.writeShort(DataTransferProtocol.OP_STATUS_ERROR);
throw e;
}
out.writeShort(DataTransferProtocol.OP_STATUS_SUCCESS); // send op status
long read = blockSender.sendBlock(out, baseStream, null); // send data
readBlock 在读取client的包头之后就开始发送相关的数据,发送是通过2个流,out就是普通的socket流,baseSteam是用于NIO的.在sendBlock中做了一些checksum计算和偏移操作后就调用sendchunks发送客户端所需要的data.此处如果transferToAllowed在配置文件中设置就是用sendfile方法加快发送流程,不然直接socket流来发送.
if (transferToAllowed && !verifyChecksum &&
baseStream instanceof SocketOutputStream &&
blockIn instanceof FileInputStream) {
FileChannel fileChannel = ((FileInputStream)blockIn).getChannel();
// blockInPosition also indicates sendChunks() uses transferTo.
blockInPosition = fileChannel.position();
streamForSendChunks = baseStream;
......
while (endOffset > offset) {
long len = sendChunks(pktBuf, maxChunksPerPacket,
streamForSendChunks);
offset += len;
totalRead += len + ((len + bytesPerChecksum - 1)/bytesPerChecksum*
checksumSize);
seqno++;
在sendChunks方法中我们发现如果使用transferTo()的方法来NIO发送数据时候必须做一个wait 动作,注释的解释是此处避免一个jre的bug.
//first write the packet
sockOut.write(buf, 0, dataOff);
// no need to flush. since we know out is not a buffered stream.
sockOut.transferToFully(fileChannel, blockInPosition, len);
}
---------------------------------------------
public void transferToFully(FileChannel fileCh, long position, int count)
throws IOException {
while (count > 0) {
/*
* Ideally we should wait after transferTo returns 0. But because of
* a bug in JRE on Linux (http://bugs.sun.com/view_bug.do?bug_id=5103988),
* which throws an exception instead of returning 0, we wait for the
* channel to be writable before writing to it. If you ever see
* IOException with message "Resource temporarily unavailable"
* thrown here, please let us know.
*
* Once we move to JAVA SE 7, wait should be moved to correct place.
*/
waitForWritable();
int nTransfered = (int) fileCh.transferTo(position, count, getChannel());
也就是这个waitForWritable();方法我们在用jstack dump信息时发现很多线程都block在这个地方,原因是由于下面file的transferTo动作在写入时要wait到这个channel是否可写.底层有个register和select的方法来做底层的等待,并有超时判断.
/**
* Waits on the channel with the given timeout using one of the
* cached selectors. It also removes any cached selectors that are
* idle for a few seconds.
*
* @param channel
* @param ops
* @param timeout
* @return
* @throws IOException
*/
int select(SelectableChannel channel, int ops, long timeout)
throws IOException {
SelectorInfo info = get(channel);
SelectionKey key = null;
int ret = 0;
try {
while (true) {
long start = (timeout == 0) ? 0 : System.currentTimeMillis();
key = channel.register(info.selector, ops);// Step1
ret = info.selector.select(timeout);//Step2
if (ret != 0) {
return ret;
}
/* Sometimes select() returns 0 much before timeout for
* unknown reasons. So select again if required.
*/
if (timeout > 0) {
timeout -= System.currentTimeMillis() - start;
if (timeout <= 0) {
return 0;
}
}
奇怪的是Step1和Step2之间等待的时间极度不均匀,在用btrace跟踪后发生分布的很散列.
timecost(千分之一毫秒)------- Distribution ------------- count
128 | 0
256 | 51895
512 |@@ 243717
1024 |@@@@@@@@@@@@@ 1410085
2048 |@@@@@@@@@@@@@@@@@@@@ 2135480
4096 |@@ 283689
8192 | 8247
16384 | 6856
32768 | 6741
65536 | 8390
131072 | 2019
262144 | 171
524288 | 21
1048576 | 3
2097152 | 0
有大量的毛刺产生.目前HDFS这个select的超时时间设置过长.默认为8分钟.可有下列参数改小到100毫秒做个实验.
<property>
<name>dfs.datanode.socket.write.timeout</name>
<value>1000</value>
</property>
可以从以下直方图看出毛刺可以消除.
value ------------- Distribution ------------- count
128 | 0
256 | 221
512 | 6896
1024 |@@@@@@@@@@@@@ 98509
2048 |@@@@@@@@@@@@@@@@@@@@@ 154922
4096 |@@@ 23579
8192 | 1490
16384 | 438
32768 | 85
65536 | 5
131072 | 0
---------------------------------------------