引言
在上篇文章我们分析了EventLoop的两大核心NioEventLoopGroup和NioEventLoop,NioEventLoopGroup在其父类MultithreadEventExecutor中维护了NioEventLoop数组,而NioEventLoop首先内部持有线程对象,所以NioEventLoopGroup是个线程池,并且上篇文章我们也分析了NioEventLoop除了是个单线程,并且也是它执行了I/O的事件循环,没错就是它自己的run方法里面的逻辑,今天我们专门深入到这块事件循环里面,看看Netty是怎么做的。
NioEventLoop事件循环
回顾一下上篇文章NioEventLoop启动是通过Channel注册到selector上时执行execute方法触发startThread()方法启动的。
@Override
public void execute(Runnable task) {
if (task == null) {
throw new NullPointerException("task");
}
boolean inEventLoop = inEventLoop();
if (inEventLoop) {
addTask(task);
} else {
startThread();
addTask(task);
if (isShutdown() && removeTask(task)) {
reject();
}
}
if (!addTaskWakesUp && wakesUpForTask(task)) {
wakeup(inEventLoop);
}
}
而这个startThread方法先是调用threadFactory创建一个线程,然后往线程提交一任务,而任务里面直接调用了NioEventLoop的run方法。
private void startThread() {
if (state == ST_NOT_STARTED) {
if (STATE_UPDATER.compareAndSet(this, ST_NOT_STARTED, ST_STARTED)) {
doStartThread();
}
}
}
private void doStartThread() {
assert thread == null;
executor.execute(new Runnable() {
@Override
public void run() {
thread = Thread.currentThread();
if (interrupted) {
thread.interrupt();
}
try {
SingleThreadEventExecutor.this.run();
success = true;
} catch (Throwable t) {
logger.warn("Unexpected exception from an event executor: ", t);
}
});
}
//删除无关代码
而这个run方法就是时间循环的核心,我们着重看看这个run方法。
protected void run() {
for (;;) {
try {
switch (selectStrategy.calculateStrategy(selectNowSupplier, hasTasks())) {
case SelectStrategy.CONTINUE:
continue;
case SelectStrategy.SELECT:
select(wakenUp.getAndSet(false));
if (wakenUp.get()) {
selector.wakeup();
}
default:
// fallthrough
}
cancelledKeys = 0;
needsToSelectAgain = false;
final int ioRatio = this.ioRatio;
if (ioRatio == 100) {
try {
processSelectedKeys();
} finally {
// Ensure we always run tasks.
runAllTasks();
}
} else {
final long ioStartTime = System.nanoTime();
try {
processSelectedKeys();
} finally {
// Ensure we always run tasks.
final long ioTime = System.nanoTime() - ioStartTime;
runAllTasks(ioTime * (100 - ioRatio) / ioRatio);
}
}
} catch (Throwable t) {
handleLoopException(t);
}
// Always handle shutdown even if the loop processing threw an exception.
try {
if (isShuttingDown()) {
closeAll();
if (confirmShutdown()) {
return;
}
}
} catch (Throwable t) {
handleLoopException(t);
}
}
}
可以看到run方法里面直接for (;;)
一个忙循环开始一直保持这个Thread进行事件循环保证当前线程可以一直处理注册在当前线程的selector出来的I/O事件。接下来是个switch判断,我们来看看selectStrategy.calculateStrategy(selectNowSupplier, hasTasks())
这个方法,selectStrategy这个对象是个接口,实现类是子类创建默认的实例传上来的DefaultSelectStrategy,我们看下这个方法
private final IntSupplier selectNowSupplier = new IntSupplier() {
@Override
public int get() throws Exception {
return selectNow();
}
};
@Override
public int calculateStrategy(IntSupplier selectSupplier, boolean hasTasks) throws Exception {
return hasTasks ? selectSupplier.get() : SelectStrategy.SELECT;
}
这个方法传入一个IntSupplier对象和boolean值德 hasTasks,方法也很简单如果队列里面有任务那么直接调用selectSupplier.get()方法返回,否则返回SelectStrategy.SELECT。这个IntSupplier我也在在上面代码列出来了,调用selectSupplier.get()方法其实是直接调用了selectNow()方法,这个selectNow()方法并不是阻塞的,而是里吗select一把之后立即返回,为什么在队列里面有任务时进行selectNow()而不是调用Select阻塞线程呢?这是因为Netty的EventLoop不仅是个I/O线程处理I/O事件,它还是个任务线程,处理用户提交或者netty自己提交的任务。如果任务队列里面有任务,执行select把线程阻塞了,用户提交的业务任务永远都得不到执行,所以在这里会做一个判断如果有任务在队列里直接select一把后再执行队列里面的任务。如果队列里面没有任务那么返回SelectStrategy.SELECT,switch代码块执行到select(wakenUp.getAndSet(false));
调用这个方法首先会把线程的wakenUp状态设为false,表示我要阻塞当前线程了,然后调用select方法
private void select(boolean oldWakenUp) throws IOException {
Selector selector = this.selector;
try {
int selectCnt = 0;
long currentTimeNanos = System.nanoTime();
long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);
for (;;) {
long timeoutMillis = (selectDeadLineNanos - currentTimeNanos + 500000L) / 1000000L;
if (timeoutMillis <= 0) {
if (selectCnt == 0) {
selector.selectNow();
selectCnt = 1;
}
break;
}
if (hasTasks() && wakenUp.compareAndSet(false, true)) {
selector.selectNow();
selectCnt = 1;
break;
}
int selectedKeys = selector.select(timeoutMillis);
selectCnt ++;
if (selectedKeys != 0 || oldWakenUp || wakenUp.get() || hasTasks() || hasScheduledTasks()) {
// - Selected something,
// - waken up by user, or
// - the task queue has a pending task.
// - a scheduled task is ready for processing
break;
}
if (Thread.interrupted()) {
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely because " +
"Thread.currentThread().interrupt() was called. Use " +
"NioEventLoop.shutdownGracefully() to shutdown the NioEventLoop.");
}
selectCnt = 1;
break;
}
long time = System.nanoTime();
if (time - TimeUnit.MILLISECONDS.toNanos(timeoutMillis) >= currentTimeNanos) {
// timeoutMillis elapsed without anything selected.
selectCnt = 1;
} else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 &&
selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
// The selector returned prematurely many times in a row.
// Rebuild the selector to work around the problem.
logger.warn(
"Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.",
selectCnt, selector);
rebuildSelector();
selector = this.selector;
// Select again to populate selectedKeys.
selector.selectNow();
selectCnt = 1;
break;
}
currentTimeNanos = time;
}
if (selectCnt > MIN_PREMATURE_SELECTOR_RETURNS) {
if (logger.isDebugEnabled()) {
logger.debug("Selector.select() returned prematurely {} times in a row for Selector {}.",
selectCnt - 1, selector);
}
}
} catch (CancelledKeyException e) {
if (logger.isDebugEnabled()) {
logger.debug(CancelledKeyException.class.getSimpleName() + " raised by a Selector {} - JDK bug?",
selector, e);
}
// Harmless exception - log anyway
}
}
select方法首先是计算什么时候结束阻塞,就是selectDeadLineNanos这个变量,那是怎么计算的呢,先拿到当前时间,然后再计算ScheduleTask下次到期时间,就算出了select需要阻塞多久再醒过来,这里设计的很巧妙,还是那句话,老子不仅是I/O线程,我还是个任务线程,谁都要照顾到。
long currentTimeNanos = System.nanoTime();
long selectDeadLineNanos = currentTimeNanos + delayNanos(currentTimeNanos);
protected long delayNanos(long currentTimeNanos) {
ScheduledFutureTask<?> scheduledTask = peekScheduledTask();
if (scheduledTask == null) {
return SCHEDULE_PURGE_INTERVAL;
}
return scheduledTask.delayNanos(currentTimeNanos);
}
然后在这个selectDeadLineNanos时间内通过一个死循环阻塞开始尽情select。我们继续看下面的代码。for循环内每次设置500毫秒的select超时时间,接着判断selectDeadLineNanos的阻塞时间到了没,到了但是selectCnt计数还是0那么直接selectNow一把再退出,接着在调用select方法阻塞之前再次判断这时候有没有任务进队列有的话乖乖退出让出线程时间给任务执行。最后才是调用selector.select(timeoutMillis)方法,直到有事件触发或者超时线程才恢复,并且只要有事件触发或者wakeUp被设为true或者任务队列有任务了,直接退出死循环。在select方法里还有一个逻辑值得一提,也就是Netty解决JDK臭名昭著的epoll空轮询导致cpu 100%的bug。
epoll BUG
java NIO在linux环境下通过selector.select在一般情况下如果没有事件准备好当前线程是被阻塞的,但是在某些情况下select之后就算没有事件触发也会返回,然后select获取到的selectorkey数量为0,然后在忙循环里面一直空轮询导致cpu 100%系统崩溃。在Netty中把这个bug修复了,修复逻辑很简单,使用一个计数器默认阈值为512次,如果出现512次空轮询,那么进行重建selector,新建一个selector,把老的selector上面注册的channel迁移到新的selector上,最后把老的selector覆盖,然后关闭老的selector的方式来修复空轮询bug,具体看下面代码,实现逻辑在rebuildSelector()方法里。
else if (SELECTOR_AUTO_REBUILD_THRESHOLD > 0 &&
selectCnt >= SELECTOR_AUTO_REBUILD_THRESHOLD) {
// The selector returned prematurely many times in a row.
// Rebuild the selector to work around the problem.
logger.warn(
"Selector.select() returned prematurely {} times in a row; rebuilding Selector {}.",
selectCnt, selector);
rebuildSelector();
selector = this.selector;
// Select again to populate selectedKeys.
selector.selectNow();
selectCnt = 1;
break;
}
private void rebuildSelector0() {
final Selector oldSelector = selector;
final SelectorTuple newSelectorTuple;
if (oldSelector == null) {
return;
}
try {
newSelectorTuple = openSelector();
} catch (Exception e) {
logger.warn("Failed to create a new Selector.", e);
return;
}
// Register all channels to the new Selector.
int nChannels = 0;
for (SelectionKey key: oldSelector.keys()) {
Object a = key.attachment();
try {
if (!key.isValid() || key.channel().keyFor(newSelectorTuple.unwrappedSelector) != null) {
continue;
}
int interestOps = key.interestOps();
key.cancel();
SelectionKey newKey = key.channel().register(newSelectorTuple.unwrappedSelector, interestOps, a);
if (a instanceof AbstractNioChannel) {
// Update SelectionKey
((AbstractNioChannel) a).selectionKey = newKey;
}
nChannels ++;
} catch (Exception e) {
logger.warn("Failed to re-register a Channel to the new Selector.", e);
if (a instanceof AbstractNioChannel) {
AbstractNioChannel ch = (AbstractNioChannel) a;
ch.unsafe().close(ch.unsafe().voidPromise());
} else {
@SuppressWarnings("unchecked")
NioTask<SelectableChannel> task = (NioTask<SelectableChannel>) a;
invokeChannelUnregistered(task, key, e);
}
}
}
selector = newSelectorTuple.selector;
unwrappedSelector = newSelectorTuple.unwrappedSelector;
try {
// time to close the old selector as everything else is registered to the new one
oldSelector.close();
} catch (Throwable t) {
if (logger.isWarnEnabled()) {
logger.warn("Failed to close the old Selector.", t);
}
}
logger.info("Migrated " + nChannels + " channel(s) to the new Selector.");
}
看完select方法,我们回过头来看run方法里面剩余的部分。select结束之后按照我们JAVA NIO的经验是要对这些SelectedKeys进行处理的,比如读、写数据,对数据进行业务处理等等。我们看看Netty是怎么处理的
final int ioRatio = this.ioRatio;
if (ioRatio == 100) {
try {
processSelectedKeys();
} finally {
// Ensure we always run tasks.
runAllTasks();
}
} else {
final long ioStartTime = System.nanoTime();
try {
processSelectedKeys();
} finally {
// Ensure we always run tasks.
final long ioTime = System.nanoTime() - ioStartTime;
runAllTasks(ioTime * (100 - ioRatio) / ioRatio);
}
}
特地把run方法剩余部分截出来,上面贴的代码太长怕忘了。可以看到在NioEventLoop里定义了个ioRatio的变量,默认是50,这个ioRatio就是之前我们说到的“EventLoop既是I/O线程处理I/O事件,又是任务线程处理任务队列里的任务”。这里首先判断ioRatio是否100,如果io执行百分比是100%,那么直接就处理processSelectedKeys()方法,只有等到有I/O事件处理完成再处理任务队列里的任务,如果不是100%那么也是先处理I/O事件,接着再计算处理任务队列里的任务可以占用线程的时间。计算公式如下,很好推导
ioTime / ioRatio = taskTime / taskRatio
taskRatio=100-ioRatio
taskTime=ioTime * (100 - ioRatio) / ioRatio
也就是说如果是50的ioRatio,那么处理I/O事件所占用线程时间和处理队列任务各占一半。处理任务队列的任务在这就不展开了,代码不难有兴趣的可以自行阅读源码,我们重点看看processSelectedKeys()方法
private void processSelectedKeysOptimized() {
for (int i = 0; i < selectedKeys.size; ++i) {
final SelectionKey k = selectedKeys.keys[i];
// null out entry in the array to allow to have it GC'ed once the Channel close
// See https://github.com/netty/netty/issues/2363
selectedKeys.keys[i] = null;
final Object a = k.attachment();
if (a instanceof AbstractNioChannel) {
processSelectedKey(k, (AbstractNioChannel) a);
} else {
@SuppressWarnings("unchecked")
NioTask<SelectableChannel> task = (NioTask<SelectableChannel>) a;
processSelectedKey(k, task);
}
if (needsToSelectAgain) {
// null out entries in the array to allow to have it GC'ed once the Channel close
// See https://github.com/netty/netty/issues/2363
selectedKeys.reset(i + 1);
selectAgain();
i = -1;
}
}
}
processSelectedKeysOptimized()方法拿到select出来的selectedKeys,首先把SelectedKey取出来然后remove掉,在Netty这变成了置空,取出来的key通过Key.attachment方法拿到当初设置进去的AbstractNioChannel。还记得么,我们在注册Channel到Selector上的时候在unsafe里把this设到attachment里去了。
@Override
protected void doRegister() throws Exception {
boolean selected = false;
for (;;) {
try {
selectionKey = javaChannel().register(eventLoop().unwrappedSelector(), 0, this);
return;
} catch (CancelledKeyException e) {
//删除无关代码
}
}
}
所以这个attachment拿到的object当然是AbstractNioChannel,然后就来到了processSelectedKey(k, (AbstractNioChannel) a)方法
private void processSelectedKey(SelectionKey k, AbstractNioChannel ch) {
final AbstractNioChannel.NioUnsafe unsafe = ch.unsafe();
if (!k.isValid()) {
//省略无关代码
try {
int readyOps = k.readyOps();
// We first need to call finishConnect() before try to trigger a read(...) or write(...) as otherwise
// the NIO JDK channel implementation may throw a NotYetConnectedException.
if ((readyOps & SelectionKey.OP_CONNECT) != 0) {
int ops = k.interestOps();
ops &= ~SelectionKey.OP_CONNECT;
k.interestOps(ops);
unsafe.finishConnect();
}
// Process OP_WRITE first as we may be able to write some queued buffers and so free memory.
if ((readyOps & SelectionKey.OP_WRITE) != 0) {
// Call forceFlush which will also take care of clear the OP_WRITE once there is nothing left to write
ch.unsafe().forceFlush();
}
if ((readyOps & (SelectionKey.OP_READ | SelectionKey.OP_ACCEPT)) != 0 || readyOps == 0) {
unsafe.read();
}
} catch (CancelledKeyException ignored) {
unsafe.close(unsafe.voidPromise());
}
}
可以看到这块代码就是在真正处理I/O事件了,如果readyOps是OP_CONNECT
那么先把OP_CONNECT
操作位从SelectedKeys的interestOps()给移除,否则每次select都能拿到这个老的事件,然后通过unsafe.finishConnect()方法把事件传播下去channel是Active的了。如果readyOps是OP_WRITE
那么直接forceFlush(),把Netty的OutboundBuffer转换成ByteBuffer调用JDK原生方法把数据写如内核缓存区最终传输出去。
如果readyOps是OP_READ或者OP_ACCEPT
那么调用unsafe.read()方法,unsafe.read()方法如下
public void read() {
assert eventLoop().inEventLoop();
final ChannelConfig config = config();
final ChannelPipeline pipeline = pipeline();
final RecvByteBufAllocator.Handle allocHandle = unsafe().recvBufAllocHandle();
allocHandle.reset(config);
boolean closed = false;
Throwable exception = null;
try {
try {
do {
int localRead = doReadMessages(readBuf);
if (localRead == 0) {
break;
}
if (localRead < 0) {
closed = true;
break;
}
allocHandle.incMessagesRead(localRead);
} while (allocHandle.continueReading());
} catch (Throwable t) {
exception = t;
}
int size = readBuf.size();
for (int i = 0; i < size; i ++) {
readPending = false;
pipeline.fireChannelRead(readBuf.get(i));
}
readBuf.clear();
allocHandle.readComplete();
pipeline.fireChannelReadComplete();
}
}
}
}
该方法主要做了三件事:1.分配空间;2.accept 操作拿到channel;3.调用pipeline.fireChannelRead方法把channelRead事件传递下去以便进行业务处理。
总结
我们分析完了EventLoop的逻辑,总的来说EventLoopGroup在Netty中是线程池的角色,EventLoopGroup中的每个线程持有类NioEventLoop是I/O线程可以处理I/O事件,在NioEventLoop内部通过死循环+select操作完成事件循环,同时也是任务线程可以处理任务队列里的任务,并且任务队列里的任务也可以是定时任务。