Spring Cloud Gateway之踩坑日记

BUG指挥官

已于 2023-01-03 15:43:12 修改

阅读量4k

点赞数 1

CC 4.0 BY-SA版权

于 2022-05-06 16:16:23 首次发布

本文链接：https://blog.youkuaiyun.com/WXF_Sir/article/details/124612073

本文详细记录了使用Spring Cloud Gateway过程中遇到的多个问题及解决方案，包括：请求处理耗时不准、reactor-netty线程问题、路由同步更新、Ribbon懒加载、堆外内存泄露和QPS限制。通过分析源码和优化，成功解决了这些问题，提升了网关性能。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、背景
楼主所在的团队全面拥抱了Spring Cloud体系，但由于历史原因，以及使用了腾讯云TSF的老版本，加上开发自维护的基础工具包一掺和，所有项目使用的Spring Cloud都停留在 2.1.2.RELEASE 版本，所以Spring Cloud Gateway（后面简称SCG）使用的是 2.1.2.RELEASE 版本。我们知道 SCG 是基于 Spring WebFlux 而构建的专属网关系统，而 Spring WebFlux 则是和 Spring MVC 一样，基于 Spring Web 而构建，而 Spring WebFlux 则是因为将 Spring MVC “Reactor化”成本很高而且不好维护而生成的新产品。17年的 Spring Web 就已经支持了响应流，我们可以看下其Gradle文件：

dependencyManagement {
	imports {
		mavenBom "io.projectreactor:reactor-bom:${reactorVersion}"
		mavenBom "io.netty:netty-bom:${nettyVersion}"
		mavenBom "org.eclipse.jetty:jetty-bom:${jettyVersion}"
	}
}

其中就有reactor-bom。说实话，当时在针对网关选型时，抛弃了Zuul而选择SCG，但没想到只能用SCG的 2.1.2.RELEASE 版本，该版本于19年6月发布，加上本来就是Spring Cloud家族中的新星，一切都在快速迭代和升级中，所以虽然距现在不到两年，但实际上SCG的模块已经进行大幅调整（新版本连spring-cloud-gateway-core都不存在了，改成了spring-cloud-gateway-server）。本文将详细阐述我遇到的其中的几个坑（该坑在新版本已经被修复）。

二、神秘的超时
这里先阐述下我们基于SCG的自建网关的位置（得搞清楚自己的定位）：

坑一：通过SCG的GlobalFilter记录的网关处理耗时不准
我们知道，想记录SCG的请求处理耗时没那么简单，我们之前使用的是GlobalFilter，创建了一个LogGlobalFilter，将其Order设置的小一点（比如小于0），这样执行顺序会靠前点，方法执行时记录一个时间，返回时记录另一个时间，代码片段如下：

public class LogGlobalFilter extends AbstractGlobalFilter {
 
    private ModifyResponseBodyGatewayFilterFactory factory = new ModifyResponseBodyGatewayFilterFactory();
    private GatewayFilter modifyResponseBodyGatewayFilter;
 
    @PostConstruct
    public void init() {
        ModifyResponseBodyGatewayFilterFactory.Config config = new ModifyResponseBodyGatewayFilterFactory.Config();
        config.setInClass(String.class);
        config.setOutClass(String.class);
        config.setRewriteFunction(new GatewayResponseRewriteFunction());
 
        modifyResponseBodyGatewayFilter = factory.apply(config);
    }
 
    @Override
    public Mono<Void> filter(ServerWebExchange exchange, GatewayFilterChain chain) {
 
        exchange.getAttributes().put(REQUEST_START_NANO_TIME_ATTRIBUTE, System.nanoTime());
 
        return modifyResponseBodyGatewayFilter.filter(exchange, chain).doOnSuccess(
            (Void v) -> {
                Long reqStartNanoTime = exchange.getAttribute(REQUEST_START_NANO_TIME_ATTRIBUTE);
                StringBuilder logStr = new StringBuilder("调用成功")
                        .append(", Gateway应答：").append((String) exchange.getAttribute(RESPONSE_BODY_ATTRIBUTE))
                        .append(", 耗时（毫秒）：").append(reqStartNanoTime == null ?
                                "计算失败" : (System.nanoTime() - reqStartNanoTime.longValue()) / 1000000);
                log.info(logStr.toString());
            }
        );
    }
 
    private static class GatewayResponseRewriteFunction implements RewriteFunction<String, String> {
        @Override
        public Publisher<String> apply(ServerWebExchange exchange, String body) {
            exchange.getAttributes().put(RESPONSE_BODY_ATTRIBUTE, body);
            return Mono.just(body);
        }
    }
}

这里表面上看没啥问题，LogGlobalFilter的Order值小于0，算是还比较高的执行优先级，而且我们在 filter 方法的开始记录了一次系统本地时间，在 doOnSuccess 方法中记录了应答的时间，两个时间一减，可以大致得出请求处理耗时。

客户端服务器设置了数秒的超时时间，QA同学在App上测试时，时不时报一些超时，我们通过traceId来看，发现网关的入口时间和客户端服务器的请求时间差了一两秒，刚开始怀疑是不是外网环境不稳定，后面发现不应该，而且还有一个奇怪现象，SkyWalking上显示的网关请求到达时间比LogGlobalFilter要早一两秒，就是说SkyWalking上显示的请求到达时间才是符合预期的（和客户端服务发起时间相差几十毫秒）。

奇怪，SkyWalking是如何做到比LogGlobalFilter更准确（更早）的统计到请求入口时间的？？？带着这个疑问，我顺便看了下SkyWalking的代码：

我们发现SkyWalking是针对 Spring WebFlux 的核心消息派发处理器 DispatcherHandler 做了字节码增强（可以理解类似AOP的效果）来统计这个时间的，于是我们修改了记录请求入口时间的策略，通过对 DispatcherHandler 做 AOP 来记录请求入口时间，以下是代码片段：

@Component
public class DispatcherHandlerMethodInterceptor implements MethodInterceptor {
 
    @Override
    public Object invoke(MethodInvocation methodInvocation) throws Throwable {
 
        if ("handle".equals(methodInvocation.getMethod().getName()) &&
                methodInvocation.getArguments().length == 1 &&
                methodInvocation.getArguments()[0] instanceof ServerWebExchange) {
 
            ServerWebExchange exchange = (ServerWebExchange) methodInvocation.getArguments()[0];
            // 记录请求开始时间
            exchange.getAttributes().put(REQUEST_START_NANO_TIME_ATTRIBUTE, System.nanoTime());
          
 
            log.info("Gateway receive request, path:{}, header:{}, params:{}",
                    exchange.getRequest().getPath(), exchange.getRequest().getHeaders(),
                    exchange.getRequest().getQueryParams());
 
        }
 
        return methodInvocation.proceed();
    }
}
 
 
@Import({DispatcherHandlerMethodInterceptor.class})
@Configuration
public class ConfigurableAdvisorConfig {
 
    private static final String DISPATCHER_HANDLER_POINTCUT =
            "execution(public * org.springframework.web.reactive.DispatcherHandler.handle(..))";
 
    @Autowired
    private DispatcherHandlerMethodInterceptor dispatcherHandlerMethodInterceptor;
 
 
    @Bean
    public AspectJExpressionPointcutAdvisor buildDispatcherHandlerPointcutAdvisor() {
        AspectJExpressionPointcutAdvisor advisor = new AspectJExpressionPointcutAdvisor();
        advisor.setExpression(DISPATCHER_HANDLER_POINTCUT);
        advisor.setAdvice(dispatcherHandlerMethodInterceptor);
        return advisor;
    }
}

坑二：reactor-netty的epoll&kqueue模式
我们绝大部分程序员都知道Netty是一个优秀的IO库，但可能没听过 reactor-netty（https://github.com/reactor/reactor-netty），简单来说 reactor-netty 是基于 Netty 的一个非阻塞和背压客户端和服务端框架（Reactor Netty offers non-blocking and backpressure-ready TCP/HTTP/UDP clients & servers based on Netty framework.）。SCG是强依赖 reactor-netty 的（更准确的说是Spring WebFlux依赖reactor-netty），SCG的 2.1.2.RELEASE 版本依赖的是 reactor-netty 的0.8.9.RELEASE 版本，而 reactor-netty 还直接依赖了 reactor-core（https://github.com/reactor/reactor-core）的 3.2.10.RELEASE 版本，都是比较老的版本了，看其RELEASE的发布记录，reactor-netty 有些性能的bug在新版本被修复，但无奈也不能用太新的版本，毕竟这库也是底层Spring Web所依赖的，所以最后把 reactor-netty 从 0.8.9.RELEASE 升级到了 0.8.23.RELEASE，reactor-core 也相应的从 3.2.10.RELEASE 升级到 3.2.22.RELEASE。但仍然摆脱不了客户端服务器来调用的超时问题，于是乎在想是不是IO线程哪里阻塞了？超时就那么几秒钟，很难直接在超时现场jstack，于是先临时jstack看看到底有哪些线程，毕竟我们只给网关服务分配了6核8G。临时jstack后，发现了竟然有32个叫 “reactor-http-epoll-x” 的线程：

"reactor-http-epoll-32" #204 daemon prio=5 os_prio=0 tid=0x00007f020c1e6320 nid=0xdd runnable [0x00007f0200bf7000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None
 
"reactor-http-epoll-31" #203 daemon prio=5 os_prio=0 tid=0x00007f020c068df0 nid=0xdc runnable [0x00007f0200cf8000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None
 
"reactor-http-epoll-30" #202 daemon prio=5 os_prio=0 tid=0x00007f020c0678e0 nid=0xdb runnable [0x00007f0200df9000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None

突然想起JDK8还没修复的一个问题，Java运行时取CPU核数没有考虑容器能用的资源，取的是宿主机的CPU核数，正好是32，是不是正是因为epoll线程过多，导致流量增加时CPU消耗增加而容易超时？为了验证这个问题，我们得想办法把epoll线程数改成6，匹配该Docker实例能使用的CPU数。怎么改？网上找了一堆资料没见着，于是还是得回归源码，翻了下 reactor-netty 的源码，如下：

"reactor-http-epoll-32" #204 daemon prio=5 os_prio=0 tid=0x00007f020c1e6320 nid=0xdd runnable [0x00007f0200bf7000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None
 
"reactor-http-epoll-31" #203 daemon prio=5 os_prio=0 tid=0x00007f020c068df0 nid=0xdc runnable [0x00007f0200cf8000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None
 
"reactor-http-epoll-30" #202 daemon prio=5 os_prio=0 tid=0x00007f020c0678e0 nid=0xdb runnable [0x00007f0200df9000]
   java.lang.Thread.State: RUNNABLE
	at io.netty.channel.epoll.Native.epollWait0(Native Method)
	at io.netty.channel.epoll.Native.epollWait(Native.java:114)
	at io.netty.channel.epoll.EpollEventLoop.epollWait(EpollEventLoop.java:256)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:281)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None

答案显而易见了，我们可以通过设置系统Property来进行强行覆盖和修改，为了简单起见，直接修改网关的main方法：

@SpringBootApplication
public class GatewayServerApplication {
 
    public static void main(String[] args) {
        System.setProperty(ReactorNetty.IO_WORKER_COUNT, "6");
        System.setProperty(ReactorNetty.IO_SELECT_COUNT, "6");
 
        SpringApplication.run(GatewayServerApplication.class, args);
    }
}

改完并发布后，本地通过JMeter进行了简单的压测，发现同样的发压条件，6个epoll线程和32个epoll相比确实很稳健很多，超时的请求数量少了一大截。于是又去Jstack了一把，发现竟然不是6个epoll线程了，而是6个 "reactor-http-nio-x" 线程：

"reactor-http-nio-5" #241 daemon prio=5 os_prio=0 tid=0x00007f348001a2a0 nid=0x105 runnable [0x00007f33f03f1000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x000000069b9e6eb0> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000069b9e6f28> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000069ba9e2f0> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:791)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:439)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None
 
"reactor-http-nio-4" #240 daemon prio=5 os_prio=0 tid=0x00007f3480024850 nid=0x104 runnable [0x00007f33f04f2000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
	at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
	at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x000000069b9e7330> (a io.netty.channel.nio.SelectedSelectionKeySet)
	- locked <0x000000069b9e7078> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000069ba9e380> (a sun.nio.ch.EPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:791)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:439)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
 
   Locked ownable synchronizers:
	- None

这比较奇怪，不管了，还有其他事情要忙。网关经过几次迭代和部署后，几天后又有QA同学反馈客户端服务器调用网关超时，排除了网络的因素外，不得不又转向网关，于是又去jstack了一把，发现竟然又都变成6个epoll线程了，太奇怪了，得想办法还原回nio线程（不知道为啥我十分确信这个版本的nio的实现比epoll要好），于是又去翻 reactor-netty 的代码，如下：

final class HttpServerBind extends HttpServer
		implements Function<ServerBootstrap, ServerBootstrap> {
 
	@Override
	public ServerBootstrap apply(ServerBootstrap b) {
		HttpServerConfiguration conf = HttpServerConfiguration.getAndClean(b);
 
 
		if (b.config()
		     .group() == null) {
			LoopResources loops = HttpResources.get();
 
			// 注意这里，根据 LoopResources.DEFAULT_NATIVE 来选择 EventLoopGroup
			EventLoopGroup selector = loops.onServerSelect(LoopResources.DEFAULT_NATIVE);
			EventLoopGroup elg = loops.onServer(LoopResources.DEFAULT_NATIVE);
 
			b.group(selector, elg)
			 .channel(loops.onServerChannel(elg));
		}
    // 省略了部分代码
}
 
 
final class DefaultLoopResources extends AtomicLong implements LoopResources {
 
	@Override
	public EventLoopGroup onServerSelect(boolean useNative) {
        // 如果 useNative 为true，并且 preferNative() 判断本地支持epoll或kqueue，那么就用本地的策略
		if (useNative && preferNative()) {
			return cacheNativeSelectLoops();
		}
 
        // 否则使用传统的NIO模式