全国霸王餐灰度实验平台：Arthas热更新+FeatureToggle轻量级实现

原创于 2025-12-03 14:40:03 发布 · 409 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#dubbo

全国霸王餐灰度实验平台：Arthas热更新+FeatureToggle轻量级实现

背景：为什么要在霸王餐链路里做灰度

“霸王餐”业务每天 3000W 次调用，0.1% 的异常就会让 3W 个用户吃不上饭。传统停机发版风险高，QA 全量回归至少要 2h。我们给整条链路加了一层灰度实验平台：

按用户 ID 做一致性 Hash，5% 流量进实验组；
实验组代码随时热更新，用户无感知；
若指标下跌，1s 内切回基线。
整套方案只依赖 Arthas + FeatureToggle，无 Agent、无 SideCar，机器成本 0 增加。

整体架构：一张图说清

                 ┌-----------┐
                 │  网关层    │  ← 根据 uid 打标 X-Gray: true|false
                 └-----┬-----┘
                       ▼
                 ┌-----------┐
                 │  业务服务  │  ← 启动时注册到 Arthas Tunnel
                 └-----┬-----┘
                       ▼
        ┌----------------------------┐
        │  灰度 SDK（juwatech.cn.gray）│  ← FeatureToggle 判断 + 热更新
        └----------------------------┘

在这里插入图片描述

FeatureToggle：一行注解搞定灰度开关

package cn.juwatech.burger.service;

import juwatech.cn.gray.annotation.GrayToggle;
import org.springframework.stereotype.Service;

@Service
public class OrderService {

    // key = "burger.newDiscount", 灰度比例 5%
    @GrayToggle(key = "burger.newDiscount", ratio = 0.05)
    public long calculateDiscount(long uid) {
        // 实验组：立减 20
        return 20;
    }

    // 基线逻辑
    public long calculateDiscountBase(long uid) {
        return 10;
    }
}

@GrayToggle 切面在运行时根据 uid 做 MurmurHash，落在 5% 区间即走实验组，否则走基线。
切换比例、白名单、黑名单全部动态推送到内存，无需重启。

Arthas 热更新：30 秒让实验组逻辑上线

本地开发完新策略 NewDiscountStrategy.java，编译成 .class
上传到 Arthas Tunnel 控制台
执行热更新命令：

ognl -x 3 '#cl=@ClassLoader@getSystemClassLoader(),
           #bc=@cn.juwatech.gray.util.BytecodeUtil@read("/tmp/NewDiscountStrategy.class"),
           #cl.defineClass("cn.juwatech.burger.strategy.NewDiscountStrategy",#bc,0,#bc.length)'

Spring 容器里单例 Bean 被替换成新的子类，全程无 STW。

完整热更新脚本：自动替换实现类

package juwatech.cn.gray.hotswap;

import com.taobao.arthas.core.command.klass100.RedefineCommand;
import com.taobao.arthas.core.shell.command.AnnotatedCommand;
import com.taobao.arthas.core.shell.command.CommandProcess;
import org.springframework.stereotype.Component;

@Component
public class BurgerHotSwapCmd extends AnnotatedCommand {

    @Override
    public void process(CommandProcess process) {
        String className = "cn.juwatech.burger.strategy.DiscountStrategy";
        String path = "/tmp/NewDiscountStrategy.class";
        // 1. 读取字节码
        byte[] bytes = BytecodeUtil.read(path);
        // 2.  redefine
        RedefineCommand redefine = new RedefineCommand();
        redefine.redefineClass(className, bytes);
        process.write("热更新完成，实验组已生效\n");
    }
}

灰度指标实时看板：0.5s 延迟

实验组与基线的下单成功率、客单价、退款率通过 Micrometer 埋点打到 Prometheus，Grafana 模板自动计算 diff 值。
核心代码：

package juwatech.cn.gray.metrics;

import io.micrometer.core.instrument.Counter;
import io.micrometer.core.instrument.MeterRegistry;
import juwatech.cn.gray.holder.GrayContextHolder;
import org.springframework.stereotype.Component;

@Component
public class GrayMetrics {

    private final Counter expSuccess;
    private final Counter baseSuccess;

    public GrayMetrics(MeterRegistry registry) {
        expSuccess = Counter.builder("order_success")
                .tag("group", "experiment")
                .register(registry);
        baseSuccess = Counter.builder("order_success")
                .tag("group", "baseline")
                .register(registry);
    }

    public void onSuccess() {
        if (GrayContextHolder.isGray()) {
            expSuccess.increment();
        } else {
            baseSuccess.increment();
        }
    }
}

回滚：一键杀死实验组

若指标异常，在网关层把 X-Gray 强制置为 false，1s 内 100% 流量回归基线；同时 Arthas 执行：

ognl -x 1 '#@cn.juwatech.gray.core.FeatureSwitch@OFF("burger.newDiscount")'

内存开关立即失效，实验组代码不再进入。

灰度配置中心：基于 Redis 的 10KB 级轻量推送

package juwatech.cn.gray.config;

import org.springframework.data.redis.connection.RedisConnectionFactory;
import org.springframework.data.redis.listener.ChannelTopic;
import org.springframework.data.redis.listener.RedisMessageListenerContainer;
import org.springframework.stereotype.Component;

@Component
public class GrayConfigSubscriber {

    public GrayConfigSubscriber(RedisConnectionFactory factory) {
        RedisMessageListenerContainer container = new RedisMessageListenerContainer();
        container.setConnectionFactory(factory);
        container.addMessageListener((message, pattern) -> {
            String json = new String(message.getBody());
            FeatureToggleRepo.reload(json);   // 全量替换内存规则
        }, new ChannelTopic("gray/config"));
        container.start();
    }
}

配置体积 < 10KB，全网 2W 台机器 500ms 内同步完成。

踩坑记录

Arthas redefine 限制：不能增减 field、不能修改签名，新增策略务必用子类继承方式。
Spring AOP 代理：@GrayToggle 必须打在具体类，不能打在接口；否则 CGLIB 代理后 redefine 会失效。
Hash 冲突：uid _hash 后取模，线上曾出现 5% 流量打到 7%，把模数从 100 改成 997 后解决。

线上效果

发版频率：从每周 1 次提升到每天 3 次，0 downtime；
实验迭代：平均 45 分钟完成一次策略上下线；
回滚速度：最快 0.8s 全链路回滚；
机器成本：0 增加，代码包大小 + 23KB。

完整依赖（Spring Boot 3.2）

<dependency>
    <groupId>cn.juwatech</groupId>
    <artifactId>gray-core</artifactId>
    <version>1.3.0</version>
</dependency>
<dependency>
    <groupId>com.taobao.arthas</groupId>
    <artifactId>arthas-spring-boot-starter</artifactId>
    <version>3.7.2</version>
</dependency>

本文著作权归吃喝不愁app开发者团队，转载请注明出处！