SOFA RPC的容错机制

在 SOFARPC 中,集群容错与故障隔离是保障系统高可用性和稳定性的重要机制,下面详细解析其实现方式。

集群容错

1. 故障转移(Failover)

故障转移是一种常见的集群容错策略,当调用某个服务提供者失败时,会自动重试其他服务提供者。在 SOFARPC 中,FailoverCluster类实现了该策略。

@Extension("failover")
public class FailoverCluster extends AbstractCluster {
    @Override
    public SofaResponse doInvoke(SofaRequest request) throws SofaRpcException {
        String methodName = request.getMethodName();
        int retries = consumerConfig.getMethodRetries(methodName);
        int time = 0;
        SofaRpcException throwable = null;
        List<ProviderInfo> invokedProviderInfos = new ArrayList<ProviderInfo>(retries + 1);
        do {
            ProviderInfo providerInfo = null;
            try {
                providerInfo = select(request, invokedProviderInfos);
                SofaResponse response = filterChain(providerInfo, request);
                if (response != null) {
                    if (throwable != null) {
                        if (LOGGER.isWarnEnabled(consumerConfig.getAppName())) {
                            LOGGER.warnWithApp(consumerConfig.getAppName(),
                                LogCodes.getLog(LogCodes.WARN_SUCCESS_BY_RETRY,
                                    throwable.getClass() + ":" + throwable.getMessage(),
                                    invokedProviderInfos));
                        }
                    }
                    return response;
                } else {
                    throwable = new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
                        "Failed to call " + request.getInterfaceName() + "." + methodName
                            + " on remote server " + providerInfo + ", return null");
                    time++;
                }
            } catch (SofaRpcException e) { 
                if (e.getErrorType() == RpcErrorType.SERVER_BUSY
                    || e.getErrorType() == RpcErrorType.CLIENT_TIMEOUT) {
                    throwable = e;
                    time++;
                } else {
                    if (throwable != null) {
                        throw throwable;
                    } else {
                        throw e;
                    }
                }
            } catch (Exception e) { 
                throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
                    "Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
                        + " on remote server: " + providerInfo + ", cause by unknown exception: "
                        + e.getClass().getName() + ", message is: " + e.getMessage(), e);
            } finally {
                if (RpcInternalContext.isAttachmentEnable()) {
                    RpcInternalContext.getContext().setAttachment(RpcConstants.INTERNAL_KEY_INVOKE_TIMES,
                        time + 1); 
                }
            }
            if (providerInfo != null) {
                invokedProviderInfos.add(providerInfo);
            }
        } while (time <= retries);

        throw throwable;
    }
}

解析

  • doInvoke方法是核心逻辑,它会根据配置的重试次数retries进行重试。
  • 在每次重试时,通过select方法选择一个服务提供者。
  • 只有当出现服务端繁忙(RpcErrorType.SERVER_BUSY)或客户端超时(RpcErrorType.CLIENT_TIMEOUT)异常时才会进行重试。
  • 记录已调用过的服务提供者,避免重复调用。
2. 快速失败(Failfast)

快速失败策略在调用失败时立即抛出异常,不进行重试。FailFastCluster类实现了该策略。

@Extension("failfast")
public class FailFastCluster extends AbstractCluster {
    @Override
    public SofaResponse doInvoke(SofaRequest request) throws SofaRpcException {
        ProviderInfo providerInfo = select(request);
        try {
            SofaResponse response = filterChain(providerInfo, request);
            if (response != null) {
                return response;
            } else {
                throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
                    "Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
                        + " on remote server " + providerInfo + ", return null");
            }
        } catch (Exception e) {
            throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
                "Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
                    + " on remote server: " + providerInfo + ", cause by: "
                    + e.getClass().getName() + ", message is: " + e.getMessage(), e);
        }
    }
}

解析

  • doInvoke方法中,首先选择一个服务提供者。
  • 调用服务,如果返回null或出现异常,则立即抛出SofaRpcException

故障隔离

故障隔离主要通过自适应容错(Adaptive Fault Tolerance,AFT)机制实现,其核心思想是根据服务提供者的异常率动态调整其权重,将异常率高的服务提供者进行降级。

1. 容错配置类FaultToleranceConfig

该类定义了容错的相关配置参数,如时间窗口、最小调用次数、异常率倍数等。

public class FaultToleranceConfig {
    private long    timeWindow                       = 10L;
    private long    leastCallCount                   = 100L;
    private long    leastWindowCount                 = 10L;
    private double  leastWindowExceptionRateMultiple = 6D;
    private boolean regulationEffective              = false;
    private double  weightDegradeRate                = 0.05D;
    private boolean degradeEffective                 = false;
    private int     degradeLeastWeight               = 1;
    private double  weightRecoverRate                = 2;
    private int     degradeMaxIpCount                = 2;
    // 省略getter和setter方法
}

解析

  • timeWindow:时间窗口大小,用于统计调用信息。
  • leastCallCount:即时开始,调用多少次以上才调控。
  • leastWindowExceptionRateMultiple:当前机器是平均异常率的多少倍才降级。
2. 配置管理

FaultToleranceConfigManager 类负责管理自适应容错的配置,它包含了默认配置和每个应用的特定配置。配置参数如时间窗口、最小调用次数、异常率倍数等会影响故障隔离的判断逻辑。

// 负责管理自适应容错的配置,包含默认配置和每个应用的特定配置
public class FaultToleranceConfigManager {
    // 存储每个应用的特定配置,键为应用名,值为配置对象
    private static final ConcurrentMap<String, FaultToleranceConfig> APP_CONFIGS = new ConcurrentHashMap<>();
    // 默认配置对象
    private static final FaultToleranceConfig DEFAULT_CFG = new FaultToleranceConfig();
    // 标识自适应容错是否启用
    private static volatile boolean aftEnable = false;

    /**
     * 为指定应用添加或移除配置
     * @param appName 应用名
     * @param value 配置对象,若为 null 则移除该应用的配置
     */
    public static void putAppConfig(String appName, FaultToleranceConfig value) {
        if (appName == null) {
            if (LOGGER.isWarnEnabled()) {
                LOGGER.warn("App name is null when put fault-tolerance config");
            }
            return;
        }
        if (value != null) {
            APP_CONFIGS.put(appName, value);
            if (LOGGER.isInfoEnabled(appName)) {
                LOGGER.infoWithApp(appName, "Get a new resource, value[" + value + "]");
            }
        } else {
            APP_CONFIGS.remove(appName);
            if (LOGGER.isInfoEnabled(appName)) {
                LOGGER.infoWithApp(appName, "Remove a resource, key[" + appName + "]");
            }
        }
        // 重新计算自适应容错是否启用
        calcEnable();
    }

    /**
     * 计算自适应容错是否启用
     */
    static void calcEnable() {
        for (FaultToleranceConfig config : APP_CONFIGS.values()) {
            if (config.isRegulationEffective()) {
                aftEnable = true;
                return;
            }
        }
        aftEnable = false;
    }

    // 其他获取配置的方法
}
2. 数据收集

通过 FaultToleranceSubscriber 类订阅客户端接收事件,当发生同步或异步调用结果返回、提供者信息更新等事件时,会更新调用统计信息。

// 订阅客户端接收事件,更新调用统计信息
public class FaultToleranceSubscriber extends Subscriber {
    @Override
    public void onEvent(Event originEvent) {
        Class eventClass = originEvent.getClass();

        if (eventClass == ClientSyncReceiveEvent.class) {
            if (!FaultToleranceConfigManager.isEnable()) {
                return;
            }
            // 处理客户端同步接收事件
            ClientSyncReceiveEvent event = (ClientSyncReceiveEvent) originEvent;
            // 获取消费者配置
            ConsumerConfig consumerConfig = event.getConsumerConfig();
            // 获取提供者信息
            ProviderInfo providerInfo = event.getProviderInfo();
            // 获取调用统计信息
            InvocationStat result = InvocationStatFactory.getInvocationStat(consumerConfig, providerInfo);
            if (result != null) {
                // 记录一次调用
                result.invoke();
                // 获取异常信息
                Throwable t = event.getThrowable();
                if (t != null) {
                    // 记录异常信息
                    result.catchException(t);
                }
            }
        } 
        // 处理其他事件
    }
}
3. 度量计算

TimeWindowRegulator 类按时间窗口进行调控,它包含了度量线程池和计算线程池。度量线程池会定期对度量模型进行度量,计算线程池则根据度量结果进行调控。

// 按时间窗口进行调控,包含度量线程池和计算线程池
@Extension("timeWindow")
public class TimeWindowRegulator implements Regulator {
    // 度量线程池,定期对度量模型进行度量
    private final ScheduledService measureScheduler = new ScheduledService("AFT-MEASURE",
            ScheduledService.MODE_FIXEDRATE,
            new MeasureRunnable(), 1, 1,
            TimeUnit.SECONDS);
    // 计算线程池,根据度量结果进行调控
    private final ExecutorService regulationExecutor = ThreadPoolUtils.newFixedThreadPool(2,
            new LinkedBlockingQueue<Runnable>(16),
            new NamedThreadFactory("AFT-REGULATION"));
    // 存储所有的度量模型
    private final CopyOnWriteArrayList<MeasureModel> measureModels = new CopyOnWriteArrayList<>();
    // 度量策略
    private MeasureStrategy measureStrategy;
    // 调控策略
    private RegulationStrategy regulationStrategy;
    // 降级策略
    private DegradeStrategy degradeStrategy;
    // 恢复策略
    private RecoverStrategy recoverStrategy;

    /**
     * 初始化策略
     */
    @Override
    public void init() {
        // 加载度量策略
        measureStrategy = ExtensionLoaderFactory.getExtensionLoader(MeasureStrategy.class).getExtension(
                measureStrategyAlias);
        // 加载调控策略
        regulationStrategy = ExtensionLoaderFactory.getExtensionLoader(RegulationStrategy.class).getExtension(
                regulationStrategyAlias);
        // 加载降级策略
        degradeStrategy = ExtensionLoaderFactory.getExtensionLoader(DegradeStrategy.class).getExtension(
                degradeStrategyAlias);
        // 加载恢复策略
        recoverStrategy = ExtensionLoaderFactory.getExtensionLoader(RecoverStrategy.class).getExtension(
                recoverStrategyAlias);

        // 添加监听器
        InvocationStatFactory.addListener(listener);
    }

    // 度量任务
    private class MeasureRunnable implements Runnable {
        @Override
        public void run() {
            // 度量计数器加 1
            measureCounter.incrementAndGet();
            for (MeasureModel measureModel : measureModels) {
                try {
                    // 判断是否到达时间窗口
                    if (isArriveTimeWindow(measureModel)) {
                        // 进行度量计算
                        MeasureResult measureResult = measureStrategy.measure(measureModel);
                        // 提交调控任务
                        regulationExecutor.submit(new RegulationRunnable(measureResult));
                    }
                } catch (Exception e) {
                    LOGGER.errorWithApp(measureModel.getAppName(),
                            LogCodes.getLog(LogCodes.ERROR_WHEN_DO_MEASURE, e.getMessage()), e);
                }
            }
        }

        /**
         * 判断是否到达时间窗口
         * @param measureModel 度量模型
         * @return 是否到达时间窗口
         */
        private boolean isArriveTimeWindow(MeasureModel measureModel) {
            long timeWindow = FaultToleranceConfigManager.getTimeWindow(measureModel.getAppName());
            return measureCounter.get() % timeWindow == 0;
        }
    }

    // 调控任务
    private class RegulationRunnable implements Runnable {
        private final MeasureResult measureResult;

        RegulationRunnable(MeasureResult measureResult) {
            this.measureResult = measureResult;
        }

        @Override
        public void run() {
            // 获取所有度量结果详情
            List<MeasureResultDetail> measureResultDetails = measureResult.getAllMeasureResultDetails();
            for (MeasureResultDetail measureResultDetail : measureResultDetails) {
                try {
                    // 进行调控
                    doRegulate(measureResultDetail);
                } catch (Exception e) {
                    LOGGER.errorWithApp(measureResult.getMeasureModel().getAppName(),
                            LogCodes.getLog(LogCodes.ERROR_WHEN_DO_REGULATE, e.getMessage()), e);
                }
            }
        }

        /**
         * 进行调控
         * @param measureResultDetail 度量结果详情
         */
        void doRegulate(MeasureResultDetail measureResultDetail) {
            // 获取度量状态
            MeasureState measureState = measureResultDetail.getMeasureState();
            // 获取调用统计维度
            InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();

            // 判断是否需要降级
            boolean isDegradeEffective = regulationStrategy.isDegradeEffective(measureResultDetail);
            if (isDegradeEffective) {
                measureResultDetail.setLogOnly(false);
                if (measureState.equals(MeasureState.ABNORMAL)) {
                    // 判断是否达到最大降级 IP 数量
                    boolean isReachMaxDegradeIpCount = regulationStrategy.isReachMaxDegradeIpCount(measureResultDetail);
                    if (!isReachMaxDegradeIpCount) {
                        // 执行降级操作
                        degradeStrategy.degrade(measureResultDetail);
                    } 
                }
            }
        }
    }
}
4. 调控策略
  • 降级策略:当某个服务提供者的异常率超过配置的阈值时,会触发降级策略,降低该提供者的权重,减少对其的调用。
  • 恢复策略:当服务提供者的状态恢复正常时,会触发恢复策略,逐步提高其权重。
// 服务水平调控策略
@Extension("serviceHorizontal")
public class ServiceHorizontalRegulationStrategy implements RegulationStrategy {
    @Override
    public boolean isDegradeEffective(MeasureResultDetail measureResultDetail) {
        // 获取调用统计维度
        InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
        // 判断是否需要降级
        return FaultToleranceConfigManager.isDegradeEffective(statDimension.getAppName());
    }

    @Override
    public boolean isReachMaxDegradeIpCount(MeasureResultDetail measureResultDetail) {
        // 获取调用统计维度
        InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
        // 获取降级提供者 IP 集合
        ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());

        // 获取当前 IP
        String ip = statDimension.getIp();
        if (ips.contains(ip)) {
            return false;
        } else {
            // 获取最大降级 IP 数量
            int degradeMaxIpCount = FaultToleranceConfigManager.getDegradeMaxIpCount(statDimension.getAppName());
            ipsLock.lock();
            try {
                if (ips.size() < degradeMaxIpCount) {
                    // 添加 IP 到降级集合
                    ips.add(ip);
                    return false;
                } else {
                    return true;
                }
            } finally {
                ipsLock.unlock();
            }
        }
    }

    // 其他方法
}
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值