在 SOFARPC 中,集群容错与故障隔离是保障系统高可用性和稳定性的重要机制,下面详细解析其实现方式。
集群容错
1. 故障转移(Failover)
故障转移是一种常见的集群容错策略,当调用某个服务提供者失败时,会自动重试其他服务提供者。在 SOFARPC 中,FailoverCluster
类实现了该策略。
@Extension("failover")
public class FailoverCluster extends AbstractCluster {
@Override
public SofaResponse doInvoke(SofaRequest request) throws SofaRpcException {
String methodName = request.getMethodName();
int retries = consumerConfig.getMethodRetries(methodName);
int time = 0;
SofaRpcException throwable = null;
List<ProviderInfo> invokedProviderInfos = new ArrayList<ProviderInfo>(retries + 1);
do {
ProviderInfo providerInfo = null;
try {
providerInfo = select(request, invokedProviderInfos);
SofaResponse response = filterChain(providerInfo, request);
if (response != null) {
if (throwable != null) {
if (LOGGER.isWarnEnabled(consumerConfig.getAppName())) {
LOGGER.warnWithApp(consumerConfig.getAppName(),
LogCodes.getLog(LogCodes.WARN_SUCCESS_BY_RETRY,
throwable.getClass() + ":" + throwable.getMessage(),
invokedProviderInfos));
}
}
return response;
} else {
throwable = new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
"Failed to call " + request.getInterfaceName() + "." + methodName
+ " on remote server " + providerInfo + ", return null");
time++;
}
} catch (SofaRpcException e) {
if (e.getErrorType() == RpcErrorType.SERVER_BUSY
|| e.getErrorType() == RpcErrorType.CLIENT_TIMEOUT) {
throwable = e;
time++;
} else {
if (throwable != null) {
throw throwable;
} else {
throw e;
}
}
} catch (Exception e) {
throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
"Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
+ " on remote server: " + providerInfo + ", cause by unknown exception: "
+ e.getClass().getName() + ", message is: " + e.getMessage(), e);
} finally {
if (RpcInternalContext.isAttachmentEnable()) {
RpcInternalContext.getContext().setAttachment(RpcConstants.INTERNAL_KEY_INVOKE_TIMES,
time + 1);
}
}
if (providerInfo != null) {
invokedProviderInfos.add(providerInfo);
}
} while (time <= retries);
throw throwable;
}
}
解析:
doInvoke
方法是核心逻辑,它会根据配置的重试次数retries
进行重试。- 在每次重试时,通过
select
方法选择一个服务提供者。 - 只有当出现服务端繁忙(
RpcErrorType.SERVER_BUSY
)或客户端超时(RpcErrorType.CLIENT_TIMEOUT
)异常时才会进行重试。 - 记录已调用过的服务提供者,避免重复调用。
2. 快速失败(Failfast)
快速失败策略在调用失败时立即抛出异常,不进行重试。FailFastCluster
类实现了该策略。
@Extension("failfast")
public class FailFastCluster extends AbstractCluster {
@Override
public SofaResponse doInvoke(SofaRequest request) throws SofaRpcException {
ProviderInfo providerInfo = select(request);
try {
SofaResponse response = filterChain(providerInfo, request);
if (response != null) {
return response;
} else {
throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
"Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
+ " on remote server " + providerInfo + ", return null");
}
} catch (Exception e) {
throw new SofaRpcException(RpcErrorType.CLIENT_UNDECLARED_ERROR,
"Failed to call " + request.getInterfaceName() + "." + request.getMethodName()
+ " on remote server: " + providerInfo + ", cause by: "
+ e.getClass().getName() + ", message is: " + e.getMessage(), e);
}
}
}
解析:
doInvoke
方法中,首先选择一个服务提供者。- 调用服务,如果返回
null
或出现异常,则立即抛出SofaRpcException
。
故障隔离
故障隔离主要通过自适应容错(Adaptive Fault Tolerance,AFT)机制实现,其核心思想是根据服务提供者的异常率动态调整其权重,将异常率高的服务提供者进行降级。
1. 容错配置类FaultToleranceConfig
该类定义了容错的相关配置参数,如时间窗口、最小调用次数、异常率倍数等。
public class FaultToleranceConfig {
private long timeWindow = 10L;
private long leastCallCount = 100L;
private long leastWindowCount = 10L;
private double leastWindowExceptionRateMultiple = 6D;
private boolean regulationEffective = false;
private double weightDegradeRate = 0.05D;
private boolean degradeEffective = false;
private int degradeLeastWeight = 1;
private double weightRecoverRate = 2;
private int degradeMaxIpCount = 2;
// 省略getter和setter方法
}
解析:
timeWindow
:时间窗口大小,用于统计调用信息。leastCallCount
:即时开始,调用多少次以上才调控。leastWindowExceptionRateMultiple
:当前机器是平均异常率的多少倍才降级。
2. 配置管理
FaultToleranceConfigManager
类负责管理自适应容错的配置,它包含了默认配置和每个应用的特定配置。配置参数如时间窗口、最小调用次数、异常率倍数等会影响故障隔离的判断逻辑。
// 负责管理自适应容错的配置,包含默认配置和每个应用的特定配置
public class FaultToleranceConfigManager {
// 存储每个应用的特定配置,键为应用名,值为配置对象
private static final ConcurrentMap<String, FaultToleranceConfig> APP_CONFIGS = new ConcurrentHashMap<>();
// 默认配置对象
private static final FaultToleranceConfig DEFAULT_CFG = new FaultToleranceConfig();
// 标识自适应容错是否启用
private static volatile boolean aftEnable = false;
/**
* 为指定应用添加或移除配置
* @param appName 应用名
* @param value 配置对象,若为 null 则移除该应用的配置
*/
public static void putAppConfig(String appName, FaultToleranceConfig value) {
if (appName == null) {
if (LOGGER.isWarnEnabled()) {
LOGGER.warn("App name is null when put fault-tolerance config");
}
return;
}
if (value != null) {
APP_CONFIGS.put(appName, value);
if (LOGGER.isInfoEnabled(appName)) {
LOGGER.infoWithApp(appName, "Get a new resource, value[" + value + "]");
}
} else {
APP_CONFIGS.remove(appName);
if (LOGGER.isInfoEnabled(appName)) {
LOGGER.infoWithApp(appName, "Remove a resource, key[" + appName + "]");
}
}
// 重新计算自适应容错是否启用
calcEnable();
}
/**
* 计算自适应容错是否启用
*/
static void calcEnable() {
for (FaultToleranceConfig config : APP_CONFIGS.values()) {
if (config.isRegulationEffective()) {
aftEnable = true;
return;
}
}
aftEnable = false;
}
// 其他获取配置的方法
}
2. 数据收集
通过 FaultToleranceSubscriber
类订阅客户端接收事件,当发生同步或异步调用结果返回、提供者信息更新等事件时,会更新调用统计信息。
// 订阅客户端接收事件,更新调用统计信息
public class FaultToleranceSubscriber extends Subscriber {
@Override
public void onEvent(Event originEvent) {
Class eventClass = originEvent.getClass();
if (eventClass == ClientSyncReceiveEvent.class) {
if (!FaultToleranceConfigManager.isEnable()) {
return;
}
// 处理客户端同步接收事件
ClientSyncReceiveEvent event = (ClientSyncReceiveEvent) originEvent;
// 获取消费者配置
ConsumerConfig consumerConfig = event.getConsumerConfig();
// 获取提供者信息
ProviderInfo providerInfo = event.getProviderInfo();
// 获取调用统计信息
InvocationStat result = InvocationStatFactory.getInvocationStat(consumerConfig, providerInfo);
if (result != null) {
// 记录一次调用
result.invoke();
// 获取异常信息
Throwable t = event.getThrowable();
if (t != null) {
// 记录异常信息
result.catchException(t);
}
}
}
// 处理其他事件
}
}
3. 度量计算
TimeWindowRegulator
类按时间窗口进行调控,它包含了度量线程池和计算线程池。度量线程池会定期对度量模型进行度量,计算线程池则根据度量结果进行调控。
// 按时间窗口进行调控,包含度量线程池和计算线程池
@Extension("timeWindow")
public class TimeWindowRegulator implements Regulator {
// 度量线程池,定期对度量模型进行度量
private final ScheduledService measureScheduler = new ScheduledService("AFT-MEASURE",
ScheduledService.MODE_FIXEDRATE,
new MeasureRunnable(), 1, 1,
TimeUnit.SECONDS);
// 计算线程池,根据度量结果进行调控
private final ExecutorService regulationExecutor = ThreadPoolUtils.newFixedThreadPool(2,
new LinkedBlockingQueue<Runnable>(16),
new NamedThreadFactory("AFT-REGULATION"));
// 存储所有的度量模型
private final CopyOnWriteArrayList<MeasureModel> measureModels = new CopyOnWriteArrayList<>();
// 度量策略
private MeasureStrategy measureStrategy;
// 调控策略
private RegulationStrategy regulationStrategy;
// 降级策略
private DegradeStrategy degradeStrategy;
// 恢复策略
private RecoverStrategy recoverStrategy;
/**
* 初始化策略
*/
@Override
public void init() {
// 加载度量策略
measureStrategy = ExtensionLoaderFactory.getExtensionLoader(MeasureStrategy.class).getExtension(
measureStrategyAlias);
// 加载调控策略
regulationStrategy = ExtensionLoaderFactory.getExtensionLoader(RegulationStrategy.class).getExtension(
regulationStrategyAlias);
// 加载降级策略
degradeStrategy = ExtensionLoaderFactory.getExtensionLoader(DegradeStrategy.class).getExtension(
degradeStrategyAlias);
// 加载恢复策略
recoverStrategy = ExtensionLoaderFactory.getExtensionLoader(RecoverStrategy.class).getExtension(
recoverStrategyAlias);
// 添加监听器
InvocationStatFactory.addListener(listener);
}
// 度量任务
private class MeasureRunnable implements Runnable {
@Override
public void run() {
// 度量计数器加 1
measureCounter.incrementAndGet();
for (MeasureModel measureModel : measureModels) {
try {
// 判断是否到达时间窗口
if (isArriveTimeWindow(measureModel)) {
// 进行度量计算
MeasureResult measureResult = measureStrategy.measure(measureModel);
// 提交调控任务
regulationExecutor.submit(new RegulationRunnable(measureResult));
}
} catch (Exception e) {
LOGGER.errorWithApp(measureModel.getAppName(),
LogCodes.getLog(LogCodes.ERROR_WHEN_DO_MEASURE, e.getMessage()), e);
}
}
}
/**
* 判断是否到达时间窗口
* @param measureModel 度量模型
* @return 是否到达时间窗口
*/
private boolean isArriveTimeWindow(MeasureModel measureModel) {
long timeWindow = FaultToleranceConfigManager.getTimeWindow(measureModel.getAppName());
return measureCounter.get() % timeWindow == 0;
}
}
// 调控任务
private class RegulationRunnable implements Runnable {
private final MeasureResult measureResult;
RegulationRunnable(MeasureResult measureResult) {
this.measureResult = measureResult;
}
@Override
public void run() {
// 获取所有度量结果详情
List<MeasureResultDetail> measureResultDetails = measureResult.getAllMeasureResultDetails();
for (MeasureResultDetail measureResultDetail : measureResultDetails) {
try {
// 进行调控
doRegulate(measureResultDetail);
} catch (Exception e) {
LOGGER.errorWithApp(measureResult.getMeasureModel().getAppName(),
LogCodes.getLog(LogCodes.ERROR_WHEN_DO_REGULATE, e.getMessage()), e);
}
}
}
/**
* 进行调控
* @param measureResultDetail 度量结果详情
*/
void doRegulate(MeasureResultDetail measureResultDetail) {
// 获取度量状态
MeasureState measureState = measureResultDetail.getMeasureState();
// 获取调用统计维度
InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
// 判断是否需要降级
boolean isDegradeEffective = regulationStrategy.isDegradeEffective(measureResultDetail);
if (isDegradeEffective) {
measureResultDetail.setLogOnly(false);
if (measureState.equals(MeasureState.ABNORMAL)) {
// 判断是否达到最大降级 IP 数量
boolean isReachMaxDegradeIpCount = regulationStrategy.isReachMaxDegradeIpCount(measureResultDetail);
if (!isReachMaxDegradeIpCount) {
// 执行降级操作
degradeStrategy.degrade(measureResultDetail);
}
}
}
}
}
}
4. 调控策略
- 降级策略:当某个服务提供者的异常率超过配置的阈值时,会触发降级策略,降低该提供者的权重,减少对其的调用。
- 恢复策略:当服务提供者的状态恢复正常时,会触发恢复策略,逐步提高其权重。
// 服务水平调控策略
@Extension("serviceHorizontal")
public class ServiceHorizontalRegulationStrategy implements RegulationStrategy {
@Override
public boolean isDegradeEffective(MeasureResultDetail measureResultDetail) {
// 获取调用统计维度
InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
// 判断是否需要降级
return FaultToleranceConfigManager.isDegradeEffective(statDimension.getAppName());
}
@Override
public boolean isReachMaxDegradeIpCount(MeasureResultDetail measureResultDetail) {
// 获取调用统计维度
InvocationStatDimension statDimension = measureResultDetail.getInvocationStatDimension();
// 获取降级提供者 IP 集合
ConcurrentHashSet<String> ips = getDegradeProviders(statDimension.getDimensionKey());
// 获取当前 IP
String ip = statDimension.getIp();
if (ips.contains(ip)) {
return false;
} else {
// 获取最大降级 IP 数量
int degradeMaxIpCount = FaultToleranceConfigManager.getDegradeMaxIpCount(statDimension.getAppName());
ipsLock.lock();
try {
if (ips.size() < degradeMaxIpCount) {
// 添加 IP 到降级集合
ips.add(ip);
return false;
} else {
return true;
}
} finally {
ipsLock.unlock();
}
}
}
// 其他方法
}