1. 容错机制概述
1.1 什么是容错
容错(Fault Tolerance)是指系统在部分组件发生故障时,仍然能够继续正常运作的能力。在分布式系统中,容错不是避免故障,而是优雅地处理故障,保证系统的可用性和稳定性。
1.2 容错的业务价值
想象一个现代化的电商系统架构:

当某个组件故障时,容错机制能够:
- 维持核心功能:即使非关键服务不可用,核心业务流程仍可继续
- 快速自动恢复:系统能够自动检测故障并尝试恢复
- 防止雪崩效应:避免单个服务的故障扩散到整个系统
2. 容错设计原则
2.1 核心设计理念
2.1.1 故障隔离
@Component
public class BulkheadIsolation {
// 使用舱壁模式隔离不同服务的资源
@Autowired
private BulkheadRegistry bulkheadRegistry;
public <T> T executeWithIsolation(String serviceName,
Supplier<T> operation) {
Bulkhead bulkhead = bulkheadRegistry.bulkhead(serviceName);
return bulkhead.executeSupplier(operation);
}
}
2.1.2 快速失败
当故障发生时,立即返回错误而不是长时间等待,避免资源耗尽。
2.1.3 优雅降级
提供有损但可用的服务,而不是完全不可用。
2.2 容错模式对比
|
模式 |
适用场景 |
实现方式 |
优缺点 |
|
重试 |
临时性故障 |
自动重新执行 |
✅ 简单有效 ❌ 可能加重负载 |
|
熔断 |
持续性故障 |
快速失败+自动恢复 |
✅ 防止雪崩 ❌ 用户体验受影响 |
|
降级 |
资源不足 |
功能裁剪 |
✅ 保障核心业务 ❌ 功能完整性受损 |
|
超时 |
慢响应 |
限制等待时间 |
✅ 释放资源 ❌ 可能中断正常请求 |
|
舱壁 |
资源隔离 |
资源池隔离 |
✅ 故障隔离 ❌ 资源利用率降低 |
3. Resilience4j 容错实战
3.1 环境准备与配置
Maven依赖
<dependencies>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot2</artifactId>
<version>2.0.2</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-aop</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
</dependencies>
完整容错配置
resilience4j:
# 熔断器配置
circuitbreaker:
instances:
orderService:
failureRateThreshold: 50
waitDurationInOpenState: 10s
permittedNumberOfCallsInHalfOpenState: 3
slidingWindowSize: 10
minimumNumberOfCalls: 5
recordExceptions:
- java.io.IOException
- java.util.concurrent.TimeoutException
ignoreExceptions:
- com.example.BusinessException
# 重试配置
retry:
instances:
paymentService:
maxAttempts: 3
waitDuration: 500ms
retryExceptions:
- java.io.IOException
- org.springframework.web.client.ResourceAccessException
# 超时配置
timelimiter:
instances:
externalApi:
timeoutDuration: 5s
cancelRunningFuture: true
# 舱壁配置
bulkhead:
instances:
inventoryService:
maxConcurrentCalls: 20
maxWaitDuration: 100ms
# 速率限制器
ratelimiter:
instances:
authService:
limitForPeriod: 10
limitRefreshPeriod: 1s
timeoutDuration: 0
management:
endpoints:
web:
exposure:
include: health,metrics,circuitbreakers
endpoint:
health:
show-details: always
3.2 综合容错策略实现
订单服务完整容错示例
@Service
@Slf4j
public class OrderServiceWithFaultTolerance {
@Autowired
private PaymentServiceClient paymentService;
@Autowired
private InventoryServiceClient inventoryService;
@Autowired
private NotificationServiceClient notificationService;
/**
* 创建订单 - 综合容错策略
*/
@Retry(name = "orderService", fallbackMethod = "createOrderFallback")
@CircuitBreaker(name = "orderService", fallbackMethod = "createOrderFallback")
@TimeLimiter(name = "externalApi", fallbackMethod = "createOrderFallback")
@Bulkhead(name = "orderService", fallbackMethod = "createOrderFallback")
@RateLimiter(name = "orderService", fallbackMethod = "createOrderFallback")
public CompletableFuture<OrderResult> createOrder(OrderRequest request) {
return CompletableFuture.supplyAsync(() -> {
log.info("开始处理订单: {}", request.getOrderId());
// 1. 参数校验(同步,必须成功)
validateOrderRequest(request);
// 2. 库存检查(核心业务,重试+熔断)
checkInventoryWithRetry(request);
// 3. 支付处理(核心业务,重试+熔断+超时)
processPaymentWithFaultTolerance(request);
// 4. 创建订单记录
Order order = saveOrder(request);
// 5. 发送通知(非核心,可降级)
sendNotificationAsync(order);
log.info("订单处理完成: {}", order.getOrderId());
return OrderResult.success(order);
});
}
/**
* 库存检查 - 带重试机制
*/
@Retry(name = "inventoryService", fallbackMethod = "checkInventoryFallback")
private void checkInventoryWithRetry(OrderRequest request) {
InventoryCheckResult result = inventoryService.checkInventory(
request.getProductId(),
request.getQuantity()
);
if (!result.isAvailable()) {
throw new InventoryException("库存不足,商品ID: " + request.getProductId());
}
}
/**
* 支付处理 - 综合容错
*/
@CircuitBreaker(name = "paymentService", fallbackMethod = "processPaymentFallback")
@TimeLimiter(name = "paymentService")
@Retry(name = "paymentService")
private PaymentResult processPaymentWithFaultTolerance(OrderRequest request) {
PaymentRequest paymentRequest = buildPaymentRequest(request);
return paymentService.processPayment(paymentRequest);
}
/**
* 异步发送通知 - 可降级
*/
@Async
@CircuitBreaker(name = "notificationService", fallbackMethod = "notificationFallback")
public void sendNotificationAsync(Order order) {
try {
notificationService.sendOrderCreatedNotification(order);
log.debug("订单通知发送成功: {}", order.getOrderId());
} catch (Exception e) {
log.warn("订单通知发送失败,但不会影响主流程: {}", e.getMessage());
// 非核心服务异常,只记录日志,不抛出异常
}
}
/**
* 综合降级方法
*/
private CompletableFuture<OrderResult> createOrderFallback(OrderRequest request, Exception e) {
log.error("订单创建容错降级, 订单ID: {}, 异常: {}", request.getOrderId(), e.getMessage());
FaultToleranceContext context = analyzeFaultContext(e);
switch (context.getSeverity()) {
case CRITICAL:
return CompletableFuture.completedFuture(handleCriticalFailure(request, context));
case HIGH:
return CompletableFuture.completedFuture(handleHighSeverityFailure(request, context));
case MEDIUM:
return CompletableFuture.completedFuture(handleMediumSeverityFailure(request, context));
default:
return CompletableFuture.completedFuture(handleLowSeverityFailure(request, context));
}
}
/**
* 库存检查降级
*/
private void checkInventoryFallback(OrderRequest request, Exception e) {
log.warn("库存检查降级, 使用本地缓存数据");
// 使用本地缓存或最近的成功结果
LocalInventoryCache cache = inventoryService.getLocalCache(request.getProductId());
if (cache != null && cache.isValid()) {
if (cache.getAvailableQuantity() < request.getQuantity()) {
throw new InventoryException("本地缓存显示库存不足");
}
// 继续执行
} else {
throw new InventoryException("无法获取库存信息,请稍后重试");
}
}
/**
* 支付处理降级
*/
private PaymentResult processPaymentFallback(OrderRequest request, Exception e) {
log.warn("支付服务降级, 创建待支付订单");
// 创建待支付订单,引导用户后续完成支付
return PaymentResult.builder()
.paymentId("PENDING-" + System.currentTimeMillis())
.status(PaymentStatus.PENDING)
.message("支付系统繁忙,订单已创建,请稍后完成支付")
.build();
}
/**
* 通知服务降级
*/
private void notificationFallback(Order order, Exception e) {
log.info("通知服务降级, 订单ID: {}", order.getOrderId());
// 将通知任务放入重试队列
notificationRetryQueue.add(order);
}
}
3.3 容错上下文与决策引擎
容错上下文管理
@Component
@Slf4j
public class FaultToleranceContextManager {
private final ThreadLocal<FaultToleranceContext> contextHolder = new ThreadLocal<>();
/**
* 开始容错上下文
*/
public void beginContext(String operationName) {
FaultToleranceContext context = FaultToleranceContext.builder()
.operationName(operationName)
.startTime(System.currentTimeMillis())
.retryCount(0)
.fallbackTriggered(false)
.build();
contextHolder.set(context);
}
/**
* 记录重试事件
*/
public void recordRetry(Exception exception) {
FaultToleranceContext context = getCurrentContext();
if (context != null) {
context.incrementRetryCount();
context.addException(exception);
log.debug("记录重试事件: {}, 重试次数: {}",
context.getOperationName(), context.getRetryCount());
}
}
/**
* 记录降级事件
*/
public void recordFallback(String fallbackMethod, Exception triggerException) {
FaultToleranceContext context = getCurrentContext();
if (context != null) {
context.setFallbackTriggered(true);
context.setFallbackMethod(fallbackMethod);
context.setTriggerException(triggerException);
context.setEndTime(System.currentTimeMillis());
log.warn("降级触发: 操作={}, 降级方法={}, 触发异常={}",
context.getOperationName(), fallbackMethod, triggerException.getMessage());
}
}
/**
* 获取当前上下文
*/
public FaultToleranceContext getCurrentContext() {
return contextHolder.get();
}
/**
* 结束上下文
*/
public void endContext() {
FaultToleranceContext context = contextHolder.get();
if (context != null) {
// 记录指标
recordMetrics(context);
contextHolder.remove();
}
}
/**
* 分析故障上下文
*/
public FaultSeverity analyzeFaultSeverity(FaultToleranceContext context) {
if (context == null) {
return FaultSeverity.UNKNOWN;
}
List<Exception> exceptions = context.getExceptions();
if (exceptions.isEmpty()) {
return FaultSeverity.NONE;
}
// 基于异常类型和频率判断严重程度
long timeoutCount = exceptions.stream()
.filter(e -> e instanceof TimeoutException)
.count();
long circuitBreakerCount = exceptions.stream()
.filter(e -> e instanceof CallNotPermittedException)
.count();
if (circuitBreakerCount > 0) {
return FaultSeverity.CRITICAL;
} else if (timeoutCount >= 3) {
return FaultSeverity.HIGH;
} else if (context.getRetryCount() >= 2) {
return FaultSeverity.MEDIUM;
} else {
return FaultSeverity.LOW;
}
}
private void recordMetrics(FaultToleranceContext context) {
// 记录到监控系统
Metrics.counter("fault_tolerance_operation_total",
"operation", context.getOperationName(),
"fallback_triggered", String.valueOf(context.isFallbackTriggered()))
.increment();
if (context.isFallbackTriggered()) {
Metrics.counter("fault_tolerance_fallback_total",
"operation", context.getOperationName(),
"fallback_method", context.getFallbackMethod())
.increment();
}
}
}
智能容错决策引擎
@Component
@Slf4j
public class SmartFaultToleranceEngine {
@Autowired
private SystemMetricsCollector metricsCollector;
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private RetryRegistry retryRegistry;
/**
* 动态调整容错策略
*/
public void adjustFaultToleranceStrategies() {
SystemMetrics metrics = metricsCollector.getCurrentMetrics();
// 基于系统负载调整策略
if (metrics.getCpuUsage() > 80) {
tightenCircuitBreakerSettings();
reduceRetryAttempts();
} else if (metrics.getCpuUsage() < 30) {
loosenCircuitBreakerSettings();
increaseRetryAttempts();
}
// 基于时间特征调整
if (isPeakBusinessHours()) {
enableAggressiveDegradation();
} else {
disableAggressiveDegradation();
}
}
/**
* 收紧熔断器设置
*/
private void tightenCircuitBreakerSettings() {
circuitBreakerRegistry.getAllCircuitBreakers().forEach((name, circuitBreaker) -> {
CircuitBreakerConfig newConfig = CircuitBreakerConfig.from(circuitBreaker.getCircuitBreakerConfig())
.failureRateThreshold(40) // 降低阈值
.waitDurationInOpenState(Duration.ofSeconds(15)) // 延长等待时间
.build();
circuitBreaker.getEventPublisher().onStateTransition(event -> {
log.info("熔断器 {} 状态变更: {} -> {}",
name, event.getStateTransition().getFromState(),
event.getStateTransition().getToState());
});
});
}
/**
* 智能重试策略
*/
public RetryConfig buildSmartRetryConfig(String serviceName, Exception lastException) {
RetryConfig.Builder builder = RetryConfig.custom();
// 基于异常类型设置重试策略
if (lastException instanceof TimeoutException) {
builder.maxAttempts(2)
.waitDuration(Duration.ofMillis(200));
} else if (lastException instanceof SocketException) {
builder.maxAttempts(3)
.waitDuration(Duration.ofMillis(500));
} else if (lastException instanceof ServerErrorException) {
builder.maxAttempts(1) // 服务器错误不重试
.waitDuration(Duration.ofMillis(100));
} else {
builder.maxAttempts(3)
.waitDuration(Duration.ofMillis(300));
}
// 基于服务重要性调整
if (isCriticalService(serviceName)) {
builder.maxAttempts(builder.build().getMaxAttempts() + 1);
}
return builder.build();
}
private boolean isPeakBusinessHours() {
LocalTime now = LocalTime.now();
return (now.isAfter(LocalTime.of(9, 0)) && now.isBefore(LocalTime.of(12, 0))) ||
(now.isAfter(LocalTime.of(14, 0)) && now.isBefore(LocalTime.of(18, 0)));
}
private boolean isCriticalService(String serviceName) {
return serviceName.contains("payment") ||
serviceName.contains("order") ||
serviceName.contains("inventory");
}
}
4. 容错监控与管理
4.1 综合监控面板
@RestController
@RequestMapping("/fault-tolerance")
@Slf4j
public class FaultToleranceMonitorController {
@Autowired
private CircuitBreakerRegistry circuitBreakerRegistry;
@Autowired
private RetryRegistry retryRegistry;
@Autowired
private BulkheadRegistry bulkheadRegistry;
@Autowired
private TimeLimiterRegistry timeLimiterRegistry;
/**
* 获取所有容错组件状态
*/
@GetMapping("/status")
public Map<String, Object> getFaultToleranceStatus() {
Map<String, Object> status = new HashMap<>();
status.put("circuitBreakers", getCircuitBreakerStatus());
status.put("retries", getRetryStatus());
status.put("bulkheads", getBulkheadStatus());
status.put("timeLimiters", getTimeLimiterStatus());
status.put("systemMetrics", getSystemMetrics());
return status;
}
/**
* 熔断器状态详情
*/
@GetMapping("/circuit-breakers/{name}")
public Map<String, Object> getCircuitBreakerDetail(@PathVariable String name) {
CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(name);
Map<String, Object> detail = new HashMap<>();
detail.put("name", name);
detail.put("state", circuitBreaker.getState().name());
detail.put("metrics", buildCircuitBreakerMetrics(circuitBreaker.getMetrics()));
detail.put("config", buildCircuitBreakerConfig(circuitBreaker.getCircuitBreakerConfig()));
return detail;
}
/**
* 手动操作熔断器状态
*/
@PostMapping("/circuit-breakers/{name}/state")
public ResponseEntity<String> changeCircuitBreakerState(
@PathVariable String name,
@RequestParam String state) {
CircuitBreaker circuitBreaker = circuitBreakerRegistry.circuitBreaker(name);
try {
switch (state.toUpperCase()) {
case "CLOSED":
circuitBreaker.transitionToClosedState();
break;
case "OPEN":
circuitBreaker.transitionToOpenState();
break;
case "HALF_OPEN":
circuitBreaker.transitionToHalfOpenState();
break;
case "FORCED_OPEN":
circuitBreaker.transitionToForcedOpenState();
break;
case "DISABLED":
circuitBreaker.transitionToDisabledState();
break;
default:
return ResponseEntity.badRequest().body("无效的状态: " + state);
}
log.info("手动变更熔断器状态: {} -> {}", name, state);
return ResponseEntity.ok("状态变更成功");
} catch (Exception e) {
log.error("变更熔断器状态失败", e);
return ResponseEntity.status(500).body("状态变更失败: " + e.getMessage());
}
}
/**
* 重置容错组件
*/
@PostMapping("/reset/{componentType}/{name}")
public ResponseEntity<String> resetComponent(
@PathVariable String componentType,
@PathVariable String name) {
try {
switch (componentType.toLowerCase()) {
case "circuitbreaker":
circuitBreakerRegistry.circuitBreaker(name).reset();
break;
case "retry":
retryRegistry.retry(name).reset();
break;
case "bulkhead":
bulkheadRegistry.bulkhead(name).reset();
break;
default:
return ResponseEntity.badRequest().body("不支持的组件类型: " + componentType);
}
log.info("重置容错组件: {}/{}", componentType, name);
return ResponseEntity.ok("重置成功");
} catch (Exception e) {
log.error("重置容错组件失败", e);
return ResponseEntity.status(500).body("重置失败: " + e.getMessage());
}
}
private Map<String, Object> getCircuitBreakerStatus() {
Map<String, Object> breakers = new HashMap<>();
circuitBreakerRegistry.getAllCircuitBreakers().forEach((name, cb) -> {
Map<String, Object> info = new HashMap<>();
info.put("state", cb.getState().name());
info.put("metrics", buildCircuitBreakerMetrics(cb.getMetrics()));
breakers.put(name, info);
});
return breakers;
}
private Map<String, Object> buildCircuitBreakerMetrics(CircuitBreaker.Metrics metrics) {
Map<String, Object> metricsInfo = new HashMap<>();
metricsInfo.put("failureRate", metrics.getFailureRate());
metricsInfo.put("bufferedCalls", metrics.getNumberOfBufferedCalls());
metricsInfo.put("failedCalls", metrics.getNumberOfFailedCalls());
metricsInfo.put("successfulCalls", metrics.getNumberOfSuccessfulCalls());
metricsInfo.put("notPermittedCalls", metrics.getNumberOfNotPermittedCalls());
metricsInfo.put("slowCalls", metrics.getNumberOfSlowCalls());
metricsInfo.put("slowCallRate", metrics.getSlowCallRate());
return metricsInfo;
}
}
4.2 容错事件监听与告警
@Component
@Slf4j
public class FaultToleranceEventListener {
@Autowired
private AlertService alertService;
@Autowired
private MetricsService metricsService;
/**
* 监听熔断器事件
*/
@EventListener
public void onCircuitBreakerEvent(CircuitBreakerEvent event) {
metricsService.recordCircuitBreakerEvent(event);
switch (event.getEventType()) {
case STATE_TRANSITION:
handleStateTransition(event);
break;
case ERROR:
handleError(event);
break;
case CALL_NOT_PERMITTED:
handleCallNotPermitted(event);
break;
case SLOW_CALL_RATE_EXCEEDED:
handleSlowCallRateExceeded(event);
break;
case FAILURE_RATE_EXCEEDED:
handleFailureRateExceeded(event);
break;
}
}
/**
* 监听重试事件
*/
@EventListener
public void onRetryEvent(RetryEvent event) {
metricsService.recordRetryEvent(event);
if (event.getEventType() == RetryEvent.Type.RETRY) {
log.warn("重试事件: {}, 异常: {}, 重试次数: {}",
event.getName(), event.getLastThrowable().getMessage(),
event.getNumberOfRetryAttempts());
// 重试次数过多告警
if (event.getNumberOfRetryAttempts() >= 3) {
alertService.sendRetryAlert(event.getName(),
event.getNumberOfRetryAttempts(),
event.getLastThrowable());
}
}
}
/**
* 处理熔断器状态转换
*/
private void handleStateTransition(CircuitBreakerEvent event) {
CircuitBreakerOnStateTransitionEvent transitionEvent =
(CircuitBreakerOnStateTransitionEvent) event;
String message = String.format("熔断器状态变更: %s [%s -> %s]",
event.getCircuitBreakerName(),
transitionEvent.getStateTransition().getFromState(),
transitionEvent.getStateTransition().getToState());
log.warn(message);
// 状态转为OPEN时发送告警
if (transitionEvent.getStateTransition().getToState() == CircuitBreaker.State.OPEN) {
alertService.sendCircuitBreakerOpenAlert(event.getCircuitBreakerName(), message);
}
// 状态转为CLOSED时发送恢复通知
if (transitionEvent.getStateTransition().getToState() == CircuitBreaker.State.CLOSED &&
transitionEvent.getStateTransition().getFromState() == CircuitBreaker.State.OPEN) {
alertService.sendCircuitBreakerRecoveredAlert(event.getCircuitBreakerName());
}
}
/**
* 处理调用被拒绝事件
*/
private void handleCallNotPermitted(CircuitBreakerEvent event) {
log.warn("调用被熔断器拒绝: {}", event.getCircuitBreakerName());
// 记录被拒绝的调用,用于容量规划
metricsService.recordRejectedCall(event.getCircuitBreakerName());
// 发送业务影响告警
if (isCriticalService(event.getCircuitBreakerName())) {
alertService.sendBusinessImpactAlert(event.getCircuitBreakerName(),
"熔断器打开,影响业务功能");
}
}
private boolean isCriticalService(String serviceName) {
return serviceName.contains("payment") ||
serviceName.contains("order") ||
serviceName.contains("auth");
}
}
5. 高级容错模式
5.1 自适应容错策略
@Component
@Slf4j
public class AdaptiveFaultToleranceStrategy {
@Autowired
private HistoricalDataAnalyzer historicalDataAnalyzer;
@Autowired
private MachineLearningPredictor mlPredictor;
/**
* 基于历史数据调整容错参数
*/
public void adaptFaultToleranceParameters() {
Map<String, ServiceBehavior> serviceBehaviors =
historicalDataAnalyzer.analyzeServiceBehavior(Period.ofDays(7));
serviceBehaviors.forEach((serviceName, behavior) -> {
AdaptiveConfig config = calculateAdaptiveConfig(behavior);
applyAdaptiveConfig(serviceName, config);
});
}
/**
* 预测性容错
*/
public PredictiveFaultToleranceResult predictFaultToleranceNeeds(String serviceName) {
ServiceMetrics currentMetrics = metricsService.getCurrentMetrics(serviceName);
SeasonalPattern pattern = historicalDataAnalyzer.getSeasonalPattern(serviceName);
// 使用机器学习预测故障概率
double faultProbability = mlPredictor.predictFaultProbability(
currentMetrics, pattern);
return PredictiveFaultToleranceResult.builder()
.serviceName(serviceName)
.faultProbability(faultProbability)
.recommendedAction(getRecommendedAction(faultProbability))
.confidenceLevel(mlPredictor.getConfidenceLevel())
.build();
}
private AdaptiveConfig calculateAdaptiveConfig(ServiceBehavior behavior) {
return AdaptiveConfig.builder()
.circuitBreakerThreshold(calculateOptimalThreshold(behavior))
.retryAttempts(calculateOptimalRetryAttempts(behavior))
.timeoutDuration(calculateOptimalTimeout(behavior))
.bulkheadSize(calculateOptimalBulkheadSize(behavior))
.build();
}
private double calculateOptimalThreshold(ServiceBehavior behavior) {
// 基于历史故障率计算最优阈值
double historicalFailureRate = behavior.getAverageFailureRate();
double variability = behavior.getFailureRateVariability();
// 故障率波动大时使用更保守的阈值
if (variability > 0.3) {
return Math.max(30, historicalFailureRate * 0.8);
} else {
return Math.max(40, historicalFailureRate * 1.2);
}
}
}
5.2 容错演练框架
@Service
@Slf4j
public class FaultToleranceDrillService {
@Autowired
private ChaosEngineeringService chaosService;
@Autowired
private FaultToleranceConfigManager configManager;
/**
* 执行容错演练
*/
public DrillResult executeFaultToleranceDrill(DrillPlan drillPlan) {
log.info("开始容错演练: {}", drillPlan.getName());
DrillResult result = new DrillResult(drillPlan);
try {
// 1. 备份当前配置
Map<String, Object> backupConfigs = configManager.backupAllConfigs();
// 2. 应用演练配置
applyDrillConfigs(drillPlan);
// 3. 注入故障
injectFaults(drillPlan.getFaultInjectionScenarios());
// 4. 执行测试用例
executeTestScenarios(drillPlan, result);
// 5. 评估容错效果
evaluateFaultToleranceEffectiveness(result);
// 6. 恢复环境
restoreEnvironment(backupConfigs);
result.setStatus(DrillStatus.SUCCESS);
} catch (Exception e) {
log.error("容错演练执行失败", e);
result.setStatus(DrillStatus.FAILED);
result.setErrorMessage(e.getMessage());
}
// 7. 生成演练报告
generateDrillReport(result);
return result;
}
/**
* 定期容错演练计划
*/
@Scheduled(cron = "0 0 2 * * SUN") // 每周日凌晨2点执行
public void scheduleWeeklyDrill() {
DrillPlan weeklyPlan = buildWeeklyDrillPlan();
executeFaultToleranceDrill(weeklyPlan);
}
/**
* 大促前专项演练
*/
public void executePrePromotionDrill() {
DrillPlan promotionPlan = buildPromotionDrillPlan();
DrillResult result = executeFaultToleranceDrill(promotionPlan);
// 基于演练结果优化配置
if (result.getStatus() == DrillStatus.SUCCESS) {
optimizeConfigurationsBasedOnDrill(result);
} else {
log.error("大促前演练失败,需要人工干预");
alertService.sendDrillFailureAlert(result);
}
}
private void injectFaults(List<FaultInjectionScenario> scenarios) {
for (FaultInjectionScenario scenario : scenarios) {
try {
switch (scenario.getFaultType()) {
case NETWORK_LATENCY:
chaosService.injectNetworkLatency(scenario.getTargetService(),
scenario.getDuration(), scenario.getParameters());
break;
case SERVICE_UNAVAILABLE:
chaosService.makeServiceUnavailable(scenario.getTargetService(),
scenario.getDuration());
break;
case HIGH_CPU:
chaosService.injectHighCpuLoad(scenario.getTargetService(),
scenario.getDuration());
break;
case MEMORY_LEAK:
chaosService.injectMemoryLeak(scenario.getTargetService(),
scenario.getDuration());
break;
}
log.info("故障注入成功: {}", scenario.getDescription());
} catch (Exception e) {
log.error("故障注入失败: {}", scenario.getDescription(), e);
}
}
}
}
6. 生产环境最佳实践
6.1 容错配置模板
# 核心服务容错配置模板
fault-tolerance:
templates:
critical-service:
circuit-breaker:
failure-rate-threshold: 40
wait-duration-in-open-state: 30s
sliding-window-size: 20
minimum-number-of-calls: 10
retry:
max-attempts: 3
wait-duration: 1s
time-limiter:
timeout-duration: 5s
bulkhead:
max-concurrent-calls: 50
max-wait-duration: 50ms
normal-service:
circuit-breaker:
failure-rate-threshold: 50
wait-duration-in-open-state: 10s
sliding-window-size: 10
minimum-number-of-calls: 5
retry:
max-attempts: 2
wait-duration: 500ms
time-limiter:
timeout-duration: 3s
bulkhead:
max-concurrent-calls: 100
max-wait-duration: 100ms
# 服务具体配置
resilience4j:
circuitbreaker:
configs:
critical:
baseConfig: critical-service
normal:
baseConfig: normal-service
circuitbreaker:
instances:
paymentService:
baseConfig: critical
orderService:
baseConfig: critical
productService:
baseConfig: normal
recommendationService:
baseConfig: normal
6.2 容错检查清单
@Component
@Slf4j
public class FaultToleranceChecklist {
/**
* 系统启动时容错配置检查
*/
@EventListener(ApplicationReadyEvent.class)
public void checkFaultToleranceConfigurations() {
log.info("开始容错配置检查...");
List<CheckItem> checkItems = Arrays.asList(
checkCircuitBreakerConfigs(),
checkRetryConfigs(),
checkBulkheadConfigs(),
checkTimeLimiterConfigs(),
checkFallbackMethods(),
checkMonitoringConfig()
);
boolean allPassed = checkItems.stream().allMatch(CheckItem::isPassed);
if (allPassed) {
log.info("容错配置检查通过");
} else {
log.error("容错配置检查发现以下问题:");
checkItems.stream()
.filter(item -> !item.isPassed())
.forEach(item -> log.error(" - {}: {}", item.getName(), item.getMessage()));
// 发送告警
alertService.sendConfigurationCheckAlert(checkItems);
}
}
/**
* 定期健康检查
*/
@Scheduled(fixedRate = 300000) // 5分钟执行一次
public void performHealthCheck() {
FaultToleranceHealth health = checkFaultToleranceHealth();
if (health.getStatus() == HealthStatus.DEGRADED) {
log.warn("容错系统健康度下降: {}", health.getMessage());
alertService.sendHealthDegradationAlert(health);
}
if (health.getStatus() == HealthStatus.CRITICAL) {
log.error("容错系统健康度严重下降: {}", health.getMessage());
alertService.sendCriticalHealthAlert(health);
// 尝试自动修复
attemptAutoRecovery(health);
}
}
private CheckItem checkCircuitBreakerConfigs() {
try {
circuitBreakerRegistry.getAllCircuitBreakers().forEach((name, cb) -> {
CircuitBreakerConfig config = cb.getCircuitBreakerConfig();
// 检查配置合理性
if (config.getFailureRateThreshold() < 10) {
throw new ConfigurationException("熔断器失败率阈值过低: " + name);
}
if (config.getWaitDurationInOpenState().getSeconds() < 5) {
throw new ConfigurationException("熔断器等待时间过短: " + name);
}
});
return CheckItem.passed("熔断器配置检查");
} catch (Exception e) {
return CheckItem.failed("熔断器配置检查", e.getMessage());
}
}
}
7. 总结
7.1 容错机制核心价值
通过本文的深入学习,你应该掌握容错机制在分布式系统中的核心价值:
- 系统韧性:在故障发生时保持系统部分可用
- 故障隔离:防止单个故障扩散到整个系统
- 快速恢复:自动检测和恢复故障组件
- 用户体验:提供优雅的降级体验而非完全不可用
7.2 关键实践要点
✅ 分层容错设计:不同层级采用不同的容错策略 ✅ 智能决策:基于历史数据和实时指标动态调整 ✅ 全面监控:实时跟踪容错组件状态和效果 ✅ 定期演练:通过混沌工程验证容错有效性 ✅ 持续优化:基于实际运行数据不断改进配置
7.3 演进路径建议
对于不同成熟度的团队,容错机制的演进路径:
初级阶段:
- 实现基本的熔断和重试
- 配置合理的超时时间
- 实现简单的降级策略
中级阶段:
- 引入舱壁隔离
- 实现多级降级
- 建立监控告警
高级阶段:
- 自适应容错策略
- 预测性容错
- 自动化容错演练
容错不是一蹴而就的,而是需要根据业务特点和技术架构不断演进的过程。合理的容错设计能够显著提升系统的稳定性和用户体验。

被折叠的 条评论
为什么被折叠?



