系统架构设计的关键决策

系统架构设计的关键决策:

  1. 技术选型: 编程语言、数据库、中间件、Web 框架、部署平台(云/本地)等。

  2. 数据管理策略: 数据存储方案(关系型、NoSQL、文件系统)、数据流、数据一致性机制、缓存策略。

  3. 并发与分布式处理: 如何处理并行任务、负载均衡、容错机制(如冗余、故障转移)。

  4. 安全架构: 认证、授权、加密、审计、安全通信等策略。

  5. 部署架构: 系统的物理或逻辑部署结构(如服务器配置、网络拓扑、容器编排)。

我们选择构建一个智能家居系统作为示例,该系统需要处理设备控制、用户管理、自动化场景等。我们将围绕这个系统来阐述关键决策。

一、技术选型

1.1 编程语言

  • Java: 用于后端微服务,因为其成熟的生态、强类型、高性能和丰富的框架(如Spring Boot)。

  • Python: 用于数据分析、机器学习(如能源优化、用户行为分析)和设备适配层(因为其丰富的硬件库)。

  • C++: 用于高性能要求的设备驱动和实时处理(如视频流处理)。

  • C#: 用于Windows环境下的客户端应用(如桌面控制端)。

  • JavaScript/TypeScript: 用于Web前端和移动端(React Native)。

1.2 数据库

  • 关系型数据库(PostgreSQL): 存储用户信息、设备元数据、场景配置等结构化数据,保证ACID。

  • 时序数据库(InfluxDB): 存储设备产生的时序数据(如温度、功耗)。

  • 文档数据库(MongoDB): 存储非结构化或半结构化数据,如设备日志、用户活动日志。

  • 缓存数据库(Redis): 用于会话存储、设备状态缓存、分布式锁。

1.3 中间件

  • 消息队列(Apache Kafka): 用于事件流处理,实现组件之间的解耦和异步通信。

  • API网关(Kong): 处理请求路由、认证、限流等。

  • 服务网格(Istio): 用于微服务之间的通信、监控和策略实施。

1.4 Web框架

  • Spring Boot(Java): 构建微服务。

  • FastAPI(Python): 构建设备适配层的API服务。

  • ASP.NET Core(C#): 构建Windows服务。

1.5 部署平台

  • 云平台(AWS/Azure): 利用云服务实现弹性伸缩和高可用性。

  • 容器化(Docker): 统一运行环境,便于部署。

  • 容器编排(Kubernetes): 管理容器化应用的部署、扩展和运维。

二、数据管理策略

2.1 数据存储方案

  • 关系型数据(用户、设备元数据): 使用PostgreSQL,利用其事务支持和复杂查询能力。

  • 时序数据(传感器数据): 使用InfluxDB,优化时间范围查询和数据聚合。

  • 非结构化数据(日志): 使用MongoDB,灵活存储和查询。

  • 缓存数据(会话、状态): 使用Redis,提供低延迟访问。

2.2 数据流

  • 设备数据通过MQTT或HTTP发送到设备适配层。

  • 适配层将数据写入InfluxDB,同时发送事件到Kafka。

  • 微服务消费Kafka事件,进行业务处理。

  • 前端通过API网关查询数据,数据可能来自缓存或数据库。

2.3 数据一致性机制

  • 对于核心业务数据(如用户账户),使用分布式事务(如Saga模式)保证最终一致性。

  • 对于设备状态,采用异步事件更新,通过版本号或时间戳解决冲突。

2.4 缓存策略

  • 缓存穿透: 使用布隆过滤器或缓存空值。

  • 缓存击穿: 使用互斥锁(如Redis的SETNX)避免大量请求同时访问数据库。

  • 缓存雪崩: 设置不同的过期时间,使用高可用缓存集群。

  • 缓存更新: 采用写穿透(Write-Through)或写回(Write-Back)策略,根据数据一致性要求选择。

三、并发与分布式处理

3.1 并行任务处理

  • 使用线程池(Java的ExecutorService)处理CPU密集型任务。

  • 使用异步非阻塞IO(如Netty、Python的asyncio)处理高并发IO操作。

  • 使用Kafka分区实现并行处理事件流。

3.2 负载均衡

  • 在Kubernetes中使用Service和Ingress实现负载均衡。

  • 使用API网关(Kong)进行请求分发。

  • 微服务内部使用客户端负载均衡(如Spring Cloud LoadBalancer)。

3.3 容错机制

  • 冗余: 关键服务部署多个实例,避免单点故障。

  • 故障转移: 使用Kubernetes的存活探针和就绪探针实现自动重启和流量切换。

  • 熔断和降级: 使用Resilience4j或Hystrix实现熔断,防止故障扩散。

  • 限流: 在API网关层对请求进行限流,保护后端服务。

四、安全架构

4.1 认证

  • 使用OAuth 2.0和JWT进行用户认证,令牌存储在HTTP Only的Cookie中或移动端安全存储。

  • 设备认证使用证书或预共享密钥。

4.2 授权

  • 使用RBAC(基于角色的访问控制)管理用户权限。

  • 微服务之间使用服务账户进行相互认证和授权。

4.3 加密

  • 数据传输使用TLS 1.3。

  • 敏感数据(如密码)使用强哈希算法(如bcrypt)存储。

  • 密钥管理使用云平台的KMS(密钥管理服务)或HashiCorp Vault。

4.4 审计

  • 记录所有关键操作(如登录、设备控制)到审计日志,日志集中存储(如ELK Stack)并设置保留策略。

4.5 安全通信

  • 微服务之间使用mTLS(双向TLS)进行通信。

  • 设备与服务器通信使用DTLS(用于UDP)或TLS(用于TCP)。

五、部署架构

5.1 物理/逻辑部署结构

  • 前端层: 静态资源托管在CDN,动态请求通过API网关。

  • 网关层: 部署Kong集群,作为所有流量的入口。

  • 微服务层: 每个微服务部署在Kubernetes集群中,跨多个可用区。

  • 数据层: 数据库集群同样跨可用区部署,主从复制,定期备份。

  • 设备适配层: 部署在边缘节点,靠近设备,减少延迟。

5.2 网络拓扑

  • 使用VPC(虚拟私有云)隔离环境。

  • 微服务在Kubernetes集群内通过Service通信,外部通过Ingress暴露。

  • 数据库不直接暴露在公网,只能通过内网访问。

5.3 容器编排

  • 使用Kubernetes进行容器编排,定义Deployment、Service、ConfigMap、Secret等资源。

  • 使用Helm进行应用打包和部署。

  • 使用GitOps(如ArgoCD)实现持续部署。

示例:设备状态更新流程

UML序列图

@startuml
actor 用户 as User
participant "移动App" as App
participant "API网关" as Gateway
participant "设备服务" as DeviceService
participant "缓存(Redis)" as Cache
participant "消息队列(Kafka)" as MQ
participant "设备适配器" as Adapter
participant "设备" as Device

User -> App: 控制设备
App -> Gateway: HTTP请求(带JWT)
Gateway -> Gateway: 验证JWT,限流
Gateway -> DeviceService: 转发请求
DeviceService -> Cache: 获取设备状态(缓存)
Cache --> DeviceService: 返回状态
DeviceService -> MQ: 发布控制事件
MQ --> Adapter: 消费事件
Adapter -> Device: 发送控制命令(MQTT)
Device --> Adapter: 返回结果
Adapter -> MQ: 发布状态更新事件
MQ --> DeviceService: 消费状态更新事件
DeviceService -> Cache: 更新设备状态
DeviceService -> Gateway: 返回响应
Gateway -> App: 返回HTTP响应
App -> User: 显示结果
@enduml

Java代码示例(设备服务)

// DeviceController.java
@RestController
@RequestMapping("/devices")
public class DeviceController {
    
    @Autowired
    private DeviceService deviceService;
    
    @PostMapping("/{deviceId}/commands")
    public ResponseEntity<CommandResult> executeCommand(
            @PathVariable String deviceId,
            @RequestBody DeviceCommand command) {
        
        CommandResult result = deviceService.executeCommand(deviceId, command);
        return ResponseEntity.ok(result);
    }
}

// DeviceService.java
@Service
public class DeviceService {
    
    @Autowired
    private RedisTemplate<String, DeviceStatus> redisTemplate;
    
    @Autowired
    private KafkaTemplate<String, DeviceEvent> kafkaTemplate;
    
    public CommandResult executeCommand(String deviceId, DeviceCommand command) {
        // 1. 检查设备状态(缓存)
        DeviceStatus status = redisTemplate.opsForValue().get(deviceId);
        if (status == null) {
            // 从数据库加载
            status = loadDeviceStatusFromDatabase(deviceId);
        }
        
        // 2. 验证命令是否可执行
        if (!status.canExecute(command)) {
            throw new IllegalStateException("Device cannot execute command in current state");
        }
        
        // 3. 发布控制事件到Kafka
        DeviceEvent event = new DeviceEvent(deviceId, command, System.currentTimeMillis());
        kafkaTemplate.send("device.commands", deviceId, event);
        
        // 4. 更新缓存状态为"执行中"
        status.setPendingCommand(command);
        redisTemplate.opsForValue().set(deviceId, status, Duration.ofMinutes(5));
        
        return new CommandResult("Command accepted", true);
    }
    
    @KafkaListener(topics = "device.status.updates")
    public void handleStatusUpdate(DeviceStatusUpdate update) {
        // 更新缓存
        redisTemplate.opsForValue().set(
            update.getDeviceId(), 
            update.getStatus(), 
            Duration.ofMinutes(5)
        );
        
        // 可以进一步处理,如通知用户等
    }
}

Python代码示例(设备适配器)

# device_adapter.py
import asyncio
import json
import logging
from aiokafka import AIOKafkaConsumer, AIOKafkaProducer
import paho.mqtt.client as mqtt

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DeviceAdapter:
    def __init__(self, kafka_bootstrap_servers, mqtt_broker):
        self.kafka_bootstrap_servers = kafka_bootstrap_servers
        self.mqtt_broker = mqtt_broker
        
        self.mqtt_client = mqtt.Client()
        self.mqtt_client.on_connect = self.on_mqtt_connect
        self.mqtt_client.on_message = self.on_mqtt_message
        
        self.kafka_consumer = None
        self.kafka_producer = None
        
    async def start(self):
        # 连接MQTT
        self.mqtt_client.connect(self.mqtt_broker, 1883, 60)
        self.mqtt_client.loop_start()
        
        # 连接Kafka
        self.kafka_consumer = AIOKafkaConsumer(
            'device.commands',
            bootstrap_servers=self.kafka_bootstrap_servers,
            group_id="device_adapter_group"
        )
        await self.kafka_consumer.start()
        
        self.kafka_producer = AIOKafkaProducer(
            bootstrap_servers=self.kafka_bootstrap_servers
        )
        await self.kafka_producer.start()
        
        # 开始消费Kafka消息
        asyncio.create_task(self.consume_kafka_messages())
        
    async def consume_kafka_messages(self):
        async for msg in self.kafka_consumer:
            try:
                event = json.loads(msg.value.decode())
                await self.handle_device_command(event)
            except Exception as e:
                logger.error(f"Error processing Kafka message: {e}")
                
    async def handle_device_command(self, event):
        device_id = event['device_id']
        command = event['command']
        
        # 通过MQTT发送命令到设备
        topic = f"devices/{device_id}/command"
        payload = json.dumps(command)
        self.mqtt_client.publish(topic, payload)
        
        logger.info(f"Sent command to device {device_id}: {command}")
        
    def on_mqtt_connect(self, client, userdata, flags, rc):
        logger.info(f"Connected to MQTT broker with result code {rc}")
        # 订阅设备状态主题
        client.subscribe("devices/+/status")
        
    def on_mqtt_message(self, client, userdata, msg):
        # 收到设备状态更新,发送到Kafka
        asyncio.create_task(self.publish_status_update(msg))
        
    async def publish_status_update(self, msg):
        try:
            topic = msg.topic
            payload = msg.payload.decode()
            
            # 解析设备ID
            device_id = topic.split('/')[1]
            status = json.loads(payload)
            
            # 构建Kafka消息
            update = {
                'device_id': device_id,
                'status': status,
                'timestamp': asyncio.get_event_loop().time()
            }
            
            await self.kafka_producer.send(
                'device.status.updates',
                json.dumps(update).encode()
            )
            
            logger.info(f"Published status update for device {device_id}")
        except Exception as e:
            logger.error(f"Error processing MQTT message: {e}")
            
async def main():
    adapter = DeviceAdapter(
        kafka_bootstrap_servers='localhost:9092',
        mqtt_broker='localhost'
    )
    await adapter.start()
    
    # 保持运行
    while True:
        await asyncio.sleep(1)

if __name__ == '__main__':
    asyncio.run(main())

部署文件示例(Kubernetes Deployment)

# device-service-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: device-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: device-service
  template:
    metadata:
      labels:
        app: device-service
    spec:
      containers:
      - name: device-service
        image: smart-home/device-service:latest
        ports:
        - containerPort: 8080
        env:
        - name: SPRING_PROFILES_ACTIVE
          value: "prod"
        - name: REDIS_HOST
          value: "redis-master"
        - name: KAFKA_BOOTSTRAP_SERVERS
          value: "kafka-broker:9092"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /actuator/health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /actuator/health
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: device-service
spec:
  selector:
    app: device-service
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

概述总结

通过上述技术选型、数据管理、并发处理、安全架构和部署架构的决策,我们构建了一个可扩展、高可用、安全的智能家居系统。每个组件都选择了适合其职责的技术,并通过合理的架构模式进行集成。实际项目中,还需要根据具体需求和约束进行调整。

智能家居系统关键决策:技术选型与架构设计

我将通过构建一个企业级智能家居云平台的完整案例,详细阐述关键决策的技术选型、实现原理和最佳实践。

一、技术选型决策框架

1.1 技术选型决策矩阵

技术选型决策矩阵:
  编程语言:
    服务端:
      - 核心业务: Java (Spring Boot) - 企业级生态、强类型、高并发
      - 实时通信: Go (Gin/Gorilla) - 高并发、低延迟、轻量级
      - 数据分析: Python (FastAPI/Pandas) - 数据科学生态丰富
      - 设备适配: C++ (嵌入式/高性能) - 系统级编程、高性能
      - 桌面应用: C# (.NET Core/WPF) - Windows生态、生产力工具
      
    客户端:
      - Web前端: TypeScript + React - 类型安全、组件化
      - 移动端: Swift/Kotlin + Flutter - 原生性能、跨平台
      - 桌面端: C# + WPF/Electron - 原生体验、跨平台
      
  数据库:
    - 关系型: PostgreSQL - 完整SQL支持、JSONB、地理空间
    - 时序数据库: TimescaleDB - 基于PostgreSQL的时序扩展
    - 文档数据库: MongoDB - 灵活Schema、水平扩展
    - 缓存数据库: Redis - 内存存储、数据结构丰富
    - 图数据库: Neo4j - 关系查询、推荐系统
    
  中间件:
    - 消息队列: Apache Kafka - 高吞吐、持久化、流处理
    - API网关: Kong - 云原生、插件生态丰富
    - 服务网格: Istio - 流量管理、可观测性
    - 任务调度: Apache Airflow - 工作流编排、监控
    
  部署平台:
    - 公有云: AWS (EKS, RDS, S3) - 成熟稳定、服务齐全
    - 私有云: Kubernetes (on-premises) - 数据主权、定制化
    - 边缘计算: K3s (轻量K8s) - 资源受限环境

1.2 技术栈详细对比

# tech_stack_evaluator.py
from dataclasses import dataclass
from typing import Dict, List, Tuple
from enum import Enum

class TechCategory(Enum):
    LANGUAGE = "language"
    DATABASE = "database"
    MIDDLEWARE = "middleware"
    FRAMEWORK = "framework"
    PLATFORM = "platform"

@dataclass
class TechOption:
    name: str
    category: TechCategory
    pros: List[str]
    cons: List[str]
    suitability_score: Dict[str, int]  # 维度: 分数(1-10)
    
class TechStackEvaluator:
    """技术栈评估器"""
    
    DIMENSIONS = {
        "performance": "性能",
        "scalability": "可扩展性",
        "maintainability": "可维护性",
        "ecosystem": "生态系统",
        "team_expertise": "团队熟悉度",
        "cost": "成本",
        "security": "安全性"
    }
    
    def __init__(self, project_requirements: Dict):
        self.requirements = project_requirements
        self.options = self._initialize_options()
        
    def _initialize_options(self) -> Dict[TechCategory, List[TechOption]]:
        """初始化技术选项"""
        return {
            TechCategory.LANGUAGE: [
                TechOption(
                    name="Java",
                    category=TechCategory.LANGUAGE,
                    pros=["企业级生态", "强类型安全", "丰富框架", "高并发"],
                    cons=["内存占用大", "启动慢", "语法冗长"],
                    suitability_score={
                        "performance": 8, "scalability": 9,
                        "maintainability": 9, "ecosystem": 10,
                        "team_expertise": 8, "cost": 6, "security": 9
                    }
                ),
                TechOption(
                    name="Go",
                    category=TechCategory.LANGUAGE,
                    pros=["高并发", "编译快", "内存高效", "部署简单"],
                    cons=["生态相对较小", "错误处理繁琐", "泛型支持晚"],
                    suitability_score={
                        "performance": 10, "scalability": 10,
                        "maintainability": 8, "ecosystem": 7,
                        "team_expertise": 6, "cost": 8, "security": 8
                    }
                ),
                TechOption(
                    name="Python",
                    category=TechCategory.LANGUAGE,
                    pros=["开发快速", "生态丰富", "数据科学", "易学"],
                    cons=["性能较低", "GIL限制", "类型系统弱"],
                    suitability_score={
                        "performance": 6, "scalability": 7,
                        "maintainability": 7, "ecosystem": 9,
                        "team_expertise": 9, "cost": 8, "security": 7
                    }
                ),
                TechOption(
                    name="C++",
                    category=TechCategory.LANGUAGE,
                    pros=["极致性能", "系统级控制", "零成本抽象"],
                    cons=["开发复杂", "内存管理难", "编译慢"],
                    suitability_score={
                        "performance": 10, "scalability": 8,
                        "maintainability": 5, "ecosystem": 8,
                        "team_expertise": 5, "cost": 5, "security": 8
                    }
                ),
                TechOption(
                    name="C#",
                    category=TechCategory.LANGUAGE,
                    pros=["Windows生态", ".NET Core跨平台", "生产力高"],
                    cons=["Linux生态相对弱", "商业工具费用"],
                    suitability_score={
                        "performance": 9, "scalability": 9,
                        "maintainability": 9, "ecosystem": 8,
                        "team_expertise": 7, "cost": 7, "security": 9
                    }
                )
            ],
            TechCategory.DATABASE: [
                TechOption(
                    name="PostgreSQL",
                    category=TechCategory.DATABASE,
                    pros=["ACID支持", "JSONB", "地理空间", "扩展丰富"],
                    cons=["大规模分片复杂", "运维要求高"],
                    suitability_score={
                        "performance": 8, "scalability": 7,
                        "maintainability": 8, "ecosystem": 9,
                        "team_expertise": 8, "cost": 8, "security": 9
                    }
                ),
                TechOption(
                    name="TimescaleDB",
                    category=TechCategory.DATABASE,
                    pros=["时序优化", "PostgreSQL兼容", "自动分片"],
                    cons=["相对较新", "社区较小"],
                    suitability_score={
                        "performance": 9, "scalability": 9,
                        "maintainability": 8, "ecosystem": 7,
                        "team_expertise": 6, "cost": 8, "security": 8
                    }
                ),
                TechOption(
                    name="MongoDB",
                    category=TechCategory.DATABASE,
                    pros=["灵活Schema", "水平扩展", "文档模型"],
                    cons=["事务支持晚", "内存占用大"],
                    suitability_score={
                        "performance": 8, "scalability": 10,
                        "maintainability": 7, "ecosystem": 9,
                        "team_expertise": 8, "cost": 7, "security": 7
                    }
                ),
                TechOption(
                    name="Redis",
                    category=TechCategory.DATABASE,
                    pros=["内存速度", "数据结构丰富", "发布订阅"],
                    cons=["持久化复杂", "内存成本高"],
                    suitability_score={
                        "performance": 10, "scalability": 9,
                        "maintainability": 8, "ecosystem": 9,
                        "team_expertise": 9, "cost": 6, "security": 8
                    }
                )
            ]
        }
    
    def evaluate(self, weights: Dict[str, float]) -> Dict[TechCategory, List[Tuple]]:
        """根据权重评估技术选项"""
        results = {}
        
        for category, options in self.options.items():
            scores = []
            for option in options:
                total_score = sum(
                    option.suitability_score[dim] * weights.get(dim, 1.0)
                    for dim in self.DIMENSIONS.keys()
                )
                scores.append((option.name, total_score, option))
            
            # 按分数排序
            scores.sort(key=lambda x: x[1], reverse=True)
            results[category] = scores
        
        return results
    
    def recommend_tech_stack(self, project_type: str = "iot_platform") -> Dict:
        """根据项目类型推荐技术栈"""
        recommendation_templates = {
            "iot_platform": {
                "description": "物联网平台 - 高并发、实时、海量数据",
                "weights": {
                    "performance": 1.2,  # 强调性能
                    "scalability": 1.3,   # 强调可扩展性
                    "maintainability": 1.0,
                    "ecosystem": 1.1,
                    "team_expertise": 0.9,
                    "cost": 0.8,
                    "security": 1.2
                },
                "stack": {
                    "backend": ["Go", "Java"],
                    "database": ["TimescaleDB", "PostgreSQL", "Redis"],
                    "message_queue": ["Kafka"],
                    "container_orchestration": ["Kubernetes"]
                }
            },
            "enterprise_app": {
                "description": "企业应用 - 稳定、安全、可维护",
                "weights": {
                    "performance": 1.0,
                    "scalability": 1.1,
                    "maintainability": 1.3,  # 强调可维护性
                    "ecosystem": 1.2,
                    "team_expertise": 1.1,
                    "cost": 1.0,
                    "security": 1.3  # 强调安全性
                }
            }
        }
        
        return recommendation_templates.get(project_type, {})

# 使用示例
if __name__ == "__main__":
    evaluator = TechStackEvaluator({})
    
    # 为物联网平台评估
    weights = evaluator.recommend_tech_stack("iot_platform")["weights"]
    results = evaluator.evaluate(weights)
    
    print("技术选型评估结果:")
    for category, options in results.items():
        print(f"\n{category.value}:")
        for name, score, option in options[:3]:  # 显示前三名
            print(f"  {name}: {score:.2f}")
            print(f"    优点: {', '.join(option.pros[:2])}")
            print(f"    缺点: {', '.join(option.cons[:2])}")

二、数据管理策略

2.1 多模数据存储架构

// MultiModalDataStrategy.java
@Component
public class MultiModalDataStrategy {
    
    @Autowired
    private PostgresRepository postgresRepo;
    
    @Autowired
    private TimescaleRepository timescaleRepo;
    
    @Autowired
    private MongoRepository mongoRepo;
    
    @Autowired
    private RedisCacheManager redisCache;
    
    // 数据存储决策器
    public DataStorageDecision decideStorage(DataEntity data) {
        DataType type = data.getType();
        AccessPattern pattern = data.getAccessPattern();
        RetentionPolicy retention = data.getRetentionPolicy();
        
        DataStorageDecision decision = new DataStorageDecision();
        
        switch (type) {
            case DEVICE_METADATA:
                // 设备元数据 - PostgreSQL (关系型)
                decision.setPrimaryStorage(StorageType.POSTGRESQL);
                decision.setSchema("devices");
                decision.setIndexes(List.of("device_id_idx", "user_id_idx"));
                break;
                
            case TIME_SERIES:
                // 时序数据 - TimescaleDB (时序优化)
                decision.setPrimaryStorage(StorageType.TIMESCALEDB);
                decision.setHypertable("sensor_readings");
                decision.setCompressionPolicy("7 days");
                decision.setRetentionPolicy("90 days");
                break;
                
            case EVENT_LOG:
                // 事件日志 - MongoDB (文档型)
                decision.setPrimaryStorage(StorageType.MONGODB);
                decision.setCollection("device_events");
                decision.setTtlIndex("created_at", 30); // 30天自动删除
                decision.setShardKey("device_id");
                break;
                
            case SESSION_DATA:
                // 会话数据 - Redis (缓存)
                decision.setPrimaryStorage(StorageType.REDIS);
                decision.setTtl(3600); // 1小时过期
                decision.setReplication(true);
                break;
                
            case FILE_BLOB:
                // 文件数据 - 对象存储
                decision.setPrimaryStorage(StorageType.S3);
                decision.setBucket("device-firmware");
                decision.setCdnEnabled(true);
                break;
        }
        
        // 添加缓存层
        if (pattern == AccessPattern.READ_HEAVY) {
            decision.setCacheLayer(StorageType.REDIS);
            decision.setCacheTtl(300); // 5分钟缓存
            decision.setCacheStrategy(CacheStrategy.WRITE_THROUGH);
        }
        
        // 添加备份策略
        decision.setBackupStrategy(new BackupStrategy(
            BackupFrequency.DAILY,
            RetentionPeriod.MONTHS_6,
            List.of(BackupType.FULL, BackupType.INCREMENTAL)
        ));
        
        return decision;
    }
    
    // 数据一致性管理器
    @Component
    public class DataConsistencyManager {
        
        private final SagaManager sagaManager;
        private final DistributedLock lock;
        private final EventPublisher eventPublisher;
        
        public void updateDeviceWithConsistency(Device device, UpdateRequest request) {
            // 1. 获取分布式锁
            LockResult lockResult = lock.tryLock(
                "device_update:" + device.getId(),
                Duration.ofSeconds(10)
            );
            
            if (!lockResult.isAcquired()) {
                throw new ConcurrentModificationException("设备正在被其他操作修改");
            }
            
            try {
                // 2. 使用Saga模式保证多数据源一致性
                SagaInstance saga = sagaManager.create(new DeviceUpdateSagaData(device.getId()));
                
                // Step 1: 更新主数据库
                sagaManager.step(saga.getId(), "update_postgres", () -> {
                    postgresRepo.updateDevice(device.getId(), request);
                    return StepOutcome.SUCCESS;
                });
                
                // Step 2: 更新缓存
                sagaManager.step(saga.getId(), "update_cache", () -> {
                    redisCache.evict("device:" + device.getId());
                    return StepOutcome.SUCCESS;
                });
                
                // Step 3: 发布事件
                sagaManager.step(saga.getId(), "publish_event", () -> {
                    eventPublisher.publish(new DeviceUpdatedEvent(device.getId(), request));
                    return StepOutcome.SUCCESS;
                });
                
                // Step 4: 更新搜索引擎
                sagaManager.step(saga.getId(), "update_search", () -> {
                    elasticsearchService.indexDevice(device);
                    return StepOutcome.SUCCESS;
                });
                
            } catch (Exception e) {
                // 执行补偿操作
                sagaManager.compensate(saga.getId());
                throw e;
            } finally {
                lock.unlock(lockResult.getLockId());
            }
        }
        
        // 最终一致性事件处理器
        @EventListener
        @Transactional(propagation = Propagation.REQUIRES_NEW)
        public void handleDeviceUpdatedEvent(DeviceUpdatedEvent event) {
            // 更新读模型(CQRS模式)
            DeviceReadModel readModel = deviceReadModelRepository.findByDeviceId(event.getDeviceId());
            if (readModel != null) {
                readModel.applyUpdate(event.getUpdate());
                deviceReadModelRepository.save(readModel);
            }
            
            // 更新缓存
            redisCache.put(
                "device_read_model:" + event.getDeviceId(),
                readModel,
                Duration.ofMinutes(5)
            );
        }
    }
    
    // 缓存策略实现
    @Component
    public class SmartCacheManager {
        
        @Autowired
        private RedisTemplate<String, Object> redisTemplate;
        
        @Autowired
        private CaffeineCacheManager localCacheManager;
        
        // 二级缓存:本地(L1) + Redis(L2)
        public <T> T getWithMultiLevelCache(String key, Class<T> type, 
                                           Supplier<T> loader, Duration ttl) {
            // 1. 检查本地缓存 (Caffeine)
            Cache localCache = localCacheManager.getCache("local_cache");
            T value = localCache.get(key, type);
            
            if (value != null) {
                // 命中本地缓存,更新访问时间
                localCache.put(key, value);
                return value;
            }
            
            // 2. 检查Redis缓存
            String redisKey = "cache:" + key;
            value = (T) redisTemplate.opsForValue().get(redisKey);
            
            if (value != null) {
                // 命中Redis缓存,回填本地缓存
                localCache.put(key, value);
                return value;
            }
            
            // 3. 缓存未命中,加载数据
            synchronized (this) {
                // 双检锁,防止缓存击穿
                value = (T) redisTemplate.opsForValue().get(redisKey);
                if (value != null) {
                    localCache.put(key, value);
                    return value;
                }
                
                // 从数据源加载
                value = loader.get();
                
                if (value != null) {
                    // 写入两级缓存
                    redisTemplate.opsForValue().set(redisKey, value, ttl);
                    localCache.put(key, value);
                }
                
                return value;
            }
        }
        
        // 缓存预热策略
        @Scheduled(cron = "0 0 2 * * ?") // 每天凌晨2点
        public void warmUpCache() {
            // 预加载热门设备数据
            List<String> popularDevices = getPopularDevices(100);
            
            for (String deviceId : popularDevices) {
                String key = "device:" + deviceId;
                Device device = deviceService.getDevice(deviceId);
                
                if (device != null) {
                    redisTemplate.opsForValue().set(
                        key, 
                        device, 
                        Duration.ofHours(6)
                    );
                }
            }
        }
        
        // 缓存淘汰策略
        public void evictCachePattern(String pattern) {
            Set<String> keys = redisTemplate.keys(pattern);
            if (keys != null && !keys.isEmpty()) {
                redisTemplate.delete(keys);
            }
            
            // 清理本地缓存
            localCacheManager.getCache("local_cache").clear();
        }
    }
}

2.2 时序数据存储优化(TimescaleDB)

-- timescale_schema.sql
-- 创建超表(hypertable)存储设备传感器数据
CREATE TABLE sensor_readings (
    time TIMESTAMPTZ NOT NULL,
    device_id VARCHAR(50) NOT NULL,
    sensor_type VARCHAR(20) NOT NULL,
    value DOUBLE PRECISION NOT NULL,
    unit VARCHAR(10),
    quality INTEGER DEFAULT 100,
    metadata JSONB
);

-- 转换为超表,按时间分区
SELECT create_hypertable(
    'sensor_readings',
    'time',
    chunk_time_interval => INTERVAL '1 day',
    if_not_exists => TRUE
);

-- 添加空间分区(按设备ID)
SELECT add_dimension(
    'sensor_readings',
    'device_id',
    number_partitions => 16
);

-- 创建索引优化查询
CREATE INDEX idx_sensor_readings_time_device ON sensor_readings (time DESC, device_id);
CREATE INDEX idx_sensor_readings_device_time ON sensor_readings (device_id, time DESC);
CREATE INDEX idx_sensor_readings_sensor_type ON sensor_readings (sensor_type, time DESC);

-- 创建连续聚合(Continuous Aggregate)优化聚合查询
CREATE MATERIALIZED VIEW sensor_hourly_avg
WITH (timescaledb.continuous) AS
SELECT
    device_id,
    sensor_type,
    time_bucket(INTERVAL '1 hour', time) AS bucket,
    AVG(value) AS avg_value,
    MIN(value) AS min_value,
    MAX(value) AS max_value,
    COUNT(*) AS reading_count
FROM sensor_readings
GROUP BY device_id, sensor_type, bucket;

-- 自动刷新策略
SELECT add_continuous_aggregate_policy(
    'sensor_hourly_avg',
    start_offset => INTERVAL '3 hours',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '1 hour'
);

-- 数据保留策略(自动删除90天前的数据)
SELECT add_retention_policy(
    'sensor_readings',
    INTERVAL '90 days'
);

-- 压缩策略(7天前的数据自动压缩)
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id',
    timescaledb.compress_orderby = 'time DESC'
);

SELECT add_compression_policy(
    'sensor_readings',
    INTERVAL '7 days'
);
// TimescaleRepository.java
@Repository
public class TimescaleRepository {
    
    @Autowired
    private JdbcTemplate jdbcTemplate;
    
    @Autowired
    private NamedParameterJdbcTemplate namedJdbcTemplate;
    
    // 批量插入优化(使用COPY命令)
    public void batchInsertReadings(List<SensorReading> readings) {
        String sql = """
            COPY sensor_readings (time, device_id, sensor_type, value, unit, quality, metadata)
            FROM STDIN WITH (FORMAT CSV)
            """;
        
        PGCopyOutputStream copyOut = new PGCopyOutputStream(
            (PGConnection) jdbcTemplate.getDataSource().getConnection(),
            sql
        );
        
        try (BufferedWriter writer = new BufferedWriter(
                new OutputStreamWriter(copyOut))) {
            
            for (SensorReading reading : readings) {
                String line = String.format("%s,%s,%s,%.4f,%s,%d,%s",
                    reading.getTime().format(DateTimeFormatter.ISO_OFFSET_DATE_TIME),
                    escapeCsv(reading.getDeviceId()),
                    escapeCsv(reading.getSensorType()),
                    reading.getValue(),
                    escapeCsv(reading.getUnit()),
                    reading.getQuality(),
                    escapeJson(reading.getMetadata())
                );
                writer.write(line);
                writer.newLine();
            }
            
        } catch (IOException e) {
            throw new DataAccessException("Failed to copy data", e);
        }
    }
    
    // 时间范围查询优化
    public List<SensorReading> getReadingsInRange(String deviceId, 
                                                 Instant startTime, 
                                                 Instant endTime,
                                                 String sensorType) {
        String sql = """
            SELECT time, device_id, sensor_type, value, unit, quality, metadata
            FROM sensor_readings
            WHERE device_id = :deviceId
              AND time >= :startTime
              AND time < :endTime
              AND (:sensorType IS NULL OR sensor_type = :sensorType)
            ORDER BY time DESC
            LIMIT 10000
            """;
        
        MapSqlParameterSource params = new MapSqlParameterSource()
            .addValue("deviceId", deviceId)
            .addValue("startTime", startTime)
            .addValue("endTime", endTime)
            .addValue("sensorType", sensorType);
        
        return namedJdbcTemplate.query(sql, params, new SensorReadingRowMapper());
    }
    
    // 聚合查询使用连续聚合视图
    public List<HourlyAggregate> getHourlyAggregates(String deviceId, 
                                                    LocalDate date) {
        String sql = """
            SELECT 
                device_id,
                sensor_type,
                bucket,
                avg_value,
                min_value,
                max_value,
                reading_count
            FROM sensor_hourly_avg
            WHERE device_id = :deviceId
              AND bucket >= :startOfDay
              AND bucket < :endOfDay
            ORDER BY bucket
            """;
        
        Instant startOfDay = date.atStartOfDay(ZoneOffset.UTC).toInstant();
        Instant endOfDay = date.plusDays(1).atStartOfDay(ZoneOffset.UTC).toInstant();
        
        MapSqlParameterSource params = new MapSqlParameterSource()
            .addValue("deviceId", deviceId)
            .addValue("startOfDay", startOfDay)
            .addValue("endOfDay", endOfDay);
        
        return namedJdbcTemplate.query(sql, params, new HourlyAggregateRowMapper());
    }
    
    // 下采样查询(降低数据分辨率)
    public List<DownsampledReading> getDownsampledReadings(String deviceId,
                                                          Instant startTime,
                                                          Instant endTime,
                                                          Duration interval) {
        String sql = """
            SELECT
                time_bucket(:interval, time) AS bucket,
                sensor_type,
                AVG(value) AS avg_value,
                COUNT(*) AS sample_count
            FROM sensor_readings
            WHERE device_id = :deviceId
              AND time >= :startTime
              AND time < :endTime
            GROUP BY bucket, sensor_type
            ORDER BY bucket
            """;
        
        MapSqlParameterSource params = new MapSqlParameterSource()
            .addValue("deviceId", deviceId)
            .addValue("startTime", startTime)
            .addValue("endTime", endTime)
            .addValue("interval", interval.toString());
        
        return namedJdbcTemplate.query(sql, params, new DownsampledReadingRowMapper());
    }
    
    // 数据质量检查
    public DataQualityReport checkDataQuality(String deviceId, 
                                             Instant startTime, 
                                             Instant endTime) {
        String sql = """
            WITH stats AS (
                SELECT
                    sensor_type,
                    COUNT(*) AS total_readings,
                    SUM(CASE WHEN quality >= 90 THEN 1 ELSE 0 END) AS good_readings,
                    MIN(value) AS min_value,
                    MAX(value) AS max_value,
                    AVG(value) AS avg_value,
                    STDDEV(value) AS stddev_value
                FROM sensor_readings
                WHERE device_id = :deviceId
                  AND time >= :startTime
                  AND time < :endTime
                GROUP BY sensor_type
            ),
            gaps AS (
                SELECT
                    sensor_type,
                    EXTRACT(EPOCH FROM (MAX(time) - MIN(time))) / 3600 AS hours_covered,
                    COUNT(*) AS expected_readings,
                    COUNT(*) FILTER (WHERE time >= :startTime AND time < :endTime) AS actual_readings
                FROM generate_series(:startTime, :endTime, INTERVAL '1 minute') AS time
                CROSS JOIN (SELECT DISTINCT sensor_type FROM sensor_readings WHERE device_id = :deviceId) types
                LEFT JOIN sensor_readings sr ON sr.time = time
                    AND sr.sensor_type = types.sensor_type
                    AND sr.device_id = :deviceId
                GROUP BY sensor_type
            )
            SELECT
                s.sensor_type,
                s.total_readings,
                s.good_readings,
                s.good_readings::FLOAT / NULLIF(s.total_readings, 0) * 100 AS quality_percentage,
                s.min_value,
                s.max_value,
                s.avg_value,
                s.stddev_value,
                g.hours_covered,
                g.expected_readings,
                g.actual_readings,
                g.actual_readings::FLOAT / NULLIF(g.expected_readings, 0) * 100 AS coverage_percentage
            FROM stats s
            JOIN gaps g ON s.sensor_type = g.sensor_type
            """;
        
        MapSqlParameterSource params = new MapSqlParameterSource()
            .addValue("deviceId", deviceId)
            .addValue("startTime", startTime)
            .addValue("endTime", endTime);
        
        return namedJdbcTemplate.query(sql, params, new DataQualityReportRowMapper());
    }
}

三、并发与分布式处理

3.1 Go实现高并发设备连接管理

// device_connection_manager.go
package main

import (
	"context"
	"errors"
	"fmt"
	"log"
	"sync"
	"time"
	
	"github.com/go-redis/redis/v8"
	"go.uber.org/zap"
	"golang.org/x/sync/errgroup"
)

type DeviceConnection struct {
	DeviceID    string
	Protocol    string
	LastSeen    time.Time
	MessageChan chan DeviceMessage
	CancelFunc  context.CancelFunc
}

type DeviceConnectionManager struct {
	connections sync.Map // map[string]*DeviceConnection
	redisClient *redis.Client
	logger      *zap.Logger
	config      *Config
	
	// 连接池
	connPool     map[string]*ConnectionPool
	poolMutex    sync.RWMutex
	
	// 统计
	stats        *ConnectionStats
	statsMutex   sync.RWMutex
}

type ConnectionPool struct {
	pool      chan *DeviceConnection
	maxSize   int
	created   int
	mutex     sync.Mutex
}

type ConnectionStats struct {
	TotalConnections    int64
	ActiveConnections   int64
	MessagesProcessed   int64
	Errors              int64
	AvgResponseTime     time.Duration
}

func NewDeviceConnectionManager(config *Config) *DeviceConnectionManager {
	return &DeviceConnectionManager{
		redisClient: redis.NewClient(&redis.Options{
			Addr:     config.RedisAddr,
			Password: config.RedisPassword,
			DB:       config.RedisDB,
		}),
		logger:     zap.NewExample(),
		config:     config,
		connPool:   make(map[string]*ConnectionPool),
		stats:      &ConnectionStats{},
	}
}

func (m *DeviceConnectionManager) Start() error {
	// 启动连接健康检查
	go m.healthCheckLoop()
	
	// 启动统计报告
	go m.statsReportLoop()
	
	// 启动连接池维护
	go m.poolMaintenanceLoop()
	
	return nil
}

func (m *DeviceConnectionManager) ConnectDevice(deviceID string, protocol string) (*DeviceConnection, error) {
	// 检查是否已达最大连接数
	if !m.checkConnectionLimit() {
		return nil, errors.New("connection limit reached")
	}
	
	// 从连接池获取或创建连接
	conn, err := m.getOrCreateConnection(deviceID, protocol)
	if err != nil {
		return nil, err
	}
	
	// 更新连接状态
	m.connections.Store(deviceID, conn)
	
	// 更新统计
	m.updateStats(func(stats *ConnectionStats) {
		stats.TotalConnections++
		stats.ActiveConnections++
	})
	
	// 启动消息处理协程
	ctx, cancel := context.WithCancel(context.Background())
	conn.CancelFunc = cancel
	
	go m.handleDeviceMessages(ctx, conn)
	
	m.logger.Info("Device connected",
		zap.String("device_id", deviceID),
		zap.String("protocol", protocol))
	
	return conn, nil
}

func (m *DeviceConnectionManager) getOrCreateConnection(deviceID, protocol string) (*DeviceConnection, error) {
	// 先从连接池获取
	if pool, ok := m.getConnectionPool(protocol); ok {
		select {
		case conn := <-pool.pool:
			conn.DeviceID = deviceID
			conn.LastSeen = time.Now()
			return conn, nil
		default:
			// 连接池为空,创建新连接
		}
	}
	
	// 创建新连接
	conn := &DeviceConnection{
		DeviceID:    deviceID,
		Protocol:    protocol,
		LastSeen:    time.Now(),
		MessageChan: make(chan DeviceMessage, m.config.MessageBufferSize),
	}
	
	// 建立实际连接
	if err := m.establishPhysicalConnection(conn); err != nil {
		return nil, err
	}
	
	return conn, nil
}

func (m *DeviceConnectionManager) handleDeviceMessages(ctx context.Context, conn *DeviceConnection) {
	defer func() {
		// 清理资源
		close(conn.MessageChan)
		m.connections.Delete(conn.DeviceID)
		
		// 将连接放回连接池
		m.returnToPool(conn)
		
		// 更新统计
		m.updateStats(func(stats *ConnectionStats) {
			stats.ActiveConnections--
		})
		
		m.logger.Info("Device disconnected",
			zap.String("device_id", conn.DeviceID))
	}()
	
	// 使用错误组管理多个处理协程
	var g errgroup.Group
	
	// 启动消息接收协程
	g.Go(func() error {
		return m.receiveMessages(ctx, conn)
	})
	
	// 启动消息发送协程
	g.Go(func() error {
		return m.sendMessages(ctx, conn)
	})
	
	// 启动心跳协程
	g.Go(func() error {
		return m.heartbeatLoop(ctx, conn)
	})
	
	// 等待所有协程完成
	if err := g.Wait(); err != nil {
		m.logger.Error("Error in device connection",
			zap.String("device_id", conn.DeviceID),
			zap.Error(err))
	}
}

func (m *DeviceConnectionManager) receiveMessages(ctx context.Context, conn *DeviceConnection) error {
	ticker := time.NewTicker(m.config.PollInterval)
	defer ticker.Stop()
	
	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
			
		case <-ticker.C:
			// 接收设备消息
			messages, err := m.pollDeviceMessages(conn)
			if err != nil {
				return err
			}
			
			// 批量处理消息
			if len(messages) > 0 {
				if err := m.processMessageBatch(conn, messages); err != nil {
					m.logger.Warn("Failed to process message batch",
						zap.String("device_id", conn.DeviceID),
						zap.Error(err))
				}
				
				// 更新统计
				m.updateStats(func(stats *ConnectionStats) {
					stats.MessagesProcessed += int64(len(messages))
				})
			}
		}
	}
}

func (m *DeviceConnectionManager) sendMessages(ctx context.Context, conn *DeviceConnection) error {
	for {
		select {
		case <-ctx.Done():
			return ctx.Err()
			
		case msg := <-conn.MessageChan:
			start := time.Now()
			
			// 发送消息到设备
			if err := m.sendToDevice(conn, msg); err != nil {
				m.logger.Error("Failed to send message to device",
					zap.String("device_id", conn.DeviceID),
					zap.Error(err))
				
				// 重试逻辑
				if m.shouldRetry(msg, err) {
					// 将消息放回队列,稍后重试
					go func() {
						time.Sleep(m.config.RetryDelay)
						select {
						case conn.MessageChan <- msg:
						default:
							m.logger.Error("Failed to retry message, channel full",
								zap.String("device_id", conn.DeviceID))
						}
					}()
				}
				
				m.updateStats(func(stats *ConnectionStats) {
					stats.Errors++
				})
			}
			
			// 更新平均响应时间
			elapsed := time.Since(start)
			m.updateStats(func(stats *ConnectionStats) {
				// 指数加权移动平均
				if stats.AvgResponseTime == 0 {
					stats.AvgResponseTime = elapsed
				} else {
					alpha := 0.1
					stats.AvgResponseTime = time.Duration(
						float64(stats.AvgResponseTime)*(1-alpha) + float64(elapsed)*alpha)
				}
			})
		}
	}
}

func (m *DeviceConnectionManager) processMessageBatch(conn *DeviceConnection, messages []DeviceMessage) error {
	// 使用工作池并发处理消息
	var wg sync.WaitGroup
	errChan := make(chan error, len(messages))
	semaphore := make(chan struct{}, m.config.MaxConcurrentProcesses)
	
	for _, msg := range messages {
		wg.Add(1)
		
		go func(msg DeviceMessage) {
			defer wg.Done()
			
			// 限制并发数
			semaphore <- struct{}{}
			defer func() { <-semaphore }()
			
			// 处理消息
			if err := m.processSingleMessage(conn, msg); err != nil {
				select {
				case errChan <- err:
				default:
					// 错误通道已满,只记录日志
					m.logger.Error("Failed to send error to channel",
						zap.String("device_id", conn.DeviceID),
						zap.Error(err))
				}
			}
		}(msg)
	}
	
	// 等待所有处理完成
	wg.Wait()
	close(errChan)
	
	// 收集错误
	var errors []error
	for err := range errChan {
		errors = append(errors, err)
	}
	
	if len(errors) > 0 {
		return fmt.Errorf("batch processing failed with %d errors", len(errors))
	}
	
	return nil
}

func (m *DeviceConnectionManager) healthCheckLoop() {
	ticker := time.NewTicker(m.config.HealthCheckInterval)
	defer ticker.Stop()
	
	for range ticker.C {
		// 检查所有连接的健康状态
		m.connections.Range(func(key, value interface{}) bool {
			deviceID := key.(string)
			conn := value.(*DeviceConnection)
			
			// 检查连接是否超时
			if time.Since(conn.LastSeen) > m.config.ConnectionTimeout {
				m.logger.Warn("Connection timeout",
					zap.String("device_id", deviceID),
					zap.Duration("idle_time", time.Since(conn.LastSeen)))
				
				// 关闭超时连接
				if conn.CancelFunc != nil {
					conn.CancelFunc()
				}
			}
			
			return true
		})
	}
}

func (m *DeviceConnectionManager) loadBalanceConnection(protocol string) (string, error) {
	// 基于一致性哈希选择后端服务器
	servers := m.getAvailableServers(protocol)
	if len(servers) == 0 {
		return "", errors.New("no available servers")
	}
	
	// 使用一致性哈希算法
	hasher := NewConsistentHasher(servers, m.config.VirtualNodes)
	return hasher.GetNode(protocol), nil
}

// 一致性哈希实现
type ConsistentHasher struct {
	nodes       map[uint32]string
	sortedKeys  []uint32
	virtualNodes int
}

func NewConsistentHasher(nodes []string, virtualNodes int) *ConsistentHasher {
	ch := &ConsistentHasher{
		nodes:       make(map[uint32]string),
		virtualNodes: virtualNodes,
	}
	
	for _, node := range nodes {
		ch.addNode(node)
	}
	
	sort.Slice(ch.sortedKeys, func(i, j int) bool {
		return ch.sortedKeys[i] < ch.sortedKeys[j]
	})
	
	return ch
}

func (ch *ConsistentHasher) addNode(node string) {
	for i := 0; i < ch.virtualNodes; i++ {
		key := ch.hash(fmt.Sprintf("%s#%d", node, i))
		ch.nodes[key] = node
		ch.sortedKeys = append(ch.sortedKeys, key)
	}
}

func (ch *ConsistentHasher) GetNode(key string) string {
	if len(ch.nodes) == 0 {
		return ""
	}
	
	hash := ch.hash(key)
	idx := sort.Search(len(ch.sortedKeys), func(i int) bool {
		return ch.sortedKeys[i] >= hash
	})
	
	if idx == len(ch.sortedKeys) {
		idx = 0
	}
	
	return ch.nodes[ch.sortedKeys[idx]]
}

func (ch *ConsistentHasher) hash(key string) uint32 {
	h := fnv.New32a()
	h.Write([]byte(key))
	return h.Sum32()
}

// 容错机制
func (m *DeviceConnectionManager) handleConnectionFailure(conn *DeviceConnection, err error) {
	// 记录故障
	m.logger.Error("Connection failure",
		zap.String("device_id", conn.DeviceID),
		zap.Error(err))
	
	// 故障转移
	if m.config.EnableFailover {
		// 尝试切换到备用服务器
		if backup, err := m.getBackupServer(conn.Protocol); err == nil {
			m.logger.Info("Failing over to backup server",
				zap.String("device_id", conn.DeviceID),
				zap.String("backup", backup))
			
			// 重新建立连接
			if err := m.reestablishConnection(conn, backup); err != nil {
				m.logger.Error("Failed to failover",
					zap.String("device_id", conn.DeviceID),
					zap.Error(err))
			}
		}
	}
	
	// 断路器模式
	if m.shouldTripCircuit(conn.DeviceID, err) {
		m.tripCircuitBreaker(conn.DeviceID)
	}
}

// 断路器实现
type CircuitBreaker struct {
	failures     int
	lastFailure time.Time
	state       CircuitState
	mutex       sync.RWMutex
	threshold   int
	timeout     time.Duration
}

type CircuitState int

const (
	Closed CircuitState = iota
	Open
	HalfOpen
)

func (cb *CircuitBreaker) AllowRequest() bool {
	cb.mutex.RLock()
	defer cb.mutex.RUnlock()
	
	switch cb.state {
	case Closed:
		return true
	case Open:
		// 检查是否应该进入半开状态
		if time.Since(cb.lastFailure) > cb.timeout {
			cb.state = HalfOpen
			return true
		}
		return false
	case HalfOpen:
		return true
	default:
		return false
	}
}

func (cb *CircuitBreaker) RecordSuccess() {
	cb.mutex.Lock()
	defer cb.mutex.Unlock()
	
	switch cb.state {
	case HalfOpen:
		// 成功请求,关闭断路器
		cb.state = Closed
		cb.failures = 0
	case Closed:
		// 维持关闭状态
		cb.failures = 0
	}
}

func (cb *CircuitBreaker) RecordFailure() {
	cb.mutex.Lock()
	defer cb.mutex.Unlock()
	
	cb.failures++
	cb.lastFailure = time.Now()
	
	if cb.state == HalfOpen {
		// 半开状态下失败,重新打开
		cb.state = Open
	} else if cb.failures >= cb.threshold {
		// 达到阈值,打开断路器
		cb.state = Open
	}
}

3.2 Java实现分布式任务调度

// DistributedTaskScheduler.java
@Component
public class DistributedTaskScheduler {
    
    @Autowired
    private TaskQueueRepository taskQueueRepo;
    
    @Autowired
    private WorkerRegistry workerRegistry;
    
    @Autowired
    private DistributedLock lock;
    
    @Autowired
    private TaskExecutor taskExecutor;
    
    private final ScheduledExecutorService scheduler = 
        Executors.newScheduledThreadPool(Runtime.getRuntime().availableProcessors() * 2);
    
    // 任务分发策略
    public enum DistributionStrategy {
        ROUND_ROBIN,      // 轮询
        LEAST_LOADED,     // 最少负载
        AFFINITY,         // 亲和性(基于数据位置)
        RANDOM,           // 随机
        CONSISTENT_HASH   // 一致性哈希
    }
    
    // 提交任务
    public CompletableFuture<TaskResult> submitTask(Task task, DistributionStrategy strategy) {
        CompletableFuture<TaskResult> future = new CompletableFuture<>();
        
        // 1. 持久化任务
        TaskEntity taskEntity = saveTask(task);
        
        // 2. 根据策略选择worker
        WorkerNode worker = selectWorker(task, strategy);
        
        // 3. 分发任务
        distributeTask(taskEntity, worker).whenComplete((result, error) -> {
            if (error != null) {
                future.completeExceptionally(error);
                // 任务失败,重试或放入死信队列
                handleTaskFailure(taskEntity, error);
            } else {
                future.complete(result);
                // 更新任务状态
                updateTaskStatus(taskEntity.getId(), TaskStatus.COMPLETED);
            }
        });
        
        return future;
    }
    
    // 批量提交任务
    public List<CompletableFuture<TaskResult>> submitBatch(List<Task> tasks, 
                                                          DistributionStrategy strategy) {
        // 分组:根据策略将任务分组到不同的worker
        Map<WorkerNode, List<Task>> groupedTasks = groupTasksByWorker(tasks, strategy);
        
        List<CompletableFuture<TaskResult>> futures = new ArrayList<>();
        
        groupedTasks.forEach((worker, taskList) -> {
            // 批量提交到同一个worker
            BatchTask batch = new BatchTask(taskList);
            CompletableFuture<List<TaskResult>> batchFuture = submitBatchToWorker(batch, worker);
            
            batchFuture.whenComplete((results, error) -> {
                if (error != null) {
                    // 批量失败,拆分为单个任务重试
                    taskList.forEach(task -> {
                        futures.add(submitTask(task, strategy));
                    });
                } else {
                    // 批量成功
                    for (int i = 0; i < results.size(); i++) {
                        futures.add(CompletableFuture.completedFuture(results.get(i)));
                    }
                }
            });
        });
        
        return futures;
    }
    
    // 任务调度器
    @Scheduled(fixedDelay = 1000)
    public void scheduleTasks() {
        // 获取待处理任务
        List<TaskEntity> pendingTasks = taskQueueRepo.findPendingTasks(100);
        
        for (TaskEntity task : pendingTasks) {
            // 使用分布式锁确保任务只被调度一次
            String lockKey = "task_schedule:" + task.getId();
            LockResult lockResult = lock.tryLock(lockKey, Duration.ofSeconds(5));
            
            if (lockResult.isAcquired()) {
                try {
                    // 调度任务
                    scheduleSingleTask(task);
                } finally {
                    lock.unlock(lockResult.getLockId());
                }
            }
        }
    }
    
    private void scheduleSingleTask(TaskEntity task) {
        // 根据任务类型选择执行器
        TaskExecutor executor = selectExecutor(task.getType());
        
        // 检查依赖任务是否完成
        if (!checkDependencies(task)) {
            return;
        }
        
        // 更新任务状态为调度中
        task.setStatus(TaskStatus.SCHEDULED);
        taskQueueRepo.save(task);
        
        // 提交到执行器
        CompletableFuture<TaskResult> future = executor.execute(task);
        
        future.whenComplete((result, error) -> {
            if (error != null) {
                // 任务失败
                handleTaskFailure(task, error);
            } else {
                // 任务成功
                task.setStatus(TaskStatus.COMPLETED);
                task.setResult(result);
                taskQueueRepo.save(task);
                
                // 触发后续任务
                triggerDependentTasks(task);
            }
        });
    }
    
    // 负载均衡:选择worker节点
    private WorkerNode selectWorker(Task task, DistributionStrategy strategy) {
        List<WorkerNode> availableWorkers = workerRegistry.getAvailableWorkers();
        
        if (availableWorkers.isEmpty()) {
            throw new NoAvailableWorkerException("No workers available");
        }
        
        switch (strategy) {
            case ROUND_ROBIN:
                return roundRobinSelect(availableWorkers);
                
            case LEAST_LOADED:
                return leastLoadedSelect(availableWorkers);
                
            case AFFINITY:
                return affinitySelect(availableWorkers, task);
                
            case CONSISTENT_HASH:
                return consistentHashSelect(availableWorkers, task);
                
            case RANDOM:
            default:
                return randomSelect(availableWorkers);
        }
    }
    
    // 最少负载选择
    private WorkerNode leastLoadedSelect(List<WorkerNode> workers) {
        return workers.stream()
            .min(Comparator.comparingInt(WorkerNode::getCurrentLoad))
            .orElseThrow(() -> new IllegalStateException("No workers"));
    }
    
    // 亲和性选择(基于数据位置)
    private WorkerNode affinitySelect(List<WorkerNode> workers, Task task) {
        // 获取任务需要的数据位置
        List<String> dataLocations = task.getDataLocations();
        
        // 寻找有数据缓存的worker
        Optional<WorkerNode> cachedWorker = workers.stream()
            .filter(worker -> worker.hasCachedData(dataLocations))
            .findFirst();
        
        return cachedWorker.orElseGet(() -> randomSelect(workers));
    }
    
    // 一致性哈希选择
    private WorkerNode consistentHashSelect(List<WorkerNode> workers, Task task) {
        ConsistentHasher hasher = new ConsistentHasher(workers, 100); // 100个虚拟节点
        return hasher.getNode(task.getId());
    }
    
    // 任务依赖管理
    private boolean checkDependencies(TaskEntity task) {
        List<String> dependencies = task.getDependencies();
        
        if (dependencies == null || dependencies.isEmpty()) {
            return true;
        }
        
        // 检查所有依赖任务是否完成
        return dependencies.stream()
            .allMatch(depId -> {
                TaskEntity depTask = taskQueueRepo.findById(depId).orElse(null);
                return depTask != null && depTask.getStatus() == TaskStatus.COMPLETED;
            });
    }
    
    // 触发依赖任务
    private void triggerDependentTasks(TaskEntity task) {
        List<TaskEntity> dependentTasks = taskQueueRepo.findByDependenciesContaining(task.getId());
        
        dependentTasks.forEach(depTask -> {
            // 检查是否所有依赖都已完成
            if (checkDependencies(depTask)) {
                depTask.setStatus(TaskStatus.PENDING);
                taskQueueRepo.save(depTask);
            }
        });
    }
    
    // 任务失败处理
    private void handleTaskFailure(TaskEntity task, Throwable error) {
        int retryCount = task.getRetryCount() != null ? task.getRetryCount() : 0;
        
        if (retryCount < task.getMaxRetries()) {
            // 重试任务
            task.setRetryCount(retryCount + 1);
            task.setStatus(TaskStatus.PENDING);
            task.setNextRetryTime(calculateNextRetryTime(retryCount));
            taskQueueRepo.save(task);
            
            log.warn("Task {} failed, will retry (attempt {}/{})", 
                task.getId(), retryCount + 1, task.getMaxRetries(), error);
        } else {
            // 达到最大重试次数,标记为失败
            task.setStatus(TaskStatus.FAILED);
            task.setError(error.getMessage());
            taskQueueRepo.save(task);
            
            // 放入死信队列
            sendToDeadLetterQueue(task, error);
            
            log.error("Task {} failed after {} retries", task.getId(), task.getMaxRetries(), error);
        }
    }
    
    // 计算下一次重试时间(指数退避)
    private Instant calculateNextRetryTime(int retryCount) {
        long delay = (long) Math.pow(2, retryCount) * 1000; // 2^retryCount秒
        delay = Math.min(delay, 300000); // 最大5分钟
        return Instant.now().plusMillis(delay);
    }
    
    // 死信队列处理
    private void sendToDeadLetterQueue(TaskEntity task, Throwable error) {
        DeadLetterMessage dlqMessage = new DeadLetterMessage(
            task.getId(),
            task.getType(),
            error.getMessage(),
            task.getPayload(),
            Instant.now()
        );
        
        // 发送到死信队列(Kafka/RabbitMQ)
        kafkaTemplate.send("task-dead-letter", task.getId(), dlqMessage);
    }
    
    // 任务优先级队列
    @Component
    public class PriorityTaskQueue {
        
        private final PriorityBlockingQueue<TaskEntity> queue = 
            new PriorityBlockingQueue<>(1000, 
                Comparator.comparingInt(TaskEntity::getPriority).reversed()
                    .thenComparing(TaskEntity::getCreatedAt));
        
        private final Map<String, TaskEntity> taskMap = new ConcurrentHashMap<>();
        
        public void addTask(TaskEntity task) {
            queue.offer(task);
            taskMap.put(task.getId(), task);
        }
        
        public TaskEntity pollTask() {
            return queue.poll();
        }
        
        public void updateTaskPriority(String taskId, int newPriority) {
            TaskEntity task = taskMap.get(taskId);
            if (task != null) {
                // 移除并重新添加以更新优先级
                queue.remove(task);
                task.setPriority(newPriority);
                queue.offer(task);
            }
        }
        
        public List<TaskEntity> getHighPriorityTasks(int limit) {
            List<TaskEntity> highPriorityTasks = new ArrayList<>();
            Iterator<TaskEntity> iterator = queue.iterator();
            
            while (iterator.hasNext() && highPriorityTasks.size() < limit) {
                TaskEntity task = iterator.next();
                if (task.getPriority() >= 8) { // 高优先级阈值
                    highPriorityTasks.add(task);
                }
            }
            
            return highPriorityTasks;
        }
    }
    
    // 任务执行监控
    @Component
    public class TaskExecutionMonitor {
        
        private final MeterRegistry meterRegistry;
        private final Map<String, TaskMetrics> taskMetrics = new ConcurrentHashMap<>();
        
        @EventListener
        public void handleTaskStarted(TaskStartedEvent event) {
            // 记录任务开始时间
            taskMetrics.put(event.getTaskId(), new TaskMetrics(event.getTaskId()));
            
            // 更新指标
            meterRegistry.counter("tasks.started", "type", event.getTaskType()).increment();
        }
        
        @EventListener
        public void handleTaskCompleted(TaskCompletedEvent event) {
            TaskMetrics metrics = taskMetrics.remove(event.getTaskId());
            if (metrics != null) {
                long duration = System.currentTimeMillis() - metrics.getStartTime();
                
                // 记录执行时间
                meterRegistry.timer("tasks.duration", "type", event.getTaskType())
                    .record(duration, TimeUnit.MILLISECONDS);
                
                // 更新成功计数
                meterRegistry.counter("tasks.completed", "type", event.getTaskType()).increment();
            }
        }
        
        @EventListener
        public void handleTaskFailed(TaskFailedEvent event) {
            taskMetrics.remove(event.getTaskId());
            
            // 更新失败计数
            meterRegistry.counter("tasks.failed", "type", event.getTaskType()).increment();
        }
        
        // 实时监控
        public Map<String, Object> getExecutionStats() {
            Map<String, Object> stats = new HashMap<>();
            
            stats.put("activeTasks", taskMetrics.size());
            stats.put("queueSize", taskQueueRepo.countByStatus(TaskStatus.PENDING));
            
            // 按类型统计
            Map<String, Long> tasksByType = taskQueueRepo.countByStatusGroupByType(TaskStatus.PENDING);
            stats.put("tasksByType", tasksByType);
            
            // 性能指标
            stats.put("avgExecutionTime", calculateAverageExecutionTime());
            stats.put("successRate", calculateSuccessRate());
            
            return stats;
        }
    }
}

// 分布式锁实现(Redis RedLock算法)
@Component
public class RedLockDistributedLock implements DistributedLock {
    
    @Autowired
    private List<RedisTemplate<String, String>> redisTemplates; // 多个Redis实例
    
    private final ThreadLocal<Map<String, LockInfo>> threadLocalLocks = new ThreadLocal<>();
    
    @Override
    public LockResult tryLock(String lockKey, Duration timeout) {
        String lockValue = UUID.randomUUID().toString();
        long endTime = System.currentTimeMillis() + timeout.toMillis();
        
        Map<String, Boolean> lockResults = new HashMap<>();
        
        // 尝试在多数Redis实例上获取锁
        int successCount = 0;
        for (int i = 0; i < redisTemplates.size(); i++) {
            RedisTemplate<String, String> redis = redisTemplates.get(i);
            
            if (tryAcquireLock(redis, lockKey, lockValue, timeout)) {
                successCount++;
                lockResults.put("redis-" + i, true);
            } else {
                lockResults.put("redis-" + i, false);
            }
            
            // 检查是否已超时
            if (System.currentTimeMillis() > endTime) {
                break;
            }
        }
        
        // 检查是否在多数实例上获得了锁
        boolean acquired = successCount > (redisTemplates.size() / 2);
        
        if (acquired) {
            // 记录锁信息
            LockInfo lockInfo = new LockInfo(lockKey, lockValue, System.currentTimeMillis());
            getThreadLocalLocks().put(lockKey, lockInfo);
            
            // 启动锁续期任务
            startLockRenewal(lockKey, lockValue);
            
            return LockResult.success(lockValue, lockResults);
        } else {
            // 释放已获得的锁
            releaseAcquiredLocks(lockKey, lockValue, lockResults);
            return LockResult.failed(lockResults);
        }
    }
    
    private boolean tryAcquireLock(RedisTemplate<String, String> redis, 
                                  String lockKey, String lockValue, 
                                  Duration timeout) {
        String result = redis.execute((RedisCallback<String>) connection -> {
            // 使用SET命令的NX和PX选项
            return connection.set(
                lockKey.getBytes(),
                lockValue.getBytes(),
                Expiration.milliseconds(timeout.toMillis()),
                RedisStringCommands.SetOption.SET_IF_ABSENT
            ) ? "OK" : null;
        });
        
        return "OK".equals(result);
    }
    
    private void startLockRenewal(String lockKey, String lockValue) {
        ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
        
        scheduler.scheduleAtFixedRate(() -> {
            // 续期锁
            renewLock(lockKey, lockValue);
        }, 10, 10, TimeUnit.SECONDS); // 每10秒续期一次
        
        // 记录调度器以便清理
        getThreadLocalLocks().get(lockKey).setRenewalScheduler(scheduler);
    }
    
    private void renewLock(String lockKey, String lockValue) {
        int successCount = 0;
        
        for (RedisTemplate<String, String> redis : redisTemplates) {
            // 检查锁是否仍然属于当前客户端
            String currentValue = redis.opsForValue().get(lockKey);
            if (lockValue.equals(currentValue)) {
                // 续期
                redis.expire(lockKey, 30, TimeUnit.SECONDS);
                successCount++;
            }
        }
        
        // 如果续期失败,清理锁
        if (successCount <= redisTemplates.size() / 2) {
            unlock(lockKey, lockValue);
        }
    }
    
    @Override
    public boolean unlock(String lockKey, String lockValue) {
        Map<String, Boolean> unlockResults = new HashMap<>();
        
        // 在Lua脚本中检查并删除锁
        String luaScript = """
            if redis.call("get", KEYS[1]) == ARGV[1] then
                return redis.call("del", KEYS[1])
            else
                return 0
            end
            """;
        
        int successCount = 0;
        for (int i = 0; i < redisTemplates.size(); i++) {
            RedisTemplate<String, String> redis = redisTemplates.get(i);
            
            Long result = redis.execute((RedisCallback<Long>) connection -> {
                return connection.eval(
                    luaScript.getBytes(),
                    ReturnType.INTEGER,
                    1,
                    lockKey.getBytes(),
                    lockValue.getBytes()
                );
            });
            
            boolean success = result != null && result == 1;
            unlockResults.put("redis-" + i, success);
            
            if (success) {
                successCount++;
            }
        }
        
        // 清理线程本地存储
        LockInfo lockInfo = getThreadLocalLocks().remove(lockKey);
        if (lockInfo != null && lockInfo.getRenewalScheduler() != null) {
            lockInfo.getRenewalScheduler().shutdown();
        }
        
        return successCount > 0;
    }
    
    private Map<String, LockInfo> getThreadLocalLocks() {
        Map<String, LockInfo> locks = threadLocalLocks.get();
        if (locks == null) {
            locks = new HashMap<>();
            threadLocalLocks.set(locks);
        }
        return locks;
    }
    
    @Data
    private static class LockInfo {
        private final String lockKey;
        private final String lockValue;
        private final long lockTime;
        private ScheduledExecutorService renewalScheduler;
    }
}

四、安全架构

4.1 C#实现完整安全框架

// SecurityFramework.cs
using System;
using System.Collections.Generic;
using System.IdentityModel.Tokens.Jwt;
using System.Linq;
using System.Security.Claims;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Authentication.JwtBearer;
using Microsoft.AspNetCore.Authorization;
using Microsoft.AspNetCore.Cryptography.KeyDerivation;
using Microsoft.AspNetCore.DataProtection;
using Microsoft.Extensions.Caching.Distributed;
using Microsoft.Extensions.Logging;
using Microsoft.IdentityModel.Tokens;

namespace SmartHome.Security
{
    // 认证服务
    public interface IAuthenticationService
    {
        Task<AuthenticationResult> AuthenticateAsync(string username, string password);
        Task<AuthenticationResult> AuthenticateWithMfaAsync(string username, string password, string mfaCode);
        Task<AuthenticationResult> RefreshTokenAsync(string refreshToken);
        Task RevokeTokenAsync(string token);
    }
    
    public class AuthenticationService : IAuthenticationService
    {
        private readonly IUserRepository _userRepository;
        private readonly ITokenService _tokenService;
        private readonly IPasswordHasher _passwordHasher;
        private readonly IMfaService _mfaService;
        private readonly ILogger<AuthenticationService> _logger;
        private readonly IDistributedCache _cache;
        
        public AuthenticationService(
            IUserRepository userRepository,
            ITokenService tokenService,
            IPasswordHasher passwordHasher,
            IMfaService mfaService,
            ILogger<AuthenticationService> logger,
            IDistributedCache cache)
        {
            _userRepository = userRepository;
            _tokenService = tokenService;
            _passwordHasher = passwordHasher;
            _mfaService = mfaService;
            _logger = logger;
            _cache = cache;
        }
        
        public async Task<AuthenticationResult> AuthenticateAsync(string username, string password)
        {
            // 1. 检查速率限制
            if (!await CheckRateLimitAsync(username))
            {
                return AuthenticationResult.Failed("Too many attempts. Please try again later.");
            }
            
            // 2. 查找用户
            var user = await _userRepository.FindByUsernameAsync(username);
            if (user == null)
            {
                // 防止用户枚举攻击:无论用户是否存在,都使用相同的时间
                await Task.Delay(200); // 恒定时间延迟
                return AuthenticationResult.Failed("Invalid credentials");
            }
            
            // 3. 验证账户状态
            if (!user.IsActive)
            {
                return AuthenticationResult.Failed("Account is deactivated");
            }
            
            // 4. 验证密码
            var passwordHash = await _passwordHasher.HashPasswordAsync(password, user.PasswordSalt);
            if (!ConstantTimeEquals(passwordHash, user.PasswordHash))
            {
                await RecordFailedAttemptAsync(username);
                return AuthenticationResult.Failed("Invalid credentials");
            }
            
            // 5. 检查是否需要MFA
            if (user.RequiresMfa)
            {
                return AuthenticationResult.RequiresMfa(user.Id);
            }
            
            // 6. 生成令牌
            var tokens = await GenerateTokensAsync(user);
            
            // 7. 记录成功登录
            await RecordSuccessfulLoginAsync(user);
            
            return AuthenticationResult.Success(tokens.AccessToken, tokens.RefreshToken, user);
        }
        
        public async Task<AuthenticationResult> AuthenticateWithMfaAsync(string username, string password, string mfaCode)
        {
            // 验证基础凭证
            var authResult = await AuthenticateAsync(username, password);
            if (!authResult.RequiresMfa)
            {
                return authResult;
            }
            
            // 验证MFA代码
            var user = await _userRepository.FindByIdAsync(authResult.UserId);
            if (!await _mfaService.VerifyCodeAsync(user.MfaSecret, mfaCode))
            {
                return AuthenticationResult.Failed("Invalid MFA code");
            }
            
            // 生成令牌
            var tokens = await GenerateTokensAsync(user);
            
            return AuthenticationResult.Success(tokens.AccessToken, tokens.RefreshToken, user);
        }
        
        private async Task<TokenPair> GenerateTokensAsync(User user)
        {
            // 生成访问令牌
            var accessToken = await _tokenService.GenerateAccessTokenAsync(user);
            
            // 生成刷新令牌
            var refreshToken = await _tokenService.GenerateRefreshTokenAsync(user.Id);
            
            // 将刷新令牌存储到缓存
            var cacheKey = $"refresh_token:{refreshToken}";
            var cacheValue = new RefreshTokenCache
            {
                UserId = user.Id,
                CreatedAt = DateTime.UtcNow,
                IsRevoked = false
            };
            
            await _cache.SetAsync(cacheKey, cacheValue, new DistributedCacheEntryOptions
            {
                AbsoluteExpirationRelativeToNow = TimeSpan.FromDays(30) // 刷新令牌有效期30天
            });
            
            return new TokenPair(accessToken, refreshToken);
        }
        
        // 恒定时间比较,防止时序攻击
        private bool ConstantTimeEquals(byte[] a, byte[] b)
        {
            if (a.Length != b.Length)
                return false;
                
            var result = 0;
            for (var i = 0; i < a.Length; i++)
            {
                result |= a[i] ^ b[i];
            }
            
            return result == 0;
        }
    }
    
    // 令牌服务
    public interface ITokenService
    {
        Task<string> GenerateAccessTokenAsync(User user);
        Task<string> GenerateRefreshTokenAsync(string userId);
        Task<ClaimsPrincipal> ValidateTokenAsync(string token);
        Task<bool> ValidateRefreshTokenAsync(string refreshToken, string userId);
    }
    
    public class TokenService : ITokenService
    {
        private readonly JwtSettings _jwtSettings;
        private readonly IDataProtector _dataProtector;
        private readonly IDistributedCache _cache;
        private readonly ILogger<TokenService> _logger;
        
        public TokenService(
            JwtSettings jwtSettings,
            IDataProtectionProvider dataProtectionProvider,
            IDistributedCache cache,
            ILogger<TokenService> logger)
        {
            _jwtSettings = jwtSettings;
            _dataProtector = dataProtectionProvider.CreateProtector("RefreshTokens");
            _cache = cache;
            _logger = logger;
        }
        
        public async Task<string> GenerateAccessTokenAsync(User user)
        {
            var tokenHandler = new JwtSecurityTokenHandler();
            var key = Encoding.ASCII.GetBytes(_jwtSettings.Secret);
            
            var claims = new List<Claim>
            {
                new Claim(JwtRegisteredClaimNames.Sub, user.Id),
                new Claim(JwtRegisteredClaimNames.Jti, Guid.NewGuid().ToString()),
                new Claim(JwtRegisteredClaimNames.Iat, 
                    new DateTimeOffset(DateTime.UtcNow).ToUnixTimeSeconds().ToString()),
                new Claim(ClaimTypes.Name, user.Username),
                new Claim(ClaimTypes.Email, user.Email),
                new Claim("user_id", user.Id),
                new Claim("tenant_id", user.TenantId)
            };
            
            // 添加用户角色
            foreach (var role in user.Roles)
            {
                claims.Add(new Claim(ClaimTypes.Role, role));
            }
            
            // 添加权限声明
            foreach (var permission in user.Permissions)
            {
                claims.Add(new Claim("permission", permission));
            }
            
            // 添加设备访问权限
            foreach (var deviceAccess in user.DeviceAccessList)
            {
                claims.Add(new Claim("device_access", 
                    $"{deviceAccess.DeviceId}:{deviceAccess.AccessLevel}"));
            }
            
            var tokenDescriptor = new SecurityTokenDescriptor
            {
                Subject = new ClaimsIdentity(claims),
                Expires = DateTime.UtcNow.AddMinutes(_jwtSettings.AccessTokenExpiryMinutes),
                Issuer = _jwtSettings.Issuer,
                Audience = _jwtSettings.Audience,
                SigningCredentials = new SigningCredentials(
                    new SymmetricSecurityKey(key),
                    SecurityAlgorithms.HmacSha256Signature)
            };
            
            // 添加自定义加密(可选)
            if (_jwtSettings.EncryptTokens)
            {
                var encryptionKey = Encoding.ASCII.GetBytes(_jwtSettings.EncryptionKey);
                tokenDescriptor.EncryptingCredentials = new EncryptingCredentials(
                    new SymmetricSecurityKey(encryptionKey),
                    SecurityAlgorithms.Aes256KW,
                    SecurityAlgorithms.Aes256CbcHmacSha512);
            }
            
            var token = tokenHandler.CreateToken(tokenDescriptor);
            return tokenHandler.WriteToken(token);
        }
        
        public async Task<string> GenerateRefreshTokenAsync(string userId)
        {
            // 生成安全的随机令牌
            var randomNumber = new byte[32];
            using var rng = RandomNumberGenerator.Create();
            rng.GetBytes(randomNumber);
            
            var refreshToken = Convert.ToBase64String(randomNumber);
            
            // 加密存储
            var protectedToken = _dataProtector.Protect(refreshToken);
            
            // 存储到数据库
            await StoreRefreshTokenAsync(userId, protectedToken);
            
            return refreshToken;
        }
        
        public async Task<ClaimsPrincipal> ValidateTokenAsync(string token)
        {
            if (string.IsNullOrEmpty(token))
                return null;
            
            try
            {
                // 检查令牌是否在撤销列表中
                var tokenId = GetTokenId(token);
                if (await IsTokenRevokedAsync(tokenId))
                {
                    _logger.LogWarning("Token has been revoked: {TokenId}", tokenId);
                    return null;
                }
                
                var tokenHandler = new JwtSecurityTokenHandler();
                var key = Encoding.ASCII.GetBytes(_jwtSettings.Secret);
                
                var validationParameters = new TokenValidationParameters
                {
                    ValidateIssuerSigningKey = true,
                    IssuerSigningKey = new SymmetricSecurityKey(key),
                    ValidateIssuer = true,
                    ValidIssuer = _jwtSettings.Issuer,
                    ValidateAudience = true,
                    ValidAudience = _jwtSettings.Audience,
                    ValidateLifetime = true,
                    ClockSkew = TimeSpan.Zero, // 严格时间验证
                    
                    // 安全增强
                    RequireExpirationTime = true,
                    RequireSignedTokens = true,
                    SaveSigninToken = false
                };
                
                if (_jwtSettings.EncryptTokens)
                {
                    var encryptionKey = Encoding.ASCII.GetBytes(_jwtSettings.EncryptionKey);
                    validationParameters.TokenDecryptionKey = 
                        new SymmetricSecurityKey(encryptionKey);
                }
                
                var principal = tokenHandler.ValidateToken(token, validationParameters, 
                    out SecurityToken validatedToken);
                
                // 检查令牌是否在黑名单中(基于jti)
                var jti = principal.FindFirst(JwtRegisteredClaimNames.Jti)?.Value;
                if (!string.IsNullOrEmpty(jti) && await IsTokenBlacklistedAsync(jti))
                {
                    _logger.LogWarning("Token is blacklisted: {Jti}", jti);
                    return null;
                }
                
                return principal;
            }
            catch (SecurityTokenException ex)
            {
                _logger.LogWarning(ex, "Token validation failed");
                return null;
            }
        }
        
        private async Task<bool> IsTokenRevokedAsync(string tokenId)
        {
            var cacheKey = $"revoked_token:{tokenId}";
            var revoked = await _cache.GetStringAsync(cacheKey);
            return revoked != null;
        }
        
        private string GetTokenId(string token)
        {
            try
            {
                var handler = new JwtSecurityTokenHandler();
                var jwtToken = handler.ReadJwtToken(token);
                return jwtToken.Id;
            }
            catch
            {
                // 如果不是JWT令牌,返回哈希值
                using var sha256 = SHA256.Create();
                var hash = sha256.ComputeHash(Encoding.UTF8.GetBytes(token));
                return Convert.ToBase64String(hash);
            }
        }
    }
    
    // 授权策略
    public static class AuthorizationPolicies
    {
        public static void Configure(AuthorizationOptions options)
        {
            // 基于角色的策略
            options.AddPolicy("AdminOnly", policy =>
                policy.RequireRole("admin"));
                
            options.AddPolicy("DeviceManager", policy =>
                policy.RequireRole("admin", "device_manager"));
                
            // 基于声明的策略
            options.AddPolicy("DeviceAccess", policy =>
                policy.RequireClaim("device_access"));
                
            // 自定义需求
            options.AddPolicy("DeviceControl", policy =>
                policy.Requirements.Add(new DeviceControlRequirement()));
                
            options.AddPolicy("TimeRestricted", policy =>
                policy.Requirements.Add(new TimeRestrictedRequirement(
                    TimeSpan.FromHours(9), // 9:00
                    TimeSpan.FromHours(17) // 17:00
                )));
                
            // 组合策略
            options.AddPolicy("SecureDeviceAccess", policy =>
            {
                policy.RequireAuthenticatedUser();
                policy.RequireRole("admin", "operator");
                policy.RequireClaim("mfa_verified", "true");
                policy.Requirements.Add(new IpWhitelistRequirement());
            });
        }
    }
    
    // 自定义授权需求
    public class DeviceControlRequirement : IAuthorizationRequirement
    {
        public string RequiredPermission { get; } = "device.control";
    }
    
    public class DeviceControlHandler : AuthorizationHandler<DeviceControlRequirement>
    {
        private readonly IDevicePermissionService _permissionService;
        
        public DeviceControlHandler(IDevicePermissionService permissionService)
        {
            _permissionService = permissionService;
        }
        
        protected override async Task HandleRequirementAsync(
            AuthorizationHandlerContext context, 
            DeviceControlRequirement requirement)
        {
            if (!context.User.HasClaim(c => c.Type == "user_id"))
                return;
                
            var userId = context.User.FindFirst("user_id")?.Value;
            var deviceId = GetDeviceIdFromResource(context.Resource);
            
            if (string.IsNullOrEmpty(deviceId))
                return;
            
            // 检查用户是否有设备控制权限
            var hasPermission = await _permissionService.CanControlDeviceAsync(userId, deviceId);
            
            if (hasPermission)
            {
                context.Succeed(requirement);
            }
        }
        
        private string GetDeviceIdFromResource(object resource)
        {
            // 从资源中提取设备ID
            if (resource is string str)
                return str;
                
            if (resource is Device device)
                return device.Id;
                
            if (resource is DeviceCommand command)
                return command.DeviceId;
                
            return null;
        }
    }
    
    // 数据加密服务
    public interface IEncryptionService
    {
        Task<string> EncryptAsync(string plaintext, string keyId);
        Task<string> DecryptAsync(string ciphertext, string keyId);
        Task<byte[]> GenerateKeyAsync();
        Task RotateKeyAsync(string keyId);
    }
    
    public class AesEncryptionService : IEncryptionService
    {
        private readonly IKeyVaultService _keyVault;
        private readonly ILogger<AesEncryptionService> _logger;
        
        public AesEncryptionService(IKeyVaultService keyVault, ILogger<AesEncryptionService> logger)
        {
            _keyVault = keyVault;
            _logger = logger;
        }
        
        public async Task<string> EncryptAsync(string plaintext, string keyId)
        {
            if (string.IsNullOrEmpty(plaintext))
                return plaintext;
            
            try
            {
                // 从密钥库获取密钥
                var key = await _keyVault.GetKeyAsync(keyId);
                
                using var aes = Aes.Create();
                aes.Key = key;
                aes.GenerateIV();
                
                using var encryptor = aes.CreateEncryptor(aes.Key, aes.IV);
                using var ms = new MemoryStream();
                
                // 写入IV(不需要保密,但必须是随机的)
                ms.Write(aes.IV, 0, aes.IV.Length);
                
                // 加密数据
                using (var cs = new CryptoStream(ms, encryptor, CryptoStreamMode.Write))
                using (var sw = new StreamWriter(cs))
                {
                    sw.Write(plaintext);
                }
                
                var encrypted = ms.ToArray();
                return Convert.ToBase64String(encrypted);
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Encryption failed for key {KeyId}", keyId);
                throw new SecurityException("Encryption failed", ex);
            }
        }
        
        public async Task<string> DecryptAsync(string ciphertext, string keyId)
        {
            if (string.IsNullOrEmpty(ciphertext))
                return ciphertext;
            
            try
            {
                // 从密钥库获取密钥
                var key = await _keyVault.GetKeyAsync(keyId);
                
                var encrypted = Convert.FromBase64String(ciphertext);
                
                using var aes = Aes.Create();
                aes.Key = key;
                
                // 提取IV(前16字节)
                var iv = new byte[aes.IV.Length];
                Array.Copy(encrypted, 0, iv, 0, iv.Length);
                aes.IV = iv;
                
                // 提取密文
                var cipherData = new byte[encrypted.Length - iv.Length];
                Array.Copy(encrypted, iv.Length, cipherData, 0, cipherData.Length);
                
                using var decryptor = aes.CreateDecryptor(aes.Key, aes.IV);
                using var ms = new MemoryStream(cipherData);
                using var cs = new CryptoStream(ms, decryptor, CryptoStreamMode.Read);
                using var sr = new StreamReader(cs);
                
                return await sr.ReadToEndAsync();
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "Decryption failed for key {KeyId}", keyId);
                throw new SecurityException("Decryption failed", ex);
            }
        }
        
        public async Task<byte[]> GenerateKeyAsync()
        {
            using var aes = Aes.Create();
            aes.KeySize = 256; // AES-256
            aes.GenerateKey();
            
            return aes.Key;
        }
        
        public async Task RotateKeyAsync(string keyId)
        {
            // 1. 生成新密钥
            var newKey = await GenerateKeyAsync();
            
            // 2. 使用新旧密钥重新加密数据
            await ReEncryptDataAsync(keyId, newKey);
            
            // 3. 更新密钥库
            await _keyVault.UpdateKeyAsync(keyId, newKey);
            
            // 4. 使旧密钥失效
            await _keyVault.DisableKeyAsync($"{keyId}_old");
            
            _logger.LogInformation("Key rotation completed for {KeyId}", keyId);
        }
        
        private async Task ReEncryptDataAsync(string keyId, byte[] newKey)
        {
            // 获取所有使用该密钥加密的数据
            var encryptedData = await GetEncryptedDataByKeyAsync(keyId);
            
            foreach (var data in encryptedData)
            {
                try
                {
                    // 使用旧密钥解密
                    var oldKey = await _keyVault.GetKeyAsync(keyId);
                    var decrypted = await DecryptWithKeyAsync(data.Ciphertext, oldKey);
                    
                    // 使用新密钥加密
                    var reEncrypted = await EncryptWithKeyAsync(decrypted, newKey);
                    
                    // 更新数据库
                    await UpdateEncryptedDataAsync(data.Id, reEncrypted);
                }
                catch (Exception ex)
                {
                    _logger.LogError(ex, "Failed to re-encrypt data {DataId}", data.Id);
                    // 记录错误但继续处理其他数据
                }
            }
        }
    }
    
    // 安全审计
    public interface IAuditService
    {
        Task LogSecurityEventAsync(SecurityEvent securityEvent);
        Task<IEnumerable<SecurityEvent>> GetSecurityEventsAsync(string userId, DateTime? from, DateTime? to);
        Task GenerateSecurityReportAsync(DateTime from, DateTime to);
    }
    
    public class AuditService : IAuditService
    {
        private readonly IAuditRepository _auditRepository;
        private readonly ILogger<AuditService> _logger;
        
        public AuditService(IAuditRepository auditRepository, ILogger<AuditService> logger)
        {
            _auditRepository = auditRepository;
            _logger = logger;
        }
        
        public async Task LogSecurityEventAsync(SecurityEvent securityEvent)
        {
            try
            {
                // 添加审计信息
                securityEvent.Id = Guid.NewGuid().ToString();
                securityEvent.Timestamp = DateTime.UtcNow;
                securityEvent.SourceIp = GetClientIp();
                securityEvent.UserAgent = GetUserAgent();
                
                // 记录到数据库
                await _auditRepository.AddAsync(securityEvent);
                
                // 记录到系统日志
                _logger.LogInformation(
                    "Security event: {EventType} by {UserId} from {SourceIp}",
                    securityEvent.EventType, securityEvent.UserId, securityEvent.SourceIp);
                
                // 检查可疑活动
                if (IsSuspiciousActivity(securityEvent))
                {
                    await TriggerSecurityAlertAsync(securityEvent);
                }
            }
            catch (Exception ex)
            {
                // 审计系统不应影响主业务流程
                _logger.LogError(ex, "Failed to log security event");
            }
        }
        
        private bool IsSuspiciousActivity(SecurityEvent securityEvent)
        {
            // 检测可疑活动模式
            switch (securityEvent.EventType)
            {
                case SecurityEventType.FailedLogin:
                    // 检查失败登录次数
                    return CheckFailedLoginRate(securityEvent.UserId);
                    
                case SecurityEventType.PasswordChange:
                    // 检查异常密码更改
                    return CheckPasswordChangePattern(securityEvent.UserId);
                    
                case SecurityEventType.PrivilegeEscalation:
                    // 特权提升总是可疑的
                    return true;
                    
                case SecurityEventType.DataAccess:
                    // 检查异常数据访问模式
                    return CheckDataAccessPattern(securityEvent);
                    
                default:
                    return false;
            }
        }
        
        private async Task TriggerSecurityAlertAsync(SecurityEvent securityEvent)
        {
            // 发送安全警报
            var alert = new SecurityAlert
            {
                Id = Guid.NewGuid().ToString(),
                EventId = securityEvent.Id,
                Severity = SecuritySeverity.High,
                Description = $"Suspicious activity detected: {securityEvent.EventType}",
                Timestamp = DateTime.UtcNow,
                ActionRequired = true
            };
            
            await _auditRepository.AddAlertAsync(alert);
            
            // 通知安全团队
            await NotifySecurityTeamAsync(alert);
            
            // 如果需要,暂时锁定账户
            if (ShouldLockAccount(securityEvent))
            {
                await LockUserAccountAsync(securityEvent.UserId);
            }
        }
    }
    
    // 安全中间件
    public class SecurityHeadersMiddleware
    {
        private readonly RequestDelegate _next;
        
        public SecurityHeadersMiddleware(RequestDelegate next)
        {
            _next = next;
        }
        
        public async Task InvokeAsync(HttpContext context)
        {
            // 添加安全HTTP头
            context.Response.Headers.Add("X-Content-Type-Options", "nosniff");
            context.Response.Headers.Add("X-Frame-Options", "DENY");
            context.Response.Headers.Add("X-XSS-Protection", "1; mode=block");
            context.Response.Headers.Add("Referrer-Policy", "strict-origin-when-cross-origin");
            context.Response.Headers.Add("Permissions-Policy", 
                "camera=(), microphone=(), geolocation=()");
            
            // 内容安全策略
            var csp = "default-src 'self'; " +
                     "script-src 'self' 'unsafe-inline' 'unsafe-eval' https://cdn.example.com; " +
                     "style-src 'self' 'unsafe-inline'; " +
                     "img-src 'self' data: https://*.example.com; " +
                     "font-src 'self'; " +
                     "connect-src 'self' https://api.example.com; " +
                     "frame-ancestors 'none'; " +
                     "base-uri 'self'; " +
                     "form-action 'self';";
            
            context.Response.Headers.Add("Content-Security-Policy", csp);
            
            // HSTS(仅HTTPS)
            if (context.Request.IsHttps)
            {
                context.Response.Headers.Add("Strict-Transport-Security", 
                    "max-age=31536000; includeSubDomains; preload");
            }
            
            await _next(context);
        }
    }
    
    // 输入验证
    public static class InputValidator
    {
        public static ValidationResult ValidateDeviceCommand(DeviceCommand command)
        {
            var result = new ValidationResult();
            
            // 1. 基础验证
            if (string.IsNullOrWhiteSpace(command.DeviceId))
                result.AddError("DeviceId", "Device ID is required");
            
            if (string.IsNullOrWhiteSpace(command.Command))
                result.AddError("Command", "Command is required");
            
            // 2. 设备ID格式验证
            if (!IsValidDeviceId(command.DeviceId))
                result.AddError("DeviceId", "Invalid device ID format");
            
            // 3. 命令白名单验证
            if (!IsAllowedCommand(command.Command))
                result.AddError("Command", $"Command '{command.Command}' is not allowed");
            
            // 4. 参数验证
            if (command.Parameters != null)
            {
                foreach (var param in command.Parameters)
                {
                    if (!IsValidParameter(param.Key, param.Value))
                    {
                        result.AddError($"Parameters.{param.Key}", 
                            $"Invalid value for parameter '{param.Key}'");
                    }
                }
            }
            
            // 5. 频率限制检查
            if (IsRateLimited(command))
            {
                result.AddError("RateLimit", "Too many requests. Please try again later.");
            }
            
            return result;
        }
        
        private static bool IsValidDeviceId(string deviceId)
        {
            // 设备ID格式:类型-序列号
            var pattern = @"^[A-Z]{3}-\d{8}-[A-Z0-9]{4}$";
            return Regex.IsMatch(deviceId, pattern);
        }
        
        private static bool IsAllowedCommand(string command)
        {
            var allowedCommands = new[]
            {
                "turn_on", "turn_off", "set_brightness", 
                "set_color", "set_temperature", "lock", "unlock"
            };
            
            return allowedCommands.Contains(command.ToLower());
        }
        
        private static bool IsValidParameter(string key, string value)
        {
            // 参数验证逻辑
            switch (key.ToLower())
            {
                case "brightness":
                    if (!int.TryParse(value, out int brightness))
                        return false;
                    return brightness >= 0 && brightness <= 100;
                    
                case "color":
                    var colorPattern = @"^#[0-9A-F]{6}$";
                    return Regex.IsMatch(value, colorPattern, RegexOptions.IgnoreCase);
                    
                case "temperature":
                    if (!int.TryParse(value, out int temp))
                        return false;
                    return temp >= 10 && temp <= 30;
                    
                default:
                    // 通用验证:防止注入攻击
                    return !value.Contains("<script>") && 
                           !value.Contains("--") && 
                           !value.Contains(";") &&
                           value.Length <= 100;
            }
        }
    }
}

五、部署架构

5.1 Kubernetes部署架构(YAML配置)

# k8s-cluster-architecture.yaml
---
# 命名空间定义
apiVersion: v1
kind: Namespace
metadata:
  name: smart-home
  labels:
    name: smart-home
    environment: production

---
# 配置映射
apiVersion: v1
kind: ConfigMap
metadata:
  name: smart-home-config
  namespace: smart-home
data:
  # 应用配置
  application.properties: |
    # 数据库配置
    spring.datasource.url=jdbc:postgresql://postgres-primary.smart-home.svc.cluster.local:5432/smarthome
    spring.datasource.username=${DB_USERNAME}
    spring.datasource.password=${DB_PASSWORD}
    
    # Redis配置
    spring.redis.host=redis-master.smart-home.svc.cluster.local
    spring.redis.port=6379
    
    # Kafka配置
    spring.kafka.bootstrap-servers=kafka-broker.smart-home.svc.cluster.local:9092
    
    # 安全配置
    security.jwt.secret=${JWT_SECRET}
    security.jwt.expiration=86400000
    
    # 监控配置
    management.endpoints.web.exposure.include=health,info,metrics,prometheus
    management.metrics.export.prometheus.enabled=true
    
  # 日志配置
  logback-spring.xml: |
    <?xml version="1.0" encoding="UTF-8"?>
    <configuration>
        <include resource="org/springframework/boot/logging/logback/base.xml"/>
        
        <springProperty scope="context" name="APP_NAME" source="spring.application.name"/>
        <springProperty scope="context" name="LOG_LEVEL" source="logging.level.root" defaultValue="INFO"/>
        
        <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
            <encoder class="net.logstash.logback.encoder.LogstashEncoder">
                <customFields>{"app":"${APP_NAME}","env":"production"}</customFields>
            </encoder>
        </appender>
        
        <root level="${LOG_LEVEL}">
            <appender-ref ref="JSON"/>
        </root>
        
        <logger name="com.smarthome" level="DEBUG" additivity="false">
            <appender-ref ref="JSON"/>
        </logger>
    </configuration>

---
# 密钥管理
apiVersion: v1
kind: Secret
metadata:
  name: smart-home-secrets
  namespace: smart-home
type: Opaque
data:
  # Base64编码的敏感数据
  db-username: YWRtaW4=  # admin
  db-password: cGFzc3dvcmQxMjM=  # password123
  jwt-secret: c3VwZXItc2VjcmV0LWtleS1mb3Itand0  # super-secret-key-for-jwt
  encryption-key: dmVyeS1zZWNyZXQtZW5jcnlwdGlvbi1rZXk=  # very-secret-encryption-key

---
# 服务账户和RBAC
apiVersion: v1
kind: ServiceAccount
metadata:
  name: smart-home-sa
  namespace: smart-home

---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: smart-home
  name: pod-reader
rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: read-pods
  namespace: smart-home
subjects:
- kind: ServiceAccount
  name: smart-home-sa
roleRef:
  kind: Role
  name: pod-reader
  apiGroup: rbac.authorization.k8s.io

---
# PostgreSQL StatefulSet(有状态应用)
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-primary
  namespace: smart-home
spec:
  serviceName: postgres-primary
  replicas: 1
  selector:
    matchLabels:
      app: postgres
      role: primary
  template:
    metadata:
      labels:
        app: postgres
        role: primary
    spec:
      serviceAccountName: smart-home-sa
      containers:
      - name: postgres
        image: postgres:14-alpine
        ports:
        - containerPort: 5432
          name: postgres
        env:
        - name: POSTGRES_DB
          value: "smarthome"
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: smart-home-secrets
              key: db-username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: smart-home-secrets
              key: db-password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        livenessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - pg_isready -U postgres
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
        readinessProbe:
          exec:
            command:
            - /bin/sh
            - -c
            - pg_isready -U postgres
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 1
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1000
      volumes:
      - name: postgres-data
        persistentVolumeClaim:
          claimName: postgres-pvc
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 50Gi

---
# PostgreSQL读副本
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-replica
  namespace: smart-home
spec:
  serviceName: postgres-replica
  replicas: 2
  selector:
    matchLabels:
      app: postgres
      role: replica
  template:
    metadata:
      labels:
        app: postgres
        role: replica
    spec:
      containers:
      - name: postgres
        image: postgres:14-alpine
        ports:
        - containerPort: 5432
          name: postgres
        env:
        - name: POSTGRES_DB
          value: "smarthome"
        - name: POSTGRES_USER
          valueFrom:
            secretKeyRef:
              name: smart-home-secrets
              key: db-username
        - name: POSTGRES_PASSWORD
          valueFrom:
            secretKeyRef:
              name: smart-home-secrets
              key: db-password
        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata
        - name: REPLICATION_USER
          value: "replicator"
        - name: REPLICATION_PASSWORD
          valueFrom:
            secretKeyRef:
              name: smart-home-secrets
              key: db-password
        command:
        - bash
        - "-c"
        - |
          set -e
          # 等待主数据库启动
          until pg_isready -h postgres-primary -p 5432; do
            echo "Waiting for primary database..."
            sleep 2
          done
          
          # 配置复制
          echo "host replication all 0.0.0.0/0 md5" >> "$PGDATA/pg_hba.conf"
          
          # 启动从库
          gosu postgres postgres \
            -c listen_addresses='*' \
            -c wal_level=replica \
            -c max_wal_senders=10 \
            -c max_replication_slots=10 \
            -c hot_standby=on \
            -c primary_conninfo='host=postgres-primary port=5432 user=$(REPLICATION_USER) password=$(REPLICATION_PASSWORD)'
        volumeMounts:
        - name: postgres-data
          mountPath: /var/lib/postgresql/data
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
  volumeClaimTemplates:
  - metadata:
      name: postgres-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: "fast-ssd"
      resources:
        requests:
          storage: 50Gi

---
# 设备服务部署
apiVersion: apps/v1
kind: Deployment
metadata:
  name: device-service
  namespace: smart-home
  labels:
    app: device-service
    version: v1.0.0
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  selector:
    matchLabels:
      app: device-service
  template:
    metadata:
      labels:
        app: device-service
        version: v1.0.0
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/actuator/prometheus"
    spec:
      serviceAccountName: smart-home-sa
      containers:
      - name: device-service
        image: smart-home/device-service:1.0.0
        imagePullPolicy: Always
        ports:
        - containerPort: 8080
          name: http
        - containerPort: 8081
          name: management
        env:
        - name: SPRING_PROFILES_ACTIVE
          value: "production"
        - name: JAVA_OPTS
          value: "-Xmx1g -Xms512m -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+UseStringDeduplication"
        envFrom:
        - configMapRef:
            name: smart-home-config
        - secretRef:
            name: smart-home-secrets
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 60
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 30
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 3
        startupProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8081
            scheme: HTTP
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 1
          failureThreshold: 30
        lifecycle:
          preStop:
            exec:
              command: ["sh", "-c", "sleep 30"]
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          runAsUser: 1000
          capabilities:
            drop:
            - ALL
          readOnlyRootFilesystem: true
      # Pod反亲和性:确保Pod分布在不同的节点上
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - device-service
              topologyKey: kubernetes.io/hostname
      # 节点选择器:选择特定标签的节点
      nodeSelector:
        node-type: compute-optimized
      tolerations:
      - key: "dedicated"
        operator: "Equal"
        value: "smart-home"
        effect: "NoSchedule"
      # 优先级
      priorityClassName: high-priority

---
# 水平Pod自动扩展器
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: device-service-hpa
  namespace: smart-home
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: device-service
  minReplicas: 3
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: 100
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
      - type: Percent
        value: 50
        periodSeconds: 60
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 180
      selectPolicy: Min

---
# 服务网格配置(Istio)
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: device-service-vs
  namespace: smart-home
spec:
  hosts:
  - device-service.smart-home.svc.cluster.local
  - api.smarthome.example.com
  gateways:
  - smart-home-gateway
  http:
  - match:
    - uri:
        prefix: /api/devices
    route:
    - destination:
        host: device-service.smart-home.svc.cluster.local
        port:
          number: 8080
      weight: 100
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: gateway-error,connect-failure,refused-stream
    timeout: 10s
    corsPolicy:
      allowOrigins:
      - exact: https://app.smarthome.example.com
      allowMethods:
      - GET
      - POST
      - PUT
      - DELETE
      allowHeaders:
      - authorization
      - content-type
      maxAge: 24h

---
# 服务网格目标规则
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: device-service-dr
  namespace: smart-home
spec:
  host: device-service.smart-home.svc.cluster.local
  trafficPolicy:
    loadBalancer:
      simple: LEAST_CONN
    connectionPool:
      tcp:
        maxConnections: 100
        connectTimeout: 30ms
      http:
        http1MaxPendingRequests: 10
        http2MaxRequests: 50
        maxRequestsPerConnection: 10
    outlierDetection:
      consecutive5xxErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
  - name: v1
    labels:
      version: v1.0.0
    trafficPolicy:
      loadBalancer:
        simple: ROUND_ROBIN

---
# 网络策略
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: device-service-network-policy
  namespace: smart-home
spec:
  podSelector:
    matchLabels:
      app: device-service
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: istio-system
    - podSelector:
        matchLabels:
          app: api-gateway
    ports:
    - protocol: TCP
      port: 8080
    - protocol: TCP
      port: 8081
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: smart-home
    ports:
    - protocol: TCP
      port: 5432
    - protocol: TCP
      port: 6379
    - protocol: TCP
      port: 9092
  - to:
    - ipBlock:
        cidr: 0.0.0.0/0
        except:
        - 10.0.0.0/8
        - 172.16.0.0/12
        - 192.168.0.0/16
    ports:
    - protocol: TCP
      port: 443
    - protocol: TCP
      port: 80

---
# PodDisruptionBudget(确保服务可用性)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: device-service-pdb
  namespace: smart-home
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: device-service

---
# 服务
apiVersion: v1
kind: Service
metadata:
  name: device-service
  namespace: smart-home
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
spec:
  selector:
    app: device-service
  ports:
  - name: http
    port: 80
    targetPort: 8080
    protocol: TCP
  - name: management
    port: 8081
    targetPort: 8081
    protocol: TCP
  type: ClusterIP

---
# Ingress(API网关)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: smart-home-ingress
  namespace: smart-home
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/proxy-body-size: "10m"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
  - hosts:
    - api.smarthome.example.com
    secretName: smarthome-tls-secret
  rules:
  - host: api.smarthome.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-gateway
            port:
              number: 80

---
# 监控配置(Prometheus)
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: device-service-monitor
  namespace: smart-home
  labels:
    release: prometheus-stack
spec:
  selector:
    matchLabels:
      app: device-service
  endpoints:
  - port: http
    interval: 30s
    path: /actuator/prometheus
    scrapeTimeout: 10s
  - port: management
    interval: 30s
    path: /actuator/metrics
    scrapeTimeout: 10s
  namespaceSelector:
    matchNames:
    - smart-home

---
# 告警规则
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: device-service-alerts
  namespace: smart-home
  labels:
    release: prometheus-stack
spec:
  groups:
  - name: device-service
    rules:
    - alert: HighErrorRate
      expr: |
        rate(http_server_requests_seconds_count{status=~"5.."}[5m]) 
        / 
        rate(http_server_requests_seconds_count[5m]) > 0.05
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High error rate on device service"
        description: "Error rate is {{ $value }}%"
        
    - alert: HighLatency
      expr: |
        histogram_quantile(0.95, 
          rate(http_server_requests_seconds_bucket[5m])
        ) > 1
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High latency on device service"
        description: "95th percentile latency is {{ $value }}s"
        
    - alert: ServiceDown
      expr: |
        up{job="device-service"} == 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: "Device service is down"
        description: "Service has been down for more than 1 minute"

---
# 备份配置(Velero)
apiVersion: velero.io/v1
kind: Backup
metadata:
  name: smart-home-daily-backup
  namespace: velero
spec:
  includedNamespaces:
  - smart-home
  excludedResources:
  - events
  - events.events.k8s.io
  storageLocation: default
  ttl: 720h # 30 days
  hooks:
    resources:
    - name: postgres-backup-hook
      includedNamespaces:
      - smart-home
      labelSelector:
        matchLabels:
          app: postgres
      pre:
      - exec:
          command:
          - /bin/bash
          - -c
          - |
            pg_dump -h localhost -U $POSTGRES_USER $POSTGRES_DB > /tmp/backup.sql
            gzip /tmp/backup.sql
          container: postgres
          timeout: 5m
      post:
      - exec:
          command:
          - /bin/rm
          - -f
          - /tmp/backup.sql.gz
          container: postgres

---
# GitOps配置(ArgoCD)
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: smart-home-app
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://github.com/smart-home/kubernetes-manifests.git
    targetRevision: HEAD
    path: production
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: smart-home
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions:
    - CreateNamespace=true
    - ApplyOutOfSyncOnly=true
    retry:
      limit: 5
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

5.2 网络拓扑图

@startuml SmartHomeNetworkTopology
title 智能家居系统网络拓扑架构

cloud "公有云 (AWS/Azure)" {
    
    rectangle "VPC: 10.0.0.0/16" {
        
        rectangle "公有子网 (Public Subnet)" {
            node "Internet Gateway" as IGW
            node "NAT Gateway" as NAT
            
            rectangle "API网关层" {
                [ELB/ALB] as ELB
                [API Gateway] as APIGW
                [WAF] as WAF
                [CDN Edge] as CDN
            }
        }
        
        rectangle "私有子网 (Private Subnet A)" {
            rectangle "应用服务层" {
                node "K8s Node Group 1" as K8S1
                node "K8s Node Group 2" as K8S2
                
                K8S1 : device-service\nuser-service\nscene-service
                K8S2 : energy-service\nsecurity-service\nnotification-service
            }
        }
        
        rectangle "私有子网 (Private Subnet B)" {
            rectangle "数据层" {
                database "PostgreSQL Primary" as PGP
                database "PostgreSQL Replica 1" as PGR1
                database "PostgreSQL Replica 2" as PGR2
                database "Redis Cluster" as REDIS
                database "Kafka Cluster" as KAFKA
                database "MongoDB Shards" as MONGO
            }
        }
        
        rectangle "管理子网 (Management Subnet)" {
            rectangle "运维监控层" {
                node "Prometheus" as PROM
                node "Grafana" as GRAF
                node "Elasticsearch" as ES
                node "Kibana" as KIB
                node "Jaeger" as JAEG
            }
            
            rectangle "CI/CD层" {
                node "Jenkins" as JENK
                node "ArgoCD" as ARGO
                node "GitLab" as GIT
            }
        }
    }
    
    rectangle "边缘计算层" {
        node "边缘网关 1" as EDGE1
        node "边缘网关 2" as EDGE2
        node "边缘网关 3" as EDGE3
        
        EDGE1 : K3s集群\n设备适配器\n本地缓存
        EDGE2 : K3s集群\n设备适配器\n本地缓存
        EDGE3 : K3s集群\n设备适配器\n本地缓存
    }
    
    rectangle "设备层" {
        folder "Zigbee设备" as ZIGBEE
        folder "WiFi设备" as WIFI
        folder "蓝牙设备" as BLE
        folder "Modbus设备" as MODBUS
    }
    
    rectangle "客户端层" {
        node "Web浏览器" as WEB
        node "移动App" as MOBILE
        node "桌面应用" as DESKTOP
        node "语音助手" as VOICE
    }
}

' 网络连接
WEB -> CDN : HTTPS
MOBILE -> CDN : HTTPS
DESKTOP -> CDN : HTTPS
VOICE -> CDN : HTTPS

CDN -> WAF : 流量清洗
WAF -> ELB : 负载均衡
ELB -> APIGW : 请求路由

APIGW -> K8S1 : 服务发现
APIGW -> K8S2 : 服务发现

K8S1 -> PGP : 主数据库
K8S1 -> PGR1 : 读副本
K8S1 -> REDIS : 缓存
K8S1 -> KAFKA : 消息队列

PGR1 .up.> PGP : 数据同步
PGR2 .up.> PGP : 数据同步

EDGE1 -> KAFKA : 设备事件
EDGE2 -> KAFKA : 设备事件
EDGE3 -> KAFKA : 设备事件

ZIGBEE -> EDGE1 : Zigbee协议
WIFI -> EDGE1 : MQTT/HTTP
BLE -> EDGE2 : 蓝牙协议
MODBUS -> EDGE3 : Modbus TCP

K8S1 -> PROM : 监控指标
K8S2 -> PROM : 监控指标
PGP -> PROM : 数据库指标

PROM -> GRAF : 数据源
ES -> GRAF : 日志数据

JENK -> K8S1 : 部署
ARGO -> K8S2 : GitOps

note right of VPC
  网络安全组规则:
  - 应用层: 仅允许来自API网关的流量
  - 数据层: 仅允许来自应用层的流量
  - 管理层: VPN/跳板机访问
  - 边缘层: 双向TLS认证
end note

@enduml

六、总结:关键决策框架

6.1 决策检查清单

技术选型检查清单:
  编程语言:
    - [ ] 团队技能匹配度 > 70%
    - [ ] 社区生态成熟度
    - [ ] 长期维护性
    - [ ] 性能满足需求
    - [ ] 招聘市场供应
  
  数据库:
    - [ ] 数据模型匹配度
    - [ ] 扩展性需求评估
    - [ ] 一致性要求
    - [ ] 运维复杂度
    - [ ] 成本预算
  
  中间件:
    - [ ] 与现有技术栈兼容性
    - [ ] 社区支持
    - [ ] 企业级特性
    - [ ] 监控和管理工具
  
  部署平台:
    - [ ] 运维团队技能
    - [ ] 合规性要求
    - [ ] 成本效益分析
    - [ ] 灾备能力

数据管理检查清单:
  - [ ] 数据分类和生命周期管理
  - [ ] 备份和恢复策略
  - [ ] 数据一致性模型
  - [ ] 缓存策略和失效机制
  - [ ] 数据安全和隐私保护

并发处理检查清单:
  - [ ] 并发模型选择(线程/协程/事件)
  - [ ] 负载均衡策略
  - [ ] 故障转移和恢复机制
  - [ ] 限流和熔断配置
  - [ ] 监控和告警

安全架构检查清单:
  - [ ] 认证和授权机制
  - [ ] 数据加密(传输中/静止)
  - [ ] 输入验证和输出编码
  - [ ] 安全审计和日志
  - [ ] 漏洞管理和补丁策略

部署架构检查清单:
  - [ ] 高可用性设计
  - [ ] 可扩展性设计
  - [ ] 网络和安全隔离
  - [ ] 监控和运维工具链
  - [ ] 灾难恢复计划

6.2 推荐架构模式

基于智能家居系统的特点,推荐以下架构模式组合:

  1. 微服务架构 + 事件驱动:处理设备事件和场景触发
  2. CQRS + 事件溯源:保证数据一致性和可追溯性
  3. 服务网格 + API网关:统一服务治理和安全策略
  4. 多级缓存 + CDN:优化读取性能和用户体验
  5. Kubernetes + GitOps:实现声明式部署和自动化运维

这些决策需要根据具体业务需求、团队能力、预算约束和合规要求进行调整。建议采用渐进式架构演进策略,从小规模开始验证,逐步扩展和完善。

评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

千江明月

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值