Orchestrator Failover过程源码分析-I

本文深入分析了orchestrator在处理主库故障时的源码流程,包括模拟3307集群故障,主库关闭后的日志分析,以及orchestrator如何检测到主库不可达并进行故障恢复。通过源码跟踪,阐述了从`executeCheckAndRecoverFunction`函数的调用,到`CheckAndRecover`的执行,以及`UnreachableMaster`和`DeadMaster`状态的判断逻辑。文章详细解释了orchestrator如何决定何时执行真正的故障转移操作。
该文章已生成可运行项目,

Orchestrator Failover过程源码分析-I

模拟故障

使用测试环境, 模拟3307集群故障

角色 IP 端口 主机名
主库 172.16.120.10 3307 centos-1
从库 172.16.120.11 3307 centos-2
从库 172.16.120.12 3307 centos-3

关闭3307主库172.16.120.10:3307

[2022-04-25 13:10:56][root@centos-1 13:10:56 ~]
[2022-04-25 13:11:22]#systemctl stop mysql3307


mysql日志
2022-04-25T13:11:35.959667+08:00 0 [Note] /usr/local/mysql5732/bin/mysqld: Shutdown complete

源码分析

我的思路是通过日志找入口.
在下文中:

  • 对主库简称为centos-1
  • 对两个从库分别简称为:
    • centos-2
    • centos-3
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
2022-04-25 13:11:27 ERROR ReadTopologyInstance(172.16.120.10:3307) show global status like 'Uptime': Error 1053: Server shutdown in progress
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
2022-04-25 13:11:27 ERROR dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:28 ERROR dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:28 DEBUG writeInstance: will not update database_instance due to error: invalid connection
2022-04-25 13:11:32 WARNING DiscoverInstance(172.16.120.10:3307) instance is nil in 0.104s (Backend: 0.001s, Instance: 0.103s), error=dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:33 DEBUG analysis: ClusterName: 172.16.120.10:3307, IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: false, CountReplicas: 2, CountValidReplicas: 2, CountValidReplicatingReplicas: 2, CountLaggingReplicas: 0, CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0
2022-04-25 13:11:33 INFO executeCheckAndRecoverFunction: proceeding with UnreachableMaster detection on 172.16.120.10:3307; isActionable?: false; skipProcesses: false
2022-04-25 13:11:33 INFO topology_recovery: detected UnreachableMaster failure on 172.16.120.10:3307
2022-04-25 13:11:33 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks

关闭centos-1后, 从日志可以看出:

  • orchestrator对centos-1的一些探测操作失败了
  • executeCheckAndRecoverFunction: proceeding with UnreachableMaster

通过第二条信息找"入口"

全局搜索executeCheckAndRecoverFunction: proceeding with

搜索到函数executeCheckAndRecoverFunction

这个函数比较长, 先不看. 先看一下是谁调用 executeCheckAndRecoverFunction .

搜索executeCheckAndRecoverFunction(, 搜到CheckAndRecover

继续搜索CheckAndRecover, 查到是ContinuousDiscovery在调用它

ContinuousDiscovery 在logic包中, 被http.standardHttp调用, 而http.standardHttp又被http.Http调用, http.Http是在启动orchestrator时被调用的

go/cmd/orchestrator/main.go

// 截取部分代码

    switch {
   
     
   case helpTopic != "":  
      app.HelpCommand(helpTopic)  
   case len(flag.Args()) == 0 || flag.Arg(0) == "cli":  
      app.CliWrapper(*command, *strict, *instance, *destination, *owner, *reason, *duration, *pattern, *clusterAlias, *pool, *hostnameFlag)  
   case flag.Arg(0) == "http":  
      app.Http(*discovery)  
   default:  
      fmt.Fprintln(os.Stderr, `Usage:  
  orchestrator --options... [cli|http]See complete list of commands:  
  orchestrator -c helpFull blown documentation:  
  orchestrator`)  
      os.Exit(1)  
   }}

也就是说, 当我们用以下命令启动orchestrator后

orchestrator -config orchestrator.conf.json -debug http

就会 http.Http -> http.standardHttp -> go logic.ContinuousDiscovery()

ContinuousDiscovery都干了啥

持续发现

// ContinuousDiscovery starts an asynchronuous infinite discovery process where instances are
// periodically investigated and their status captured, and long since unseen instances are  
// purged and forgotten.
ContinuousDiscovery启动一个永不停止的异步"发现"过程, 在这个过程中, 实例被周期性地调查并捕获它们的状态, 长期以来不可见的实例被清除和遗忘.

这段注释中 asynchronuous 还拼写错了, 应该是 asynchronous

ContinuousDiscovery 先启动一个协程

func ContinuousDiscovery() {
   
     
   ...
   go handleDiscoveryRequests()

handleDiscoveryRequests 在Orchestrator Discover源码分析中介绍过
handleDiscoveryRequests迭代discoveryQueue channel 并在每个条目上调用DiscoverInstance, 而DiscoverInstance又会调用ReadTopologyInstanceBufferable, 后者会实际连接MySQL实例, 获取各种指标/参数信息, 最终将结果写入database_instance

那么discoveryQueue里的"数据"又是谁放进来的呢?, 有两个地方

  • 通过命令行或前端页面手动触发"
本文章已经生成可运行项目
评论 3
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值