Orchestrator Failover过程源码分析-I
模拟故障
使用测试环境, 模拟3307集群故障
| 角色 | IP | 端口 | 主机名 |
|---|---|---|---|
| 主库 | 172.16.120.10 | 3307 | centos-1 |
| 从库 | 172.16.120.11 | 3307 | centos-2 |
| 从库 | 172.16.120.12 | 3307 | centos-3 |
关闭3307主库172.16.120.10:3307
[2022-04-25 13:10:56][root@centos-1 13:10:56 ~]
[2022-04-25 13:11:22]#systemctl stop mysql3307
mysql日志
2022-04-25T13:11:35.959667+08:00 0 [Note] /usr/local/mysql5732/bin/mysqld: Shutdown complete
源码分析
我的思路是通过日志找入口.
在下文中:
- 对主库简称为centos-1
- 对两个从库分别简称为:
- centos-2
- centos-3
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
2022-04-25 13:11:27 ERROR ReadTopologyInstance(172.16.120.10:3307) show global status like 'Uptime': Error 1053: Server shutdown in progress
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
[mysql] 2022/04/25 13:11:27 packets.go:37: unexpected EOF
2022-04-25 13:11:27 ERROR invalid connection
2022-04-25 13:11:27 ERROR dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:28 ERROR dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:28 DEBUG writeInstance: will not update database_instance due to error: invalid connection
2022-04-25 13:11:32 WARNING DiscoverInstance(172.16.120.10:3307) instance is nil in 0.104s (Backend: 0.001s, Instance: 0.103s), error=dial tcp 172.16.120.10:3307: connect: connection refused
2022-04-25 13:11:33 DEBUG analysis: ClusterName: 172.16.120.10:3307, IsMaster: true, LastCheckValid: false, LastCheckPartialSuccess: false, CountReplicas: 2, CountValidReplicas: 2, CountValidReplicatingReplicas: 2, CountLaggingReplicas: 0, CountDelayedReplicas: 0, CountReplicasFailingToConnectToMaster: 0
2022-04-25 13:11:33 INFO executeCheckAndRecoverFunction: proceeding with UnreachableMaster detection on 172.16.120.10:3307; isActionable?: false; skipProcesses: false
2022-04-25 13:11:33 INFO topology_recovery: detected UnreachableMaster failure on 172.16.120.10:3307
2022-04-25 13:11:33 INFO topology_recovery: Running 1 OnFailureDetectionProcesses hooks
关闭centos-1后, 从日志可以看出:
- orchestrator对centos-1的
一些探测操作失败了 - executeCheckAndRecoverFunction: proceeding with UnreachableMaster
通过第二条信息找"入口"
全局搜索executeCheckAndRecoverFunction: proceeding with
搜索到函数executeCheckAndRecoverFunction
这个函数比较长, 先不看. 先看一下是谁调用 executeCheckAndRecoverFunction .
搜索executeCheckAndRecoverFunction(, 搜到CheckAndRecover
继续搜索CheckAndRecover, 查到是ContinuousDiscovery在调用它
ContinuousDiscovery 在logic包中, 被http.standardHttp调用, 而http.standardHttp又被http.Http调用, http.Http是在启动orchestrator时被调用的
go/cmd/orchestrator/main.go
// 截取部分代码
switch {
case helpTopic != "":
app.HelpCommand(helpTopic)
case len(flag.Args()) == 0 || flag.Arg(0) == "cli":
app.CliWrapper(*command, *strict, *instance, *destination, *owner, *reason, *duration, *pattern, *clusterAlias, *pool, *hostnameFlag)
case flag.Arg(0) == "http":
app.Http(*discovery)
default:
fmt.Fprintln(os.Stderr, `Usage:
orchestrator --options... [cli|http]See complete list of commands:
orchestrator -c helpFull blown documentation:
orchestrator`)
os.Exit(1)
}}
也就是说, 当我们用以下命令启动orchestrator后
orchestrator -config orchestrator.conf.json -debug http
就会 http.Http -> http.standardHttp -> go logic.ContinuousDiscovery()
ContinuousDiscovery都干了啥
持续发现
// ContinuousDiscovery starts an asynchronuous infinite discovery process where instances are
// periodically investigated and their status captured, and long since unseen instances are
// purged and forgotten.
ContinuousDiscovery启动一个永不停止的异步"发现"过程, 在这个过程中, 实例被周期性地调查并捕获它们的状态, 长期以来不可见的实例被清除和遗忘.
这段注释中 asynchronuous 还拼写错了, 应该是 asynchronous
ContinuousDiscovery 先启动一个协程
func ContinuousDiscovery() {
...
go handleDiscoveryRequests()
handleDiscoveryRequests 在Orchestrator Discover源码分析中介绍过
handleDiscoveryRequests迭代discoveryQueue channel 并在每个条目上调用DiscoverInstance, 而DiscoverInstance又会调用ReadTopologyInstanceBufferable, 后者会实际连接MySQL实例, 获取各种指标/参数信息, 最终将结果写入database_instance 表
那么discoveryQueue里的"数据"又是谁放进来的呢?, 有两个地方
- 通过命令行或前端页面手动触发"

本文深入分析了orchestrator在处理主库故障时的源码流程,包括模拟3307集群故障,主库关闭后的日志分析,以及orchestrator如何检测到主库不可达并进行故障恢复。通过源码跟踪,阐述了从`executeCheckAndRecoverFunction`函数的调用,到`CheckAndRecover`的执行,以及`UnreachableMaster`和`DeadMaster`状态的判断逻辑。文章详细解释了orchestrator如何决定何时执行真正的故障转移操作。
最低0.47元/天 解锁文章

被折叠的 条评论
为什么被折叠?



