【沃趣科技】MySQL高可用工具Orchestrator系列三：探测机制

woqutechteam

于 2019-11-20 10:22:23 发布

阅读量924

点赞数

分类专栏： MySQL 文章标签： MySQL Orchestrator

本文链接：https://blog.youkuaiyun.com/woqutechteam/article/details/103157375

版权

本文深入探讨了沃趣科技的Orchestrator在MySQL高可用中的探测机制。Orchestrator通过整体性的故障检测方法，避免因网络问题造成的误报，依赖复制拓扑来判断主库故障。它定期拉取实例状态并存储在元数据库中，若无法获取状态，则标记实例为异常。此外，文章还介绍了如何判断实例是否存活。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

沃趣科技作为国内领先的数据库云平台解决方案提供商，一直致力于企业级数据库云平台产品的研发，为用户提供高性能、高可用、可扩展的的数据库云环境及不同业务场景需求的数据库平台，满足客户对极致性能、数据安全、容灾备份、业务永续等需求。沃趣科技凭借专业的团队，优质的产品，前沿的技术，贴心的服务赢得了客户的信任与尊重，也获得了市场的认同。

上篇文章讲了orchestrator复制拓扑的发现方式。本篇文章我们继续探索orchestrator的旅程，讲一讲orchestrator的探测机制。

故障检测

orch使用了一种整体性的方法去探测主库和中间主库是否正常。

一种比较天真的方法，比如，监控工具探测到主库无法连接或者查询，就发出报警。这种方法容易受到网络故障而造成误报。为了降低误报，会通过运行n次每次间隔t时间的方式。在某些情况下，这减少了误报的机会，但是增加了真正故障的响应时间。

orchestrator会利用复制拓扑。orch不仅会监测主库，也会检测从库。比如，要诊断出主库挂了的情况，orch必须满足以下两个条件：

联系不到主库。
可以联系到主库对应的从库，并且这些从库也连不上主库。

orch没有将错误按时间来进行分类，而是按复制拓扑服务器（也就是所谓的multiple observers）本身进行分类。实际上，当所有的从库都连不上主库的时候，说明复制拓扑实际上就被破坏了，有理由需要进行故障转移。

orch的这种整体故障检测方式在生产环境是非常可靠的。

探测机制

orch会每隔InstancePollSeconds（默认5s）时间去被监控的实例上拉取实例状态，并将这些状态信息存入orch的元数据库的orchestrator.database_instance表中，然后orch会每隔InstancePollSeconds秒从元数据库中获取每个instance的状态，展示在web界面上。

拉取实例状态的语句如下：

show variables like 'maxscale%'
show global status like 'Uptime'
select @@global.hostname, ifnull(@@global.report_host, ''), @@global.server_id, @@global.version, @@global.version_comment, @@global.read_only, @@global.binlog_format, @@global.log_bin, @@global.log_slave_updates
show master status
show global status like 'rpl_semi_sync_%_status'
select @@global.gtid_mode, @@global.server_uuid, @@global.gtid_executed, @@global.gtid_purged, @@global.master_info_repository = 'TABLE', @@global.binlog_row_image
show slave status
select count(*) > 0 and MAX(User_name) != '' from mysql.slave_master_info
show slave hosts
select substring_index(host, ':', 1) as slave_hostname from information_schema.processlist where command IN ('Binlog Dump', 'Binlog Dump GTID')
SELECT SUBSTRING_INDEX(@@hostname, '.', 1)

拉取得到实例状态之后，通过下面语句将状态值存入到orch的元数据库中：

注：values后面的值就是上面拉取到的实例状态值。

INSERT INTO database_instance
                (hostname, port, last_checked, last_attempted_check, last_check_partial_success, uptime, server_id, server_uuid, version, major_version, version_comment, binlog_server, read_only, binlog_format, binlog_row_image, log_bin, log_slave_updates, binary_log_file, binary_log_pos, master_host, master_port, slave_sql_running, slave_io_running, replication_sql_thread_state, replication_io_thread_state, has_replication_filters, supports_oracle_gtid, oracle_gtid, master_uuid, ancestry_uuid, executed_gtid_set, gtid_mode, gtid_purged, gtid_errant, mariadb_gtid, pseudo_gtid, master_log_file, read_master_log_pos, relay_master_log_file, exec_master_log_pos, relay_log_file, relay_log_pos, last_sql_error, last_io_error, seconds_behind_master, slave_lag_seconds, sql_delay, num_slave_hosts, slave_hosts, cluster_name, suggested_cluster_alias, data_center, region, physical_environment, replication_depth, is_co_master, replication_credentials_available, has_replication_credentials, allow_tls, semi_sync_enforced, semi_sync_master_enabled, semi_sync_replica_enabled, instance_alias, last_discovery_latency, last_seen)
        VALUES
                ('10.10.30.5', 3306, NOW(), NOW(), 1, 322504, 1521, 'e2685a0f-d8f8

最低0.47元/天解锁文章