mysql mgr节点无法连接,MySQL MGR成员故障处理-优快云博客

1. 主观造成主从节点间binlog的丢失

MGR故障模拟1

如下是搭建完成后的MGR集群，目前集群处于完全正常的状态中。

#执行如下操作，主观造成主从节点间binlog的丢失

use shooter;

SET SQL_LOG_BIN=0; #设置下面操作不记录到binlog日志中

INSERT INTO book VALUES (4, 'shooter');

SET SQL_LOG_BIN=1; #开启记录

select @@server_uuid AS this_server_uuid,t.* from performance_schema.global_status t where t.variable_name = 'group_replication_primary_member'

#执行如下操作，主观造成主从节点间binlog的丢失

useshooter;

SETSQL_LOG_BIN=0;#设置下面操作不记录到binlog日志中

INSERTINTObookVALUES(4,'shooter');

SETSQL_LOG_BIN=1;#开启记录

select@@server_uuidASthis_server_uuid,t.*fromperformance_schema.global_statustwheret.variable_name='group_replication_primary_member'

然后我们在主节点上删除刚才添加的这条数据。

delete from book where c1=4;

deletefrombookwherec1=4;

在主节点上对于对于从节点丢失的数据操作，GTID无法找到对应的数据，组复制会立马熄火

另外2个非写入节点出现错误

122

123

查看错误日志：

查看当前存货的节点node1状况

select * from performance_schema.replication_group_member_stats\G

show global variables like '%gtid%';

select*fromperformance_schema.replication_group_member_stats\G

showglobalvariableslike'%gtid%';

node1

上面我们通过制造数据不一致，binlog过期造成node2和node3一直处于RECOVERING状态无法同步直到放弃同步

通过跳过GTID方式解决

修改gtid_purged值

在坏掉的节点上执行如下操作

先重置,清除node2和node3节点的gtid_purged值

reset master;

show global variables like '%gtid%';

resetmaster;

showglobalvariableslike'%gtid%';

gtid_purged已清除

官方文档描述gtid_purged值是基于最旧binlog日志文件的Previous_gtids_log_event值，即上面node1的:

bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004

官方文档描述gtid_purged值是基于最旧binlog日志文件的Previous_gtids_log_event值，即上面node1的:

bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004

根据node1的gtid_purged ,更改node2和node3的gtid_purged,使其从该gtid_purged开始复制,即不再复制主库之前执行过已经丢失binlog的gtid，这种跳过事务同步的方法存在数据不一致的隐患。

下来在node2和node3上面操作

stop group_replication;

#因为当前是1-6,要跳过当前到7

set global gtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004';

#重新加入组

CHANGE MASTER TO MASTER_USER='rpl_user',MASTER_PASSWORD='123456' FOR CHANNEL 'group_replication_recovery';

#查看当前的gtid_purged

show global variables like '%gtid%';

#启动

START GROUP_REPLICATION;

stopgroup_replication;

#因为当前是1-6,要跳过当前到7

setglobalgtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004';

#重新加入组

CHANGEMASTERTOMASTER_USER='rpl_user',MASTER_PASSWORD='123456'FORCHANNEL'group_replication_recovery';

#查看当前的gtid_purged

showglobalvariableslike'%gtid%';

#启动

STARTGROUP_REPLICATION;

查看节点当前状态

SELECT * FROM performance_schema.replication_group_members;

SELECT*FROMperformance_schema.replication_group_members;

虽然节点启动了，但是当前机器的数据与主节点是不一致的。

数据不一致的解决

先从其他机器拷贝一份数据同步到这台宕机的数据库(否则你启动了，虽然连接到组了但是你和当前主primary数据是不一致的，因为你宕机的时候主primary数据并没有停止继续写入，当用户执行删除，你当前跳过的这台机器上，没有primary那条数据，后面用户轮循删除到这台机器，就出错的，组就会自动退出)，日志会报如下错误

#日志错误如下

[ERROR]Slave SQL for channel 'group_replication_applier': Could not execute Delete_rows event on table shooter.book; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 153, Error_code: 1032

[Warning] Slave: Can't find record in 't1' Error_code: 1032

[ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 180

[ERROR] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'

[ERROR] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'

[ERROR] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'

[Note] Plugin group_replication reported: 'The group replication applier thread was killed'

[Note] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

#日志错误如下

[ERROR]SlaveSQLforchannel'group_replication_applier':CouldnotexecuteDelete_rowseventontableshooter.book;Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event'smasterlogFIRST,end_log_pos153,Error_code:1032

[Warning]Slave:Can't find record in 't1'Error_code:1032

[ERROR]Errorrunningquery,slaveSQLthreadaborted.Fixtheproblem,andrestarttheslaveSQLthreadwith"SLAVE START".Westoppedatlog'FIRST'position180

[ERROR]Plugingroup_replicationreported:'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'

[ERROR]Plugingroup_replicationreported:'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'

[ERROR]Plugingroup_replicationreported:'The server was automatically set into read only mode after an error was detected.'

[Note]Plugingroup_replicationreported:'The group replication applier thread was killed'

[Note]Plugingroup_replicationreported:'Group membership changed: This member has left the group.'

为了避免数据不一致的错误，在恢复宕机节点前，先把主机器数据写入暂停一下(可能要在半夜搞，因为和这个时候没有用户访问)，然后把主数据全部同步到当前宕机这台。使之保持一致。然后再恢复宕机这台的复制组。

然后在宕机恢复后的机器执行如下操作。

#查看当前状态

reset master;

show global variables like '%gtid%' ;

set global gtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-26:1000015';(从主机器或者其他机器获取)

CHANGE MASTER TO MASTER_USER='rpl_user',MASTER_PASSWORD='123456' FOR CHANNEL 'group_replication_recovery';

SET SESSION binlog_format = 'ROW';

SET GLOBAL binlog_format = 'ROW';

START GROUP_REPLICATION;

#查看当前状态

resetmaster;

showglobalvariableslike'%gtid%';

setglobalgtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-26:1000015';(从主机器或者其他机器获取)

CHANGEMASTERTOMASTER_USER='rpl_user',MASTER_PASSWORD='123456'FORCHANNEL'group_replication_recovery';

SETSESSIONbinlog_format='ROW';

SETGLOBALbinlog_format='ROW';

STARTGROUP_REPLICATION;

然后查看当前节点状态

2.模拟主宕机(多主模式)

关闭node1mysql服务

查看另外2台node2，node3节点

node2

现在在node3，node2随便一台写入数据(我这里写入node3)

node3

查看node2是否同步过去了

node2

现在启动node1(模拟node1恢复了)

登入启动MGR组

START GROUP_REPLICATION;

登入启动MGR组

STARTGROUP_REPLICATION;

只有他自己在线，这是为什么呢，因为他宕机后他就不再是主master，他就成了从。所以要把他加入复制组

reset master;

set global gtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004';

CHANGE MASTER TO MASTER_USER='rpl_user',MASTER_PASSWORD='123456' FOR CHANNEL 'group_replication_recovery';

SET SESSION binlog_format = 'ROW';

SET GLOBAL binlog_format = 'ROW';

SET GLOBAL group_replication_bootstrap_group=ON;

START GROUP_REPLICATION;

SET GLOBAL group_replication_bootstrap_group=OFF;

resetmaster;

setglobalgtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8:1000003:2000003-2000004';

CHANGEMASTERTOMASTER_USER='rpl_user',MASTER_PASSWORD='123456'FORCHANNEL'group_replication_recovery';

SETSESSIONbinlog_format='ROW';

SETGLOBALbinlog_format='ROW';

SETGLOBALgroup_replication_bootstrap_group=ON;

STARTGROUP_REPLICATION;

SETGLOBALgroup_replication_bootstrap_group=OFF;

启动后发现数据和其他机器不一致怎么办

从其他机器备份一份数据过来，然后导入这台机器的数据库，导入数据前你要先清空当前节点的gtid_purged,然后用souce xxxx.sql导入你的备份

reset master;

set global gtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8';

CHANGE MASTER TO MASTER_USER='rpl_user',MASTER_PASSWORD='123456' FOR CHANNEL 'group_replication_recovery';

SET SESSION binlog_format = 'ROW';

SET GLOBAL binlog_format = 'ROW';

SET GLOBAL group_replication_bootstrap_group=ON;

START GROUP_REPLICATION;

SET GLOBAL group_replication_bootstrap_group=OFF;

从其他机器备份一份数据过来，然后导入这台机器的数据库，导入数据前你要先清空当前节点的gtid_purged,然后用soucexxxx.sql导入你的备份

resetmaster;

setglobalgtid_purged='bd4ff52e-2868-11ea-91f0-0050568b6568:1-8';

CHANGEMASTERTOMASTER_USER='rpl_user',MASTER_PASSWORD='123456'FORCHANNEL'group_replication_recovery';

SETSESSIONbinlog_format='ROW';

SETGLOBALbinlog_format='ROW';

SETGLOBALgroup_replication_bootstrap_group=ON;

STARTGROUP_REPLICATION;

SETGLOBALgroup_replication_bootstrap_group=OFF;

查看节点

1.模拟主宕机(单主模式 )

1.查看当前单主集群状态

SELECT MEMBER_ID,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,IF(global_status.VARIABLE_NAME IS NOT NULL,'PRIMARY','SECONDARY') AS MEMBER_ROLE FROM performance_schema.replication_group_members LEFT JOIN performance_schema.global_status ON global_status.VARIABLE_NAME = 'group_replication_primary_member' AND global_status.VARIABLE_VALUE = replication_group_members.MEMBER_ID;

SELECTMEMBER_ID,MEMBER_HOST,MEMBER_PORT,MEMBER_STATE,IF(global_status.VARIABLE_NAMEISNOTNULL,'PRIMARY','SECONDARY')ASMEMBER_ROLEFROMperformance_schema.replication_group_membersLEFTJOINperformance_schema.global_statusONglobal_status.VARIABLE_NAME='group_replication_primary_member'ANDglobal_status.VARIABLE_VALUE=replication_group_members.MEMBER_ID;