问题现象:
主从同步告警,报错如下:
Slave_SQL_Running: No
Last_Errno: 13146
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
继续查看
mysql> select * from performance_schema.replication_applier_status_by_worker\G;
LAST_ERROR_NUMBER: 13146
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597; Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'
环境说明:
MySQL 8.0.X 双主架构单写架构。
双主单写是指在节点1上有一个VIP地址,应用是通过VIP地址连接节点1进行单写,只有节点1故障,VIP才会漂到节点2上,也就是同一时间点对数据库是单写操作。下面将有VIP地址的节点称为主库,没有VIP地址的成为从库。
问题分析:
报错:
Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'
的产生的原因可能是:主或从上表结构不同。
但是手动检查两个节点的cjc.t1表结构,完全相同。
解析对应binlog日志:
mysql> show binary logs;
如果数据太多,可以输出到文件里
mysql> show binlog events in 'mysql-bin.000006';
mysqlbinlog --no-defaults --base64-output=decode-rows --verbose mysql-bin.000006 > 6.log
继续查看两个节点SQL执行历史,发现问题期间,有人人为在从节点执行了cjc.t1表添加列的操作,同一时间点,主库上的应用在主库上执行了对表cjc.t1的update操作。
为什么这种场景下会触发主从中断的错误?该问题可以稳定重现,重现过程如下:
问题重现:
故障模拟:
主:192.168.1.101
从:192.168.1.102
101:创建测试表
create database cjc;
use cjc;
create table t1(id bigint not null,aid bigint,name varchar(10));
insert into t1 values(1,1,'a');
insert into t1 values(2,2,'x');
102:停止同步,模拟同步延时或主从同时执行的场景
stop slave;
101:更新表数据
update t1 set name='y' where id=1;
102:添加列
alter table t1 add column age bigint(0) after id;
102:启动slave,同步报错:
start slave;
mysql> show slave status\G;
Slave_SQL_Running: No
Last_Errno: 13146
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
mysql> select * from performance_schema.replication_applier_status_by_worker\G;
LAST_ERROR_NUMBER: 13146
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597; Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'
报错:
Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'
报错说明:
从cjc.t1表的第3列(从0列开始计算,所以Column 2实际指的是表的第3列),在binlog里记录的是varchar(40(bytes))类型,但在从库应用时,发现当前这个列类型变成了bigint类型,所有同步报错。
其中表的字符集是utf8mb4,每个字符占用4bytes,所以varchar(40(bytes))对应的类型是varchar(10)。
查看表结构、列位置信息:
mysql> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,COLUMN_TYPE from information_schema.columns where TABLE_SCHEMA='cjc';
添加age列之前:
+--------------+------------+-------------+------------------+------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_TYPE |
+--------------+------------+-------------+------------------+------------------+
| cjc | t1 | id | 1 | bigint |
| cjc | t1 | aid | 2 | bigint |
| cjc | t1 | name | 3 | varchar(10) |
+--------------+------------+-------------+------------------+------------------+
3 rows in set (0.00 sec)
添加age列之后:
+--------------+------------+-------------+------------------+------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_TYPE |
+--------------+------------+-------------+------------------+------------------+
| cjc | t1 | id | 1 | bigint |
| cjc | t1 | age | 2 | bigint |
| cjc | t1 | aid | 3 | bigint |
| cjc | t1 | name | 4 | varchar(10) |
+--------------+------------+-------------+------------------+------------------+
4 rows in set (0.00 sec)
通过解析binlog日志,可以看到,binlog里记录的update语句如下:
### UPDATE `cjc`.`t1`
### WHERE
### @1=1
### @2=1
### @3='a'
### SET
### @1=1
### @2=1
### @3='y'
因为binlog_format=ROW,update语句尽管只更新了1列,在binlog里也会记录所有列的更新前和更新后的值,但是没有记录具体列名,而是用列位置代替,也就是binlog里记录的列3是name varchar(10),应用从库时发现从库的列3变成了:aid bigint。
102 添加 age bigint(0) 列之前:
列1:id bigint
列2:aid bigint
列3:name varchar(10)
102 添加 age bigint(0) 列之后:
列1:id bigint
列2:age bigint
列3:aid bigint
列4:name varchar(10)
解决方案:
理想的解决方案:
最后思考一下,问题出现的根本原因是什么,是因为MySQL双主架构下,同一时间对同一表执行了双写操作吗?
实际是因为双写时导致了两个节点SQL执行顺序不同导致的问题。
主库执行:
update t1 set name='y' where id=1;
同一时间点,从库执行了:
alter table t1 add column age bigint(0) after id;
那么主库的SQL执行顺序是:
1. update t1 set name='y' where id=1;
2. alter table t1 add column age bigint(0) after id;
而从库的执行顺序是:
1. alter table t1 add column age bigint(0) after id;
2. update t1 set name='y' where id=1;
知道了问题根本原因,那么解决方案也就知道了,手动调整从库的SQL执行顺序,改成和主库相同的执行顺序:
102从:
stop slave;
set sql_binlog_bin=0;
alter table t1 drop column age;
set sql_binlog_bin=1;
start slave;
先删除列,删除后从库的表结构和binlog里记录的表结构一致了,会自动同步主库的update操作。
同步正常后,再手动将列加回来。
stop slave;
set sql_binlog_bin=0;
alter table t1 add column age bigint(0) after id;
set sql_binlog_bin=1;
start slave;
主从同步正常:
show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
实际解决方案:
实际上,从库上人为手动执行了大量的SQL语句,对多个表执行了添加列、修改列类型等操作,很难完全理清所有SQL的执行顺序,稍有差池会导致数据不一致,最稳妥的方案是通过主库全备重建从库。
如何避免类似问题的发生:
严禁在从库执行写操作。
参考
MySQL Replication Breaks With Error 1677: Column … of Table ‘…’ Cannot Be Converted (Doc ID 2037712.1)
欢迎关注我的公众号《IT小Chen》