MySQL数据库主从同步中断问题分析:Column x of table cjc.t1 cannot be converted from type xx to type xx-优快云博客

问题现象:

主从同步告警，报错如下:

Slave_SQL_Running: No
Last_Errno: 13146
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

继续查看

mysql> select * from  performance_schema.replication_applier_status_by_worker\G;
LAST_ERROR_NUMBER: 13146
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597; Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'

环境说明:

MySQL 8.0.X 双主架构单写架构。
双主单写是指在节点1上有一个VIP地址，应用是通过VIP地址连接节点1进行单写，只有节点1故障，VIP才会漂到节点2上，也就是同一时间点对数据库是单写操作。下面将有VIP地址的节点称为主库，没有VIP地址的成为从库。

问题分析:

报错:

Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'

的产生的原因可能是:主或从上表结构不同。
但是手动检查两个节点的cjc.t1表结构，完全相同。
解析对应binlog日志:

mysql> show binary logs;

如果数据太多，可以输出到文件里

mysql> show binlog events in 'mysql-bin.000006';

mysqlbinlog --no-defaults --base64-output=decode-rows --verbose mysql-bin.000006 > 6.log

继续查看两个节点SQL执行历史，发现问题期间，有人人为在从节点执行了cjc.t1表添加列的操作，同一时间点，主库上的应用在主库上执行了对表cjc.t1的update操作。
为什么这种场景下会触发主从中断的错误？该问题可以稳定重现，重现过程如下：

问题重现:

故障模拟:

主:192.168.1.101
从:192.168.1.102

101:创建测试表

create database cjc;
use cjc;
create table t1(id bigint not null,aid bigint,name varchar(10));
insert into t1 values(1,1,'a');
insert into t1 values(2,2,'x');

102:停止同步，模拟同步延时或主从同时执行的场景

stop slave;

101:更新表数据

update t1 set name='y' where id=1;

102:添加列

alter table t1 add column age bigint(0) after id;

102:启动slave，同步报错:

start slave;
mysql> show slave status\G;
Slave_SQL_Running: No
Last_Errno: 13146
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
mysql> select * from  performance_schema.replication_applier_status_by_worker\G;
LAST_ERROR_NUMBER: 13146
LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '326bf8c8-b999-11ee-aac4-fa163eaf634b:1100064' at source log mysql-bin.000006, end_log_pos 597; Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'

报错:

Column 2 of table 'cjc.t1' cannot be converted from type 'varchar(40(bytes))' to type 'bigint'

报错说明:
从cjc.t1表的第3列(从0列开始计算，所以Column 2实际指的是表的第3列)，在binlog里记录的是varchar(40(bytes))类型，但在从库应用时，发现当前这个列类型变成了bigint类型，所有同步报错。
其中表的字符集是utf8mb4，每个字符占用4bytes，所以varchar(40(bytes))对应的类型是varchar(10)。
查看表结构、列位置信息:

mysql> select TABLE_SCHEMA,TABLE_NAME,COLUMN_NAME,ORDINAL_POSITION,COLUMN_TYPE from information_schema.columns where TABLE_SCHEMA='cjc';

添加age列之前:

+--------------+------------+-------------+------------------+------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_TYPE      |
+--------------+------------+-------------+------------------+------------------+
| cjc      | t1         | id          |                1 |  bigint          |
| cjc      | t1         | aid         |                2 |  bigint          |
| cjc      | t1         | name        |                3 |  varchar(10)     |
+--------------+------------+-------------+------------------+------------------+
3 rows in set (0.00 sec)

添加age列之后:

+--------------+------------+-------------+------------------+------------------+
| TABLE_SCHEMA | TABLE_NAME | COLUMN_NAME | ORDINAL_POSITION | COLUMN_TYPE      |
+--------------+------------+-------------+------------------+------------------+
| cjc      | t1         | id          |                1 |  bigint          |
| cjc      | t1         | age         |                2 |  bigint          |
| cjc      | t1         | aid         |                3 |  bigint          |
| cjc      | t1         | name        |                4 |  varchar(10)     |
+--------------+------------+-------------+------------------+------------------+
4 rows in set (0.00 sec)

通过解析binlog日志，可以看到，binlog里记录的update语句如下:

### UPDATE `cjc`.`t1`
### WHERE
###   @1=1
###   @2=1
###   @3='a'
### SET
###   @1=1
###   @2=1
###   @3='y'

因为binlog_format=ROW，update语句尽管只更新了1列，在binlog里也会记录所有列的更新前和更新后的值，但是没有记录具体列名，而是用列位置代替，也就是binlog里记录的列3是name varchar(10)，应用从库时发现从库的列3变成了:aid bigint。
102 添加 age bigint(0) 列之前:

列1:id bigint
列2:aid bigint
列3:name varchar(10)

102 添加 age bigint(0) 列之后:

列1:id bigint
列2:age bigint
列3:aid bigint
列4:name varchar(10)

解决方案:

理想的解决方案:

最后思考一下，问题出现的根本原因是什么，是因为MySQL双主架构下，同一时间对同一表执行了双写操作吗？
实际是因为双写时导致了两个节点SQL执行顺序不同导致的问题。
主库执行:

update t1 set name='y' where id=1;

同一时间点，从库执行了:

alter table t1 add column age bigint(0) after id;

那么主库的SQL执行顺序是:

1. update t1 set name='y' where id=1;
2. alter table t1 add column age bigint(0) after id;

而从库的执行顺序是:

1. alter table t1 add column age bigint(0) after id;
2. update t1 set name='y' where id=1;

知道了问题根本原因，那么解决方案也就知道了，手动调整从库的SQL执行顺序，改成和主库相同的执行顺序:
102从:

stop slave;
set sql_binlog_bin=0;
alter table t1 drop column age;
set sql_binlog_bin=1;
start slave;

先删除列，删除后从库的表结构和binlog里记录的表结构一致了，会自动同步主库的update操作。
同步正常后，再手动将列加回来。

stop slave;
set sql_binlog_bin=0;
alter table t1 add column age bigint(0) after id;
set sql_binlog_bin=1;
start slave;

主从同步正常:

show slave status\G;
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

实际解决方案:

实际上，从库上人为手动执行了大量的SQL语句，对多个表执行了添加列、修改列类型等操作，很难完全理清所有SQL的执行顺序，稍有差池会导致数据不一致，最稳妥的方案是通过主库全备重建从库。

如何避免类似问题的发生:

严禁在从库执行写操作。

参考

MySQL Replication Breaks With Error 1677: Column … of Table ‘…’ Cannot Be Converted (Doc ID 2037712.1)

欢迎关注我的公众号《IT小Chen》