说明:
不是达梦数据库的问题,是XXX人员没按规范操作导致的,本次案例主要介绍达梦数据库守护集群发生自动切换后如何排查切换原因,从本次案例可以看到,达梦守护集群的故障自动发现、故障自动切换功能还是比较完善的。
问题现象:
一套非生产达梦数据库,守护集群架构,无监控,某天检查参数时突然发现原主库变成备库,原备库变成了主库,也就是发生了主、备切换。此类问题可大可小,如果是因为数据库BUG导致的自动切换,就是非常严重的问题,需要尽快定位原因并解决问题,最终发现,并不是数据库的问题。
问题分析:
····
第一步:通过确认监控器确定自动切换时间等信息;
第二步:检查主库实例日志、dmwatcher日志、core日志、OS日志等分析自动切换原因;
第三步:确认影响及解决方案。
····
检查监控器日志:
达梦数据库守护集群,通常只有在主库故障时才会自动切换,由于自动切换是由确认监控器发起的,所以第一步先检查确认监控器日志:
通过日志可以看到:
时间:2024-06-03 13:28:02
监控器发现 CJC01 实例故障,WSTATUS 为 ERROR,RECEIVE TIMEOUT CJC01,dwmon_tcp_recv failed。
时间:2024-06-03 13:28:08
[!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例CJC02[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!]
时间:2024-06-03 13:28:12
使用实例CJC02接管成功,主、备数据库自动切换完成,新主库变为CJC02。
详细监控器日志如下:
····
dmdba@cjc-db-01:/dm8/mal/log$vi dmmonitor_20240525105444.log
[monitor] 2024-06-03 13:28:02: dwmon_check_watch_error_low, receive dmwatcher(CJC01) msg timeout(far_err_time:10), close port(ip:192.168.0.11, port:55339, vio:4, n_fixed:1)!
[monitor] 2024-06-03 13:28:02: dwmon tcp port vio(4) close, inst_name:CJC01, ip:192.168.0.11, port:55339, n_fixed:1.
[monitor] 2024-06-03 13:28:02:
[monitor] 2024-06-03 13:28:02: dwmon_tcp_recv failed, close port, vio:0, mid:1991408620, errno:115, code:-6007
[monitor] 2024-06-03 13:28:02: dwmon tcp port vio(0) close, inst_name:, ip:192.168.0.11, port:55339, n_fixed:1.
[monitor] 2024-06-03 13:28:02: 接收守护进程(CJC01)消息超时
WTIME WSTATUS INST_OK INAME ISTATUS IMODE RSTAT N_OPEN FLSN CLSN
2024-06-03 13:27:14 ERROR OK CJC01 OPEN PRIMARY VALID 10 11554054 11554054
[monitor] 2024-06-03 13:28:02: </RECEIVE TIMEOUT CJC01>
[monitor] 2024-06-03 13:28:02: dwmon_inst_can_takeover_by_prim_info, check_pstat = 1, n_pinfo = 1, but ok_cnt = 0, return FALSE!
[monitor] 2024-06-03 13:28:02: [!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]仍然处于活动状态,或者模式/状态或LSN值不符合被其他实例自动接管条件 !!!]
[CJC02] 2024-06-03 13:27:33.500 [INFO] dmwatcher P0000007887 T0000000000000007891 没有收到远程守护进程(CJC01)消息,原状态为(OPEN),距上次接收消息时间间隔(11)s, 设置远程守护进程为ERROR状态
[monitor] 2024-06-03 13:28:08: dwmon_inst_can_takeover_by_prim_info, primary name: CJC01, instance name: CJC02, n_pinfo = 1, ok_cnt = 1, return TRUE!
[monitor] 2024-06-03 13:28:08: [!!! 实例CJC01[PRIMARY, OPEN, ISTAT_SAME:TRUE]故障,实例CJC02[STANDBY, OPEN, ISTAT_SAME:TRUE]符合自动接管条件 !!!]
[monitor] 2024-06-03 13:28:08: 检测到PRIMARY实例故障,开始对组(GRPCJC)执行自动接管
[monitor] 2024-06-03 13:28:08:
[monitor] 2024-06-03 13:28:08: 通知组(GRPCJC)当前活动的守护进程设置MID
[monitor] 2024-06-03 13:28:08: Begin to wait site(CJC02) complete…
[monitor] 2024-06-03 13:28:08: [!!! dwmon_tcp_msg_send, get master tcp port failed, send cmd msg(cmd:5, name_sendto:GRPCJC) to dmwatcher() !!!]
[monitor] 2024-06-03 13:28:08: Wait site(CJC02) finished, code=0!
[monitor] 2024-06-03 13:28:08: 通知组(GRPCJC)当前活动的守护进程设置MID成功
[monitor] 2024-06-03 13:28:08: 开始使用实例CJC02接管
[monitor] 2024-06-03 13:28:08: 通知守护进程CJC02切换TAKEOVER状态
[monitor] 2024-06-03 13:28:09: 守护进程(CJC02)状态切换 [OPEN–>TAKEOVER]
[monitor] 2024-06-03 13:28:10: 切换守护进程CJC02为TAKEOVER状态成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句
[monitor] 2024-06-03 13:28:10: 实例CJC02执行SP_SET_GLOBAL_DW_STATUS(0, 7)语句成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行SP_APPLY_KEEP_PKG()语句
[monitor] 2024-06-03 13:28:10: 实例CJC02执行SP_APPLY_KEEP_PKG()语句成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE MOUNT语句
[monitor] 2024-06-03 13:28:10: Begin to wait site(CJC02) complete…
[monitor] 2024-06-03 13:28:10: Wait site(CJC02) finished, code=0!
[monitor] 2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE MOUNT语句成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE PRIMARY语句
[monitor] 2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE PRIMARY语句成功
[monitor] 2024-06-03 13:28:10: 通知实例CJC02修改所有归档状态无效
[monitor] 2024-06-03 13:28:10: 修改所有实例归档为无效状态成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行ALTER DATABASE OPEN FORCE语句
[monitor] 2024-06-03 13:28:10: ohis_inst_info_copy_low, inst(CJC02) apply info changed, old info[p_db_magic:694012907, n_apply_ep:1], new info to set[p_db_magic:1894906507, n_apply_ep:0]!
[monitor] 2024-06-03 13:28:10: 实例CJC02执行ALTER DATABASE OPEN FORCE语句成功
[monitor] 2024-06-03 13:28:10: 实例CJC02开始执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句
[monitor] 2024-06-03 13:28:11: 实例CJC02执行SP_SET_GLOBAL_DW_STATUS(7, 0)语句成功
[monitor] 2024-06-03 13:28:11: 通知守护进程CJC02切换OPEN状态
[monitor] 2024-06-03 13:28:12: 守护进程(CJC02)状态切换 [TAKEOVER–>OPEN]
[monitor] 2024-06-03 13:28:12: 切换守护进程CJC02为OPEN状态成功
[monitor] 2024-06-03 13:28:12: 通知组(GRPCJC)的守护进程执行清理操作
[monitor] 2024-06-03 13:28:12: Notify instance(CJC02) to clear monitor info and wait complete!
[monitor] 2024-06-03 13:28:12: dwmon_cmd_msg_send_low failed, get tcp_port failed(inst_name:, ip:192.168.0.11, port:55339)!
[monitor] 2024-06-03 13:28:12: [!!! dwmon_tcp_msg_send to master tcp_port failed, code:-6010, (inst_name:, ip:192.168.0.11, port:55339, vio:0) !!!]
[monitor] 2024-06-03 13:28:12: Begin to wait site(CJC02) complete…
[monitor] 2024-06-03 13:28:12: 清理守护进程(CJC02)请求成功
[monitor] 2024-06-03 13:28:12: Wait site(CJC02) finished, code=0!
[monitor] 2024-06-03 13:28:12: 使用实例CJC02接管成功
[monitor] 2024-06-03 13:28:12: </AUTO TAKEOVER CJC02>
[monitor] 2024-06-03 13:28:12: 组(GRPCJC)使用实例CJC02自动接管成功
[monitor] 2024-06-03 13:28:49:
GROUP OGUID MON_CONFIRM MODE MPP_FLAG
GRPCJC 666666 TRUE AUTO FALSE
<>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.0.12 55339 2024-06-03 13:28:18 GLOBAL VALID OPEN CJC02 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.0.12 15239 OK CJC02 OPEN PRIMARY 0 0 REALTIME VALID 688606 11554511 688606 11554511 NONE
ERROR INSTANCES:
<<DATABASE GLOBAL INFO:>>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.0.11 55339 2024-06-03 13:27:14 GLOBAL VALID ERROR CJC01 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.0.11 15239 OK CJC01 OPEN PRIMARY 0 0 REALTIME VALID 688604 11554054 688604 11554054 NONE
[monitor] 2024-06-03 13:29:49:
GROUP OGUID MON_CONFIRM MODE MPP_FLAG
GRPCJC 666666 TRUE AUTO FALSE
<>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.0.12 55339 2024-06-03 13:29:18 GLOBAL VALID OPEN CJC02 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.0.12 15239 OK CJC02 OPEN PRIMARY 0 0 REALTIME VALID 688606 11554511 688606 11554511 NONE
ERROR INSTANCES:
<<DATABASE GLOBAL INFO:>>
DW_IP MAL_DW_PORT WTIME WTYPE WCTLSTAT WSTATUS INAME INST_OK N_EP N_OK ISTATUS IMODE DSC_STATUS RTYPE RSTAT
192.168.0.11 55339 2024-06-03 13:27:14 GLOBAL VALID ERROR CJC01 OK 1 1 OPEN PRIMARY DSC_OPEN REALTIME VALID
EP INFO:
INST_IP INST_PORT INST_OK INAME ISTATUS IMODE DSC_SEQNO DSC_CTL_NODE RTYPE RSTAT FSEQ FLSN CSEQ CLSN DW_STAT_FLAG
192.168.0.11 15239 OK CJC01 OPEN PRIMARY 0 0 REALTIME VALID 688604 11554054 688604 11554054 NONE
····
检查主库实例日志:
检查CJC01原主库实core日志:
检查core目录,没有新文件生成,看来不是数据库BUG导致的自动切换。
检查CJC01原主库实例日志:
通过日志可以看到:
时间:2024-06-03 13:27:30.770
主库实例出现异常,socket_err_should_retry errno:110;
时间:2024-06-03 13:27:32.077
mal通信出现异常, mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:10
时间:2024-06-03 13:28:08.129
主库无法和备库通信,Can’t connect to DM server on ‘192.168.0.12’ port(36339) errno(115)
时间:2024-06-03 13:29:07.746
主库向备库发送arch失败,[realtime arch] mal send to site(CJC02, mal_seqno:1) failed
时间:2024-06-03 15:05:31.897
主库自动重启, [for dem]SYSTEM SHUTDOWN ABORT…,utsk_dw_sql_exec dmserver shutdown abort!!!
详细实例日志如下:
····
2024-06-03 13:26:07.692 [INFO] database P0000006806 T0000000000000006859 ckpt2_log_adjust: ckpt_lsn(11540754), ckpt_fil(0), ckpt_off(864264192), cur_lsn(11554054), l_next_seq(688605), g_next_seq(688605), cur_free(867622912), total_space(2147475456), used_space(3358720), free_space(2144116736), n_ep(1), db_open_id(10)
2024-06-03 13:26:07.693 [INFO] database P0000006806 T0000000000000006859 checkpoint end, 0 pages flushed, used_space[3358720], free_space[2144116736].
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006785 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006786 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006783 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006782 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006780 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006781 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006778 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006779 socket_err_should_retry errno:110
2024-06-03 13:27:30.770 [INFO] database P0000006806 T0000000000000006784 socket_err_should_retry errno:110
2024-06-03 13:27:31.330 [INFO] database P0000006806 T0000000000000026999 socket_err_should_retry errno:110
2024-06-03 13:27:31.330 [INFO] database P0000006806 T0000000000000027000 socket_err_should_retry errno:110
2024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026993 socket_err_should_retry errno:110
2024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026992 socket_err_should_retry errno:110
2024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026996 socket_err_should_retry errno:110
2024-06-03 13:27:31.809 [INFO] database P0000006806 T0000000000000026994 socket_err_should_retry errno:110
2024-06-03 13:27:31.810 [INFO] database P0000006806 T0000000000000026998 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006809 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006807 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006805 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006804 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006836 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006837 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006835 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006803 socket_err_should_retry errno:110
2024-06-03 13:27:31.970 [INFO] database P0000006806 T0000000000000006838 socket_err_should_retry errno:110
2024-06-03 13:27:32.077 [INFO] database P0000006806 T0000000000000006852 mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:10
2024-06-03 13:27:33.077 [INFO] database P0000006806 T0000000000000006852 mal_site_link_check->mal_recv_force_wait failed, code:-6010!, wait_time:1
2024-06-03 13:27:33.077 [INFO] database P0000006806 T0000000000000006852 mal_site_link_check from mal_site(0) to mal_site(1) failed, call mal_site_port_close!
2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006852 MAL mgr thrd site(0) lost connect to site(1), ctl_handle(48), data_handle(49), dsc_handle(0)
2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006846 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006847 mal_site_letter_recv code=-6007, errno=107, site(0) recv from site(1) failed, socket handle = 0
2024-06-03 13:27:33.078 [INFO] database P0000006806 T0000000000000006846 send CMD_MAL_LINK_CHECK(350): (mal_id:87333158777, stmt_id:4320122, mppexec_id:0, pln_op_id:65535, org_site :0, src_site:0, dest_site:1, build_time:-1)
2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006847 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2024-06-03 13:27:33.077 [WARNING] database P0000006806 T0000000000000006845 mal_site_tsk_check site(0) connect lost to site(1), socket handle = 0, mal sys status = 0, try get port again
2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848 mal_site_letter_recv code=-6007, errno=0, site(0) recv from site(1) failed, socket handle = 0
2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006847 site(0) ctl_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848 MAL receive site(0) lost connect to site(1), ctl_handle(0), data_handle(0), dsc_handle(0)
2024-06-03 13:27:33.078 [WARNING] database P0000006806 T0000000000000006848 site(0) data_link mal_site_letter_recv from site(1) failed, socket handle = 0, mal sys status is 0, try to get mal_port again
2024-06-03 13:27:33.249 [INFO] database P0000006806 T0000000000000026991 socket_err_should_retry errno:110
2024-06-03 13:27:34.369 [INFO] database P0000006806 T0000000000000006839 socket_err_should_retry errno:110
2024-06-03 13:27:35.329 [INFO] database P0000006806 T0000000000000006787 socket_err_should_retry errno:110
2024-06-03 13:27:35.409 [INFO] database P0000006806 T0000000000000026997 socket_err_should_retry errno:110
2024-06-03 13:27:35.409 [INFO] database P0000006806 T0000000000000026995 socket_err_should_retry errno:110
2024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025591 socket_err_should_retry errno:110
2024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025594 socket_err_should_retry errno:110
2024-06-03 13:27:35.490 [INFO] database P0000006806 T0000000000000025587 socket_err_should_retry errno:110
2024-06-03 13:27:35.489 [INFO] database P0000006806 T0000000000000025592 socket_err_should_retry errno:110
2024-06-03 13:27:35.489 [INFO] database P0000006806 T0000000000000025598 socket_err_should_retry errno:110
2024-06-03 13:28:08.129 [ERROR] database P0000006806 T0000000000000006852 Can’t connect to DM server on ‘192.168.0.12’ port(36339) errno(115)
2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006894 checkpoint requested by CKPT_INTERVAL, rlog free space[2144116736], used space[3358720]
2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006894 checkpoint generate by ckpt_interval
2024-06-03 13:29:07.740 [INFO] database P0000006806 T0000000000000006859 checkpoint begin, used_space[3358720], free_space[2144116736]…
2024-06-03 13:29:07.746 [ERROR] database P0000006806 T0000000000000006860 [realtime arch] mal send to site(CJC02, mal_seqno:1) failed, begin lsn:11554054, end lsn:11554054, code:-6010
2024-06-03 13:29:07.746 [ERROR] database P0000006806 T0000000000000006860 send realtime archive to instance[CJC02] failed, code = -6010, begin_lsn = 11554054, end_lsn = 11554054!
2024-06-03 13:29:07.746 [INFO] database P0000006806 T0000000000000006860 rlog4_process_arch_failed, need_suspend:1
2024-06-03 13:29:07.747 [INFO] database P0000006806 T0000000000000006860 Send archive log to remote instance failed, switch all ep to SUSPEND status success!
2024-06-03 14:10:39.591 [INFO] database P0000006806 T0000000000000006865 (1) License daily checking by time trigger!
2024-06-03 15:05:31.897 [INFO] database P0000006806 T0000000000000006897 utsk_dw_sql_exec, exec sql SHUTDOWN ABORT!
2024-06-03 15:05:31.897 [FATAL] database P0000006806 T0000000000000006897 [for dem]SYSTEM SHUTDOWN ABORT.
2024-06-03 15:05:31.897 [FATAL] database P0000006806 T0000000000000006897 utsk_dw_sql_exec dmserver shutdown abort!!!
2024-06-03 15:05:31.897 [INFO] database P0000006806 T0000000000000006897 total 3 rfil opened!
2024-06-03 15:05:44.947 [INFO] database P0000008699 T0000000000000008699 INI parameter DW_PORT changed, the original value 0, new value 15539
2024-06-03 15:05:44.955 [INFO] database P0000008699 T0000000000000008699 INI parameter DPC_2PC changed, the original value 1, new value 0
2024-06-03 15:05:44.975 [INFO] database P0000008699 T0000000000000008699 version info: enterprise
2024-06-03 15:05:44.977 [INFO] database P0000008699 T0000000000000008699 os_sema2_create_low, create and inc sema success, key:226172672, sem_id:131072, sem_value:1!
2024-06-03 15:05:45.845 [INFO] database P0000008699 T0000000000000008699 DM Database Server 64 V8 03134284094-20231220-212751-20067 startup…
2024-06-03 15:05:45.846 [INFO] database P0000008699 T0000000000000008699 INI parameter ROLLSEG_POOLS changed, the original value 19, new value 1
2024-06-03 15:05:46.257 [WARNING] database P0000008699 T0000000000000008699 fail to load libraster.so, libodbc.so.2: cannot open shared object file: No such file or directory
2024-06-03 15:05:46.257 [INFO] database P0000008699 T0000000000000008699 fil_sys_init
2024-06-03 15:05:46.409 [INFO] database P0000008699 T0000000000000008699 Database mode = 1, oguid = 666666
2024-06-03 15:05:46.419 [INFO] database P0000008699 T0000000000000008699 Initialize temp tablespace, file path: /dm8/data/CJC/TEMP.DBF, code: 0
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008730 dm_mal_recv_thd started, src_site:0, dest_site:0, port_data
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008735 dm_mal_recv_thd started, src_site:0, dest_site:1, port_ctl
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008733 dm_mal_tsk_thd started, src_site:0, dest_site:1, port_data
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008729 dm_mal_tsk_thd started, src_site:0, dest_site:0, pot_ctl
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008736 dm_mal_recv_thd started, src_site:0, dest_site:1, port_data
2024-06-03 15:05:46.451 [INFO] database P0000008699 T0000000000000008734 dm_mal_tsk_thd started, src_site:0, dest_site:1, pot_ctl
2024-06-03 15:05:46.452 [INFO] database P0000008699 T0000000000000008739 mal_lsnr_thread created success, port_num=36339
2024-06-03 15:05:47.452 [INFO] database P0000008699 T0000000000000008740 mal_site_ctl_link_create startup from mal_site(0) to mal_site(0)!
2024-06-03 15:05:47.453 [INFO] database P0000008699 T0000000000000008740 mal_site_magic_gen site_magic[46784], src_site:0, dst_site:0
2024-06-03 15:05:47.453 [INFO] database P0000008699 T0000000000000008740 site[0] mal_site_ctl_port_set to site[0, IP: 192.168.0.11, port_num: 36339], socket handle = 10, site_magic = 46784, link_seq = 1
····
检查主库操作系统history:
前面通过监控器日志、实例日志等可以看到达梦主、备库自动切换过程,但是没有主库故障原因的相关日志,继续查看OS日志、history等:
在CJC01实例服务器上检查history:
终于找到原因,XXX人员未按操作规范,私自在服务器上执行重启网络服务、ifdown ethxxx等操作,时间完全吻合。
history部分日志如下:
····
xxx 2024-06-03-13:26:14 [root]service network restart
xxx 2024-06-03-13:27:15 [root]ifdown ethxxx
xxx 2024-06-03-13:38:38 [root]ip a
xxx 2024-06-03-13:39:15 [root]route -n
xxx 2024-06-03-13:39:39 [root]history
xxx 2024-06-03-15:04:55 [root]ip a
xxx 2024-06-03-15:05:29 [root]ifup ethxxx
xxx 2024-06-03-15:05:33 [root]ip a
····
重启主库网卡期间,由于监控器和备库都无法和主库通信,而监控器和备库之间可以通信,所以会认为主库故障,自动将主库踢出集群,并将备库提升为主库。
自动切换后,有哪些影响?
1.由于自动切换时间很短,应用无感知,无影响;
2.检查达梦数据库自动备份执行情况:
自动备份是通过主库内部JOB实现的,每晚自动备份到主库服务器指定目录下,检查主库,自动切换后,没有生成备份文件,那么是否可以认为达梦备份失败了?
检查JOB执行情况,备份无异常,正常执行。
····
SQL> set pagesize 1000
SQL> SELECT EXEC_ID,NAME,START_TIME,END_TIME,ERRCODE,JOBID FROM SYSJOB.SYSJOBHISTORIES2 ORDER BY START_TIME;
行号 EXEC_ID NAME START_TIME END_TIME ERRCODE JOBID
…
79 xxxxxxx07 DM_BAK_FULL_CJC 2024-06-03 22:00:17 2024-06-03 22:00:23 0 yyyyyyyyy4
80 xxxxxxx08 DM_BAK_FULL_CJC_DELETE 2024-06-03 23:00:17 2024-06-03 23:00:17 0 yyyyyyyyy5
81 xxxxxxx09 DM_BAK_FULL_CJC 2024-06-04 22:00:17 2024-06-04 22:00:22 0 yyyyyyyyy4
82 xxxxxxx10 DM_BAK_FULL_CJC_DELETE 2024-06-04 23:00:17 2024-06-04 23:00:17 0 yyyyyyyyy5
83 xxxxxxx11 DM_BAK_FULL_CJC 2024-06-05 22:00:18 2024-06-05 22:00:22 0 yyyyyyyyy4
84 xxxxxxx12 DM_BAK_FULL_CJC_DELETE 2024-06-05 23:00:18 2024-06-05 23:00:18 0 yyyyyyyyy5
····
检查备库,对应目录下发现了备份文件,也就是这种类型的自动备份只能在主库执行,主备切换后,备份任务会自动在新主库上自动接管。
最后为了方便运维,手动执行了回切操作。
###chenjuchao 20240615###
欢迎关注我的公众号《IT小Chen》