Pacemaker+Corosync+PostgreSQL集群故障测试和解决方案

最新推荐文章于 2025-05-22 21:31:30 发布

转载

最新推荐文章于 2025-05-22 21:31:30 发布 · 1.5k 阅读

2 ·

CC 4.0 BY-SA版权

原文链接：https://my.oschina.net/aven92/blog/519458

文章标签：

#数据库 #php #运维

本文详细介绍了在Pacemaker+Corosync环境下，PostgreSQL集群遇到的各种故障，包括单节点和两节点故障，以及Pacemaker和Corosync自身的故障。针对每种故障，提供了详细的解决步骤，确保集群在故障发生时仍能保持服务的可用性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

为什么80%的码农都做不了架构师？>>>

· 2015年度PG大象会报名地址：http://postgres2015.eventdove.com/

· PostgreSQL中国社区： http://postgres.cn/index.php/home

· PostgreSQL专业1群： 3336901（已满）

· PostgreSQL专业2群： 100910388

· PostgreSQL专业3群： 150657323

· PostgresChina微信公众号 PostgreSQL用户会微博二维码

一、环境

$ cat /etc/redhat-release 
CentOS Linux release 7.0.1406 (Core) 
$ uname -a
Linux zhaopin-5-90 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$ sudo pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Tue Oct 20 11:02:08 2015Last change: Tue Oct 20 10:57:37 2015 by root via crm_attribute on zhaopin-5-92
Stack: corosync
Current DC: zhaopin-5-92 (version 1.1.13-a14efad) - partition with quorum
3 nodes and 5 resources configured
Online: [ zhaopin-5-90 zhaopin-5-91 zhaopin-5-92 ]
Full list of resources:
 Master/Slave Set: pgsql-master [pgsql]
     Masters: [ zhaopin-5-92 ]
     Slaves: [ zhaopin-5-90 zhaopin-5-91 ]
 Resource Group: master-group
     vip-master(ocf::heartbeat:IPaddr2):Started zhaopin-5-92
 Resource Group: slave-group
     vip-slave(ocf::heartbeat:IPaddr2):Started zhaopin-5-90
PCSD Status:
  zhaopin-5-90 (172.17.5.90): Online
  zhaopin-5-91 (172.17.5.91): Online
  zhaopin-5-92 (172.17.5.92): Online
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

node1: 172.17.5.90
node2: 172.17.5.91
node3: 172.17.5.92
node4: 172.17.5.93 应急

vip-master: 172.17.5.99
vip-slave:  172.17.5.98

二、PostgreSQL故障

1. 单节点故障

1）master故障

在node3上执行：

$ ps aux | grep postgres | grep -v grep
postgres  29482  0.0  0.2 103216  7900 ?        S    10:57   0:00 /usr/bin/postgres -D /data/postgresql/data -c config_file=/data/postgresql/data/postgresql.conf
postgres  29524  0.0  0.0 103216  1980 ?        Ss   10:57   0:00 postgres: checkpointer process   
postgres  29525  0.0  0.0 103216  1988 ?        Ss   10:57   0:00 postgres: writer process   
postgres  29526  0.0  0.0 103216  1896 ?        Ss   10:57   0:00 postgres: wal writer process   
postgres  29527  0.0  0.0 104044  3008 ?        Ss   10:57   0:00 postgres: autovacuum launcher process   
postgres  29528  0.0  0.0  88912  1792 ?        Ss   10:57   0:00 postgres: archiver process   last was 00000001000000000000000E
postgres  29529  0.0  0.0  89024  1920 ?        Ss   10:57   0:00 postgres: stats collector process   
postgres  29790  0.0  0.0 104060  3208 ?        Ss   10:57   0:00 postgres: wal sender process postgres 172.17.5.90(35734) streaming 0/F0000E0
postgres  29791  0.0  0.0 104192  3392 ?        Ss   10:57   0:00 postgres: wal sender process postgres 172.17.5.91(60722) streaming 0/F0000E0
$ psql -U postgres
psql (9.2.13)
Type "help" for help.
postgres=# select * from pg_stat_replication ;
  pid  | usesysid | usename  | application_name | client_addr | client_hostname | client_port |         backend_start         |   state   | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state 
-------+----------+----------+------------------+-------------+-----------------+-------------+-------------------------------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
 29790 |       10 | postgres | zhaopin-5-90     | 172.17.5.90 |                 |       35734 | 2015-10-20 02:57:25.718047+00 | streaming | 0/F0000E0     | 0/F0000E0      | 0/F0000E0      | 0/F0000E0       |             1 | sync
 29791 |       10 | postgres | zhaopin-5-91     | 172.17.5.91 |                 |       60722 | 2015-10-20 02:57:25.718566+00 | streaming | 0/F0000E0     | 0/F0000E0      | 0/F0000E0      | 0/F0000E0       |             2 | potential
(2 rows)
postgres=# \q

杀掉PostgreSQL的进程

$ sudo killall postgres
$ ps aux | grep postgres | grep -v grep

PostgreSQL的进程已经没有了

$ sudo pcs status
Cluster name: pgcluster
WARNING: corosync and pacemaker node names do not match (IPs used in setup?)
Last updated: Tue Oct 20 11:11:54 2015Last change: Tue Oct 20 11:10:31 2015 by root via crm_attribute on zhaopin-5-90
Stack: corosync
Current DC: zhaopin-5-92 (version 1.1.13-a14efad) - partition with quorum
3 nodes and 5 resources configured
Online: [ zhaopin-5-90 zhaopin-5-91 zhaopin-5-92 ]
Full list of resources:
 Master/Slave Set: pgsql-master [pgsql]
     Masters: [ zhaopin-5-90 ]
     Slaves: [ zhaopin-5-91 ]
     Stopped: [ zhaopin-5-92 ]
 Resource Group: master-group
     vip-master(ocf::heartbeat:IPaddr2):Started zhaopin-5-90
 Resource Group: slave-group
     vip-slave(ocf::heartbeat:IPaddr2):Started zhaopin-5-91
Failed Actions:
* pgsql_start_0 on zhaopin-5-92 'unknown error' (1): call=33, status=complete, exitreason='My data may be inconsistent. You have to remove /var/lib/pgsql/tmp/PGSQL.lock file to force start.',
    last-rc-change='Tue Oct 20 11:10:14 2015', queued=0ms, exec=389ms
PCSD Status:
  zhaopin-5-90 (172.17.5.90): Online
  zhaopin-5-91 (172.17.5.91): Online
  zhaopin-5-92 (172.17.5.92): Online
Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

查看集群状态可见，master和vip-master已经切换到node1上了

在node1上执行:

$ psql -U postgres
psql (9.2.13)
Type "help" for help.
postgres=# select * from pg_stat_replication ;
  pid

最低0.47元/天解锁文章

200万优质内容无限畅学