一、Redis Sentinel 的安装
检查一下自己的系统里面有没有安装redis-sentinel命令,如果没有 可以选择安装redis-sentinel:
//我这里已经安装了
shell> which redis-sentinel
/usr/bin/redis-sentinel
//安装方法:(ubuntu18.04)
shell> apt search redis | grep sentinel
shell> apt install redis-sentinel
shell> redis-sentinel --help
shell> find / -name "*sentinel*"|grep conf
如果没有redis-sentinel命令,也可以选择不安装,用redis-server以sentinel模式启动,但是必须要求有sentinel.conf配置文件
shell> redis-server /path/to/sentinel.conf --sentinel
二、Redis Sentinel介绍
1. 参考网址
我这里不做具体的介绍了,redis官网说的非常详细(我费了一天的时间去读了一下,可能因为我太菜)
redis sentinel官网
redis sentinel中文
2. Redis Sentinel作用
- 监控(Monitoring): Sentinel 会不断地检查你的主服务器和从服务器是否运作正常。
- 提醒(Notification): 当被监控的某个 Redis 服务器出现问题时, Sentinel 可以通过 API 向管理员或者其他应用程序发送通知。
- 自动故障迁移(Automatic failover): 当一个主服务器不能正常工作时, Sentinel 会开始一次自动故障迁移操作, 它会将失效主服务器的其中一个从服务器升级为新的主服务器, 并让失效主服务器的其他从服务器改为复制新的主服务器; 当客户端试图连接失效的主服务器时, 集群也会向客户端返回新主服务器的地址, 使得集群可以使用新主服务器代替失效服务器。
3. Redis Sentinel故障转移
(1) 故障转移过程
一次故障转移操作由以下步骤组成:
- 发现主服务器已经进入客观下线状态。
- 对我们的当前纪元进行自增(详情请参考 Raft leader election ), 并尝试在这个纪元中当选。
- 如果当选失败, 那么在设定的故障迁移超时时间的两倍之后, 重新尝试当选。 如果当选成功, 那么执行以下步骤。
- 选出一个从服务器,并将它升级为主服务器。
- 向被选中的从服务器发送 SLAVEOF NO ONE 命令,让它转变为主服务器。
- 通过发布与订阅功能, 将更新后的配置传播给所有其他 Sentinel , 其他 Sentinel 对它们自己的配置进行更新。
- 向已下线主服务器的从服务器发送 SLAVEOF 命令, 让它们去复制新的主服务器。
- 当所有从服务器都已经开始复制新的主服务器时, 领头 Sentinel 终止这次故障迁移操作。
每当一个 Redis 实例被重新配置(reconfigured) —— 无论是被设置成主服务器、从服务器、又或者被设置成其他主服务器的从服务器 —— Sentinel 都会向被重新配置的实例发送一个 CONFIG REWRITE 命令, 从而确保这些配置会持久化在硬盘里。
(2) 重要参数
这里我要提出一个非常重要且有意思的参数:
sentinel monitor <master-group-name> <ip> <port> <quorum>
没错,就是quorum,这个对理解redis sentinel 故障转移非常非常重要,务必要仔细查看官网。
我在这里先简单解释下这个参数作用:
- Quorum: the number of Sentinel processes that need to detect an error condition in order for a master to be flagged as ODOWN.
- The failover is triggered by the ODOWN state.
- Once the failover is triggered, the Sentinel trying to failover is required to ask for authorization to a majority of Sentinels (or more than the majority if the quorum is set to a number greater than the majority).
The difference may seem subtle but is actually quite simple to understand and use.
For example if you have 5 Sentinel instances, and the quorum is set to 2, a failover will be triggered as soon as 2 Sentinels believe that the master is not reachable, however one of the two Sentinels will be able to failover only if it gets authorization at least from 3 Sentinels
If instead the quorum is configured to 5, all the Sentinels must agree about the master error condition, and the authorization from all Sentinels is required in order to failover.
3. Redis Sentinel选择Replication依据
(1) 中文官网介绍
Sentinel 使用以下规则来选择新的主服务器:
- 在失效主服务器属下的从服务器当中, 那些被标记为主观下线、已断线、或者最后一次回复 PING 命令的时间大于五秒钟的从服务器都会被淘汰。
- 在失效主服务器属下的从服务器当中, 那些与失效主服务器连接断开的时长超过 down-after 选项指定的时长十倍的从服务器都会被淘汰。
- 在经历了以上两轮淘汰之后剩下来的从服务器中, 我们选出复制偏移量(replication offset)最大的那个从服务器作为新的主服务器; 如果复制偏移量不可用, 或者从服务器的复制偏移量相同, 那么带有最小运行 ID 的那个从服务器成为新的主服务器。
(2) 英文官网介绍
When a Sentinel instance is ready to perform a failover, since the master is in ODOWN state and the Sentinel received the authorization to failover from the majority of the Sentinel instances known, a suitable replica needs to be selected.
The replica selection process evaluates the following information about replicas:
- Disconnection time from the master.
- Replica priority.
- Replication offset processed.
- Run ID.
没错,这里提到了Replica priority.
The replicas are sorted by replica-priority as configured in the redis.conf file of the Redis instance. A lower priority will be preferred.
这里说的是 如果配置了replica-priority,较小的instance将优先被选择。
Redis masters (that may be turned into replicas after a failover), and replicas, all must be configured with a replica-priority if there are machines to be strongly preferred. Otherwise all the instances can run with the default run ID (which is the suggested setup, since it is far more interesting to select the replica by replication offset).
这里说的是如果你的某些机器有强弱优先级,那就配置这个属性,否则忽视这个这个属性,接下来以replication offset作为选择依据
4. Redis 出现网络分区处理
当Redis出现网络分区(类似于脑裂)时,有部分消息会丢失,如果最大程度避免丢失的消息量呢?
(1) 问题描述
In every Sentinel setup, as Redis uses asynchronous replication, there is always the risk of losing some writes because a given acknowledged write may not be able to reach the replica which is promoted to master. However in the above setup there is an higher risk due to clients being partitioned away with an old master, like in the following picture:
In this case a network partition isolated the old master M1, so the replica R2 is promoted to master. However clients, like C1, that are in the same partition as the old master, may continue to write data to the old master. This data will be lost forever since when the partition will heal, the master will be reconfigured as a replica of the new master, discarding its data set.
说的就是当M1和R2、R3出现网络分区时,R2被提升为新的Master,这个时候C1客户端还可以继续写入,当网络恢复时,就是导致原先M1上的消息丢失.
(2) 问题处理
This problem can be mitigated using the following Redis replication feature, that allows to stop accepting writes if a master detects that it is no longer able to transfer its writes to the specified number of replicas.
min-replicas-to-write 1
min-replicas-max-lag 10
With the above configuration (please see the self-commented redis.conf example in the Redis distribution for more information) a Redis instance, when acting as a master, will stop accepting writes if it can’t write to at least 1 replica. Since replication is asynchronous not being able to write actually means that the replica is either disconnected, or is not sending us asynchronous acknowledges for more than the specified max-lag number of seconds.
Using this configuration, the old Redis master M1 in the above example, will become unavailable after 10 seconds. When the partition heals, the Sentinel configuration will converge to the new one, the client C1 will be able to fetch a valid configuration and will continue with the new master.
However there is no free lunch. With this refinement, if the two replicas are down, the master will stop accepting writes. It’s a trade off.
说的就是,如果发生1的问题,规定M1必须能成功与至少1台slave进行复制,判断是否成功复制的标准是在10内收到slave的确认回复,否则M1就认为异常,停止提供写服务。当网络恢复时,同步新的M2的数据。
5. Redis Sentinel故障转移前后图解
正常情况:
主节点挂掉:
三、Redis Sentinel配置使用
1. 参考网址
redis sentinel介绍及使用
redis sentinel配置
redis sentinel官网
2. 比较理想的sentinel配置
3. 故障转移前后通知客户端
当故障发生及结束后,如果通知客户端呢?
(1) 库实现
有的库对这部分已经实现了,比如 redis-py 库:
- redis-py 在建立连接的时候进行了主库地址变更判断。
- redis-py 在处理修改命令的时候捕获了一个特殊的异常 ReadOnlyError,在这个异常里将所有的旧连接全部关闭了,后续指令就会进行重连。
(2) 如果库没有实现呢?
如果库没有实现相关机制呢?redis-sentinel本身也提供了两种机制。
客户端可以在出现连接异常是主动通过命令获得master地址:
127.0.0.1:5000> SENTINEL get-master-addr-by-name mymaster
1) "127.0.0.1"
2) "6379"
客户端(或者服务调用方)可以通过sentinel pub/sub机制订阅相关的故障转移事件,以可以获得sentinel故障转移时主动通知:
具体请参考官网:
+reset-master <instance details> -- The master was reset.
+slave <instance details> -- A new replica was detected and attached.
+failover-state-reconf-slaves <instance details> -- Failover state changed to reconf-slaves state.
+failover-detected <instance details> -- A failover started by another Sentinel or any other external entity was detected (An attached replica turned into a master).
+slave-reconf-sent <instance details> -- The leader sentinel sent the REPLICAOF command to this instance in order to reconfigure it for the new replica.
+slave-reconf-inprog <instance details> -- The replica being reconfigured showed to be a replica of the new master ip:port pair, but the synchronization process is not yet complete.
+slave-reconf-done <instance details> -- The replica is now synchronized with the new master.
-dup-sentinel <instance details> -- One or more sentinels for the specified master were removed as duplicated (this happens for instance when a Sentinel instance is restarted).
+sentinel <instance details> -- A new sentinel for this master was detected and attached.
+sdown <instance details> -- The specified instance is now in Subjectively Down state.
-sdown <instance details> -- The specified instance is no longer in Subjectively Down state.
+odown <instance details> -- The specified instance is now in Objectively Down state.
-odown <instance details> -- The specified instance is no longer in Objectively Down state.
+new-epoch <instance details> -- The current epoch was updated.
+try-failover <instance details> -- New failover in progress, waiting to be elected by the majority.
+elected-leader <instance details> -- Won the election for the specified epoch, can do the failover.
+failover-state-select-slave <instance details> -- New failover state is select-slave: we are trying to find a suitable replica for promotion.
no-good-slave <instance details> -- There is no good replica to promote. Currently we'll try after some time, but probably this will change and the state machine will abort the failover at all in this case.
selected-slave <instance details> -- We found the specified good replica to promote.
failover-state-send-slaveof-noone <instance details> -- We are trying to reconfigure the promoted replica as master, waiting for it to switch.
failover-end-for-timeout <instance details> -- The failover terminated for timeout, replicas will eventually be configured to replicate with the new master anyway.
failover-end <instance details> -- The failover terminated with success. All the replicas appears to be reconfigured to replicate with the new master.
switch-master <master name> <oldip> <oldport> <newip> <newport> -- The master new IP and address is the specified one after a configuration change. This is the message most external users are interested in.
+tilt -- Tilt mode entered.
-tilt -- Tilt mode exited.
4. redis sentinel 配置
配置参数只是最基础的配置,请知悉
(1) redis master slave 配置
redis_master.conf
bind 127.0.0.1
port 6379
redis_slave.conf
bind 127.0.0.1
port 6380
slaveof 127.0.0.1 6379
(2) redis sentinel 配置
sentinel_1.conf
//按需配置
daemonize yes
pidfile "/home/ubuntu/Software/redis/log/redis-sentinel1.pid"
logfile "/home/ubuntu/Software/redis/log/redis-sentinel1.log"
//是否开启保护模式:如果使用内网ip或者外网ip需要打开
# protected-mode no
//如果此项打开,请务必注释掉
# sentinel myid 05cf578c6cf3d38402b941e50ebbed6325fff4c9
//常规配置
port 26379
sentinel monitor manage1 127.0.0.1 6379 2
sentinel down-after-milliseconds manage1 6000
sentinel failover-timeout manage1 18000
sentinel parallel-syncs manager1 1
sentinel_2.conf
//按需配置
daemonize yes
pidfile "/home/ubuntu/Software/redis/log/redis-sentinel2.pid"
logfile "/home/ubuntu/Software/redis/log/redis-sentinel2.log"
//是否开启保护模式:如果使用内网ip或者外网ip需要打开
# protected-mode no
//如果此项打开,请务必注释掉
# sentinel myid 05cf578c6cf3d38402b941e50ebbed6325fff4c9
//常规配置
port 26380
sentinel monitor manage1 127.0.0.1 6379 2
sentinel down-after-milliseconds manage1 6000
sentinel failover-timeout manage1 18000
sentinel parallel-syncs manager1 1
sentinel_3.conf
//按需配置
daemonize yes
pidfile "/home/ubuntu/Software/redis/log/redis-sentinel3.pid"
logfile "/home/ubuntu/Software/redis/log/redis-sentinel3.log"
//是否开启保护模式:如果使用内网ip或者外网ip需要打开
# protected-mode no
//如果此项打开,请务必注释掉
# sentinel myid 05cf578c6cf3d38402b941e50ebbed6325fff4c9
//常规配置
port 26381
sentinel monitor manage1 127.0.0.1 6379 2
sentinel down-after-milliseconds manage1 6000
sentinel failover-timeout manage1 18000
sentinel parallel-syncs manager1 1
(3) redis master slave sentinel 运行
//启动redis master slave(尽量以root权限运行)
shell> sudo redis-server ./ha/redis_master.conf
shell> sudo redis-server ./ha/redis_slave.conf
//启动redis sentinel(一定要以root权限运行)
shell> sudo redis-sentinel ./sentinel/sentinel_1.conf
shell> sudo redis-sentinel ./sentinel/sentinel_2.conf
shell> sudo redis-sentinel ./sentinel/sentinel_3.conf
运行如下图所示,这里不再进行故障转移测试了: