MIT 6.824 Lec5&6&7 Fault Tolerance: Raft Q&A

Raft算法在分布式系统中用于实现一致性,以简洁易懂著称,但牺牲了性能,如单独写入log和限制批量操作。尽管如此,它已在如Docker、etcd等项目中应用。与Paxos相比,Raft更易理解,但在大型系统中并非最佳选择。选举过程中服务短暂不可用,但在网络分区情况下,Raft不会出现split brain。通过随机选举超时防止并发候选人,避免系统失常。在安全方面,Raft假设所有服务器合法,依赖外部机制抵御攻击。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Does Raft sacrifice anything for simplicity?

Raft为了简便易用,牺牲了一些性能,例如:

  • 每个operation都需要单独写入log进行持久化存储,如果将很多operations构成一个batch一起写入会更好
  • Leader和每个Follower之间只能存在一个有效的AppendEntries RPC。已pipline的形式构建多个RPCs可能更好。
  • snapshot的设计只适用于状态空间比较少的情况,因为需要将所有状态全部存储在disk中。对于大型系统来说,一般只存储最近改动过的状态。
  • Servers可能无法充分利用多核CPU计算,因为每次只能处理一个operation(in log order)。

Is Raft used in real-world software, or do companies generally roll their own flavor of Paxos (or use a different consensus protocol)?

很多real-world software使用了Raft机制,如 Raft: Docker (https://docs.docker.com/engine/swarm/raft/), etcd (https://etcd.io), and MongoDB.

What is Paxos? In what sense is Raft simpler?

Paxos也是一种共识算法,虽然Raft比Paxos更容易理解,但是Paxos的机制其实相比Raft更简单。Google Chubby, ZooKeeper都是基于Paxos协议来实现的。虽然Raft是一种很好理解的共识算法,但是由于其牺牲了性能方面的指标,因此实际上并不是大型分布式系统的最优选择。

How long had Paxos existed before the authors created Raft?

Paxos在1980年就已经出现了。

In Raft, the service which is being replicated is not available to the clients during an election process. In practice how much of a problem does this cause?

在Raft中,因为leader election而导致server暂停服务的时间不超过1/10s。在Raft论文中,选举事件只有在servers故障或network有问题时才出现。

Are there other consensus systems that don’t have leader-election pauses?

基于Paxos的共识算法不需要进行leader选举。

How are Raft and VMware FT related?

VMware FT存在单点故障,比如test-and-set server,而Raft不存在单点服务器故障。从这个角度来讲,Raft比VMware FT具有更好的容错机制。

VMware FT可以对硬件级别的操作进行模拟和备份,而Raft只备份软件级别的数据,因此对大多数系统来讲,Raft的效率更高。

Why can’t a malicious person take over a Raft server, or forge incorrect Raft messages?

Raft并没有对黑客攻击进行任何防御,它默认所有server都是合法的。我们可以通过加解密RPC通信,或者防火墙来抵御不法攻击。

The paper mentions that Raft works under all non-Byzantine conditions. What are Byzantine conditions and why could they make Raft fail?

非拜占庭错误是指server出现fail-stop故障,但是并不会产生错误的输出。拜占庭故障是指server由于黑客攻击或bug而产生的错误输出和结果。如果发生了拜占庭故障,那么Raft会将错误的结果返回给client。

What if a client sends a request to a leader, but the leader crashes before sending the client request to all followers, and the new leader doesn’t have the request in its log? Won’t that cause the client request to be lost?

是的,任何没有提交的log都可能在leader换届时被覆盖而丢失。但是由于client一直收不到回复,因此会进行request的重传。client的重传机制也代表分布式系统必须由判别重复请求的能力。

If there’s a network partition, can Raft end up with two leaders and split brain?

不会,Raft中只会存在一个leader。多数投票原则保证了Raft不会存在split brain现象。当出现网络分区时,小于半数的servers partition永远不会选举出leader,而超过半数的servers partition最多只有一个,因此不可能出现split brain现象。

Suppose a new leader is elected while the network is partitioned, but the old leader is in a different partition. How will the old leader know to stop committing new entries?

由于old leader必然处在少数servers的分区,因此必然不满足new entry的commit条件。当网络恢复后,由于old leader的term小于new leader,因此当收到RPC通信时,会自动地变为followers。

When some servers have failed, does “majority” refer to a majority of the live servers, or a majority of all servers (even the dead ones)?

所有的servers。

What if the election timeout is too short? Will that cause Raft to malfunction?

如果election timeout太小了,可能导致leader还没有来的级发送heart break,followers就发起了选举,导致系统消耗大量的时间来处理leader election,没有时间处理client requests。如果election timeout太大,则导致选举时整个系统会暂停过长的时间。

Why randomize election timeouts?

防止同时出现多个candidates,并发生分票。

Can a candidate declare itself the leader as soon as it receives votes from a majority, and not bother waiting for further RequestVote replies?

可以,超过半数投票已经足够了。

Can a leader in Raft ever stop being a leader except by crashing?

可以,比如网络连接断开,网络拥堵数据包丢失,任何follower只要在一定时间内没有收到leader信息,都会开始新的leader election。

When are followers’ log entries sent to their state machines?

当follower通过AppendEntries RPCs中的leaderCommit字段得知leader已经提交了log entry后,follower就可以commit对应的log。

Should the leader wait for replies to AppendEntries RPCs?

Leader会并发的发送AppendEntries RPCs,无需进行等待,当收到回复时,只需要通过计数来判断log可不可以进行commit即可。在Go语言中的实现机制如下:

  for each server {
    go func() {
      send the AppendEntries RPC and wait for the reply
      if reply.success == true {
        increment count
        if count == nservers/2 + 1 {
          this entry is committed
        }
      }
    } ()
  }

What happens if a half (or more) of the servers die?

系统会停止工作,不断重复选举过程。

Why is the Raft log 1-indexed?

Raft log从1计数的原因是保留了0的空白位置,在第一次添加log时,可以将prevLogIndex设置为0,从而省去过多的边界判断。

When network partition happens, wouldn’t client requests in minority partitions be lost?

会的,只有majority网络才可以commit log。

When raft receives a read request does it still commit a no-op?

leader会在自己的任期开始时提交no-op

The paper states that no log writes are required on a read, but then immediately goes on to introduce committing a no-op as a technique to get the committed. Is this a contradiction or are no-ops not considered log ‘writes’?

no-op log entry只有在每个任期开始时才会产生,普通的读操作并不会产生。

How does using the heartbeat mechanism to provide leases (for read-only) operations work, and why does this require timing for safety (e.g. bounded clock skew)?

作者在论文中并没有对该部分进行详细叙述。其机制可能是leader在发送AppendEntries RPCs的时候会通知其它servers在接下来的100ms内不允许存在其他leader,如果当前leader收到大多数server的positive reply,则说明可以在接下来的100ms内进行读操作。这一机制需要所有servers保证时钟的误差不能过大。

What exactly do the C_old and C_new variables in Section 6 (and Figure 11) represent? Are they the leader in each configuration?

他们分别代表旧配置和新配置中的servers集合。

Just to be clear, the process of having new members join as non-voting entities isn’t to speed up the process of replicating the log, but rather to influence the election process? How does this increase
availability? These servers that need to catch up are not going to be available regardless, right?

加入non-voting servers的目的是加快configure change的过程,否则在提交新的log entry之前必须在new servers上同步大量的logs。

Under what circumstances would a follower receive a snapshot that is a prefix of its own log?

网络发送数据包和RPC执行的顺序可能不同,比如leader发送了log index 100的snapshot和一个log index 110的log entry,有可能RPC会先发送后一个数据。

Is InstallSnapshot atomic? If a server crashes after partially installing a snapshot, and the leader re-sends the InstallSnapshot RPC, is this idempotent like RequestVote and AppendEntries RPCs?

InstallSnapshot的实现必须是atomic的,leader重新发送snapshot的操作也符合幂等性

What data compression schemes, such as VIZ, ZIP, Huffman encoding, etc. are most efficient for Raft snapshotting?

这由snapshot存储的数据内容以及实现方式决定,如果是图像,也许可以用JPEG来代替snapshot。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值