Raft - learning from MIT 6.824
- Course info at https://pdos.csail.mit.edu/6.824/labs/lab-raft.html
- Paper at https://raft.github.io/raft.pdf
- A very good YouTube video
Raft Core Feature
If committed --meaning--> Present in the future leader's logs, and this is achieved by:
- Restrictions on leader election (See voting rules)
- Restrictions on commit
Leader Election
We firstly look at Leader Election
mechanism - which is implemented by two parts:
RequestVote
- each node start asFollower
and whenelection timeout
passes and no HeartBeat received before, then a node step up asCandidate
and send votes to rest of the nodes and if collect more than half (count itself as well), it steps up as aLeader
, which keeps sendingHeartBeat
to others.HeartBeat - a.k.a Empty AppendEntries RPC
Besides, the tips on the paper and the lab2a websites, I encountered a few other pitfalls (which might be mentioned somewhere in to docs)
Candidates
- In order to make the tests can pass with limited running time, I have to make sure when a candidate sends vote requests out and wait, it should stop waiting as long as it already got enough majority votes.
- The same applies when a leader is waiting for heatbeats. Don't wait after quorum votes have been fetched.
- Each places when a candidate/leader step down to a follower,
remember to set
votedFor
back tonil
or-1
so to allow it gives a valid vote in the next round of voting process. - Not like Follower, Candidates can probably send out the next round of vote
with a shorter and fixed Duration than
a Follower's
election timeout duration
, in the case when this round of vote does not get enough votes; But at the same time
Getting vote request as Candidate
- It seems not necessary, but I also added the logic (seem to be safer): If myself is a Candidate and the other node (candidate) want me to vote to her, if we have the same term value, then I don't vote to her (so as she won't vote for me neither);
Standard Voting Rules when got Voting Request:
- If request term lass than my term, ignore. Reply my term.
- So when request term >= my term,
- From
Rules for Servers
: If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower - The candidate situation mentioned above
- if
notYetVoted
orvotedTheSameBefore
, then do the vote, setvotedFor
and reset myelection timeout
; It is important to havevotedFor
set and checked here as it helps in the case where there is multiple candidates alive, then no multiple leaders are selected, as we make sure each voter can only vote to one candidate.
- From
- Restrictions on leader election
- When voting, include
lastLogIndex
andlastLogTerm
- Voting server
v
denies vote if its log is "more complete":if (lastTerm_v > lastTerm_c) || (lastTerm_v == lastTerm_c) && (lastIndex_v > lastIndex_c) return false
- When voting, include
HeartBeat - a.k.a Empty AppendEntries RPC
After the Leader is elected, the electionTimeoutTime
or election timeout
should then be
constantly refreshed with each HeartBeat
- which is achieved by sending empty
AppendEntries
RPC calls.
The basic rules are like:
- From
Rules for Servers
: If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower
The code could be something like this
func (rf *Raft) HeartBeat(args *HeartBeatRequest, reply *HeartBeatReply) {
rf.mu.Lock()
defer rf.mu.Unlock()
if args.Term >= rf.term {
rf.state = Follower // step down
rf.votedFor = args.From
rf.term = args.Term
rf.electionTimeoutTime = time.Now().Add(rf.getElectionTimeoutDuration())
reply.Good = true
reply.Term = rf.term
return
} else {
reply.Good = false
reply.Term = rf.term
return
}
}
AppendEntries RPC
Tricky part summary
Leader ticker and Candidate ticker
The candidate ticker can be done with just a single thread loop
and it is okay to set up a timeout as well. For the critical
conditions of TestFigure8Unreliable2C
, it will mostly work
well as long as we make sure timeout is relatively large so
we can always have a candidate selected after a few rounds.
But for the leader ticker we cannot use single thread call with a long timeout anymore as if we do that, the timeout will be a time much longer than the election timeout of other followers then it ended up constantly new candidate showing up.
However, if we let leader ticker async, for example, start a goroutine and send a HeartBeat call every 100ms, then the other subtle issue is that we have to make sure, when those goroutines finishes in different time and order, we can still handle things correctly.
A few places to worth notice and mention:
- For each leader ticker goroutine, after getting reply from any followers, do a check of the leader's states and term as the leader might have already been converted to a follower by other leader
- when leader to update
nextIndex
for a follower, check it can only be larger thanmatchedIndex
in case the HeartBeats to the same follower replied out of order.
Follower Log correction Speed-Up This is where the paper didn't say much, but basically there is two cases we need to speed-up follower log.
- During follower replying
HeartBeat
, when follower'sLastLogTerm
is greater than leader'srf.log[nextIndex-1].Term
(orargs.PrevLogTerm
), so we rollnextIndex
to the smaller side until it reaches an entry with the same term as leader'sPrevLogTerm
, by cutting follower's log and replying to leader with a smallerLastLogIndex
;
Also, when PrevLogIndex/Term matches, remove all the following entries that starts with a miss match of any entry term from leader's incoming entriesLeader: [0, 1, 1, 2, 2, 2, 4, 4,] Entries: ^ [2, 4, 4,] Follower: [0, 1, 1, 3, 3, ] then we should cut it to: [0, 1, 1, ] <-----------------------| ^-------new `nextIndex` found by `LastLogIndex`
Leader: [0, 1, 1, 2, 2, 2, 4, 4,] Entries: ^ [2, 4, 4,] Follower: [0, 1, 1, 2, 2, 2, 3, 3, 3, 3] then we should cut it to: [0, 1, 1, 2, 2, 2, 4, 4,] <-----------------------|
- During leader's handling of a follower's reply of
HeartBeat
, where the follower'sLastLogTerm
is less than leader'srf.log[nextIndex-1].Term
, so we rollnextIndex
to the smaller side until it reaches an entry with the same term as follower'sLastLogTerm
.Leader: [0, 1, 2, 2, 2, 2, 4, 4,] Entries: ^ [2, 4, 4,] Follower: [0, 1, 1, 1, 1, 1, 1, 1, ] then we should cut it to: [0, 1, 1, 1, ] <-----------------------------| ^--------------- and also send back `LastLogTerm` as 1 so to let leader set `nextIndex` to the place just after the last entry of term = LastLogTerm ^----------------- new `nextIndex` found by `LastLogTerm`
Async commit and updating rf.commitIndex
The actual apply of those logs and update of rf.commitIndex
can be done async with a single goroutine.
Important fix for TestFigure8Unreliable2C
's commits not sync issue here - as we added some optimization on
- Leader only sends part of new entries when there are more than a max number (500 in this case) of new entries that are not on the follower
- When a Follower is handling the heartbeat, it does not touch the log if the incoming entries have been existing/duplicated, leaving some entries after the incoming entry part untouched/uncut as well. However, the entries in the latter part might not be valid.
So with those optimizations mentioned above, at the place when we update the follower's rf.commitIndex
, we cannot go as far as the leader's commitIndex
, but only goes to the end index of the incoming replicating entry parts - which is args.PrevLogIndex+len(args.Entries)
Or see some comments on this Lab2C PR