Raft Consensus Algorithm Implementation with Go

Yunus Kılıç
8 min readApr 13, 2021

“A fundamental problem in distributed computing and multi-agent systems is to achieve overall system reliability in the presence of a number of faulty processes. This often requires coordinating processes to reach consensus, or agree on some data value that is needed during computation. Example applications of consensus include agreeing on what transactions to commit to a database in which order, state machine replication, and atomic broadcasts”. [1]

As you see in the definition, the consensus is the most important part of the distributed applications. In this article, I will describe how to implement a Raft Consensus algorithm implementation with Go. I try to give reasons and hint about implementation and logic. In order to that, I will follow MIT 6.824 course’s lab to pass all tests. [2]

RAFT

“Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to (multi-)Paxos, and it is as efficient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems.”[3]

If you need some background about Raft, this visualization will be good for you.[4] Also, you can read its paper.[3]

Previously Paxos was being used for a long time, but Raft offers the same efficiency with understandability. I agree with the writers. But implementation part still has some corner cases and difficulties. There is a figure call Figure 2 on paper[3], which defines every single detail about Raft implementation. Every word is so much important to complete implementation. If you do not obey these statements properly, your code will be failed easily.

Want to read this story later? Save it in Journal.

Figure 2 on paper[3]

Let’s implement Raft with these specifications with detailed explanations if needed. Raft divides concepts into three as Leader Election, Log Replication, and Security. After the leader elected, log replication operations will have occurred.

Basic Structure And Leader Election

As you see in the top left part of the figure, the state object has some attributes. Some of these are persistent, others are volatile. So we need to define these fields to construct our object.

Let’s look at the fields detailly.

  • currentTerm, votedFor, log, commitIndex, lastApplied, nextIndex, and matchIndex defined and described by the paper in Figure. But I want to add some extra information about the log. In the figure, it is written that the log contains commands. But inside the paper, you should know that command’s term. So I defined it as a struct called CommandTerm.
  • There are extra fields that are not stated in the figure, let’s explain these.

Status

As you can see from the below image, the Raft algorithm depends on some states and transitions between states.

State transition figure[3]

ApplyMsg

In order to awareness, committed entries will be written to this chan

LastLogIndex

Just for keeping the last index

LastAccessed

To keep the last time raft object accessed from the leader. Later, we will implement a heartbeat mechanism to avoid unnecessary voting.

Now, the next step is understanding the state transition management code. In order to that while creating each Raft server start a goroutine called manageLifecycle.

In the beginning, each server is a follower. Raft uses randomized timers to elect leaders. After a random duration, a follower will try to be the leader with becoming a candidate. While waiting for the duration a heartbeat comes to the follower, the duration will be reset. Otherwise, the raft server increases its current term, changes its state to a candidate.

After becoming a candidate, an election will be starting soon.

If a candidate collects the majority of the votes before timeout duration, then elected as a leader. In order to that, we are using RequestVote RPC.

As stated in Figure:

Arguments:
term candidate’s term
candidateId candidate requesting vote
lastLogIndex index of candidate’s last log entry (§5.4)
lastLogTerm term of candidate’s last log entry (§5.4)
Results:
term currentTerm, for candidate to update itself
voteGranted true means candidate received vote Receiver
implementation:
1. Reply false if term < currentTerm (§5.1)
2. If votedFor is null or candidateId, and candidate’s log is at least as up-to-date as receiver’s log, grant vote (§5.2, §5.4)

If a candidate could not collect majority votes, becomes a follower after the timeout.

If a follower returns a term that is bigger than the candidate’s term, then the election also is failed and the candidate becomes a follower. Also, the candidate updates its term before becomes a follower.

If a candidate wins the election, it becomes a leader. And assign the next index for each server to its last log index + 1.

Let’s look at Request Vote RPC implementation.

Line9: If the follower’s term is bigger than the candidate, it returns false and the follower’s term.

Line14: if the argument’s term is bigger, updates the follower’s currentTerm and votedFor. Because follower will return true if votedFor is null or candidateId. When a bigger term comes to followers, updates its term and deletes votedFor.

Line19: is written for candidate’s log is at least as up-to-date as receiver’s log, grant vote requirement. Checks follower logs are ahead from candidate’s logs, if it is ahead, returns false to the candidate.

Let’s assume that candidate elected. The next step is the leader state.

The leader has multiple responsibilities. Accepting commands from the client, sending commands to peers, committing commands, send heartbeats to the peers. So most complex part of the code is the leading state.

  • Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts. To handle this, the leader sends append entries without any entry. When a follower accepts a heartbeat, resets its election timer.
  • If command received from client: append entry to local log, respond after entry applied to state machine. Inside start method, command will be applied into leader’s log.
  • If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log entries starting at nextIndex

If successful: update nextIndex and matchIndex for follower (§5.3)

If AppendEntries fails because of log inconsistency: decrement nextIndex and retry.

Line 55 at RaftLeader.go is used for this job.

if nextIndex[peer] <= lastLogIndex {
args.Entry = rf.log[prevLogIndex+1 : lastLogIndex+1]
}

If the leader has multiple entries to append follower, then AppendEntriesArgs has multiple entries.

If successful:

rf.nextIndex[peer] = min(rf.nextIndex[peer]+len(args.Entry), rf.lastLogIndex+1)
rf.matchIndex[peer] = prevLogIndex + len(args.Entry)

The next Index and match index will be increased. In some corner cases, there will be delays, etc so I allow the max next index will be equal to the leader’s lastLogIndex+1.

If fails:

On the paper, it says that decrease next index by 1. But for this lab, there are some test cases to check the fastness of the system. To optimize log inconsistency fixing, the instructor of the course offers a method.[6]

"the Figure 2 design backs up one entry per RPC -- slow!
lab tester may require faster roll-back
paper outlines a scheme towards end of Section 5.3
no details; here's my guess; better schemes are possible
Case 1 Case 2 Case 3
S1: 4 5 5 4 4 4 4
S2: 4 6 6 6 or 4 6 6 6 or 4 6 6 6
S2 is leader for term 6, S1 comes back to life, S2 sends AE for last 6
AE has prevLogTerm=6
rejection from S1 includes:
XTerm: term in the conflicting entry (if any)
XIndex: index of first entry with that term (if any)
XLen: log length
Case 1 (leader doesn't have XTerm):
nextIndex = XIndex
Case 2 (leader has XTerm):
nextIndex = leader's last entry for XTerm
Case 3 (follower's log is too short):
nextIndex = XLen"

Line 76 at RaftLeader and later are used for faster rollback. We are going to look AppendEntries side very soon.

  • If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm: set commitIndex = N

Line 16 at RaftLeader and below is used to commit agreed entries. It is important that the agreement’s term and the leader’s current term must be equal to commit.

Now let’s look at another critical method: AppendEntries RPC.

1. Reply false if term < currentTerm (§5.1)
2. Reply false if log doesn’t contain an entry at prevLogIndex
whose term matches prevLogTerm (§5.3)
3. If an existing entry conflicts with a new one (same index
but different terms), delete the existing entry and all that
follow it (§5.3)
4. Append any new entries not already in the log
5. If leaderCommit > commitIndex, set commitIndex =
min(leaderCommit, index of last new entry)

XTerm, XIndex, XLen are used for faster recovery which stated above.

Line 10 is for rule 1.

Line 14 is for rule 2.

Line 45 is for rule 4.

Line 50 is for rule 5.

Rule 3 is a tricky one. It causes the TestFigure8Unreliable2C test to fail. In the beginning, I tried to delete the log like below

rf.Log = rf.Log[:args.PrevLogIndex + 1]

But as stated inside Raft Q&A[5], it is wrong to approach.

3. If an existing entry conflicts with a new one (same index
but different terms), delete the existing entry and all that
follow it

We only need to delete conflicting entries. So I changed my code as like below

index := 0
for ; index < len(args.Entry); index++ {
currentIndex := args.PrevLogIndex + 1 + index
if currentIndex > len(rf.log)-1 {
break
}
if rf.log[currentIndex].Term != args.Entry[index].Term {
rf.log = rf.log[:currentIndex]
rf.lastLogIndex = len(rf.log) - 1
rf.persist()
break
}
}

reply.Success = true
rf.lastAccessed = time.Now()
if len(args.Entry) > 0 {
rf.log = append(rf.log, args.Entry[index:]...)
rf.lastLogIndex = len(rf.log) - 1
rf.persist()
}

Just delete after conflicting log, append the remaining part of entries.

With this refactoring, I am able to pass all test cases in a reasonable amount of time.

PASS
ok github.com/yunuskilicdev/distributedsystems/src/raft 209.609s

You can find the whole solution at GitHub repo(/src/raft folder)

This Raft implementation will be very helpful for me. Because I understand how distributed systems implemented, how to fix a bug, and how easily failed. Thanks for the lab assignment structure because it has lots of different test cases with corner cases and different situations.

In the next article, I will build a fault-tolerant key/value storage service using my Raft library.

Resources:

[1] https://en.wikipedia.org/wiki/Consensus_(computer_science)

[2] http://nil.csail.mit.edu/6.824/2020/labs/lab-raft.html

[3] http://nil.csail.mit.edu/6.824/2020/papers/raft-extended.pdf

[4] http://thesecretlivesofdata.com/raft/

[5] https://thesquareplanet.com/blog/raft-qa/

[6] http://nil.csail.mit.edu/6.824/2020/notes/l-raft2.txt

Photo by Hannah Busing on Unsplash

📝 Save this story in Journal.

--

--

Yunus Kılıç

I have 10 years of experience in high-quality software application development, implementation, and integration.