Why is distributed consensus so hard?

Second in a series describing Howard & Mortier’s generalization of distributed consensus, expressed in PlusCal. The first entry introduced the source papers.

Updated May 27, 2020 to incorporate points from Howard and Mortier (2020) and Santos and Schiper (2013).

I begin this series on distributed consensus by stepping back from the details, considering why the problem seems so hard and specifically why literature on the problem is so difficult to read. This ultimately results from the breadth of issues that must be addressed by any solution. “Distributed consensus” isn’t a single problem so much as a class of problems, with related but distinct solutions. Before describing the specific focus of Howard and Mortier’s paper, I want to set the broader context. This context will help readers who want to apply these results to different forms of consensus.

So why is distributed consensus so hard, not just to solve, but to even describe? Turns out there’s a lot of reasons.

1. You can’t get what you really want

The Fischer, Lynch, and Paterson (1985) paper presents a fundamental limitation on distributed consensus algorithms: For realistic levels of reliability, no algorithm can be guaranteed to complete. In consequence, the algorithms can only make a less powerful guarantee: Although processes are not guaranteed to complete, any that do will agree on the same value. In the language of concurrent programming, the algorithms guarantee safety but not liveness. Careful reliability engineering can make the probability of completion high but never 100%.

2. The scope matters

Writers assume different scopes for their articles. Some only consider the consensus-making processes. Others include the processes that consume the consensus results. Still others distinguish between processes that request a consensus and those that consume the results. Some versions include the option of backup processes whose only role is to observe and track the decisions but not participate in making them.

3. There are many ways of defining roles

Paxos is typically described in terms of a collection of a concurrent, replicated roles. Each role executes a specific algorithm, typically comprising an infinite loop consuming messages from the other roles and sending them replies.

These roles have been defined in many ways. Some particularly common roles have been assigned different names, even by the same author at different times. For example, in her dissertation Howard names the two key roles Proposer and Acceptor, while her paper with Mortier names those same roles Client and Server. Lamport’s “Paxos Made Simple” separates Howard’s Proposer (AKA Client) role into distinct Proposer and Learner roles.

The number and names of roles is also affected by the author’s scope. Altınbüken and van Renesse’s “Paxos Made Moderately Complex” expands the scope to include Replicas, which consume the consensus results to execute a replicated state machine algorithm

The following table correlates the distinct names several authors have used for similar roles:

Wikipedia Cacogne Altınbüken & van Renesse Lamport (1998) Lamport (2001) Howard Howard & Mortier (2019) Howard & Mortier (2020)
Client Client Citizen
Proposer Suggester Leader (including Commander and Scout) President Proposer Proposer Client Leader
Acceptor Voter Acceptor Priest Acceptor Acceptor Server Follower
Learner Arbiter Replica Learner

Many of the authors also designate a specific Proposer/Suggester/Client as the “Leader”. The Leader designation simply indicates priority amongst otherwise identical processes; any member of that class may be promoted to Leader if the current one falls short. Howard & Mortier’s (2020) reformulation of Paxos using the Raft roles includes a Leader role of this type, as well as a Candidate role for Followers striving to become Leaders. Note however that only one of Altınbüken & van Renesse’s “Leader”s is a Leader in this sense.

The correspondences above are not necessarily exact, due to the differing ways the authors separate Paxos into roles, but they are close.

4. The fundamental concepts have multiple names

Several times, the consensus literature has suffered the fate of defining perfectly good terms for basic concepts, only to have the field later select those same terms to mean something completely different. For example, Lamport’s original Paxos papers defined “instance” to mean a single consensual decision. This worked well in the 1990s but in the intervening years, “instance” has more commonly come to mean individual virtual machines (such as Amazon EC2 instances) or individual copies of replicated services. Given that consensus is now most often used in environments based on cloud technology or replicated services, this formerly straightforward nomenclature is now ambiguous.

Combining these historical redefinitions with the multiplicity of role names (Point 3 above), we get a nomenclature that is several times larger than the actual number of concepts it describes.

5. The basic algorithm leaves many key decisions to implementation

The basic consensus algorithm makes a single decision (called “instance” in the literature), runs on a fixed set of processes, and makes no guarantee of ever completing. Actual implementations typically include extensions to:

Variants of consensus include more or less of these implementation choices.

6. There are several metrics of efficiency

The metrics for the efficiency of a consensus algorithm typically count the number of hops (often counted as round-trips) required to complete a single decision, rather than its computation or memory requirements. Within this broad definition remains plenty of room for variation.

For example, the scope affects the round trip count. If the scope includes the ultimate client (which I will refer to in this series as the “service client”), then the count will include the round trip from that client to the service endpoint. Including the service client has more subtle implications than just concatening an extra trip on the count of a more basic algorithm. Some algorithms, such as Lamport’s Fast Paxos, are designed to allow the service client to bypass the Leader and propose values directly to the Acceptors, reducing the round trip count.

In some use cases, there are different types of round trips, with different expected latencies. For example, trips between regions, with their longer geographic distances, are typically much slower than trips between datacentres within a region, which in turn are longer than trips within a single datacentre. For high availability, services that require consensus algorithms may well use a mix of intra- and inter-datacentre processes. For these systems, a design might aim to minimize the more expensive round trips between datacentres.

Latency versus throughput

Performance analysis of consensus algorithms can emphasize latency or throughput. Latency is directly correlated to the number of hops on the path of a single request from a service client through consensus and back—more hops increases the latency. Throughput, on the other hand, is inversely correlated to the number of messages sent between participating processes. An algorithm that requires messages be sent between every participating process will have lower throughput than one that only requires intercommunication between a fraction of them, even if both algorithms have the same number of hops in their paths and hence the same latency.

Message throughput is limited by network bandwidth and the concurrent traffic on the same links for other services. Where latency’s limiting factor, the number of hops, is inherent to the algorithm’s design, message throughput can be purchased by increasing network capacity or reducing concurrent traffic.

In organizational terms, latency and throughput are often different types of concerns, with latency being a team’s service target and network bandwidth being a cost they spend to reach that target.

Coalesced roles improve latency and throughput

Both the latency and throughput of consensus are improved by coalescing the multiple roles onto single processors. For example, every role other than “Client” in the Wikipedia column of the above table may be coalesced onto multiple threads on a processor or even (using asynchronous, nonblocking IO) a single thread. This eliminates the messages between roles on that processor but does not reduce the diameter or complexity of the algorithm, which arise from the requirement to communicate decisions between the designated Leader and the Acceptors running on different processors.

Persistent writes may limit performance

In addition to message-passing, consensus may be limited by the rate of persistent writes. Most practical implementations are designed so that when an Acceptor crashes, it restarts and rejoins the consensus process. In order to restart, each Acceptor must retain an essential part of its state in persistent storage. This typically means writing that state before sending any reply to a request from the Leader. Cacogne notes (see his “Constituent Components” section) that this is “far and away the biggest performance bottleneck for most implementations”.

Most descriptions of consensus algorithms specifically designate the portion of the Acceptor state that must be persistently retained for implementations that support restarting after a crash.

CPU utilization may limit performance

Santos and Schiper present data (Fig. 7a, Sect. 5.2.1) that for the combination of small request sizes and relatively high-latency networks (such as in wide-area networks), the CPU cost of consensus can limit performance. However, this case seems rarer than those where the latency of the network or persistent storage limits performance.

7. There are several definitions of “fault-tolerant”

Distributed consensus is typically used for its fault-tolerance. But what kinds of faults must it tolerate?

Tolerating crashes

The most basic form of reliability is tolerating crashes. The classical consensus algorithms tolerate crash-stop failure, in which a process either runs correctly or it crashes, never to participate again. Consensus algorithms are parameterized by f, the maximum number of processes that can crash without preventing consensus from progressing.

A related but independent form of fault tolerance is crash recovery: If a process crashes, it can restart, recover its prior state, and rejoin the consensus process. If the recovery is fast enough, the other processes may only observe a longer message reply time. Crash recovery is not a property of an algorithm, but of its implementation. As noted above, the choice to support crash recovery makes throughput of persistent storage a potential performance bottleneck.

Tolerating network errors

Modern networks are well-engineered but outages nonetheless occur. There are several types of errors:

Classical consensus algorithms are designed to tolerate all of these but the last. They assume an asynchronous network model, which permits arbitrarily long message delivery times, which implies not only delays but also subsumes message loss (the message was delayed long enough for the prospective recipient to give up) and out of order delivery (the message sent later had a shorter delay than the one sent earlier). Although tolerance of duplicate delivery may not be explicitly specified, widely-used consensus algorithms tolerate them as well.

The final sort of network errors, forged messages, are only correctly handled by the Byzantine-tolerant algorithms described next.

Tolerating incorrect or hijacked processes

The standard definition of fault tolerance does not protect against a misbehaving process, which due to a bug or control by malicious hackers can violate safety requirements. It also does not protect against message forgery.

Consensus algorithms that perform reliably in the presence of buggy or deliberately malicious processes are called Byzantine fault-tolerant. Byzantine fault-tolerant versions of Paxos have been developed but they have longer latencies, sustain lower throughput, and require more processors to achieve a given level of availability. Unless specifically stated to be so, consensus algorithms are not Byzantine fault-tolerant.

8. There is a tension between clarity and practicality

Descriptions of consensus have to trade off between elegant proofs and practical algorithms. For example, Lamport’s original paper develops Paxos sequentially:

Mathematicians derived the Synod protocol in a series of steps. First, they proved results showing that a protocol satisfying certain constraints would guarantee consistency and allow progress. A preliminary protocol was then derived directly from these constraints. A restricted version of the preliminary protocol provided the basic protocol that guaranteed consistency, but not progress. The complete Synod protocol, satisfying the consistency and progress requirements, was obtained by restricting the basic protocol. [pp. 136–137; emphasis in original]

In the final step, he extends the Synod protocol into the Parliamentary protocol, what we now call Multi-Paxos.

Such sequential reasoning makes for an elegant proof, with the safety invariants driving the development of the algorithm. But it has the disadvantage of spreading the logic of the algorithm across multiple representations. If your goal is to locate an algorithm suitable for production use, you have to work through the sequence when what you actually want is the final algorithm, combined with a systematic presentation of its invariants.

9. Descriptions emphasize safety but not liveness

The core purpose of consensus algorithms is ensuring safety: All processes that complete should agree on a single value. Ensuring liveness, that the processes are very likely to progress to completion, is secondary (recall that the FLP Theorem states that no algorithm can guarantee completion in the presence of processes crashing). There are multiple approaches for supporting progress, within the constraint that they not compromise the invariants upon which safety depends.

This provides useful flexibility for the implementor, who can choose methods most appropriate to their use case. But it can make descriptions of consensus algorithms incomplete, as a paper may focus on the safety guarantees and leave techniques for ensuring progress up to the reader.

When combined with the clarity/practicality tension described above, the result can be descriptions that leave the reader far from ready to implement a system.

For the sake of completeness, I note that proponents of the Raft variant of distributed consensus argue that it is more understandable in part because it directly incorporates features that are essential for practical use. As my goal in this series is to represent ideas from Howard and Mortier’s paper, which focuses on Paxos, I am not going to enter this debate.

Conclusion

Distributed consensus is a rich, complex topic with a long history. For any single paper the authors must make specific choices of emphasis, from the scope of functionality, to how the algorithm will be separated into roles, to how progress will be ensured (if at all), to which performance metrics will be emphasized, to which faults will be tolerated, to the number of implementation details described.

Authors will make different choices, with the result that many papers describe similar but not quite the same algorithms. Even papers describing the same algorithm may use different terms and roles, obscuring their common basis. To best understand a paper, locate the authors’ context, the choices they have made on the above topics. This will help connect this paper to others on the topic.

With the broad context of consensus laid out, in the next post I turn to the specific goals of Howard and Mortier’s paper.

Acknowledgements

I’d like to acknowledge the strong influence of Tom Cacogne’s “Understanding Paxos” on this post, in particular his early sections providing a high-level description of Paxos.