How hard is it to implement replicated state machines?

Note: Over the past eight months, I have been working with the code for TenCent’s Phxpaxos, an open-source implementation of replicated state-machines, with consensus enforced by Paxos. I have learned a lot from studying this code and from comparing it with the more focussed libpaxos, which does not include state machine code.

Recently I have isolated several topics well-adapted to the smaller size of blog posts and I will be adding them here.

Distributed systems are often made highly available by implementing them as replicated state machines, coordinating their operations via a consensus algorithm such as Paxos. Two popular tutorials from the 1990s describe the basic theory but how hard is it to implement this theory? The code from the Phxpaxos project suggests that the general solution requires more work than it might seem.

The complexity of replicated state machine implementations

The basic idea of replicated state machines was first sketched in Lamport’s classic 1978 CACM article, “Time, Clocks, and the Ordering of Events in a Distributed System” (it’s easy to miss—see the left column of p. 562), though many details remained to be developed by others. More details are provided in two widely-cited surveys: Fred Schneider’s 1990 Computing Surveys article, “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial”, and Butler Lampson’s 1996 Distributed Algorithms paper, “How to Build Highly Available Systems Using Consensus”.

Schneider’s paper is an extremely general introduction, presuming an arbitrary consensus algorithm, written when the theory was well-established but few systems were in production. As such, only a small portion of the paper is pertinent to the Phxpaxos implementation.

Lampson’s paper is primarily an introduction to specifying and reasoning about replicated state machines. He develops a notation that he then uses to specify Paxos and informally argue for its correctness. He says nothing about designing the state machines whose sequence of operations will be established by Paxos.

Schneider and Lampson’s scant coverage of the actual implementation of state machines implies that it is straightforward, perhaps trivial, once consensus is established. Yet the Phxpaxos code base implements Paxos in 5,184 lines, while implementing state machines in 2,207 lines:

Directory Excluding / including files Lines of .cpp, .h
src/algorithm Excluding checkpoint_* 5184
Total Paxos   5184
src/sm-base   553
src/checkpoint   814
src/algorithm Including only checkpoint_* 840
Total state machine   2207

The state machine code is 43% of the line count for the Paxos code—anything but trivial. Although line count is at best a rough indication of complexity, I think for this case, comparing proportions of a single project writen by a single, small team, it provides a reasonable guide to the relative level of effort.

Bear in mind that this code does not implement any particular state machine, merely the abstract structure within which a state machine will be implemented by inheriting from the abstract classes.

What makes the code for a general replicated state machine so complex, even when the algorithm for sequencing client requests is handled separately? There are several challenges:

  1. The message lengths are arbitrary.
  2. The abstract framework requires substantial boilerplate.
  3. The system must provide a general checkpoint mechanism.

Addressing these challenges requires substantial code. This implementation effort was not just in service of Phxpaxos’s users. The Phxpaxos implementation of Paxos also uses the state machine framework to solve two common problems in general Paxos, reconfiguration and leader election. These algorithms do not require the full generality of the Phxpaxos state machine framework, however. Much of the framework is only useful for more complicated cases.

I first describe the implementation complexities before turning to the two state machines internal to the library.

Arbitrary message lengths

In the replicated state machine context, consensus algorithms negotiate the “next operation” for the machines to perform. This requires passing around the full value of the machine operations. For systems such as replicated stores, the values may be large, even unbounded. The consensus algorithm must be capable of efficiently handling values of a megabyte or more. This complexity is not located in the state machine code itself but spread through the consensus code.

Phxpaxos provides an option, Options.bIsLargeValueMode, that when set increases the timeouts used when waiting for message responses. These longer timeouts are required because longer messages take measurably longer to transmit. Timeouts appropriate for shorter messages would yield too high a level of false positives, decisions that a message was lost when it simply required longer transmission time.

The abstract state machine framework

Though the framework for abstract state machines does not include any actual state machine operations, it requires considerable logic in its own right:

  1. Because the system accepts an arbitrary number of state machines, not just one, it requires logic for adding state machines to the active list and locating a state machine’s address given its identifier, via the state machine factory class, SMFac. Supporting multiple state machines is not just an empty generality—as noted above, Phxpaxo’s algorithm requires two state machines itself, on top of any defined by the user of the library.
  2. The state machines require a context data structure SMCtx for carrying any necessary state from the proposing client, through the long sequence of asynchronous Paxos messages, to completion of the Paxos instance and execution by the state machine, finally returning the context, including any result code from the operation, to the proposing client.
  3. The framework must define method calls for multiple styles of operation. State machine operations in Phxpaxos can be executed singly or in batch, in active mode or checkpoint mode, requiring support of four execution styles.
  4. Generalized state machines complicate the Paxos message protocol (defined in file src/comm/commdef.h). The basic Multi-Paxos protocol implemented by Phxpaxos (enum PaxosMsgType) requires 10 message types. The checkpointing mechanism required to support general state machines adds 2 more message types to that protocol and introduces a separate protocol of 7 types (enumerated values CheckpointMsgType, CheckpointSendFileFlag, and CheckpointSendFileAckFlag). These message types and their flags must all be handled in the main Paxos logic code in files src/algorithm/instance.cpp and src/algorithm/learner.cpp. This makes the implementation harder to read than that of a pure Paxos library such as libpaxos, which does not include state machine code. I explore the complexities of checkpointing in more detail below.
  5. The framework must provide callbacks to support state machines with specialized requirements. Phxpaxos supports three such callbacks:
    1. A callback executed after a value has been chosen by a Paxos instance and just before it is executed (StateMachine.BeforePropose). This allows the state machine to update the value to reflect any changes that occurred during the Paxos rounds. This callback is used by the Phxpaxos implementation of a state machine for master election (file src/master/master_sm.cpp).

      This callback is potentially such a drag on performance of the main loop that it is only enabled when both a system-wide flag (Options.bOpenChangeValueBeforePropose) is set and a state machine-specific function (StateMachine.NeedCallBeforePropose) returns true.

    2. A callback executed after reconfiguration of Paxos node membership (Options.pMembershipChangeCallback).

    3. A callback executed upon election of a new master (Options.pMasterChangeCallback).

  6. The boilerplate state machine logic must account for the Phxpaxos “group” feature. This establishes multiple independent Paxos clusters running off the same network address, which can be used to increase throughput. The SMFac factory object and concrete classes derived from StateMachine accept a group index and must select the appropriate state for the indicated cluster.

Taken together, these represent a substantial code base simply to establish the basic framework from which application-specific StateMachine classes are derived. Yet this is only the smaller portion of the state machine code. The largest portion implements a feature that only some state machines require and that is infrequently executed: checkpoints. I turn to that feature next.

General checkpointing

The single most complex feature required by a general state machine framework such as that of Phxpaxos is a mechanism for taking checkpoints of a state machine and restoring from them as necessary. This is complex code and many state machines do not need such a mechanism—neither of the two required for Phxpaxos need it—and those that do need checkpoints only use them in the case of crashes that require the most extensive form of recovery.

Checkpointing is complex, so complex in fact that I have moved the details into one or more separate posts. For this post, I only provide an overview.

When are checkpoints used?

Checkpoints are required in several situations, of which the simplest case is bringing a freshly-added node up to date. This typically arises when an existing node has crashed or is taken down for maintenance and the operators want to maintain the cluster’s fault-tolerance by bringing the node count back to its production level.

In this situation, checkpoints provide the history required to jump-start a fresh machine. The Paxos log may not be a complete system history because Phxpaxos garbage collects its logs, culling old values no longer necessary for active decisions. Although Paxos does not need these older values to proceed, those values are needed to bring fresh state machines up to date.

Which state machines require checkpoints?

The amount of data required to bring a newcomer up to date varies with the state machine. In the case of Paxos itself, the newcomer can use the most recent value of any active node. If the sender was not the most current node, the regular synchronization protocol within Paxos will bring both it and the newcomer up to date using the normal Paxos messages. (Phxpaxos message protocols include some shortcuts to accelerate this process but the principle applies to basic Paxos as well.)

The same principle applies to the two internal state machines of Phxpaxos, which record the current leader and current node configuration. For these machines as well, the only values that matter are the most recent settings. The history of previous leaders and configurations is irrelevant.

However for machines with more complex state, the history is essential. Consider a key-value store. Each Paxos instance sets a single key-value pair. (I will ignore reads as only the write history matters in this discussion.) Although for the leader and configuration machines the latest instance captures the entire state, for a key-value store the instance only captures the most recent write. A fresh instance of the key-value store potentially needs the complete history, incorporating every write made to the store, to be brought up to date. This complete history is provided by a checkpoint.

To support arbitrary state machines, including those that need more than the result of the last instance to become up to date, Phxpaxos must support checkpoints.

What checkpointing requires

You need the following to record a machine’s state and restore it from a checkpoint:

  1. A way for a node to detect when its current Paxos instance is far behind the quorum’s and needs to be sent a checkpoint to bring it up to date. This is always the case for fresh nodes added to the cluster but a subtle algorithm is required for an active node that has simply fallen far behind the others.
  2. A way for a node that has determined it needs a checkpoint to request one.
  3. A way for a node that has a recent checkpoint to decide to send it to the requesting node.
  4. A way for several eligible nodes with recent checkpoints to decide which will send its checkpoint to the requesting node. Yes, this is a consensus algorithm within the larger Paxos consensus.
  5. A way for the current node to transmit its checkpoint to the requesting node, accounting for potential message loss or corruption by the network.
  6. A way for the receiving node to reset the state of all its state machines to their checkpointed values and then enter the Paxos negotiations. Note that this may leave the node slightly behind the latest quorum and it will need to use the standard Paxos mechanisms for catching up to the quorum.
  7. A way to ensure that once a node has completed restoring its state machines from a checkpoint, it will be able to catch up to the quorum using Paxos mechanisms and not be so far behind that it needs a second checkpoint.

It is worth emphasizing that all this is occurring amidst the ongoing regular Paxos negotiations. Although there are no efficiency concerns for the node receiving the checkpoint, there must be minimal effects on the nodes activly participating in the Paxos cluster, especially the one sending the checkpoint.

The subtlety and complexity of the above requirements make it easier to see why checkpointing code might be longer and harder to understand than might appear at first glance.

Solving two common Paxos problems

The authors of Phxpaxos took advantage of their generalized state machine code to solve two problems of generalized Paxos:

  1. Master election: Phxpaxos uses a state machine MasterStateMachine (defined in src/master/master_sm.{h,cpp}) to transition from one elected master to another.
  2. Reconfiguration: Phxpaxos uses a state machine SystemVSM (defined in src/config/system_v_sm.{h,cpp}) to transition from one configuration of nodes to another.

Both of these state machines are “memoryless”, in that their most recent instance contains all the history they need to transition to their next state, so neither of them requires the checkpoint mechanism.

Conclusion

The abstract definition of replicated state machines is simple but a production-quality implementation is subtle and complex. The Phxpaoxos code, which implements Paxos and provides an abstract framework for implementing replicated state machines whose operations are coordinated by that Paxos algorithm, spends a third of its code implementing the state machine framework.

The state machine framework adds complexity to the Paxos algorithm as well as adding new code for the abstract state machine classes. Additionally, although not all state machine implementations require checkpointing, supporting those that do need checkpoints adds a complicated checkpoint protocol to the implementation.

This analysis is not a criticism of the Tencent’s code for Phxpaxos. Any implementation of general state machines would require this. The complexity, inherent in the algorithms but concealed by the abstraction of pseudocode, is revealed when those algorithms are implemented in detail.

There have been several alternative designs that aim to manage the complexity by factoring the design along different lines:

  1. Tencent developed a storage layer implementing a version of Paxos based on optimistic concurrency. The design is described in their PaxosStore paper and the PaxosStore project code is open source. Bear in mind though, that the state machine in PaxosStore is a direct implementation of a restricted case and not the general-purpose state machines supported by Phxpaxos.
  2. The Raft consensus algorithm integrates the state machine code into the consensus protocol.
  3. The Tango Shared Log is a new abstraction that supports implementing distributed data structures.

These alternatives may clarify the implementation of replicated state machines by aligning the abstraction barriers at different angles but the underlying complexity remains. Whichever approach you take to implementing state machines, expect to spend considerable time getting it correct for the many edge conditions and modes of partial failure.