Exploring distributed algorithm failures using TLA Toolbox

First in a series on using the TLA Toolbox to explore the failure tolerance of distributed algorithms. The next post describes conventions for distributed algorithms in PlusCal.

I have wanted to write a series about Leslie Lamport’s TLA Toolbox for some time. For years, I’ve been following the growing use of formal and semi-formal methods in software development, including such tools as Alloy, Coq, Spin, and Isabelle. And that’s just a taste of what’s on offer—consider the tools Byron Cook lists at the end of his talk on formal verification of security at Amazon Web Services.

Of the tools I have explored, the TLA language and the TLC model checker represent a fascinating design point:

Lamport has devoted his career to developing provably correct distributed algorithms and the Toolbox distills this experience into executable form. The rest of us, those who haven’t won a Turing award, can use the wisdom embedded in these tools to approach that level of performance.

Focusing on failure tolerance of distribute algorithms

Despite my interest in the Toolbox, I have held off writing anything about it because so much has already been written. In fact, the newcomer is held back by the sheer number of tutorials and references available at the Toolbox site alone, let alone the others around the Web. There seemed no point in duplicating what was already so well-described elsewhere.

In recent months however, I have found a novel angle for approaching the Toolbox, using it to study the failure tolerance of distributed algorithms. This is a slippery topic, yet of growing importance. Modern service-oriented designs, running on cloud data centres, are expected to be resilient in the face of failure. The most basic form of this resilience is that the service must run on multiple instances and continue when a subset of those instances fails. A necessary condition of a resilient service is that its algorithm is designed to be replicated and resilient in the event of failure of one or more instances.

Notably, Lamport has contributed one of the oldest and most famous resilient algorithms, Paxos. Opinions vary on how easy Paxos is to understand: Lamport considers it simple, others see it as at least moderately complex, while still others emphasize the substantial extra work required to make it practical and make it production-ready. And all these articles focus on understanding the cases where at least a majority of the instances run without failure. What of the cases where one instance crashes (which the algorithm should handle) or where so many instances crash that it cannot proceed?

The benefits of model checking

Model checking can help tackle the challenge of understanding whether a distributed algorithm proceeds or fails in the presence of failed instances. A model checker lets you explore the behaviours of an algorithm. You can, for example:

A model checker run is a superpowered version of a regular test run. Where a test run simply demonstrates a single behaviour, a model checker run tests all behaviours up to some maximum length. Although such runs are less complete than proofs (a proof is valid for all behaviours, regardless of length), runs of a model checker are far more complete than a single run of the program. With a model checker, the designer can see the implications of a change on a wide range of behaviours in a short time. The method does not provide 100% confidence in your conclusions but it provides far more confidence than mere runs of the program could ever provide.

… and the horror, the horror!

However useful a model checker can be, I worry that I violate the express wishes of the TLA team when I apply their Toolbox to investigate failure tolerance, which typically requires writing liveness properties in moderately complex formulas of temporal logic. In The TLA Book, Lamport advocates focusing instead on safety properties, which can typically be written in propositional logic, avoiding liveness properties and the more complicated temporal logic they require:

Experience shows that most of the benefit from writing and using a specification comes from the safety part … [W]hen looking for errors, most of your effort should be devoted to examining the safety part … The unbridled use of temporal logic produces formulas that are hard to understand. (p. 116)

I am unsure whether Lamport would consider my use of temporal logic in the following posts “well-bridled” but I do believe that such formulas are more important now that liveness properties have become more prominent. It is not enough that a system in production ensure that at most one process is in a critical section at a given time (a safety property). The system must also ensure that processes attempting to enter that critical section are admitted without undue delay (a liveness property). As Dean and Barroso argue,

Just as fault-tolerant computing aims to create a reliable whole out of less reliable parts, we suggest that large online services need to create a predictably responsive whole out of less predictable parts. [Abstract]

These guarantees are ultimately probabilistic and as such, they cannot be tested by model checking, which is fundamentally logical. Yet model checking can test the logical properties standing as prerequisites to acceptable latency distributions. An algorithm that halts when an instance crashes or becomes partitioned can never provide acceptable latencies in the long tail. The tests described in this series are insufficient to guarantee success but their failure guarantees the design will be inadequate in the face of some types of failure.

What we stand to learn and not learn

Taking all this together, using TLA to explore a distributed algorithm’s responses to failure offers several lessons:

It is also important to bear in mind what the method will not do:

Despite these limitations, the approach remains useful and fun. I’ll start by considering a simpler algorithm than Paxos, Lamport’s 1978 distributed mutual exclusion algorithm. This algorithm was designed under an assumption of perfect instances running on a perfect network, so it’s a good case for analyzing its response to instance failure. Even this simple algorithm will provide fodder for a whole series of posts.