Simulation checks of the PlusCal version of Lamport's algorithm

07 May 2019 Tags: Distributed systems

Fifth in a series on using the TLA Toolbox to explore the failure tolerance of distributed algorithms. The previous post was a description of Lamport’s 1978 mutual exclusion algorithm.

My first step when checking a new algorithm is to run a simulation. In this mode, TLC does not exhaustively search the tree but follows a single trace up to a maximum depth. The search continues until the operator cancels it. The assurances provided by such runs are both broader and more provisional than exhaustive searching: Broader because I can set simulation traces to proceed much further than those in an exhaustive search, but more provisional because there is no assurance that the odd, error-prone corner cases will be checked.

Simulation runs are useful for their capacity to identify simple errors due to typos and logical oversights. Exhaustive checks require some work to set up; there is no point to spending that effort until after I have some assurance that no easily-found mistakes remain.

Checking the PlusCal version of Lamport’s algorithm

The specification of a model checking run comprises several parts:

The invariants to be checked. These are usually defined in the TLA code trailing the TLA translation of the PlusCal.
An optional view, also defined in the trailing TLA code.
The options for the TLC model checker. In the Toolbox, these are defined in a series of dialogue box panels. In this series, I represent the options in YAML instead.

Invariants and views

The model checker requires that invariants be defined for checking.

For the algorithm to be considered correct, it must maintain two invariants. The first ensures the correct types of non-temporary variables, the second ensures that at most one process is ever in the critical section:

\* Type invariant
TypeOK ==
  \* Globals
  /\ channel \in [Pid -> [Pid -> Seq(Message)]]
  /\ crit \in SUBSET Pid
  \* Locals
  /\ clock \in [Pid -> ClockVal]
  /\ requests \in [Pid -> [Pid -> ClockVal]]

\* At most one process owning the resource
MutualExclusion == Cardinality(crit) < 2

Checking can be accelerated by defining a view that excludes temporary variables from the check. This reduces the state space explored by the model checker while ensuring the resulting space covers the essential states:

\* Essential variables to be monitored by TLC
View == <<channel, crit, clock, acks, requests, pc>>

Parameters to the model checker

The TLC parameters defining the actual run are spread across multiple dialogue box panels (in the Toolbox) or multiple files (when running TLC from the command line). This makes it hard to keep track of them all. It also makes it hard to record the parameters for reproducing the results.

I instead define the parameters in a YAML file. Ultimately, I intend to write a Python driver that reads such a file, creates the necessary configuration files for running the command-line version of TLC, and runs the check. For now, I record the parameters here and enter them into the IDE by hand.

Notes on two YAML syntax features that may be unfamiliar:

The > symbol introduces a running text block. Line boundaries are ignored.
The | symbol introduces a preformatted text block. Line boundaries are preserved, as they are by the HTML PRE tag.

Description: >
  Simulation run of PlusCal representation of Lamport 1978.

Specification:
  Behavior:
    Temporal-formula: Spec

  Definition-overrides: |
    ClockVal <- Nat

Model:
  Declared-constants: |
    N <- 2
    MaxClock <- 8
    defaultInitValue <- [model value]

Checks:
  Deadlock: false
  Invariants:
    - TypeOK
    - MutualExclusion

Run-type:
  Simulation:
    Max-trace-length: 120
    Seed: 3154741148735185214

Run-parameters:
  Workers: 2

Results of the simulation

I ran the simulation for two minutes. In that time, 19 million states were explored (some of which might be duplicates):

Time (min)	States explored
1:00	9,100,140
2:00	19,242,611

The invariants were preserved in all cases. This does not guarantee correctness because we cannot estimate how many states remained unexplored. It does provide confidence that the code has no “obvious” defects.

Checking Merz’s TLA representation of Lamport’s algorithm

For comparison, I ran a comparable simulation on Merz’s original TLA version of the algorithm.

Merz’s safety invariant is slightly different in form but tests the same property as mine above:

Mutex == \A p,q \in crit : p = q

I used specifications comparable to the ones above for my PlusCal version. I inadvertently left the deadlock check on but that had no effect on the outcome:

Model-description: >
  Simulation run of Merz's TLA representation of Lamport 1978.
  See https://github.com/tlaplus/Examples/tree/master/specifications/lamport_mutex

Specification:
  Behavior:
    Temporal-formula: Spec

  Definition-overrides: |
    Clock <- 1 .. 130

Model:
  Declared-constants:|
    N <- 2
    maxClock <- 8

Checks:
  Deadlock: true
  Invariants:
    - TypeOK
    - Mutex

Run-type:
  Simulation:
    Max-trace-length: 120
    Seed: 692067743782335802

Merz’s simpler TLA representation allowed TLC to check 37% more states than for my PlusCal representation, though again a given state may well have been counted multiple times:

Time (min)	States explored
1:00	12,898,239
2:00	26,229,053

Conclusion

These simulation runs demonstrated that this PlusCal representation of Lamport’s algorith had no basic errors. To increase our confidence, we need to run exhaustive checks. Even those will not prove the algorithm’s correctness in a mathematical sense but full checks will guarantee that we have checked every state within the parameters of the run. With that result, we can begin to address the real topic of this series: exploring how the algorithm fails when its assumptions are violated.

All my marbles in one place