Distributed algorithms in PlusCal

Second in a series on using the TLA Toolbox to explore the failure tolerance of distributed algorithms. The first post presented the context and why the topic is important. The next post describes the semantics of fairness in PlusCal.

Before describing how to analyze failures in distributed algorithms in the TLA Toolbox, I want to provide some backround in the conventions I use to represent the algorithms. Although Lamport has provided a good reference for the features of the PlusCal language, there is a lack of material on how to structure the larger scale of algorithms. As my purpose in this series is to provide an intermediate level tutorial, in this post I want to describe the conventions I use to make writing distributed algorithms in PlusCal more readable and reliable, while the next post will clarify some potentially obscure aspects of PlusCal semantics. With that background out of the way, I can turn to the main topic.

The dialect

PlusCal has two dialects, PlusCal-c, which uses syntax similar to C (but different enough in the details to be confusing), and PlusCal-p, which uses syntax similar to Pascal. I prefer the PlusCal-p syntax, as it emphasizes that I am presenting an algorithm rather than a bowdlerized C program.

The common structure of many distributed algorithms

Effective PlusCal is a matter of matching the features of the language to the structure of the algorithms that we want to represent. Many distributed algorithms have a common structure, an infinite loop whose body is a collection of alternatives, each guarded by a condition:

while TRUE do
  either
    guard1
      alternative1
  or
    guard2
      alternative2
  or
    ...
  end either
end while

Each guard specifies the enabling condition for its associated alternative. In each loop iteration, one of the enabled alternatives is chosen to be executed. So long as at least one guard is true, enabling its alternative, the loop continues to execute. The loop may also terminate if it executes an alternative that ends with a goto out of the loop.

If there is an iteration in which every alternative is disabled, the algorithm has deadlocked. When checking the algorithm with TLC, you can set an option for TLC to check for deadlocks.

What is the best way to represent this structure in PlusCal?

Alternatives, actions, and atomicity

In the above structure, an “alternative” may comprise one or more actions. In TLA, action denotes a single transition in the underlying state machine. In PlusCal code, this would be all the statements lying between a guard and the first label following it. So long as the alternative enabled by a guard has no labels, it comprises a single TLA action. If the alternative includes labels, then it comprises one more TLA action than the number of labels it contains.

The granularity of an alternative determines its atomicity. If it is only a single TLA action, it is atomic: It executes instantaneously, while all other processes are paused. An alternative that includes one or more labels, on the other hand, will have its execution interspersed with actions by other processes. I will return to this point in a moment; for now, I’ll simply state that for distributed algorithms, which only communicate through messages, atomicity of alternatives in an either statement is less important than for algorithms that communicate through shared memory.

Guards in PlusCal

PlusCal guards can have rich structure. They are constructed from three language keywords that define an enabling condition: await, when, and with. The await and when keywords are interchangeable. In the Lamport algorithm, I chose when because it gave the algorithm a straightforward reading but for other algorithms await might read better.

Whereas await and when simply establish an enabling condition, the with keyword combines an enabling condition with the operation of choosing a value from a set. An expression such as

with a \in { ... some set ...}

is enabled when the set is nonempty. Furthermore, when the set is nonempty, a value is nondeterministically chosen and assigned to a.

A guard composed of just one of these keywords is simple enough to interpret. For example, in the following snippet,

when /\ 1 < b
     /\ b < 10;
  \* guarded alternative ...

the alternative is enabled when a is strictly between 1 and 10.

However, PlusCal allows combining enabling statements and even interspersing assignments with guards. Guards can be placed at any point in an action, even as the last statement. The evaluation rule for actions is:

An action is enabled if and only if every one of its await, when, and with statements is enabled. If even one guard is disabled, no statement in the action is executed.

This rule implies that the evaluation may backtrack when a guard depends upon the result of an assignment. Consider this snippet:

a := b + 1;
await a > 2;

If b <= 1, the action will not be enabled and so a will not in fact be assigned in the preceding statement. The action comprising these two statements is only executed if the assignment were to have resulted in a > 2.

Another way of viewing this is that the mathematical equalities in the TLA underlying PlusCal allow us to rewrite the above statements as the equivalent

await b > 1;
a := b + 1;

moving the guard before the assignment.

Atomic actions and data races in distributed algorithms

The second requirement for effective use of PlusCal is knowing when and how to ensure atomic actions. This is directly related to the distinction between local and global variables.

In a completely distributed algorithm, processes only communicate via messages. Writing such an algorithm in PlusCal, the system state is split, such that

Because process variables are local, you can freely mix references and assignments to them, without concern for intervening actions by other processes, whereas for global variables, mixing references and assignments may introduce race conditions. The syntactic distinction between local and global variables highlights the ones for which atomic access is important:

When writing a message-passing algorithm in PlusCal, the only variables for which atomic access matters are the global variables representing the network.

Example contrasting local and global variables

For example, consider the following algorithm, which includes one global variable and two local variables:

--algorithm RaceAlgo
variables
  \* Global variable
  glob

process Proc \in 1 .. 2
variables
  \* Variables local to process
  loc1 \in 1 .. 2,
  loc2

begin
(* Unaffected by other processes because loc1, loc2 are local ... *)
L1:
  loc2 := loc1 + 1;
L2:
  loc2 := loc2 + 1;

(* ... but following has race condition because glob is global *)
L3:
  glob := loc1 + 1;
L4:
  glob := glob + 1;

  print <<self, loc1, loc2, glob>>

end process
end algorithm

Running this on TLC, it finds 54 distinct end states. The algorithm is symmetric, so there are 6 distinct outcomes, repeated for each process:

loc1 loc2 glob
1 3 3
1 3 4
1 3 5
2 4 3
2 4 4
2 4 5

In every case, loc2 equals loc1 + 2, but not necessarily for glob, which can have three different outcomes for each value ofloc1. We get glob equal to loc1 + 2 if the two processes are executed sequentially, producing the two highlighted rows: 1, 3, 3 and 2, 4, 4. But those aren’t the only outcomes. In addition to them, there are four other outcomes for glob, from executions that intermingled execution of the statements at L3 and L4 by Processes 1 and 2.

As the example demonstrates, we can assign to local variables without concern for the presence of labels but when assigning to global variables we must be alert to the placement of labels, as those mark points where a different process could intervene and update the globals.

Packaging network operations as macros

Given that global variables represent the network state and are also the variables for which we need to ensure atomic access, it would be useful to have a way to ensure our network operations are atomic. PlusCal macros fulfill this need because they are always atomic, not allowed to include a label. Macros also package the slightly unfamiliar representation of networks as sets and functions into familiar operation names.

For example, in the Lamport algorithm, the network is represented as a function of type Channel, a mapping from source and destination process ids to a Sequence of Messages:

Channel == [src \in Pid |-> [dst \in Pid |-> Seq(Message)]]

The network itself is a variable channel of that type, and a network broadcast is represented by

channel := [channel EXCEPT ![self] =
             [dst \in Pid |->
               IF dst = self THEN channel[self][self]
                             ELSE Append(channel[self][dst], msg)]]

This is more readably written as the macro call

Broadcast(msg);

For a more complex example, consider receiving a message. The receiving process must perform three steps in sequence:

  1. Wait for a message of a specified type to be available at the head of the sequence sent from at least one source,
  2. pick the process id of one source when the wait is satisfied, and
  3. remove that message from the sequence, passing the message contents src and time to the destination.

The resulting code is subtle:

with s \in {src \in Pid: /\ Len(channel[src][dst]) > 0
                         /\ Head(channel[src][dst]).type = type
           } do
  src := s;
  time := Head(channel[src][self]).time;
  channel := [channel EXCEPT ![src][self] = Tail(channel[src][self])]
end with

It’s far more readable when packaged into a macro call:

Receive("AckReq", src, time);

The Receive macro functions as both a guard and, when the guard is enabled and allows the assignments to be executed, updates the global state. A message is removed from the network (represented by channel) and its contents, a sender id and logical clock value, are assigned to local variables. In the actual algorithm, its call looks like

either
 ...
or
  Receive("AckReq", src, time); \* Guard; updates src and time
    clock := Max(clock, time);  \* Update local clock
    acks := acks \union {src}   \* Record receipt of acknowledgement
or
 ...

I indent the statements following the guard to indicate that their execution is controlled by the guard.

Summary

Effective use of PlusCal for distributed algorithms requires matching the language features to the structure of the algorithm. Typically, the only actions that have to be atomic are those that reference or update global variables. Package global accesses as macros, ensuring atomic access. Given that these actions are nearly always network operations, the macros also make the operations clearer than the underlying TLA expressions.

Pay attention to the guards enabling the alternative choices in an algorithm. Often these guards will be macros representing network operations that wait on a message of a specific type and retrieve the message contents when one is received.

I have found that adopting these conventions makes my algorithms clearer and more reliable. In the next post, I’ll go beneath this syntax and explore a subtle point of the semantics of PlusCal.