Distributed algorithms in PlusCal
03 Jan 2019 Tags: Distributed systemsSecond in a series on using the TLA Toolbox to explore the failure tolerance of distributed algorithms. The first post presented the context and why the topic is important. The next post describes the semantics of fairness in PlusCal.
Before describing how to analyze failures in distributed algorithms in the TLA Toolbox, I want to provide some backround in the conventions I use to represent the algorithms. Although Lamport has provided a good reference for the features of the PlusCal language, there is a lack of material on how to structure the larger scale of algorithms. As my purpose in this series is to provide an intermediate level tutorial, in this post I want to describe the conventions I use to make writing distributed algorithms in PlusCal more readable and reliable, while the next post will clarify some potentially obscure aspects of PlusCal semantics. With that background out of the way, I can turn to the main topic.
The dialect
PlusCal has two dialects, PlusCal-c, which uses syntax similar to C (but different enough in the details to be confusing), and PlusCal-p, which uses syntax similar to Pascal. I prefer the PlusCal-p syntax, as it emphasizes that I am presenting an algorithm rather than a bowdlerized C program.
The common structure of many distributed algorithms
Effective PlusCal is a matter of matching the features of the language to the structure of the algorithms that we want to represent. Many distributed algorithms have a common structure, an infinite loop whose body is a collection of alternatives, each guarded by a condition:
while TRUE do
either
guard1
alternative1
or
guard2
alternative2
or
...
end either
end while
Each guard specifies the enabling condition for its associated
alternative. In each loop iteration, one of the enabled alternatives
is chosen to be executed. So long as at least one guard is true,
enabling its alternative, the loop continues to execute. The loop may
also terminate if it executes an alternative that ends with a goto
out of the loop.
If there is an iteration in which every alternative is disabled, the algorithm has deadlocked. When checking the algorithm with TLC, you can set an option for TLC to check for deadlocks.
What is the best way to represent this structure in PlusCal?
Alternatives, actions, and atomicity
In the above structure, an “alternative” may comprise one or more actions. In TLA, action denotes a single transition in the underlying state machine. In PlusCal code, this would be all the statements lying between a guard and the first label following it. So long as the alternative enabled by a guard has no labels, it comprises a single TLA action. If the alternative includes labels, then it comprises one more TLA action than the number of labels it contains.
The granularity of an alternative determines its atomicity. If it is
only a single TLA action, it is atomic: It executes instantaneously,
while all other processes are paused. An alternative that includes one
or more labels, on the other hand, will have its execution
interspersed with actions by other processes. I will return to this
point in a moment; for now, I’ll simply state that for distributed
algorithms, which only communicate through messages, atomicity of
alternatives in an either
statement is less important than for
algorithms that communicate through shared memory.
Guards in PlusCal
PlusCal guards can have rich structure. They are constructed from
three language keywords that define an enabling condition: await
,
when
, and with
. The await
and when
keywords are
interchangeable. In the Lamport algorithm, I chose when
because it
gave the algorithm a straightforward reading but for other algorithms
await
might read better.
Whereas await
and when
simply establish an enabling condition, the
with
keyword combines an enabling condition with the operation of
choosing a value from a set. An expression such as
with a \in { ... some set ...}
is enabled when the set is nonempty. Furthermore, when the set is
nonempty, a value is nondeterministically chosen and assigned to a
.
A guard composed of just one of these keywords is simple enough to interpret. For example, in the following snippet,
when /\ 1 < b
/\ b < 10;
\* guarded alternative ...
the alternative is enabled when a
is strictly between 1 and 10.
However, PlusCal allows combining enabling statements and even interspersing assignments with guards. Guards can be placed at any point in an action, even as the last statement. The evaluation rule for actions is:
An action is enabled if and only if every one of its await
,
when
, and with
statements is enabled. If even one guard is
disabled, no statement in the action is executed.
This rule implies that the evaluation may backtrack when a guard depends upon the result of an assignment. Consider this snippet:
a := b + 1;
await a > 2;
If b <= 1
, the action will not be enabled and so a
will not in fact
be assigned in the preceding statement. The action comprising these
two statements is only executed if the assignment were to have
resulted in a > 2
.
Another way of viewing this is that the mathematical equalities in the TLA underlying PlusCal allow us to rewrite the above statements as the equivalent
await b > 1;
a := b + 1;
moving the guard before the assignment.
Atomic actions and data races in distributed algorithms
The second requirement for effective use of PlusCal is knowing when and how to ensure atomic actions. This is directly related to the distinction between local and global variables.
In a completely distributed algorithm, processes only communicate via messages. Writing such an algorithm in PlusCal, the system state is split, such that
- the state of the algorithm is contained in the local variables of its processes, and
- the state of the network, the messages in transit, is contained in the global variables.
Because process variables are local, you can freely mix references and assignments to them, without concern for intervening actions by other processes, whereas for global variables, mixing references and assignments may introduce race conditions. The syntactic distinction between local and global variables highlights the ones for which atomic access is important:
When writing a message-passing algorithm in PlusCal, the only variables for which atomic access matters are the global variables representing the network.
Example contrasting local and global variables
For example, consider the following algorithm, which includes one global variable and two local variables:
--algorithm RaceAlgo
variables
\* Global variable
glob
process Proc \in 1 .. 2
variables
\* Variables local to process
loc1 \in 1 .. 2,
loc2
begin
(* Unaffected by other processes because loc1, loc2 are local ... *)
L1:
loc2 := loc1 + 1;
L2:
loc2 := loc2 + 1;
(* ... but following has race condition because glob is global *)
L3:
glob := loc1 + 1;
L4:
glob := glob + 1;
print <<self, loc1, loc2, glob>>
end process
end algorithm
Running this on TLC, it finds 54 distinct end states. The algorithm is symmetric, so there are 6 distinct outcomes, repeated for each process:
loc1 |
loc2 |
glob |
---|---|---|
1 | 3 | 3 |
1 | 3 | 4 |
1 | 3 | 5 |
2 | 4 | 3 |
2 | 4 | 4 |
2 | 4 | 5 |
In every case, loc2
equals loc1 + 2
, but not necessarily for
glob
, which can have three different outcomes for each value
ofloc1
. We get glob
equal to loc1 + 2
if the two processes are
executed sequentially, producing the two highlighted rows: 1, 3, 3 and
2, 4, 4. But those aren’t the only outcomes. In addition to them,
there are four other outcomes for glob
, from executions that
intermingled execution of the statements at L3
and L4
by
Processes 1 and 2.
As the example demonstrates, we can assign to local variables without concern for the presence of labels but when assigning to global variables we must be alert to the placement of labels, as those mark points where a different process could intervene and update the globals.
Packaging network operations as macros
Given that global variables represent the network state and are also the variables for which we need to ensure atomic access, it would be useful to have a way to ensure our network operations are atomic. PlusCal macros fulfill this need because they are always atomic, not allowed to include a label. Macros also package the slightly unfamiliar representation of networks as sets and functions into familiar operation names.
For example, in the Lamport algorithm, the network is represented as a
function of type Channel
, a mapping from source and
destination process ids to a Sequence
of Messages
:
Channel == [src \in Pid |-> [dst \in Pid |-> Seq(Message)]]
The network itself is a variable channel
of that type, and a
network broadcast is represented by
channel := [channel EXCEPT ![self] =
[dst \in Pid |->
IF dst = self THEN channel[self][self]
ELSE Append(channel[self][dst], msg)]]
This is more readably written as the macro call
Broadcast(msg);
For a more complex example, consider receiving a message. The receiving process must perform three steps in sequence:
- Wait for a message of a specified
type
to be available at the head of the sequence sent from at least one source, - pick the process id of one source when the wait is satisfied, and
- remove that message from the sequence, passing the message contents
src
andtime
to the destination.
The resulting code is subtle:
with s \in {src \in Pid: /\ Len(channel[src][dst]) > 0
/\ Head(channel[src][dst]).type = type
} do
src := s;
time := Head(channel[src][self]).time;
channel := [channel EXCEPT ![src][self] = Tail(channel[src][self])]
end with
It’s far more readable when packaged into a macro call:
Receive("AckReq", src, time);
The Receive
macro functions as both a guard and, when the guard is
enabled and allows the assignments to be executed, updates the global
state. A message is removed from the network (represented by
channel
) and its contents, a sender id and logical clock value, are
assigned to local variables. In the actual algorithm, its call looks like
either
...
or
Receive("AckReq", src, time); \* Guard; updates src and time
clock := Max(clock, time); \* Update local clock
acks := acks \union {src} \* Record receipt of acknowledgement
or
...
I indent the statements following the guard to indicate that their execution is controlled by the guard.
Summary
Effective use of PlusCal for distributed algorithms requires matching the language features to the structure of the algorithm. Typically, the only actions that have to be atomic are those that reference or update global variables. Package global accesses as macros, ensuring atomic access. Given that these actions are nearly always network operations, the macros also make the operations clearer than the underlying TLA expressions.
Pay attention to the guards enabling the alternative choices in an algorithm. Often these guards will be macros representing network operations that wait on a message of a specific type and retrieve the message contents when one is received.
I have found that adopting these conventions makes my algorithms clearer and more reliable. In the next post, I’ll go beneath this syntax and explore a subtle point of the semantics of PlusCal.