Motivating global stabilization

By Samuel Moelius, Staff Engineer
Originally published on October 12, 2021

Consensus protocols have come to play a critical role in many applications. Fischer, Lynch, and Paterson’s classic impossibility result showed that under reasonable assumptions, it can be impossible for a protocol to reach consensus. In Dwork, Lynch, and Stockmeyer’s paper “Consensus in the Presence of Partial Synchrony” (the DLS paper), the authors circumvent this impossibility result by introducing the following “global stabilization time” (GST) assumption.

For each execution there is a global stabilization time (GST), unknown to the processors, such that the message system respects the upper bound Δ from time GST onward.

In other words, GST is a point in time after which all network messages are delivered with a delay of at most Δ. Dwork, Lynch, and Stockmeyer showed that, under this assumption, one can construct a protocol that is guaranteed to reach consensus.

But at face value, the assumption can seem unrealistic. After all, real networks do not work this way! Think of the network in your home or office. It is unlikely that at some magical point in time, the network will become reliably responsive, from then through eternity.

So why should you care? Well, even though the GST assumption can seem unrealistic, it has tremendous utility, as we explain in this post:

One can see consensus as a game between two players and the GST assumption as a means of limiting the moves of one of those players. From this perspective, the assumption is quite natural, elegant even.
Protocols that achieve consensus under this assumption exhibit a very useful property: they can recover from any configuration that could result from delays. Thus, proving a protocol correct under this assumption is not just of theoretical significance—it has concrete, practical implications.

We begin by reviewing Fischer, Lynch, and Paterson’s classic impossibility result, which laid the groundwork for the DLS paper. We then discuss the GST assumption in the context of a game between two players, with GST as a means of limiting the moves of one of those players. Finally, we shine a light on the practical implications of proving a protocol correct under the GST assumption.

The FLP impossibility result

In “Impossibility of Distributed Consensus with One Faulty Process,” Fischer, Lynch, and Paterson showed that under very mild assumptions, it can be impossible for a set of processes to reach consensus. This has come to be known as the FLP impossibility result. The DLS paper introduced the GST assumption, which makes it possible to circumvent this result, as highlighted in EATCS’s citation awarding the paper the Edsger W. Dijkstra Prize in Distributed Computing in 2007:

The eventual synchrony [assuming GST] approach introduced in this paper … has since been established as the leading approach for circumventing the FLP impossibility result and solving asynchronous consensus, atomic broadcast, and state-machine replication.

In this section, we describe the main ideas of Fischer, Lynch, and Paterson’s proof to provide context for the work presented in the DLS paper.

The model

In the FLP model, two or more processes exchange messages in an attempt to agree on a value, either 0 or 1. Each process has a write-once output register that is initially blank. Once a process believes a value has been agreed upon, it writes the value to its output register.

The processes reach consensus if at least one process writes a value to its output register, and no other process writes the opposite value to its own output register. Note that this definition of consensus is extremely lenient. For example, one might expect that all processes have to write the same value to their output registers. Adopting such a lenient definition strengthens the FLP result.

A configuration consists of the internal state of all processes (including their output registers), along with all messages that have been sent but not yet delivered. An event e is a pair (p, m) consisting of a message, m, addressed to a process, p. In our discussion of the proof in the section that follows, we use language like “e is delivered” instead of “m is delivered to p.” Furthermore, we say “e0, e1, … are delivered in configuration C0” to mean the following:

e0 is delivered in configuration C0, resulting in some configuration C1.
e1 is delivered in configuration C1, resulting in some configuration C2.
And so on.

Things can go awry for a protocol in two ways:

Any message may be delayed an arbitrary but finite amount of time.
One process may crash; that is, one process may stop responding to messages.

Note that if a process doesn’t receive a message that it expects to receive, it has no way of knowing whether the message is simply delayed or whether the sender has crashed.

Fischer, Lynch, and Paterson showed that in this model, consensus cannot be guaranteed.

The proof

Here, we try to give the intuition for the FLP result’s proof. Our description is intentionally a little fast and loose. For a more detailed explanation, we recommend Henry Robinson’s blog post, “A Brief Tour of FLP Impossibility.”

Consider the authors’ central lemma, which says essentially the following. If C is a bivalent configuration—meaning that the decision values 0 and 1 are possible—and e is an event that is deliverable in C, then e can be delivered in a way that results in a bivalent configuration. In more precise terms, there is a sequence of events ending with e that are deliverable in C and that result in a bivalent configuration. Using this lemma, the authors show that it is possible to deliver all messages in a way that no value is ever agreed upon.

The lemma’s proof goes essentially as follows. Let C be any configuration from which either 0 or 1 could be agreed upon, and let e be any event that is deliverable in C. Consider a configuration that results from delivering a sequence of events ending in e. By way of contradiction, suppose that any such configuration allows only 0 or 1 to be agreed upon, but no such configuration allows both possibilities (i.e., no such configuration is bivalent).

The authors show that, under these assumptions, both types of configurations must exist: configurations that allow only 0 to be agreed upon, and configurations that allow only 1 to be agreed upon. In fact, the authors show that there must exist a configuration C0, an event e’ for the same process as e, and a value i ∈ {0, 1}, such that the following holds:

If e is delivered in C0, then only i can be agreed upon.
If e’ and then e are delivered in C0, then only 1 – i can be agreed upon.

(It may be helpful to glance ahead at figure 1 at this point.)

Essentially, delivering e in C0 locks the processes to value i, while delivering e’ and then e in C0 locks the processes to value 1 – i. (We found jbapple’s StackExchange answer helpful in understanding this part of the proof.)

Now, let p be the recipient process named in e and e’. Because process p could crash, there must exist a sequence of events σ that is deliverable in C0, that does not involve p, and that leads to an agreement. Let A be the resulting configuration, and let j be the value agreed upon. Note that agreeing on j here does not involve e or e’, as neither is in σ.

On the other hand, while process p could crash, it might not; messages to or from p might simply be delayed. Recall that other processes cannot tell the difference between these two cases. So configuration A should be reachable if messages to or from p are simply delayed. Moreover, events e and e’ should be deliverable in A.

Consider the configuration that results from delivering e in A. This configuration should be the same as the configuration that results from delivering e and then σ in C0, as σ does not involve p. Next, consider the configuration that results from delivering e’ and then e in A. Again, because σ does not involve p, the result should be the same as the result of delivering e’, then e, and then σ in C0. (See Figure 1.)

Figure 1: Diagram used in the proof of the FLP result’s central lemma

To summarize, from C0, there are three key waypoints:

If e is delivered, the processes agree on i (D0).
If e’ and then e are delivered, the processes agree on 1 – i (D1).
If σ is delivered, the processes agree on j (A). (Remember, neither e nor e’ is in σ.)

But e and e’ are deliverable in A. Delivering e in A should lead to an agreement on i(E0), while delivering e’ and then e in A should lead to an agreement on 1 – i(E1). However, j was already agreed upon in A, and j cannot be both i and 1 – i. Thus, a contradiction results.

Limiting the adversary

Let us return to the DLS paper, which contains the following passage:

It is helpful to view each situation as a game between a protocol designer and an adversary. … If Δ holds eventually [i.e., the GST assumption holds], the adversary picks Δ, the designer (knowing Δ) supplies a consensus protocol, and the adversary picks a time T when Δ must start holding.

We can take this metaphor a bit further. After committing to time T(GST), the adversary simulates the supplied consensus protocol. Recall that there are two ways in which things can go awry for the protocol: messages may be delayed, and one process may crash. One can think of delaying a message or crashing a process as moves are available to the adversary. The adversary wins if it can get the protocol to violate one of its correctness conditions. Specifically, the adversary wins if it can get two processes to write different values to their output registers, or if it can get the simulation to run forever without any process ever writing a value to its output register.

With this in mind, one might interpret the FLP result to mean that, unhindered by the GST assumption, the adversary cannot be defeated. In other words, in the game described above, the adversary always has a (possibly infinite) winning sequence of moves.

Let us give a face to the adversary. What kind of adversary could delay messages or cause a process to crash? We like to think of an electrical storm.

Imagine a bunch of residential homes trying to communicate over telephone lines. (This used to be a thing.) The storm can introduce noise into the lines through electrical interference, causing messages to have to be retransmitted and, thus, delayed (Figure 2). Furthermore, a lightning bolt could strike one unlucky home, causing it to … well, stop responding to messages (Figure 3).

Figure 2: As an electrical storm, the adversary can use electrical interference to cause messages to have to be retransmitted and, thus, delayed.

Figure 3: The adversary can cause one unlucky home (process) to crash (i.e., stop responding to messages).

How does GST fit into this analogy? The arrival of GST would mean that the sky has cleared. Here, the analogy is imperfect, so we have to skew reality a bit. Normally, one can look out one’s window and see that the sky has cleared, but for our purposes, the residents in figures 2 and 3 have no windows! Recall that in the passage cited above, the protocol designer supplies the protocol before the adversary chooses GST. So, effectively, the adversary knows GST, but the processes that carry out the protocol do not.

Our next question is, how might we level the playing field? What limitations could we impose upon the adversary to give the protocol a chance at winning?

The adversary is already limited in some ways. For example, while it can delay infinitely many messages, it can delay any message for only a finite amount of time. What if this finiteness restriction were extended to the set of messages that the adversary can affect? That is, what if the adversary were limited to delaying only a finite number of messages for a finite amount of time?

This proposal is, in fact, another way of stating the GST assumption, and the DLS paper shows that an adversary that is limited in this way can be defeated. In other words, there is a protocol that is guaranteed to reach a consensus under the GST assumption. In the context of the electrical storm metaphor, if the sky eventually clears, a decision can be reached (Figure 4).

Figure 4: Intuitively, the DLS result says that if the sky eventually clears, a decision can be reached.

How does assuming GST circumvent the FLP result? Recall that the proof’s central lemma relies on a process’s inability to tell whether another process p has crashed, or whether messages to or from p are delayed. Under the GST assumption, this reasoning is valid before GST but not after. To be more precise, a process cannot distinguish the following two cases:

Process p has crashed.
Messages to or from p are delayed and GST has not yet arrived.

Recall that the processes cannot tell when GST has arrived, so the difference between having the GST assumption and not having it is quite subtle. Nonetheless, the DLS paper shows that this difference is enough to guarantee that consensus can be reached.

From this perspective, the GST assumption is quite elegant. It is a minor adjustment to a logical formula. It expands a concept that was already present within the model, namely finiteness.

Now, we’re not suggesting that one should seek mathematical elegance over truth. We’re simply noting that, when viewed in this way, the GST assumption is elegant. Moreover, taking the next section’s observations into account, this elegance is simply icing on the cake.

Recovering from delays

In thinking of a consensus protocol as playing a game against an adversary, what does it mean for a protocol to win?

Recall that the adversary can delay any message an arbitrary but finite amount of time up until GST. At GST, the adversary must walk away. Because the adversary chooses GST, it can leave the protocol in any configuration that could result from delays.

Next, recall that a correct protocol must eventually reach a decision; that is, some process must write a value to its output register. So if a protocol is proved correct under the GST assumption, it must reach a decision regardless of its configuration at GST. As explained above, the protocol’s configuration at GST could be any configuration that could result from delays.

So if a protocol can defeat the adversary, it can reach a decision from any configuration that could result from delays.

This point is worth emphasizing, as the existence of such protocols is rather remarkable. Imagine all of the ways in which messages could be delayed. Note that the timeliness of a message’s delivery could affect what messages are subsequently produced. In other words, a process might decide which message to send based on whether a timeout occurs before some other message is received. So, very likely, the number of configurations that could result from delays is enormous. Pick any one such configuration, put the protocol into that configuration, and let it run. If the protocol is proved correct under the GST assumption, it will eventually reach a decision.

Keep in mind that any configuration that could result from delays does not mean any configuration. For example, if p could never send m, the adversary cannot put the protocol into a configuration in which p has sent m. Still, the number of possible configurations could be huge, and being able to reach a decision from any of them is quite amazing.

Conclusion

While the GST assumption may seem contrived, it is simply an elegant adjustment to a logical formula. Furthermore, proving a protocol correct under the GST assumption has profound implications: no matter how delays are imposed upon the protocol, so long as they subside, the protocol can recover and reach a decision.

Of course, the GST assumption may not be applicable to all protocols, such as those that are expected to run in environments in which delays are unending and too frequent for the GST assumption to be meaningful. A protocol such as this may require some other assumption to demonstrate its correctness.

Furthermore, proving a protocol correct does not imply that its implementation is correct. For example, a developer may think they are assuming GST when, in reality, they are assuming something stronger. For example, a developer might make assumptions about when GST will arrive.

At Trail of Bits, we recommend a security audit for every consensus protocol implementation. If we could be of help to you in this regard, please reach out!

As a final note, while preparing this post, we came across a post by Ittai Abraham, “Flavours of Partial Synchrony,” which looks at the GST assumption from a slightly different perspective. We encourage you to read his post as well.

Acknowledgments

This research was conducted by Trail of Bits based upon work supported by DARPA under Contract No. HR001120C0084 (Distribution Statement A, Approved for Public Release: Distribution Unlimited). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Government or DARPA.