Distributed Algorithms For Message-Passing Systems-1
Distributed Algorithms For Message-Passing Systems-1
Distributed Algorithms For Message-Passing Systems-1
Distributed
Algorithms for
Message-Passing
Systems
Distributed Algorithms
for Message-Passing Systems
Michel Raynal
Distributed Algorithms
for Message-Passing Systems
Michel Raynal
Institut Universitaire de France
IRISA-ISTIC
Université de Rennes 1
Rennes Cedex
France
La profusion des choses cachait la rareté des idées et l’usure des croyances.
[. . . ] Retenir quelque chose du temps où l’on ne sera plus.
In Les années (2008), Annie Ernaux
Nel mezzo del cammin di nostra vita
Mi ritrovai per una selva oscura,
Ché la diritta via era smarritta.
In La divina commedia (1307–1321), Dante Alighieri (1265–1321)
Wir müssen nichts sein, sondern alles werden wollen.
Johann Wolfgang von Goethe (1749–1832)
Chaque génération, sans doute, se croit vouée à refaire le monde.
La mienne sait pourtant qu’elle ne le refera pas. Mais sa tâche est peut-être plus grande.
Elle consiste à empêcher que le monde ne se défasse.
Speech at the Nobel Banquet, Stockholm, December 10, 1957, Albert Camus (1913–1960)
Rien n’est précaire comme vivre
Rien comme être n’est passager
C’est un peu fondre pour le givre
Ou pour le vent être léger
J’arrive où je suis étranger.
In Le voyage de Hollande (1965), Louis Aragon (1897–1982)
v
vi Preface
absence of shared memory and global time, failure, dynamicity, mobility, etc. Mas-
tering one form or another of uncertainty is pervasive in all distributed computing
problems. A main difficulty in designing distributed algorithms comes from the fact
that each entity cooperating in the achievement of a common goal cannot have in-
stantaneous knowledge of the current state of the other entities; it can only know
their past local states.
Although distributed algorithms are often made up of a few lines, their behavior
can be difficult to understand and their properties hard to state and prove. Hence,
distributed computing is not only a fundamental topic but also a challenging topic
where simplicity, elegance, and beauty are first-class citizens.
Why This Book? While there are a lot of books on sequential computing (both on
basic data structures, or algorithms), this is not the case in distributed computing.
Most books on distributed computing consider advanced topics where the uncer-
tainty inherent to distributed computing is created by the net effect of asynchrony
and failures. It follows that these books are more appropriate for graduate students
than for undergraduate students.
The aim of this book is to present in a comprehensive way basic notions, concepts
and algorithms of distributed computing when the distributed entities cooperate by
sending and receiving messages on top of an underlying network. In this case, the
main difficulty comes from the physical distribution of the entities and the asyn-
chrony of the environment in which they evolve.
Audience This book has been written primarily for people who are not familiar
with the topic and the concepts that are presented. These include mainly:
• Senior-level undergraduate students and graduate students in computer science
or computer engineering, who are interested in the principles and foundations of
distributed computing.
• Practitioners and engineers who want to be aware of the state-of-the-art concepts,
basic principles, mechanisms, and techniques encountered in distributed comput-
ing.
Prerequisites for this book include undergraduate courses on algorithms, and ba-
sic knowledge on operating systems. Selections of chapters for undergraduate and
graduate courses are suggested in the section titled “How to Use This Book” in the
Afterword.
Content As already indicated, this book covers algorithms, basic principles, and
foundations of message-passing programming, i.e., programs where the entities
communicate by sending and receiving messages through a network. The world is
distributed, and the algorithmic thinking suited to distributed applications and sys-
tems is not reducible to sequential computing. Knowledge of the bases of distributed
computing is becoming more important than ever as more and more computer ap-
plications are now distributed. The book is composed of six parts.
Preface vii
• The aim of the first part, which is made up of six chapters, is to give a feel for the
nature of distributed algorithms, i.e., what makes them different from sequential
or parallel algorithms. To that end, it mainly considers distributed graph algo-
rithms. In this context, each node of the graph is a process, which has to compute
a result whose meaning depends on the whole graph.
Basic distributed algorithms such as network traversals, shortest-path algo-
rithms, vertex coloring, knot detection, etc., are first presented. Then, a general
framework for distributed graph algorithms is introduced. A chapter is devoted to
leader election algorithms on a ring network, and another chapter focuses on the
navigation of a network by mobile objects.
• The second part is on the nature of distributed executions. It is made up of four
chapters. In some sense, this part is the core of the book. It explains what a dis-
tributed execution is, the fundamental notion of a consistent global state, and the
impossibility—without freezing the computation—of knowing whether a com-
puted consistent global state has been passed through by the execution or not.
Then, this part of the book addresses an important issue of distributed compu-
tations, namely the notion of logical time: scalar (linear) time, vector time, and
matrix time. Each type of time is analyzed and examples of their uses are given.
A chapter, which extends the notion of a global state, is then devoted to asyn-
chronous distributed checkpointing. Finally, the last chapter of this part shows
how to simulate a synchronous system on top of an asynchronous system (such
simulators are called synchronizers).
• The third part of the book is made up of two chapters devoted to distributed
mutual exclusion and distributed resource allocation. Different families of
permission-based mutual exclusion algorithms are presented. The notion of an
adaptive algorithm is also introduced. The notion of a critical section with mul-
tiple entries, and the case of resources with a single or several instances is also
presented. Associated deadlock prevention techniques are introduced.
• The fourth part of the book is on the definition and the implementation of commu-
nication operations whose abstraction level is higher than the simple send/receive
of messages. These communication abstractions impose order constraints on mes-
sage deliveries. Causal message delivery and total order broadcast are first pre-
sented in one chapter. Then, another chapter considers synchronous communica-
tion (also called rendezvous or logically instantaneous communication).
• The fifth part of the book, which is made up of two chapters, is on the detection
of stable properties encountered in distributed computing. A stable property is a
property that, once true, remains true forever. The properties which are studied are
the detection of the termination of a distributed computation, and the detection of
distributed deadlock. This part of the book is strongly related to the second part
(which is devoted to the notion of a global state).
• The sixth and last part of the book, which is also made up of two chapters, is
devoted to the notion of a distributed shared memory. The aim is here to pro-
vide the entities (processes) with a set of objects that allow them to cooperate at
viii Preface
an abstraction level more appropriate than the use of messages. Two consistency
conditions, which can be associated with these objects, are presented and inves-
tigated, namely, atomicity (also called linearizability) and sequential consistency.
Several algorithms implementing these consistency conditions are described.
To have a more complete feeling of the spirit of this book, the reader is invited
to consult the section “The Aim of This Book” in the Afterword, which describes
what it is hoped has been learned from this book. Each chapter starts with a short
presentation and a list of the main keywords, and terminates with a summary of its
content. Each of the six parts of the book is also introduced by a brief description of
its aim and its technical content.
Acknowledgments This book originates from lecture notes for undergraduate and graduate
courses on distributed computing that I give at the University of Rennes (France) and, as an
invited professor, at several universities all over the world. I would like to thank the students
for their questions that, in one way or another, have contributed to this book. I want also to
thank Ronan Nugent (Springer) for his support and his help in putting it all together.
Last but not least (and maybe most importantly), I also want to thank all the researchers
whose results are presented in this book. Without their work, this book would not exist.
Michel Raynal
Professeur des Universités
Institut Universitaire de France
IRISA-ISTIC, Université de Rennes 1
Campus de Beaulieu, 35042, Rennes, France
March–October 2012
Rennes, Saint-Grégoire, Tokyo, Fukuoka (AINA’12), Arequipa (LATIN’12),
Reykjavik (SIROCCO’12), Palermo (CISIS’12), Madeira (PODC’12), Lisbon,
Douelle, Saint-Philibert, Rhodes Island (Europar’12),
Salvador de Bahia (DISC’12), Mexico City (Turing Year at UNAM)
Contents
ix
x Contents
no-op no operation
skip empty statement
process program in action
n number of processes
e number of edges in the process graph
D diameter of the process graph
pi process whose index is i
idi identity of process pi (very often idi = i)
τ time instant (with respect to an external observer)
a, b pair with two elements a and b
ev
−→ causal precedence relation on events
σ
−→ causal precedence relation on local states
zz
−→ z-precedence relation on local checkpoints
Σ
−→ precedence relation on global states
Mutex mutual exclusion
ABCD small capital letters: message type (message tag)
abcdi italics lower-case letters: local variable of process pi
m1 ; . . . ; mq sequence of messages
ai [1..s] array of size s (local to process pi )
for each i ∈ {1, . . . , m} order irrelevant
for each i from 1 to m order relevant
wait (P ) while ¬P do no-op end while
return (v) returns v and terminates the operation invocation
% blablabla % comments
; sequentiality operator between two statements
¬(a R b) relation R does not include the pair a, b
xxi
List of Figures and Algorithms
xxiii
xxiv List of Figures and Algorithms
Fig. 7.9 Implementation of a vector clock system (code for process pi ) . 160
Fig. 7.10 Time propagation in a vector clock system . . . . . . . . . . . . 161
Fig. 7.11 On the development of time (1) . . . . . . . . . . . . . . . . . . 164
Fig. 7.12 On the development of time (2) . . . . . . . . . . . . . . . . . . 164
Fig. 7.13 Associating vector dates with global states . . . . . . . . . . . . 165
Fig. 7.14 First global state satisfying a global predicate (1) . . . . . . . . . 167
Fig. 7.15 First global state satisfying a global predicate
(2) . . . . . . . . . 168
Fig. 7.16 Detection the first global state satisfying i LPi
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 169
Fig. 7.17 Relevant events in a distributed computation . . . . . . . . . . . 171
Fig. 7.18 Vector clock system for relevant events (code for process pi ) . . 171
Fig. 7.19 From relevant events to Hasse diagram . . . . . . . . . . . . . . 171
Fig. 7.20 Determination of the immediate predecessors
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 172
Fig. 7.21 Four possible cases when updating impi [k],
while vci [k] = vc[k] . . . . . . . . . . . . . . . . . . . . . . . . 173
Fig. 7.22 A specific communication pattern . . . . . . . . . . . . . . . . . 175
Fig. 7.23 Specific communication pattern with n = 3 processes . . . . . . 175
Fig. 7.24 Management of vci [1..n] and kprimei [1..n, 1..n]
(code for process pi ): Algorithm 1 . . . . . . . . . . . . . . . . 178
Fig. 7.25 Management of vci [1..n] and kprimei [1..n, 1..n]
(code for process pi ): Algorithm 2 . . . . . . . . . . . . . . . . 179
Fig. 7.26 An adaptive communication layer (code for process pi ) . . . . . 181
Fig. 7.27 Implementation of a k-restricted vector clock system
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 182
Fig. 7.28 Matrix time: an example . . . . . . . . . . . . . . . . . . . . . . 183
Fig. 7.29 Implementation of matrix time (code for process pi ) . . . . . . . 184
Fig. 7.30 Discarding obsolete data: structural view (at a process pi ) . . . . 185
Fig. 7.31 A buffer management algorithm (code for process pi ) . . . . . . 185
Fig. 7.32 Yet another clock system (code for process pi ) . . . . . . . . . . 188
Fig. 8.1 A checkpoint and communication pattern (with intervals) . . . . 190
Fig. 8.2 A zigzag pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Fig. 8.3 Proof of Theorem 9:
a zigzag path joining two local checkpoints of LC . . . . . . . . 194
Fig. 8.4 Proof of Theorem 9:
a zigzag path joining two local checkpoints . . . . . . . . . . . . 195
Fig. 8.5 Domino effect (in a system of two processes) . . . . . . . . . . . 196
Fig. 8.6 Proof by contradiction of Theorem 11 . . . . . . . . . . . . . . 200
Fig. 8.7 A very simple z-cycle-free checkpointing algorithm
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Fig. 8.8 To take or not to take a forced local checkpoint . . . . . . . . . . 202
Fig. 8.9 An example of z-cycle prevention . . . . . . . . . . . . . . . . . 202
Fig. 8.10 A vector clock system for rollback-dependency trackability
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Fig. 8.11 Intervals and vector clocks for rollback-dependency trackability . 204
List of Figures and Algorithms xxvii
Fig. 8.12 Russell’s pattern for ensuring the RDT consistency condition . . 205
Fig. 8.13 Russell’s checkpointing algorithm (code for pi ) . . . . . . . . . 205
Fig. 8.14 FDAS checkpointing algorithm (code for pi ) . . . . . . . . . . . 207
Fig. 8.15 Matrix causali [1..n, 1..n] . . . . . . . . . . . . . . . . . . . . . 208
Fig. 8.16 Pure (left) vs. impure (right) causal paths from pj to pi . . . . . 208
Fig. 8.17 An impure causal path from pi to itself . . . . . . . . . . . . . . 209
Fig. 8.18 An efficient checkpointing algorithm for RDT (code for pi ) . . . 210
Fig. 8.19 Sender-based optimistic message logging . . . . . . . . . . . . . 212
Fig. 8.20 To log or not to log a message? . . . . . . . . . . . . . . . . . . 212
Fig. 8.21 An uncoordinated checkpointing algorithm (code for pi ) . . . . 214
Fig. 8.22 Retrieving the messages which are in transit
with respect to the pair (ci , cj ) . . . . . . . . . . . . . . . . . . 215
Fig. 9.1 A space-time diagram of a synchronous execution . . . . . . . . 220
Fig. 9.2 Synchronous breadth-first traversal algorithm (code for pi ) . . . 221
Fig. 9.3 Synchronizer: from asynchrony to logical synchrony . . . . . . . 222
Fig. 9.4 Synchronizer α (code for pi ) . . . . . . . . . . . . . . . . . . . 226
Fig. 9.5 Synchronizer α: possible message arrival at process pi . . . . . . 227
Fig. 9.6 Synchronizer β (code for pi ) . . . . . . . . . . . . . . . . . . . 229
Fig. 9.7 A message pattern which can occur with synchronizer β
(but not with α): Case 1 . . . . . . . . . . . . . . . . . . . . . . 229
Fig. 9.8 A message pattern which can occur with synchronizer β
(but not with α): Case 2 . . . . . . . . . . . . . . . . . . . . . . 229
Fig. 9.9 Synchronizer γ : a communication graph . . . . . . . . . . . . . 230
Fig. 9.10 Synchronizer γ : a partitioning . . . . . . . . . . . . . . . . . . . 231
Fig. 9.11 Synchronizer γ (code for pi ) . . . . . . . . . . . . . . . . . . . 233
Fig. 9.12 Synchronizer δ (code for pi ) . . . . . . . . . . . . . . . . . . . 235
Fig. 9.13 Initialization of physical clocks (code for pi ) . . . . . . . . . . . 236
Fig. 9.14 The scenario to be prevented . . . . . . . . . . . . . . . . . . . 237
Fig. 9.15 Interval during which a process can receive pulse r messages . . 238
Fig. 9.16 Synchronizer λ (code for pi ) . . . . . . . . . . . . . . . . . . . 239
Fig. 9.17 Synchronizer μ (code for pi ) . . . . . . . . . . . . . . . . . . . 240
Fig. 9.18 Clock drift with respect to reference time . . . . . . . . . . . . . 241
Fig. 10.1 A mutex invocation pattern and the three states of a process . . . 248
Fig. 10.2 Mutex module at a process pi : structural view . . . . . . . . . . 250
Fig. 10.3 A mutex algorithm based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Fig. 10.4 Proof of the safety property of the algorithm of Fig. 10.3 . . . . . 253
Fig. 10.5 Proof of the liveness property of the algorithm of Fig. 10.3 . . . 254
Fig. 10.6 Generalized mutex based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Fig. 10.7 An adaptive mutex algorithm based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Fig. 10.8 Non-FIFO channel in the algorithm of Fig. 10.7 . . . . . . . . . 259
Fig. 10.9 States of the message PERMISSION ({i, j }) . . . . . . . . . . . . 260
xxviii List of Figures and Algorithms
Fig. 14.7 The counting vector algorithm for termination detection . . . . . 374
Fig. 14.8 The counting vector algorithm at work . . . . . . . . . . . . . . 375
Fig. 14.9 Termination detection of a diffusing computation . . . . . . . . . 378
Fig. 14.10 Ring-based implementation of a wave . . . . . . . . . . . . . . 380
Fig. 14.11 Spanning
tree-based implementation of a wave . . . . . . . . . . 381
Fig. 14.12 Why ( 1≤i≤n idlexi ) ⇒ TERM(C, ταx ) is not true . . . . . . . . . 382
Fig. 14.13 A general algorithm for termination detection . . . . . . . . . . 384
Fig. 14.14 Atomicity associated with τix . . . . . . . . . . . . . . . . . . . 384
Fig. 14.15 Structure of the channels to pi . . . . . . . . . . . . . . . . . . 386
Fig. 14.16 An algorithm for static termination detection . . . . . . . . . . . 391
Fig. 14.17 Definition of time instants for the safety of static termination . . 392
Fig. 14.18 Cooperation between local observers . . . . . . . . . . . . . . . 394
Fig. 14.19 An algorithm for dynamic termination detection . . . . . . . . . 395
Fig. 14.20 Example of a monotonous distributed computation . . . . . . . . 398
Fig. 15.1 Examples of wait-for graphs . . . . . . . . . . . . . . . . . . . . 402
Fig. 15.2 An algorithm for deadlock detection
in the AND communication model . . . . . . . . . . . . . . . . 410
Fig. 15.3 Determining in-transit messages . . . . . . . . . . . . . . . . . 411
Fig. 15.4 PROBE () messages sent along a cycle
(with no application messages in transit) . . . . . . . . . . . . . 411
Fig. 15.5 Time instants in the proof of the safety property . . . . . . . . . 412
Fig. 15.6 A directed communication graph . . . . . . . . . . . . . . . . . 414
Fig. 15.7 Network traversal with feedback on a static graph . . . . . . . . 414
Fig. 15.8 Modification in a wait-for graph . . . . . . . . . . . . . . . . . . 415
Fig. 15.9 Inconsistent observation of a dynamic wait-for graph . . . . . . 416
Fig. 15.10 An algorithm for deadlock detection
in the OR communication model . . . . . . . . . . . . . . . . . 418
Fig. 15.11 Activation pattern for the safety proof . . . . . . . . . . . . . . . 420
Fig. 15.12 Another example of a wait-for graph . . . . . . . . . . . . . . . 423
Fig. 16.1 Structure of a distributed shared memory . . . . . . . . . . . . . 428
Fig. 16.2 Register: What values can be returned by read operations? . . . . 429
op
Fig. 16.3 The relation −→ of the computation described in Fig. 16.2 . . . 430
Fig. 16.4 An execution of an atomic register . . . . . . . . . . . . . . . . 432
Fig. 16.5 Another execution of an atomic register . . . . . . . . . . . . . . 432
Fig. 16.6 Atomicity allows objects to compose for free . . . . . . . . . . . 435
Fig. 16.7 From total order broadcast to atomicity . . . . . . . . . . . . . . 436
Fig. 16.8 Why read operations have to be to-broadcast . . . . . . . . . . . 437
Fig. 16.9 Invalidation-based implementation of atomicity:
message flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Fig. 16.10 Invalidation-based implementation of atomicity: algorithm . . . 440
Fig. 16.11 Invalidation and owner-based implementation of atomicity
(code of pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Fig. 16.12 Invalidation and owner-based implementation of atomicity
(code of the manager pX ) . . . . . . . . . . . . . . . . . . . . . 442
Fig. 16.13 Update-based implementation of atomicity . . . . . . . . . . . . 443
List of Figures and Algorithms xxxi
This first part of the book is on distributed graph algorithms. These algorithms con-
sider the distributed system as a connected graph whose vertices are the processes
(nodes) and whose edges are the communication channels. It is made up of five
chapters.
This chapter first introduces basic definitions related to distributed algorithms. Then,
considering a distributed system as a graph whose vertices are the processes and
whose edges are the communication channels, it presents distributed algorithms for
graph traversals, namely, parallel traversal, breadth-first traversal, and depth-first
traversal. It also shows how spanning trees or rings can be constructed from these
distributed graph traversal algorithms. These trees and rings can, in turn, be used to
easily implement broadcast and convergecast algorithms.
As the reader will see, the distributed graph traversal techniques are different
from their sequential counterparts in their underlying principles, behaviors, and
complexities. This come from the fact that, in a distributed context, the same type of
traversal can usually be realized in distinct ways, each with its own tradeoff between
its time complexity and message complexity.
1.1.1 Definition
Structural View It follows from the previous definitions that, from a structural
point of view, a distributed system can be represented by a connected undirected
graph G = (Π, C) (where C denotes the set of channels). Three types of graph are
of particular interest (Fig. 1.1):
• A ring is a graph in which each process has exactly two neighbors with which it
can communicate directly, a left neighbor and a right neighbor.
• A tree is a graph that has two noteworthy properties: it is acyclic and connected
(which means that adding a new channel would create a cycle while suppressing
a channel would disconnect it).
• A fully connected graph is a graph in which each process is directly connected to
every other process. (In graph terminology, such a graph is called a clique.)
During a round, a process sends at most one message to each of its neighbors. The
fundamental property of a synchronous system is that a message sent by a process
during a round r is received by its destination process during the very same round r.
Hence, when a process proceeds to the round r + 1, it has received (and processed)
all the messages which have been sent to it during round r, and it knows that the
same is true for any process.
This knowledge concerns its identity, the total number n of processes, the identity
of its neighbors, the structure of the communication graph, etc. As an example, a
process pi may only know that
• it is on a unidirectional ring,
• it has a left neighbor from which it can receive messages,
• it has a right neighbor to which it can send messages,
• its identity is idi ,
• the fact that no two processes have the same identity, and
• the fact that the set of identities is totally ordered.
As we can see, with such an initial knowledge, no process initially knows the total
number of processes n. Learning this number requires the processes to exchange
information.
Initial Knowledge Each process pi has identity idi , and no process knows n (the
total number of processes). Initially, a process pi knows its identity and the iden-
tity idj of each of its neighbors. Hence, each process pi is initially provided with
a set neighborsi and, for each idj ∈ neighborsi , the pair idi , idj denotes locally
the channel connecting pi to pj . Let us observe that, as the channels are bidirec-
tional, both idi , idj and idj , idi denote the same channel and are consequently
considered as synonyms.
operation start() is
(1) for each idj ∈ neighborsi
(2) do send POSITION (idi , neighborsi ) to the neighbor identified idj
(3) end for;
(4) parti ← true
end operation.
the message POSITION (idi , neighborsi ) to each of its neighbors (line 2) and sets
parti to true (line 4).
When pi receives a message POSITION (id, neighbors) from one of its neighbors
px for the first time (line 7), it includes the position of the corresponding process pj
in the local data structures proc_knowni and channels_knowni (lines 8–9) and, as it
has learned something new, it forwards this message POSITION () to all its neighbors,
but the one that sent it this message (line 10). If it has already received the message
POSITION (id, neighbors) (we have then j ∈ proc_knowni ), pi discards the message.
Cost Let e be the total number of channels and D be the diameter of the com-
munication graph. The diameter of a graph is the longest among all the shortest
distances connecting any pair of processes, where the shortest distance between pi
and pj is the smallest number of channels to go from pi to pj . The diameter notion
is a global notion that measures the “breadth” of the communication graph.
For any i and any channel, a message POSITION (idi , −) is sent at least once and
at most twice (once in each direction) on that channel. It follows that the message
complexity is upper bounded by 2ne.
As far as the time complexity is concerned, let us consider that each message
takes one time unit and local processing has zero duration. In the worst case, a
1.2 Parallel Traversal: Broadcast and Convergecast 9
When Initially the Channels Have Only Local Names Let us consider a pro-
cess pi that has ci neighbors to which it is point-to-point connected by ci chan-
nels locally denoted channeli [1..ci ]. When each process pi is initially given only
channeli [1..ci ], the processes can easily compute their sets neighborsi . To that end,
each process executes a preliminary communication phase during which it first
sends a message ID (i) on each channeli [x], 1 ≤ x ≤ ci , and then waits until it has
received the identities of the processes at the other end of its ci channels. When
pi has received ID (idk ) on channel channeli [x], it can associate its local address
channeli [x] with the identity idk whose scope is the whole system.
Port Name When each channel channeli [x] is defined by a local name, the index
x is sometimes called a port. Hence, a process pi has ci communication ports.
It is assumed that, while the identity of a process pi is its index i, no process knows
explicitly the value of n (i.e., pn knows that its identity is n, but does not know that
its identity is also the number of processes).
Fig. 1.5
A rooted spanning tree
single pair carrying the value vi sent by pi to the root. A non-leaf process pi waits
for the pairs (k, vk ) from all its children, adds its own pair (i, vi ), and finally sends
the resulting set val_seti to its parent (line 4). When the root has received a set of
pairs from each of its children, it has a pair from each process and can compute the
function f () (line 5).
This section presents a simple algorithm that (a) implements broadcast and con-
vergecast, and (b) builds a spanning tree. This algorithm is sometimes called prop-
agation of information with feedback. Once a spanning tree has been constructed,
it can be used for future broadcasts and convergecasts involving the same distin-
guished process pa .
(i, vi ) plus all the pairs (k, vk ) it has received from its children line 14). Then, pi
has terminated its participation in the algorithm (its local variable expected_msgi
then becomes useless). If pi is the distinguished process pa , the set val_set contains
a pair (x, vx ) per process px , and pa can accordingly compute f (val_set) (where
f () is the function whose result is the output of the computation).
Let us notice that, when the distinguished process pa discovers that the algo-
rithm has terminated, all the messages sent by the algorithm have been received and
processed.
Cost Let us observe that a message BACK() is eventually sent as a response to each
message GO(). Moreover, except on the channels of the spanning tree that is built,
two messages GO() can be sent (one in each direction).
Let e be the number of channels of the underlying communication graph. It fol-
lows that the algorithm gives rise to 2(n − 1) messages which travel on the chan-
nels of tree and 4(e − (n − 1)) messages which travel on the other channels, i.e.,
2(2e − n + 1) messages. Then, once the tree is built, a broadcast/convergecast costs
only 2(n − 1) messages.
Assuming all messages take one time unit and local computations have zero du-
ration, it is easy to see that the time complexity is 2D where D is the diameter of
the communication graph. Once the tree is built, the time complexity of a broad-
14 1 Basic Definitions and Network Traversal Algorithms
An Example An execution of the algorithm described in Fig. 1.7 for the commu-
nication graph depicted in the left part of Fig. 1.8 is described in Fig. 1.9.
Figure 1.9 is a space-time diagram. The execution of a process pi , 1 ≤ i ≤ 4, is
represented by an axis oriented from left to right. An arrow from one axis to another
represents a message transfer. In this picture, an arrow labeled GO x,y () represents a
message GO () sent by px to py . Similarly, an arrow labeled BACK x,y () represents a
message BACK () sent by px to py .
The process p1 is the distinguished process that receives the external message
START () and consequently will be the root of the tree. It sends a message GO () to its
neighbors p2 and p3 . When p3 receives this message, it defines its parent as being
p1 and forwards message GO() to its two other neighbors p2 and p4 .
Since the first message GO() received by p2 is the one sent by p3 , p2 defines its
parent as being p3 and forwards the message GO() to its other neighbor, namely p1 .
When p1 receives a message GO() from p2 , it sends back a message BACK (∅) to
p2 . In contrast, when p4 receives the message GO() from p3 , it sends by return to
p3 a message BACK () carrying the pair (4, v4 ). Moreover, when p2 has received a
message BACK () from p1 , it sends to its parent p3 a message BACK () carrying the
pair (2, v2 ).
Finally, when p3 receives the messages BACK () from p2 and p4 , it discovers
that these processes are its children and sends a message BACK () carrying the set
{(2, v2 ), (3, v3 ), (4, v4 )} to its parent p1 . When p1 receives this message, it discov-
1.2 Parallel Traversal: Broadcast and Convergecast 15
ers that p2 is its only child. It can then compute f () on the vector [v1 , v2 , v3 , v4 ].
The tree that has been built is represented at the right of Fig. 1.8.
The Case of Non-FIFO Channels Assuming non-FIFO channels and taking into
account Fig. 1.9, let us consider that the message GO 1,2 () arrives at p2 after the mes-
sage BACK 1,2 (). It is easy to see that the algorithm remains correct (i.e., a spanning
tree is built).
The only thing that changes is the meaning associated with line 16. When a
process sends a message BACK() to its parent, it can no longer claim that its local
computation is terminated. A process needs now to have received a message on each
of its incident channels before claiming local termination.
A Spanning Tree per Process The algorithm of Fig. 1.7 can be easily general-
ized to build n trees, each one associated with a distinct process which is its dis-
tinguished process. Then, when any process pi wants to execute an efficient broad-
cast/convergecast, it has to use its associated spanning tree.
To build a spanning tree per process, the local variables parenti , childreni , and
expected_msgi of each process pi have to be replaced by the arrays parenti [1..n],
childreni [1..n] and expected_msgi [1..n] and all messages have to carry the identity
of the corresponding distinguished process. More precisely, when a process pk re-
ceives a message START (), it uses its local variables parentk [k], childrenk [k], and
expected_msgk [k]. The corresponding messages will carry the identity k, GO (k, −)
and BACK (k, −), and, when a process pi receives such messages, it will uses its
local variables parenti [k], childreni [k] and expected_msgi [k].
Concurrent Initiators for a Single Spanning Tree The algorithm of Fig. 1.7 can
be easily modified to build a single spanning tree while allowing several processes to
independently start the execution of the algorithm, each receiving initially a message
START (). To that end, each process manages an additional local variable max_id i
initialized to 0, which contains the highest identity of a process competing to be the
root of the spanning tree.
• If a process pi receives a message START () while max_idi = 0, pi discards this
message (in that case, it already participates in the algorithm but does not com-
pete to be the root). Otherwise, pi starts executing the algorithm and all the cor-
responding messages GO () or BACK () carry its identity.
16 1 Basic Definitions and Network Traversal Algorithms
Fig. 1.10 Two different spanning trees built from the same communication graph
• Then, when a process pi receives a message GO (j, −), pi discards the message if
j < max_idi . Otherwise, pi considers pj as the process with the highest identity
which is competing to be the root. It sets consequently max_idi to j and con-
tinues executing the algorithm by using messages GO () and BACK () carrying the
identity j .
It is easy to see that this simple application of the forward/discard strategy ensures
that a single spanning tree will be constructed, namely the one rooted at pj where j
is such that, at the end of the execution, we have max_id1 = · · · = max_idn = j .
Principle of the Algorithm This algorithm, which is due to T.-Y. Cheung (1983),
is based on parallel traversals of the communication graph. These traversals are con-
current and some of them can stop others. In addition to the local variables parenti ,
childreni , and expexted_msgi , each process pi manages a local variable, denoted
leveli , which represents its current approximation of its distance to the root. More-
over, each message GO () carries now the current level of the sending process.
Then, when a process pi receives a message GO (d), there are two cases according
to the current state of pi and the value of d.
• The message GO (d) is the first message GO () received by pi . In that case, pi
initializes leveli to d + 1 and forwards the message GO (d + 1) to its neighbors
(except the sender of the message GO (d)).
• The message GO (d) is not the first message GO () received by pi and leveli >
d + 1. In that case, pi (a) updates its variable leveli to d + 1, (b) defines the
sender of the message GO (d) just received as its new parent, and (c) forwards a
message GO (d + 1) to each of its other neighbors pk in order that they recompute
their distances to the root.
As we can see, these simple principles consist of a chaotic distributed iterative
computation. They are used to extend the basic parallel network traversal algorithm
of Fig. 1.7 with a forward/discard strategy that allows processes to converge to their
final position in the breadth-first spanning tree.
when START () is received do % only the distinguished process receives this message %
(1) send GO (−1) to itself.
Fig. 1.11 Construction of a breadth-first spanning tree without centralized control (code for pi )
If the message GO (d) is not the first message GO () received by pi , there are two
cases. Let pj be the sender of the message GO (d).
• If leveli ≤ d + 1, pi cannot improve its position in the tree. It then sends by return
the message BACK (no, d + 1) to inform the sender of the message GO () that it
cannot be its child at distance d + 1 of the tree (line 16). Hence, pi stops the
network traversal associated with the message GO (d) it has received.
• If leveli > d + 1, pi has to improve its position in the tree under construction. To
that end, it propagates the network traversal associated with the message GO (d) it
has received in order to allow its other neighbors to improve their positions in the
tree. Hence, it executes the same statements as those executed when it received
its first message GO (d) (lines 10–15 are exactly the same as lines 3–8).
When a process pi receives a message BACK (resp, d), it considers it only if
leveli = d − 1 (line 19). This is because this message is meaningful only if its sender
pj sent it when its level was levelj = d = leveli + 1. In the other cases, the message
1.3 Breadth-First Spanning Tree 19
Termination It follows from line 23 that the root learns the termination of the
algorithm.
On the other hand, the local variable leveli of a process pi , which is not the root,
can be updated each time it receives a message GO (). Unfortunately, the number
of such messages received by pi is not limited to the number of its neighbors but
depends on (a) the number of neighbors of its neighbors, etc. (i.e., the structure of
the communication graph), and (b) the speed of messages (i.e., asynchrony). As its
knowledge of the communication graph is local (it is restricted to its neighbors), a
process cannot define a local predicate indicating that its local variables have con-
verged to their final values. But, as the root can discover when the construction of
the tree has terminated, it can send a message (which will be propagated along the
tree) to inform the other processes that their local variables parenti , childreni , and
leveli have their final values.
Cost There are two type of messages, and each message carries an integer whose
value is bounded by the diameter D of the communication graph. Moreover, a mes-
sage BACK() carries an additional Boolean value. It follows that the size of each
message is upper bounded by 2 + log2 D bits. It is easy to see that the time com-
plexity is O(D), i.e., O(n) in the worst case.
As far as the message complexity is concerned, the worst case is a fully connected
communication graph (i.e., any pair of processes is connected by a channel) and a
process at distance d of the root updates leveli d times (as in Fig. 1.12). This means
that among the (n − 1) processes which are not the root, one updates its level once,
another one updates it twice, etc., and one updates it (n − 1) times. Moreover, each
time a process updates its level, a process forwards the messages GO () to (n − 2)
of its neighbors (all processes but itself and the sender of the GO () that entailed the
update of its own level). The root sends (n − 1) messages GO (). Hence the total
number of messages GO () is
n−1
(n − 1)(n2 − 2n + 2)
(n − 1) + (n − 2) i= .
2
i=1
As at most one message BACK () is associated with each message GO (), it follows
that, in a fully connected network, the message complexity is upper bounded by
O(n3 ).
This section presents a second distributed algorithm that builds a breadth-first span-
ning tree. Differently from the previous one, this algorithm—which is due to Y. Zhu
and T.-Y. Cheung (1987)—is based on a centralized control that allows each process
to locally learn when its participation to the algorithm has terminated. Moreover,
both its time and message complexities are O(n2 ). This shows an interesting trade-
off with the previous algorithm whose time complexity is O(n) while its message
complexity is O(n3 ).
at distance d. Hence, the messages implementing the wave number d + 1 can use
only the channels of the breadth-first spanning tree of depth d that has been built by
the previous waves. This reduces consequently the number of messages needed to
implement a wave.
From an implementation point of view, for any d, the wave number d going from
the root up to processes at distance d is implemented with messages GO (), while its
return to the root is implemented with messages BACK (), as depicted in Fig. 1.13.
Algorithm: Local Variables In addition to the constant set neighborsi and its
local variables parenti (initialized to ⊥) and childreni , each process pi manages the
following local variables in order to implement the underlying principle previously
described.
• distancei is a write-once local variable that keeps the distance of pi to the root.
• to_sendi is a set that, once pi has been inserted into the spanning tree, contains
its neighbors to which it has to propagate the waves it receives from the root. If
pi is at distance d, these wave propagations will concern waves whose number is
greater than d.
• waiting_fromi is a set used by pi to manage the return of the current wave to its
parent in the tree. (Its role is similar to that of the local variable expected_msgi
used in the previous algorithms.)
when START () is received do % only the distinguished process receives this message %
parenti ← i; childreni ← ∅; distancei ← 0; to_sendi ← neighborsi ;
for each k ∈ to_sendi do send GO (0) to pk end for;
waiting_fromi ← neighborsi .
Fig. 1.14 Construction of a breadth-first spanning tree with centralized control (starting code)
Fig. 1.15 Construction of a breadth-first spanning tree with centralized control (code for a pro-
cess pi )
in order to expand the tree. If to_sendi is not empty, pi is one of the children of pj
and the tree can possibly be expanded from pi . In this case, pi sends the message
BACK (continue) to its parent pj to inform it that it is one of its children and (b)
possibly, during the next wave, new processes can be added to the tree with pi as
parent. These two cases are expressed at line 3.
If pi already has a parent (parenti = ⊥), i.e., it is already in the tree when it
receives a message GO (d), its behavior depends then on the sender pj of the mes-
sage GO (d) line 5. If pj is its parent, pi forwards the wave by sending the message
GO (d + 1) to its other neighbors in the set to_send i (line 6) and resets accordingly
waiting_fromi to to_sendi (line 7). If pj is not its parent (line 8), pi sends back the
message BACK (no) to the process pj to inform it that (a) it is not one of its children
and consequently (b) pj no longer has to forward waves to it.
Reception of a Message BACK () When a process pi receives a message BACK () it
has already determined its position in the breadth-first spanning tree. This message
BACK () sent by a neighbor pj is associated with a message GO () sent by pi to pj .
It carries a value resp ∈ {stop, continue, no}.
Hence, when pi receives BACK (resp) from pj , it first suppresses pj from the pro-
cess from which it waits for messages (line 11). Then, if resp ∈ {stop, continue},
pj is one of its children (line 12) and if resp ∈ {stop, no}, pi discovers that it has
no longer to send messages to pj (line 13). The behavior of pi depends then on the
set to_sendi .
• If to_sendi is empty (line 14), pi knows that its participation to the algorithm
is terminated. (Let us notice that, due to lines 7, 11 and 13, we have then
waiting_fromi = ∅.) If pi is the root, it also knows that the algorithm has ter-
minated (line 15). If it is not the root, it sends the message BACK (stop) to its
parent to inform it that (a) it has locally terminated, and (b) the tree can no longer
be extended from it (line 16).
• If to_sendi is not empty, it is possible that the tree can be expanded from pi . In
this case, if waiting_fromi = ∅, pi returns the wave to its parent by sending it the
message BACK (continue) (line 22). If it is the root, pi starts a new wave by
sending the message GO (0) to its neighbors from which the tree can possibly be
expanded (line 20) and resets appropriately its local set to_sendi (line 21).
at line 16 (if pi has several neighbors). Of course, the fact that a process has lo-
cally terminated does not mean that the algorithm has terminated. Only the root can
learn it. This occurs at line 15 when the root pi receives a message BACK (stop)
entailing the last update of to_sendi which becomes empty.
external message START (). The distinguished process sends first a message GO () to
one of its neighbors (line 2).
Then, when a process pi receives a message GO (), it defines the message sender
as its parent in the depth-first tree (lines 3–4). The local variable visitedi is a set
containing the identities of its neighbors which have been already visited by the
depth-first traversal (implemented by the progress of the message GO ()). If pj is its
only neighbor, pi sends back to pj the message BACK (yes) to inform it that (a) it
is one of its children and (b) it has to continue the depth-first traversal (lines 5–6).
Otherwise (neighborsi = visitedi , pi propagates the depth-first traversal to one of
its neighbors that, from its point of view, has not yet been visited (line 7). Finally, if
pi has already been visited by the depth-first traversal (parenti = ⊥), it sends back
to pj a message BACK (no) to inform it that it is not one of its children (line 9).
When a process pi receives a message BACK (resp), it first adds its sender pj to
its set of children if resp = yes (line 11). Moreover, it also adds pj to the set of
its neighbors which have been visited by the depth-first traversal (line 12). Then,
its behavior is similar to that of lines 5–8. If, from its point of view, not all of its
neighbors have been visited, it sends a message GO () to one of them (line 18). If
all of its neighbors have been visited (line 13), it claims the termination if it is the
root (line 14). If it is not the root, it sends to its parent the message BACK (yes) to
inform the parent that it is one of its children and it has to forward the depth-first
traversal.
26 1 Basic Definitions and Network Traversal Algorithms
On the Tree That Is Built It is easy to see that, given a predefined root process,
the depth-first spanning tree that is built does not depend on the speed of messages
but depends on the way each process pi selects its neighbor pk to which it propa-
gates the depth-first traversal (line 7 and line 18).
Fig. 1.17 Time and message optimal depth-first traversal (code for pi )
The Notion of a Local Algorithm Both the previous algorithm (Fig. 1.16) and
its improvement are local in the sense that (a) each process has initially to know
only its identity, that of its neighbors, and the fact no two processes have the same
identity, and (b) the size of the information exchanged between any two neighbors
is bounded.
propagates the depth-first traversal to one of its neighbors pk that has not yet been
visited and initializes childreni to {k} (lines 7–8).
Finally, when a process pi receives a message BACK (visited) such that all its
neighbors have been visited (line 10), it claims termination if it is the root (line 12).
If it is not the root, it forwards the message BACK (visited) to its parent (line 13). If
some of its neighbors have not yet been visited, pi selects one of them, propagates
the network traversal by sending to it the message GO (visited) and adds it to its set
of children (lines 15–16).
It is easy to see that this algorithm builds a depth-first spanning tree, and requires
(n − 1) messages GO () and (n − 1) messages BACK (). As no two messages are
concurrent, the time complexity is 2(n − 1). As already indicated, this algorithm
is not local: the set visited carried by each message grows until it contains all the
process identities. Hence, the size of a message includes one bit for the message
type and up to n log2 n bits for its content.
From an operational point of view, we have the following. When the distin-
guished process receives the message START () it defines itself as the starting pro-
cess (parenti = i), selects one of its neighbors pk , and sends to pk the message
GO (visited, i) where visited = {i} (lines 1–3). Moreover, the starting process records
the identity k in order to be able to close the ring when it discovers that the depth-
first traversal has terminated (line 12).
When a process pi receives a message GO (visited, last) (let us remember that
it receives exactly one message GO ()), it defines (a) its parent with respect to the
depth-first traversal as the sender pj of the message GO ()), and (b) its successor
on the ring as the last process (before it) that received a message GO (), i.e., the
process plast (line 4). Then, if all its neighbors have been visited by the depth-first
traversal, it sends back to its parent pj the message GO (visited ∪ {i}, i) and defines
the appropriate routing for the token, namely it sets routingi [j ] = j (lines 5–6). If
there are neighbors of pi that have not yet been visited, pi selects one of them (say
pk ) and propagates the depth-first traversal by sending the message GO (visited ∪
{i}, i) to pk . As before, it also defines the appropriate routing for the token, namely
it sets routingi [k] = j (lines 7–8).
When a process pi receives a message BACK (visited, last), it does the same as
previously if some of its neighbors have not yet been visited (lines 15–16 are similar
to lines 7–8). If all its neighbors have been visited and pi is the starting process,
it closes the ring by assigning to routingi [firsti ] the identity j of the process that
sent the message BACK (−, −) (lines 11–12). If pi is not the starting process, it
forwards the message BACK (visited, last) to its parent which will make the depth-
first traversal progress. It also assigns the identity j to routingi [parenti ] for the token
to be correctly routed along the appropriate channel of the communication graph in
order to attain its destination process on the logical ring.
Remarks The two logical neighbors on the ring of each process pi depend on the
way a process selects a non-visited neighbor when it has to propagate the depth-first
traversal.
Allowing the messages GO () and BACK () to carry more information (on the struc-
ture of the network that has been visited—or has not been visited—by the depth-first
1.4 Depth-First Traversal 31
traversal) allows the length of the ring at the communication graph level to be re-
duced to x, where x ∈ [n . . . 2(n − 1)]. This number x depends on the structure of
the communication graph and the way neighbors are selected when a process prop-
agates the network traversal.
An Example Let us consider the communication graph depicted Fig. 1.21 (with-
out the dotted arrows). The dotted arrows represent the logical ring constructed by
the execution described below.
In this example, when a process has to send a message GO () to one of its neigh-
bors, it selects the neighbor with the smallest identity in the set neighborsi \ visited.
1. The distinguished process sends the message GO ({1}, 1) to its neighbor p2 and
saves the identity 2 into firsti to be able to close the ring at the end of the network
traversal.
2. Then, when it receives this message, p2 defines succ2 = 1, forwards the
depth-first traversal by sending the message GO ({1, 2}, 2) to p3 and defines
routing2 [3] = 1.
3. When p3 receives this message, it defines succ3 = 2, forwards the depth-
first traversal by sending the message GO ({1, 2, 3}, 3) to p4 and defines
routing3 [4] = 2.
4. When p4 receives this message, it defines succ4 = 3, and propagates the depth-
first traversal by sending the message BACK ({1, 2, 3, 4}, 4) to its parent p3 .
Moreover, it defines routing4 [3] = 3.
5. When p3 receives this message, as neighbors3 ⊂ visited and it is not the
starting process, it forwards BACK ({1, 2, 3, 4}, 4) to its parent p2 and defines
routing3 [2] = 4.
6. When p2 receives this message, as neighbors3 is not included in visited, it selects
its not yet visited neighbor p5 and sends it the message GO ({1, 2, 3, 4}, 4). It also
defines routing2 [5] = 3.
7. When p5 receives this message, it defines p4 as its successor on the logical ring
(succ5 = 4), sends back to its parent p2 the message BACK ({1, 2, 3, 4, 5}, 5) and
defines routing5 [2] = 2.
8. When p2 receives this message, p2 forwards it to its parent p1 and defines
routing2 [1] = 5.
32 1 Basic Definitions and Network Traversal Algorithms
9. Finally, when p1 receives the message BACK ({1, 2, 3, 4, 5}, 5), all its neigh-
bors have been visited. Hence, the depth-first traversal is terminated. Conse-
quently, p1 closes the ring by assigning its value to routing1 [first1 ], i.e., it defines
routing1 [2] = 2.
The routing tables at each process constitute a distributed implementation of the
paths followed by the token to circulate on the ring from each process to its suc-
cessor. These paths are summarized in Table 1.1 which describes the physical paths
implementing the n virtual channels of the logical ring.
1.5 Summary
After defining the notion of a distributed algorithm, this chapter has presented sev-
eral traversal network algorithms, namely, parallel, breadth-first, and depth-first
traversal algorithms. It has also presented algorithms that construct spanning trees
or rings on top of a communication graph. In addition to being interesting in their
own right, these algorithms show that distributed traversal techniques are different
from their sequential counterparts.
M. Raynal [170]. Other depth-first traversal algorithms have been proposed in the
literature, e.g., [95, 224].
• The breadth-first traversal algorithm without centralized control is due to C.-T.
Cheung [91]. The one with centralized control is due to Y. Zhu and C.-T. Cheung
[395].
• The algorithm building a logical ring on top of an arbitrary network is due to J.-
M. Hélary and M. Raynal [177]. It is shown in this paper how the basic algorithm
described in Fig. 1.19 can be improved to obtain an implementation of the ring
requiring a number of messages ≤ 2n to implement a full turn of the token on the
ring.
• The distributed construction of minimum weight spanning trees has been ad-
dressed in many papers, e.g., [143, 209, 231].
• Graph algorithms and algorithmic graph theory can be found in many textbooks
(e.g., [122, 158]). The book by M. van Steen [359] constitutes an introduction
to graph and complex networks for engineers and computer scientists. Recent
advances on graph theory are presented in the collective book [164].
1. Let us assume that a message GO () is allowed to carry the position of its sender
in the communication graph. How can we improve a distributed graph traversal
algorithm so that it can benefit from this information?
Solution in [319].
2. Write the full text of the depth-first traversal algorithm corresponding to the im-
provement presented in the part titled “An easy improvement of the basic al-
gorithm” of Sect. 1.4.1. Depict then a run of it on the communication graph
described in the left part of Fig. 1.8.
3. Let us consider the case of a directed communication graph where the mean-
ing of “directed” is as follows. A channel from pi to pj allows (a) pi to send
only messages GO () to pj and (b) pj to send only messages BACK () to pi . Two
processes pi and pj are then such that either there is no communication chan-
nel connecting them, or there is one directed communication channel connecting
one to the other, or there are two directed communication channels (one in each
direction).
Design a distributed algorithm building a breadth-first spanning tree with a
distinguished root pa and compare it with the algorithm described in Fig. 1.11.
It is assumed that there is a directed path from the distinguished root pa to any
other process.
Solution in [91].
4. Consider a communication graph in which the processes have no identity and
each process pi knows its position in the communication network with the help
of a local array channeli [1..ci ] (where ci is the number of neighbors of pi ). An
example is given in Fig. 1.22. As we can see in this figure, the channel connecting
34 1 Basic Definitions and Network Traversal Algorithms
Fig. 1.22
An anonymous network
This chapter addresses three basic graph problems encountered in the context of
distributed systems. These problems are (a) the computation of the shortest paths
between a pair of processes where a positive length (or weight) is attached to each
communication channel, (b) the coloring of the vertices (processes) of a graph in
+ 1 colors (where is the maximal number of neighbors of a process, i.e., the
maximal degree of a vertex when using the graph terminology), and (c) the detection
of knots and cycles in a graph. As for the previous chapter devoted to graph traversal
algorithms, an aim of this chapter is not only to present specific distributed graph
algorithms, but also to show that their design is not always obtained from a simple
extension of their sequential counterparts.
Bellman–Ford’s sequential algorithm computes the shortest paths from one prede-
termined vertex of a graph to every other vertex. It is an iterative algorithm based
on the dynamic programming principle. This principle and its adaptation to a dis-
tributed context are presented below.
Initial Knowledge and Local Variables Initially each process knows that there
are n processes and the set of process identities is {1, . . . , n}. It also knows its posi-
tion in the communication graph (which is captured by the set neighborsi ). Interest-
ingly, it will never learn more on the structure of this graph. From a local state point
of view, each process pi manages the following variables.
• As just indicated, gi [j ], for j ∈ neighhborsi , denotes the length associated with
the channel i, j .
• lengthi [1..n] is an array such that lengthi [k] will contain the length of the shortest
path from pi to pk . Initially, lengthi [i] = 0 (and keeps that value forever) while
lengthi [j ] = +∞ for j = i.
• routing_toi [1..n] is an array that is not used to compute the shortest paths from
pi to each other process. It constitutes the local result of the computation. More
precisely, when the algorithm terminates, for any k, 1 ≤ k ≤ n, routing_toi [k] = j
means that pj is a neighbor of pi on a shortest path to pk , i.e., pj is an optimal
neighbor when pi has to send information to pk (where optimality is with respect
to the length of the path from pi to pk ).
The meaning of this formula is depicted in Fig. 2.1 for a process pi such that
neighborsi = {j1 , j2 , j3 }. Each dotted line from pjx to pk , 1 ≤ x ≤ 3, represents the
shortest path joining pjx to pk and its length is lengthjx [k]. The solution of this set
of equations is computed asynchronously and iteratively by the n processes, each
process pi computing successive approximate values of its local array lengthi [1..n]
until it stabilizes at its final value.
The Algorithm The algorithm is described in Fig. 2.2. At least one process pi
has to receive the external message START () in order to launch the algorithm. It
2.1 Distributed Shortest Path Algorithms 37
Fig. 2.2 A distributed adaptation of Bellman–Ford’s shortest path algorithm (code for pi )
sends then to each of its neighbors the message UPDATE (lengthi ) which describes
its current local state as far as the computation of the length of its shortest paths to
each other process is concerned.
When a process pi receive a message UPDATE (length) from one of its neighbors
pj , it applies the forward/discard strategy introduced in Chap. 1. To that end, pi first
strives to improve its current approximation of its shortest paths to any destination
process (lines 3–9). Then, if pi has discovered shorter paths than the ones it knew
before, pi sends its new current local state to each of its neighbors (lines 10–12). If
its local state (captured by the array lengthi [1..n]) has not been modified, pi does
not send a message to its neighbors.
Termination While there is a finite time after which the arrays lengthi [1..n] and
routing_toi [1..n], 1 ≤ i ≤ n, have obtained their final values, no process ever learns
when this time has occurred.
when r = 1, 2, . . . , D do
begin synchronous round
(1) for each j ∈ neighborsi do send UPDATE (lengthi ) to pj end for;
(2) for each j ∈ neighborsi do receive UPDATE (lengthj ) from pj end for;
(3) for each k ∈ {1, . . . , n} \ {i} do
(4) let length_ik1 = minj ∈neighborsi (gi [j ] + lengthj [k]);
(5) if (length_ik < lengthi [k]) then
(6) lengthi [k] ← length_ik;
(7) routing_toi [k] ← a neighbor j that realizes the previous minimum
(8) end if
(9) end for
end synchronous round.
Fig. 2.5 The principle that underlies Floyd–Warshall’s shortest paths algorithm
The Distributed Algorithm The algorithm is described in Fig. 2.6. The processes
execute concurrently a loop where the index pv takes the successive values from 1
to n (line 1). If a process receives a message while it has not yet started executing its
local algorithm, it locally starts the local algorithm before processing the message.
As the communication graph is connected, it follows that, as soon as at least one
process pi starts its local algorithm, all the processes start theirs.
As indicated just previously, when the processes execute the iteration step pv, the
process ppv has to broadcast its local array lengthpv [1..n] so that each process pi to
try to improve its shortest distance to any process pj as indicated in Fig. 2.5.
To this end, let us observe that if, at the pvth iteration of the loop, there is path
from pi to ppv involving only processes in the set {p1 , . . . , ppv−1 }, there is then a
favorite neighbor to attain ppv , namely the process whose index has been computed
and saved in routing_toi [pv]. This means that, at the pvth iteration, the set of local
variables routing_tox [pv] of the processes px such that lengthx [pv] = +∞ define a
tree rooted at ppv .
The algorithm executed by the processes, which ensures a correct process coor-
dination, follows from this observation. More precisely, a local algorithm is made
up of three parts:
• Part 1: lines 1–6. A process pi first sends a message to each of its neighbors pk
indicating if pi is or not one of pk ’s children in the tree rooted at ppv . It then
waits until it has received such a message from each of its neighbors.
Then, pi executes the rest of the code for the pvth iteration only if it has a
chance to improve its shortest paths with the help of ppv , i.e., if lengthi [pv] =
+∞.
2.1 Distributed Shortest Path Algorithms 41
• Part 2: lines 8–17. This part of the algorithm ensures that each process pi such
that lengthi [pv] = +∞ receives a copy of the array lengthpv [1..n] so that it can
recompute the values of its shortest paths and the associated local routing table
(which is done in Part 3).
The broadcast of lengthpv [1..n] by ppv is launched at line 13, where this pro-
cess sends the message PV _ LENGTH (pv, lengthpv ) to all its children in the tree
whose it is the root. When it receives such a message carrying the value pv and
the array pv_length[1..n] (line 9), a process pi forwards it to its children in the
tree rooted at ppv (lines 12 and 14).
• Part 3: lines 18–23. Finally, a process pi uses the array pv_length[1..n] it has
received in order to improve its shortest paths that pass through the processes
p1 , . . . , ppv .
Cost Let e be the number of communication channels. It is easy to see that, dur-
ing each iteration, (a) at most two messages CHILD () are sent on each channel (one
in each direction) and (b) at most (n − 1) messages PV _ LENGTH () are sent. It fol-
lows that the number of messages is upper-bounded by n(2e + n); i.e., the message
complexity is O(n3 ). As far the size of messages is concerned, a message CHILD ()
carries a bit, while PV _ LENGTH () carries n values whose size depends on the indi-
vidual lengths associated with the communication channels.
42 2 Distributed Graph Algorithms
Finally, there are n iteration steps, and each has O(n) time complexity. Moreover,
in the worst case, the processes starts the algorithm one after the other (a single
process starts, which entails the start of another process, etc.). When summing up,
it follows that the time complexity is upper-bounded by O(n2 ).
(1) for each j ∈ neighborsi do send INIT (colori [i]) to pj end for;
(2) for each j ∈ neighborsi
(3) do wait (INIT (col_j ) received from pj ); colori [j ] ← col_j
(4) end for;
(5) for ri from ( + 2) to m do
begin asynchronous round
(6) if (colori [i] = ri )
(7) then c ← smallest color in {1, . . . , + 1} such that ∀j ∈ neighborsi : colori [j ] = c;
(8) colori [i] ← c
(9) end if;
(10) for each j ∈ neighborsi do send COLOR (ri , colori [i]) to pj end for;
(11) for each j ∈ neighborsi do
(12) wait (COLOR (r, col_j ) with r = ri received from pj );
(13) colori [j ] ← col_j
(14) end for
end asynchronous round
(15) end for.
This section presents a distributed algorithm which colors the processes in at most
( + 1) colors in such a way that no two neighbors have the same color. Distributed
coloring is encountered in practical problems such as resource allocation or pro-
cessor scheduling. More generally, distributed coloring algorithms are symmetry
breaking algorithms in the sense that they partition the set of processes into subsets
(a subset per color) such that no two processes in the same subset are neighbors.
Local Variables Each process pi manages a local variable colori [i] which ini-
tially contains its initial color, and will contain its final color at the end of the algo-
rithm. A process pi also manages a local variable colori [j ] for each of its neigh-
bors pj . As the algorithm is asynchronous and round-based, the local variable ri
managed by pi denotes its current local round number.
the computation model. They have to be explicitly managed by the processes them-
selves. Hence, each process pi manages a local variable ri that it increases when it
starts a new asynchronous round (line 5).
The first round (lines 1–2) is an initial round during which the processes ex-
change their initial color in order to fill in their local array colori [neighborsi ]. If
the processes know the initial colors of their neighbors, this communication round
can be suppressed. The processes then execute m − ( + 1) asynchronous rounds
(line 5).
The processes whose initial color belongs to the set of colors {1, . . . , + 1} keep
their color forever. The other processes update their colors in order to obtain a color
in {1, . . . , + 1}. To that end, all the processes execute sequentially the rounds
+ 2, . . . , until m, considering that each round number corresponds to a given
distinct color. During round r, + 2 ≤ r ≤ m, each process whose initial color is r
looks for a new color in {1, . . . , + 1} which is not the color of its neighbors and
adopts it as its new color (lines 6–8). Then, each process exchanges its color with
its neighbors (lines 10–14) before proceeding to the next round. Hence, the round
invariant is the following one: When a round r terminates, the processes whose
initial colors were in {1, . . . , r} (a) have a color in the set {1, . . . , + 1}, and (b)
have different colors if they are neighbors.
Proof Let us first observe that the processes whose initial color belongs to
{1, . . . , + 1} never modify their color. Let us assume that, up to round r, the
processes whose initial colors were in the set {1, . . . , r} have new colors in the
set {1, . . . , + 1} and any two of them which are neighbors have different colors.
Thanks to the initial m-coloring, this is initially true (i.e., for the fictitious round
r = + 1).
Let us assume that the previous assertion is true up to some round r ≥ + 1.
It follows from the algorithm that, during round r + 1, only the processes whose
current color is r + 1 update it. Moreover, each of them updates it (line 7) with a
color that (a) belongs to the set {1, . . . , + 1} and (b) is not a color of its neighbors
(we have seen in Sect. 2.2.1 that such a color does exist). Consequently, at the end
of round r + 1, the processes whose initial colors were in the set {1, . . . , r + 1}
2.2 Vertex Coloring and Maximal Independent Set 45
Fig. 2.9 One bit of control information when the channels are not FIFO
have new colors in the set {1, . . . , + 1} and no two of them have the same new
color if they are neighbors. It follows that, as claimed, this property constitutes a
round invariant from which we conclude that each process has a final color in the
set {1, . . . , + 1} and no two neighbor processes have the same color.
It follows that the message COLOR () does not need to carry the value of r but only
a bit, namely the parity of r. The algorithm can then be simplified as follows:
• At line 10, each process pi sends the message COLOR (ri mod 2, colori [i]) to each
of its neighbors.
• At line 12, each process pi waits for a message COLOR (b, colori [i]) from each
of its neighbors where b = (ri mod 2).
Finally, it follows from previous discussion that, if the channels are FIFO, the mes-
sages COLOR () do not need to carry a control value (neither r, nor its parity bit).
46 2 Distributed Graph Algorithms
• Greedy strategy: as the previous set is not necessarily maximal, the algorithm
starts with an initial independent set (defined by some color) and executes a se-
quence of rounds, each round r corresponding to a color, in which it strives to
add to the independent set under construction as much possible processes whose
color is r. The corresponding “addition” predicate for a process pi with color r
is that none of its neighbors is already in the set.
As previous algorithms, the algorithm described in Fig. 2.11 simulates a syn-
chronous algorithm. The color of a process pi is kept in its local variable denoted
colori . The messages carry a round number (color) which can be replaced by its
parity. The processes execute m asynchronous rounds (a round per color). When it
executes round r, if its color is r and none of its neighbors belongs to the set un-
der construction, a process pi adds itself to the set (line 3). Then, before starting
the next round, the processes exchange their membership of the maximal indepen-
dent set in order to update their local variables selectedi [j ]. (As we can see, what
is important is not the fact that the rounds are executed in the order 1, . . . , m, but
the fact that the processes execute the rounds in the same predefined order, e.g.,
1, m, 2, (m − 1), . . . .)
The size of the maximal independent set that is computed is very sensitive to the
order in which the colors are visited by the algorithm. As an example, let us consider
the graph at the right of Fig. 2.10 where the process p1 is colored a while the other
processes are colored b. If a = 1 and b = 2, the maximal independent set that is
built is the set {1}. If a = 2 and b = 1, the maximal independent set that is built is
the set {2, 3, 4, 5}.
Fig. 2.12 Luby’s synchronous random algorithm for a maximal independent set (code for pi )
destination process in the very same round. (It is easy to extend this algorithm so
that it works in the asynchronous model.)
Each process pi manages the following local variables.
• The local variable statei , whose initial value is arbitrary, is updated only once.
It final value (in or out) indicates whether pi belongs or not to the maximal in-
dependent set that is computed. When, it has updated statei to its final value, a
process pi executes the statement return() which stops its participation to the al-
gorithm. Let us notice that the processes do not necessarily terminate during the
same round.
• The local variable com_withi , which is initialized to neighborsi , is a set contain-
ing the processes with which pi will continue to communicate during the next
round.
• Each local variable randomi [j ], where j ∈ neighborsi ∪ {i}, represents the local
knowledge of pi about the last random number used by pj .
2.2 Vertex Coloring and Maximal Independent Set 49
Knots and cycles are graph patterns encountered when one has to solve distributed
computing problems such as deadlock detection. This section presents an asyn-
chronous distributed algorithm that detects such graph patterns.
A directed graph is a graph where every edge is oriented from one vertex to another
vertex. A directed path in a directed graph is a sequence of vertices i1 , i2 , . . . , ix
such that for any y, 1 ≤ y < x, there is an edge from the vertex iy to the vertex iy+1 .
A cycle is a directed path such that ix = i1 .
A knot in a directed graph G is a subgraph G such that (a) any pair of vertices
in G belongs to a cycle and (b) there is no directed path from a vertex in G to a
vertex which is not in G . Hence, a vertex of a directed graph belongs to a knot if
and only if it is reachable from all the vertices that are reachable from it. Intuitively,
a knot is a “black hole”: once in a knot, there is no way to go outside of it.
An example is given in Fig. 2.14. The directed graph has 11 vertices. The set
of vertices {7, 10, 11} defines a cycle which is not in a knot (this is because, when
traveling on this cycle, it is possible to exit from it). The subgraph restricted to the
vertices {3, 5, 6, 8, 9} is a knot (after entering this set of vertices, it is impossible to
exit from it).
2.3 Knot and Cycle Detection 51
• Safety (consistency).
– If pa obtains the answer “knot”, it belongs to a knot. Moreover, it knows the
identity of all the processes involved in the knot.
– If pa obtains the answer “no knot”, it does not belong to a knot. Moreover, if
it belongs to at least one cycle, pa knows the identity of all the processes that
are involved in a cycle with pa .
As we can see, the safety property of the knot detection problem states what is a
correct result while its liveness property states that eventually a result has to be
computed.
The algorithm that is presented below relies on the construction of a spanning tree
enriched with appropriate statements. It is due to D. Manivannan and M. Sing-
hal (2003).
Remark The previous message GO _ DETECT () and the messages CYCLE _ BACK (),
SEEN _ BACK (), and PARENT _ BACK () introduced below are nothing more than par-
ticular instances of the messages GO () and BACK () used in the graph traversal algo-
rithms described in Chap. 1.
sends back to pk the message SEEN _ BACK (), and when pk receives this message it
includes the ordered pair k, i in a local set denoted seenk . (Basically, the message
SEEN _ BACK () informs its receiver that its sender has already received a message
GO _ DETECT ().) In that way, if later pi is found to be on a cycle including pa , it
can be concluded from the pair k, i ∈ seenk that pk is also on a cycle including pa
(this is because, due to the messages GO _ DETECT (), there is a directed path from
pa to pk and pi , and due to the cycle involving pa and pi , there is a directed path
from pi to pa ).
Finally, as in graph traversal algorithms, when it has received an acknowledg-
ment from each of its immediate successors, a process pi sends a message PAR -
ENT _ BACK () to its parent in the spanning tree. Such a message contains (a) the
processes that, due to the messages CYCLE _ BACK () received by pi from immediate
successors, are known by pi to be on a cycle including pa , and (b) the ordered pairs
i, stored in seeni as a result of the acknowledgment messages SEEN _ BACK ()
and PARENT _ BACK () it has received from its immediate successors in the logical
directed graph. This information, which will be propagated in the tree to pa , will
allow pa to determine if it is in a knot or a cycle.
Local Variable at the Initiator pa Only The local variable candidatesa , which
appears only at the initiator, is a set (initially empty) of process identities. If pa is in
a knot, candidatesa will contain the identities of all the processes that are in the knot
including pa , when the algorithm terminates. If pa is not in a knot, candidatesa will
contain all the processes that are in a cycle including pa (if any). If candidatesa = ∅
when the algorithm terminates, pa belongs to neither a knot, nor a cycle.
have received a message GO _ DETECT (), these local variables define a directed
spanning tree rooted at pa which will be used to transmit information back to this
process.
• The local variable waiting_fromi is a set of process identities. It is initialized to
set of the immediate successors of pi in the logical directed graph.
• The local variable in_cyclei is a set (initially empty) of process identities. It will
contain processes that are on a cycle including pi .
• The local variable seeni is a set (initially empty) of ordered pairs of process iden-
tities. As we have seen, k, j ∈ seeni means that there is a directed path from pa
to pk and a directed edge from pk to pj in the directed graph. It also means that
both pk and pj have received a message GO _ DETECT () and, when pj received
the message GO _ DETECT () from pk , it did not know whether it belongs to a cycle
including pa (see Fig. 2.15).
Launching the Algorithm The only process pi that receives the external mes-
sage START () discovers that it is the initiator, i.e., pi is pa . If it has no outgoing
edges – predicate (waiting_fromi = ∅) at line 1 –, pi returns the pair (no knot,∅),
which indicates that pi belongs neither to a cycle, nor to a knot (line 4). Otherwise,
it sends the message GO _ DETECT () to all its immediate successors in the directed
graph (line 3).
Reception of a Message GO _ DETECT () When a process pi receives the message
GO _ DETECT () from pj , it sends back to pj the message CYCLE _ BACK () if it is
the initiator, i.e., if pi = pa (line 7). If it is not the initiator and this message is the
first it receives, it first defines pj as its parent in the spanning tree (line 9). Then, if
waiting_fromi = ∅ (line 10), pi propagates the detection to its immediate successors
in the directed graph (line 11). If waiting_fromi = ∅, pi has no successor in the
directed graph. It then returns the message PARENT _ BACK (seeni , in_cyclei ) to its
parent (both seeni and in_cyclei are then equal to their initial value, i.e., ∅; seeni = ∅
means that pi has not seen another detection message, while in_cyclei = ∅ means
that pi is not involved in a cycle including the initiator).
If pi is already in the detection tree, it sends back to pj the message
SEEN _ BACK () or CYCLE _ BACK () according to whether the local set in_cyclei is
empty or not (line 14–15). Hence, if in_cyclei = ∅, pi is on a cycle including pa
and pj will consequently learn that it is also on a cycle including pa .
Reception of a Message XXX _ BACK () When a process pi receives a message
XXX _ BACK () (where XXX stands for SEEN , CYCLE , or PARENT ), it first suppresses
its sender pj from waiting_fromi .
2.3 Knot and Cycle Detection 55
As we have seen, a message SEEN _ BACK () informs its receiver pi that its sender
pj has already been visited by the detection algorithm (see Fig. 2.15). Hence, pi
adds the ordered pair i, j to seeni (line 19). Therefore, if later pj is found to be
on a cycle involving the initiator, the initiator will be able to conclude from seeni
that pi is also on a cycle involving pa . The receiver pi then invokes the internal
operation check_waiting_from().
If the message received by pi from pj is CYCLE _ BACK (), pi adds j to in_cyclei
(line 20) before invoking check_waiting_from(). This is because there is a path from
the initiator to pi and a path from pj to the initiator, hence pi and pj belong to a
same cycle including pa .
If the message received by pi from pj is PARENT _ BACK (seen, in_cycle), pi
adds the ordered pairs contained in seen sent by its child pj to its set seeni (line 22).
Moreover, if in_cycle is not empty, pi merges it with in_cyclei (line 25). Otherwise
pi adds the ordered pair i, j to seeni (line 24). In this way, the information allow-
ing pa to know (a) if it is in a knot or (b) if it is only in a cycle involving pi will be
propagated from pi first to its parent, and then propagated from its parent until pa .
Finally, pi invokes check_waiting_from().
The Internal Operation check_waiting_from() As just seen, this operation is in-
voked each time pi receives a message XXX _ BACK (). Its body is executed only
if pi has received a message XXX _ BACK () from each of its immediate successors
(line 28). There are two cases.
If pi is not the initiator, it first adds itself to in_cyclei if this set is not empty
(line 41). This is because, if in_cyclei = ∅, pi knows that it is on a cycle involving
the initiator (lines 20 and 25). Then, pi sends to its parent (whose identity has been
saved in parenti at line 9) the information it knows on cycles involving the initia-
tor. This information has been incrementally stored in its local variables seeni and
in_cyclei at lines 19–27. Finally, pi invokes return(), which terminates its partici-
pation (line 42).
If pi is the initiator pa , it executes the statements of lines 30–39. First pi cleans
its local variables seeni and in_cyclei (lines 30–38). For each k ∈ in_cyclei , pi first
moves k from in_cyclei to candidatesi . This is because, if pi is in a knot, so are all
the processes which are on a cycle including pi . Then, for each x, if the ordered
pair x, k ∈ seeni , pi suppresses it from seeni and adds px to in_cyclei . This is
because, after pa has received a message XXX _ BACK () from each of its immediate
successors, we have for each process pk reachable from pa either k ∈ in_cyclea or
x, k ∈ seena for some px reachable from pa . Hence, if k ∈ in_cyclea and x, k ∈
seena , then px is also in a cycle with pa .
Therefore, after the execution of lines 30–38, candidatesa contains the identities
of all the processes reachable from pa which are on a cycle with pa . It follows that,
if seena becomes empty, all the processes reachable from pa are on a cycle with pa .
The statement of line 39 is a direct consequence of this observation. If seena = ∅,
pa belongs to a knot made up of the processes which belong to the set candidatesi .
If seena = ∅, candidatesa contains all the processes that are involved in a cycle
including pa (hence if candidatesa = ∅, pi is involved neither in a knot, nor in a
cycle).
2.4 Summary 57
Fig. 2.17
Knot/cycle detection:
example
An Example Let us consider the directed graph depicted in Fig. 2.17. This
graph has a knot composed of the processes p2 , p3 , and p4 , a cycle involving
the processes p1 , p6 , p5 and p7 , plus another process p8 . If the initiator process
pa belongs to the knot, pa will discover that it is in a knot, and we will have
candidatesa = {2, 3, 5} and seena = ∅ when the algorithm terminates. If the ini-
tiator process pa belongs to the cycle on the right of the figure (e.g., pa is p1 ), we
will have candidatesa = {1, 6, 5, 7} and seena = {4, 2, 3, 4, 2, 3, 1, 2, 5, 4}
when the algorithm terminates (assuming that the messages GO _ DETECT() propa-
gate first along the process chain (p1 , p2 , p3 , p4 ), and only then from p5 to p4 ).
Cost of the Algorithm As in a graph traversal algorithm, each edge of the di-
rected graph is traversed at most once by a message GO _ DETECT () and a message
SEEN _ BACK (), CYCLE _ BACK () or PARENT _ BACK () is sent in the opposite direc-
tion. It follows that the number of message used by the algorithm is upper bounded
by 2e, where e is the number of edges of the logical directed graph.
Let DT be the depth of the spanning tree rooted at pa that is built. It is easy
to see that the time complexity is 2(DT + 1) (DT time units for the messages
GO _ DETECT () to go from the root pa to the leaves, DT time units for the mes-
sages XXX _ BACK () to go back in the other direction and 2 more time units for the
leaves to propagate the message GO _ DETECT () to their immediate successors and
obtain their acknowledgment messages XXX _ BACK ()).
2.4 Summary
Considering a distributed system as a graph whose vertices are the processes and
edges are the communication channels, this chapter has presented several distributed
graph algorithms. “Distributed” means here each process cooperates with its neigh-
bors to solve a problem but never learns the whole graph structure it is part of.
The problems that have been addressed concern the computation of shortest
paths, the coloring of the vertices of a graph in + 1 colors (where is the maxi-
mal degree of the vertices), the computation of a maximal independent set, and the
detection of knots and cycles.
As the reader has seen, the algorithmic techniques used to solve graph problems
in a distributed context are different from their sequential counterparts.
58 2 Distributed Graph Algorithms
• Graph notions and sequential graph algorithms are described in many textbooks,
e.g., [122, 158]. Advanced results on graph theory can be found in [164]. Time
complexity results of numerous graph problems are presented in [148].
• Distributed graph algorithms and associated time and message complexity analy-
ses can be found in [219, 292].
• As indicated by its name, the sequential shortest path algorithm presented in
Sect. 2.1.1 is due to R.L. Ford. It is based on Bellman’s dynamic program-
ming principle [44]. Similarly, the sequential shortest path algorithm presented in
Sect. 2.1.2 is due to R.W. Floyd and R. Warshall who introduced independently
similar algorithms in [128] and [384], respectively. The adaptation of Floyd–
Warshall’s shortest path algorithm is due to S. Toueg [373].
Other distributed shortest path algorithms can be found in [77, 203].
• The random algorithm presented in Sect. 2.2.3, which computes a maximal inde-
pendent set, is due to M. Luby [240]. The reader will find in this paper a proof
that the expected number of rounds is O(log2 n). Another complexity analysis of
( + 1)-coloring is presented in [201].
• The knot detection algorithm described in Fig. 2.3.4 is due to D. Manivannan and
M. Singhal [248] (this paper contains a correctness proof of the algorithm). Other
asynchronous distributed knot detection algorithms can be found in [59, 96, 264].
• Distributed algorithms for finding centers and medians in networks can be found
in [210].
• Deterministic distributed vertex coloring in polylogarithmic time, suited to syn-
chronous systems, is addressed in [43].
The problems addressed in this chapter consist for each process pi in computing a
result outi which involves the whole set of inputs in1 , in2 , . . . , inn , where ini is the
input provided by process pi . More precisely, let IN[1..n] be the vector that, from an
external observer point of view, represents the inputs, i.e., ∀i: IN[i] = ini . Similarly,
let OUT[1..n] be the vector such that ∀i: OUT[i] = outi . Hence, the processes have
to cooperate and coordinate so that they collectively compute
OUT = F (IN).
According to the function F () which is computed, all the processes may obtain
the same result out (and we have then OUT[1] = · · · = OUT[n] = out) or different
results (i.e., it is possible that there are processes pi and pj such that outi = outj ).
Examples of such problems are the following.
Routing Tables In this case, the input of a process is its position in the network
and its output is a routing table that, for each process pj , defines the local channel
that pi has to use so that the messages to pj travel along the path with the shortest
distance (let us recall that the distance is the minimal number of channels separating
pi from pj ).
Maximal or Minimal Input Simple functions returning the same result to all the
processes are the computation of the maximum (or minimum) of their inputs. This
is typically the case in the election problem where, assuming that processes have
distinct and comparable identities, the input of each process is its identity and their
common output is the greatest (or smallest) of their identities, which defines the
process that is elected.
Cut Vertex A cut vertex in a graph is a vertex (process) whose suppression dis-
connects the graph. Knowledge of such processes is important to analyze message
bottleneck and fault-tolerance. The global function computed by each process is
here a predicate indicating, at each process, if it is a cut vertex.
pa then knows the input vector IN[1..n]. It can consequently compute OUT[1..n] =
F (IN) with the help of a sequential algorithm, and returns its result to each process
along the channels of the spanning tree.
While this solution is viable and worthwhile for some problems, we are interested
in this chapter in a solution in which the control is distributed, that is to say in a
solution in which no process plays a special role. This can be expressed in the form
of the following constraint: the processes must obey the same rules of behavior.
On the Efficiency Side: Do Not Learn More than What Is Necessary A solu-
tion satisfying the previous constraint would be for each process to flood the system
with its input so that all the processes learn all the inputs, i.e., the input vector
IN[1..n]. Then, each process pi can compute the output vector OUT[1..n] and ex-
tracts from it its local result outi = OUT[i].
As an example, if the processes have to compute the shortest distances, they
could first use the algorithm described in Fig. 1.3 (Sect. 1.1.2) in order to learn the
communication graph. They could then use any sequential shortest path algorithm
to compute their shortest distances. This is not satisfactory when a process pi learns
more information than what is needed to compute its table of shortest distances. As
we have seen in Sect. 2.1, it is not necessary for it to know the whole structure of
the communication graph in which it is working.
The solutions in which we are interested are thus characterized by a second con-
straint: a process has not to learn information which is useless from the point of
view of the local output it has to compute.
This section presents an algorithmic framework which offers a solution for the class
of distributed computations which have been described previously, while respecting
the symmetry and efficiency constraints that have been stated. This framework is
due to J.-Cl. Bermond, J.-Cl. König, and M. Raynal (1987).
Local Variables at Each Process pi In addition to its local variable ri and its
identity idi , a process manages the following local variables:
• As already indicated, ini is the input parameter provided by pi to solve the prob-
lem, while outi is a local variable that will contain its local output.
• A process pi has ci (1 ≤ ci ≤ n − 1) bidirectional channels, which connect
it to ci distinct neighbors processes. The set channelsi = {1, . . . , ci } is the set
of local indexes that allow these neighbors to be addressed, and the local ar-
ray channeli [1..ci ] defines the local names of these channels (see Exercise 4,
Chap. 1). This means that the local names of the channel (if any) connecting
pi and pj are channeli [x] (where x ∈ channelsi ) at pi and channelj [y] (where
y ∈ channelsj ) at pj .
• newi is a local set that, at the end of each round r, contains all the information
that pi learned during this round (i.e., it receives this information at round r for
the first time).
• infi is a local set that contains all the information that pi has learned since the
beginning of the execution.
Principle The underlying principle is nothing more than the forward/discard prin-
ciple. During a round r, a process sends to its neighbors all the new information it
has received during the previous round (r − 1). It is assumed that it has received its
input during the fictitious round r = 0.
It follows that, during the first round a process learns the inputs of the processes
which are distance 1 from it, during the second round it learns the inputs of the
processes at distance 2, and more generally it learns the inputs of the processes at
distance d during the round r = d.
Fig. 3.1 Computation of routing tables defined from distances (code for pi )
Cost As previously, let one time unit be the maximal transit time for a message
from a process to one of its neighbors. The time complexity is 2D. The worst case
is when a single process starts the algorithm. It then takes D messages sent sequen-
tially from a process to another process before all processes have started their local
algorithm. Then, all the processes execute D rounds.
Each process executes D rounds, and two messages are exchanged on each chan-
nel at each round. Hence, the total number of messages is 2eD, where e is the num-
ber of channels of the communication graph.
64 3 An Algorithmic Framework to Compute Global Functions
This section shows how to eliminate the a priori knowledge of the diameter.
A Simple Predicate Let us observe that, if a process does not learn any new in-
formation at a given round, it will learn no more during the next rounds.
This follows from the fact that, if pi learns nothing new in round r, there is no
process situated at distance r from pi , and consequently no process at a distance
greater than r. Hence, if it learns nothing new in round r, pi will learn nothing new
during rounds r > r.
Actually, the last round r during which pi learns something new is the round
r = ecci (its eccentricity), but, not knowing ecci , it does not know that it has learned
everything. In contrast, at the end of round r = ecci + 1, pi will have newi = ∅ and
it will then learn that it knows everything. (Let us observe that this predicate also
allows pi to easily compute ecci .)
Not All Processes Terminate During the Same Round Let us observe that, as
not all the processes have necessarily the same eccentricity, they do not terminate
at the same round. Let us consider two neighbor processes pi and pj such that pi
learns no new information during round r while pj does. We have the following:
• As pi and pj are neighbors, we have 0 ≤ |ecci − eccj | ≤ 1. Thus, pj will have
newj = ∅ at the end of round r + 1.
• In order that pj does not block during round r + 1 waiting for a message from pi ,
this process has to execute round r + 1, sending it a message carrying the value
newi = ∅.
It follows that a process pi now executes ecci + 2 rounds. At round r = ecci it
knows everything, at round r = ecci + 1 it knows that it knows everything, and at
round r = ecci + 2 it informs its neighbors of this and terminates.
The corresponding generic algorithm is described in Fig. 3.2. Each process man-
ages an additional local variable com_withi , which contains the channels on which
it has to send and receive messages. The management of this variable is the main
novelty with respect to the algorithm of Fig. 3.1.
If the empty set is received on a channel channeli [x], this channel is withdrawn
from com_withi (line 7). Moreover, to prevent processes from being blocked waiting
forever for a message, a process pi (before it terminates) sends the message MSG (∅)
on each channel which is still open (line 15). It also empties the open channels by
waiting for a message on each of them (line 16).
3.2 An Algorithmic Framework 65
init infi ← {(idi , ini )}; newi ← {(idi , ini }; com_withi ← {1, . . . , ci }; ri ← 0.
∀x ∈ com_withr−1
i : senti [x]r = newr−1
i ,
∀x ∈ com_withr−1
i : receivedi [x]r = sentj [y]r ,
66 3 An Algorithmic Framework to Compute Global Functions
newri = receivedi [x]r \ infi r−1 ,
x∈com_withri
com_withri = com_withr−1
i \ x | receivedi [x]r = ∅ .
Definition A cut vertex (or articulation point) of a graph is a vertex whose sup-
pression disconnects the graph. A graph is biconnected if it remains connected after
the deletion of any of its vertices. Thus, a connected communication graph is bicon-
nected if and only if it has no cut vertex. A connected graph can be decomposed in
a tree whose vertices are its biconnected components.
Given a vertex (process) pi of a connected graph, let Ri be the local relation
defined on the edges (channels) incident to pi as follows. Let e1 and e2 be two
edges incident to pi ; e1 Ri e2 if the edges e1 and e2 belong to the same biconnected
component of the communication graph. It is easy to see that Ri is an equivalence
relation; i.e., it is reflexive, symmetric, and transitive. Thus, if e1 Ri e2 and e2 Ri e3 ,
then, the three edges e1 , e2 , and e3 incident to pi belong to the same biconnected
component.
Example As an example, let us consider the graph depicted in Fig. 3.3. The pro-
cesses p4 , p5 , and p8 are cut vertices (the deletion of any of them disconnects the
graph).
There are four biconnected components (ellipsis on the figure) denoted C1 , C2 ,
C3 , and C4 . As an example, the deletion of any process (vertex) inside a component
does not make it disconnected.
When looking at the four edges a, b, c, and d connecting process p8 to its neigh-
bors, we have a R8 b and c R8 d, but we do not have a R8 d. This is because the
channels a and b belong to the biconnected component C3 , the channels c and d be-
long to the biconnected component C4 , and C3 ∪ C4 is not a biconnected component
of the communication graph.
3.3 Distributed Determination of Cut Vertices 67
Fig. 3.4
Determining cut vertices:
principle
Principle of the Algorithm The algorithm, which is due to J.-Cl. Bermond and
J.-Cl. König (1991), is based on the following simple idea.
Given a process pi which is on one or more cycles, let us consider an elementary
cycle to which pi belongs (an elementary cycle is a cycle that does not pass several
times through the same vertex/process). Let a = channeli [x] and b = channeli [y] be
the two distinct channels of pi on this cycle (see an example in Fig. 3.4). Moreover,
let pj be the process on this cycle with the greatest distance to pi , and let be this
distance. Hence, the length of the elementary cycle including pi and pj is 2 or
2 + 1 (in the figure, it is 2 + 1 = 3 + 4).
The principle that underlies the design of the algorithm follows directly from (a)
the message exchange pattern of the generic framework, and (b) the fact that, during
each round r, a process sends only the new information it has learned during the
previous round. More precisely, we have the following.
• It follows from the message exchange pattern that pi receives idj during round
r = (in the figure, idj is learned from channel b). Moreover, pi also receives
idj from channel a during round r = if the length of the cycle is even, or during
round r = + 1 if the length of the cycle is odd. In the figure, as the length of the
elementary cycle is odd, pi receives idj a first time on channel b during round
and a second time on channel a during round + 1. The pair (a, b) ∈ Ri , i.e.,
the channels a and b (which are incident to pi ) belong to the same biconnected
component.
This observation provides a simple local predicate, which allows a process
pi to determine if two of its incident channels belong to the same biconnected
component.
• Let us now consider two channels incident to the same process pi that are not
in the same biconnected component (as an example, this is the case of channels
b and c, both incident to p8 ). As these two channels do not belong to a same
elementary cycle including pi , this process cannot receive two messages carrying
the same process identity during the same round or two consecutive rounds. Thus,
the previous predicate is an “if and only if” predicate.
Description of the Algorithm The algorithm is described in Fig. 3.5. This algo-
rithm is an enrichment of the generic algorithm of Fig. 3.2: the only addition are
the lines N.9_1–N.9_10, which replace line 9, and lines 17_1–17_3, which replace
68 3 An Algorithmic Framework to Compute Global Functions
line 17. Said differently, by suppressing all the lines whose number is prefixed by
N, we obtain the algorithm of Fig. 3.2.
Each process pi manages two array-like data structures, denoted routing_toi and
disti . Given a process identity id, disti [id] contains the distance from pi to the pro-
cess whose identity is id, and routing_toi [id] contains the local channel on which
messages for this process have to be sent. We use an array-like structure to make
the presentation clearer. Since initially a process pi knows neither the number of
processes, nor their identities, a dynamic list has to used to implement routing_toi
and disti .
Thus, let us consider a process pi that, during a round r, receives on a channel
channeli [x] a message carrying the value new = ∅. Let id ∈ new (line N.9_1) and
let pj be the corresponding process (i.e., id = idj ). There are two cases:
3.4 Improving the Framework 69
This section shows that the previous generic algorithm can be improved so as to
reduce the size and the number of messages that are exchanged.
Filtering That Affects the Content of Messages A trivial way to reduce the size
of messages exchanged during a round r, is for each process pi to not send on a
channel channeli [x] the complete information newi it has learned during the previ-
ous round, but to send only the content of newi minus what has been received on
this channel during the previous round.
Let receivedi [x] be the information received by pi on the channel channeli [x]
during a round r. Thus, pi has to send only senti [x] = newi \ receivedi [x] on
channeli [x] during the round r + 1. This is the first filtering: it affects the content of
messages themselves.
Filtering That Affects the Channels That Are Used The idea is here for a pro-
cess pi to manage its channels according to their potential for carrying new infor-
mation.
70 3 An Algorithmic Framework to Compute Global Functions
To that end, let us consider two neighbor processes pi and pj such that we have
newi = newj at the end of round r. According to the message exchange pattern and
the round-based synchronization, this means that pi and pj have exactly the same
set E of processes at distance r, and during that round both learn the input values of
the processes in the set E. Hence, if pi and pj knew that newi = newj , they could
stop exchanging messages. This is because the new information they will acquire
in future rounds will be the inputs of the processes at distance r + 1, r + 2, etc.,
which they will obtain independently from one another (by way of the processes in
the set E).
The problem of exploiting this property lies in the fact that, at the end of a round,
a process does not know the value of the set new that it will receive from each of its
neighbors. However, at the end of a round r, each process pi knows, for each of its
channels channeli [x], the value senti [x] it has sent on this channel, and the value
receivedi [x] it has received on this channel. Four cases may occur (let pj denote the
neighbor to which pi is connected by channeli [x]).
1. senti [x] = receivedi [x]. In this case, pi and pj sent to each other the same infor-
mation during round r. They do not learn any new information from each other.
What is learned by pi is the fact that it has learned in round (r − 1) the informa-
tion that pj sent it in round r, and similarly for pj . Hence, from now on, they
will not learn anything more from each other.
2. senti [x] ⊂ receivedi [x]. In this case, pj does not learn anything new from pi in
the current round. Hence, it will not learn anything new from pi in the future
rounds.
3. receivedi [x] ⊂ senti [x]. This case is the inverse of the previous one: pi learns
that it will never learn new information on the channel channeli [x], in all future
rounds.
4. In the case where senti [x] and receivedi [x] cannot be compared, both pi and
pj learn new information from each other, receivedi [x] \ senti [x] as far as pi is
concerned.
These items allow for the implementation of the second filtering. It is based on
the values carried by the messages for managing the use of the communication chan-
nels.
Two More Local Variables To implement the previous filtering, each process pi
is provided with two local set variables, denoted c_ini and c_outi . They contain
indexes of local channels, and are initialized to {1, . . . , ci }. These variables are used
as follows in order to implement the management of the communication channels:
• When senti [x] = receivedi [x] (item 1), x is suppressed from both c_ini and
c_outi .
• When senti [x] ⊂ receivedi [x] (item 2), x is suppressed from c_outi .
• When receivedi [x] ⊂ senti [x] (item 3), x is suppressed from c_ini .
3.4 Improving the Framework 71
The Improved Algorithm The final algorithm is described in Fig. 3.6. It is the
algorithm of Fig. 3.2 modified according to the previous discussion. The local ter-
mination is now satisfied when a process pi can no longer communicate; i.e., it is
captured by the local predicate (c_ini = ∅) ∧ (c_outi = ∅).
When it executes its “send” part of the algorithm (lines 3–7), a process pi has
now to compute the value senti [x] sent on each open output channel channeli [x].
Moreover, if senti [x] = ∅, this channel is suppressed from the set c_outi , which
contains the indexes of the output channels of pi that are still open. In this case, the
receiving process pj will also suppress the corresponding channel from its set of
open input channels c_inj (lines 14).
When it executes its “receive” part of the algorithm (lines 9–17), a process pi
updates its set of input channels c_ini and its set of output channels c_outi according
to the value receivedi [x] it has received on each input channel that is still open (i.e.,
on each channeli [x] such that x ∈ c_ini ).
Complexity The algorithm terminates in D + 1 rounds. This comes from the fact
that when senti [x] = ∅, both the sender pi and the receiver pj withdraw the cor-
responding channel from c_outi and c_inj , respectively. The maximum number of
messages is consequently 2e(D + 1). The time complexity is 2D + 1 in the worst
case (which occurs when a single process starts, its first round message wakes up
72 3 An Algorithmic Framework to Compute Global Functions
other processes, etc., and the eccentricity of the starting process is equal to the di-
ameter D of the communication graph).
On the Message Complexity This appears clearly in the message complexity (de-
noted C in the following) of the previous algorithm for which C is upper bounded
by 2e(D + 1). If D is known by the processes, one round is saved, and we have
C = 2eD. This means that
• If the graph is fully connected we have D = 1, e = n(n − 1)/2, and C = O(n2 ).
• If the graph is a tree we have e = (n − 1), and C = O(nD).
This shows that it can be interesting to first build a spanning tree of the com-
munication graph and then use it repeatedly to compute global functions. However,
for some problems, a tree is not satisfactory because the tree that is obtained can
be strongly unbalanced in the sense that processes may have a distinct number of
neighbors.
The Notion of a Regular Graph Hence, for some problems, we are interested in
communication graphs in which the processes have the same number of neighbors
(i.e., the same degree ). When, they exist, such graph are called regular. In such a
graph we have e = (n)/2, and consequently we obtain
C = nD.
This relation exhibits a strong relation linking three of the main parameters associ-
ated with a regular graph.
What Regular Graphs Can Be Built? Given and D, Moore’s bound (1958) is
an upper bound on the maximal number of vertices (processes) that a regular graph
with diameter D and degree can have. This number is denoted n(D, ), and we
have n(D, ) ≤ 1 + + ( − 1) + · · · + ( − 1)D−1 , i.e.,
n(D, 2) ≤ 2D + 1, and
( − 1)D −2
n(D, ) ≤ for > 2.
−2
3.5 The Case of Regular Communication Graphs 73
This is an upper bound. Therefore (a) it does not mean that regular graphs for
which n(D, ) is equal to the bound exist for any pair (D, ), and (b) when such
a graph exists, it does not state how to built it. However,
√ this bound states that, in
the regular graphs that can be built, we have ≥ D n. It follows that, in the regular
networks that can be built, we have
D
C = D nD+1 .
The graphs known as De Bruijn’s graphs are directed regular graphs, which can
be easily built. This section presents them, and shows their interest in computing
global functions. (These graphs can also be used as overlay structures in distributed
applications.)
Let x be a vertex of a directed graph. + (x) denotes its input degree (number of
incoming edges), while − (x) denotes its output degree (number of output edges).
In a regular network, we have ∀ x : + (x) = − (x) = , and the value defines
the degree of the graph.
Fig. 3.7 The De Bruijn’s directed networks dB(2,1), dB(2,2), and dB(2,3)
(1) when r = 1, 2, . . . , D do
begin synchronous round
(2) for each x ∈ c_outi do send MSG (receivedi ) on channeli [x] end for;
(3) receivedi ← ∅;
(4) for each x ∈ c_ini do
(5) wait (MSG (rec) received on channeli [x]);
(6) receivedi ← receivedi ∪ rec
(7) end for
end synchronous round
(8) end when;
(9) outi ← F (receivedi );
(10) return(outi ).
Fig. 3.8 A generic algorithm for a De Bruijn’s communication graph (code for pi )
(2, in2 ), . . . , (n, ini )}. Each process can consequently compute F (receivedi ) and re-
turns the corresponding result.
A Simple Example Let us consider the communication graph DB(2,2) (the one
described in the middle of Fig. 3.7). We have the following:
• At the end of the first round:
– The process labeled 00 is such that received00 = {(00, in00 ), (10, in10 )}.
– The process labeled 10 is such that received10 = {(01, in01 ), (11, in11 )}.
• At the end of the second round, the process labeled 01 is such that received01
contains the values of received00 and received10 computed at the previous round.
We consequently have received01 = {(00, in00 ), (10, in10 ), (01, in01 ), (11, in11 )}.
If the function F () is associative and commutative, a process can compute
to_send = F (receivedi ) at the end of each round, and send this value instead of
receivedi during the next round (line 2). Merging of files, max(), min(), and + are
examples of such functions.
3.6 Summary
This chapter has presented a general framework to compute global functions on a set
of processes which are the nodes of a network. The main features of this framework
are that all processes execute the same local algorithm, and no process learns more
than what it needs to compute F . Moreover, the knowledge of the diameter D is not
necessary and the time complexity is 2(D + 1), while the total number of messages
is upper bounded by 2e(D + 2), where e is the number of communication channels.
76 3 An Algorithmic Framework to Compute Global Functions
• The general round-based framework presented in Sect. 3.2 and its improvement
presented in Sect. 3.4 are due to J.-Cl. Bermond, J.-Cl. König, and M. Ray-
nal [48].
• The algorithm that computes the cut vertices of a communication graph is due to
J.-Cl. Bermond and J.-Cl. König [47]. Other distributed algorithms determining
cut vertices have been proposed (e.g., [187]).
• The tradeoff between the number of rounds and the number of messages is ad-
dressed in [223, 243] and in the books [292, 319].
• The use of the general framework in regular directed networks has been inves-
tigated in [46]. Properties of regular networks (such as hypercubes, De Bruijn’s
graphs, and Kautz’s graphs) are presented in [45, 49].
• A general technique for network synchronization is presented [27].
• A graph problem is local if it can be solved by a distributed algorithm with time
complexity smaller than D (the diameter of the corresponding graph). The inter-
ested reader will find in [236] a study on the locality of the graph coloring problem
in rings, regular trees of radius r, and n-vertex graphs with largest degree .
3.8 Problem
1. Adapt the general framework presented in Sect. 3.2 to communication graphs in
which the channels are unidirectional. It is, however, assumed that the graphs are
strongly connected (there is a directed path from any process to any process).
Chapter 4
Leader Election Algorithms
This chapter is on the leader election problem. Electing a leader consists for the
processes of a distributed system in selecting one of them. Usually, once elected, the
leader process is required to play a special role for coordination or control purposes.
Leader election is a form of symmetry breaking in a distributed system. After
showing that no leader can be elected in anonymous regular networks (such as
rings), this chapter presents several leader election algorithms with a special focus
on non-anonymous ring networks.
Let each process pi be endowed with two local Boolean variables electedi and
donei , both initialized to false. (Let us recall that i is the index of pi , i.e., a nota-
tional convenience that allows us to distinguish processes. Indexes are not known by
the processes.) The Boolean variables electedi are such that eventually exactly one
of them becomes true, while each Boolean variable donei becomes true when the
corresponding process pi learns that a process has been elected. More formally, the
election problem is defined by the following safety and liveness properties, where
varτi denotes the value of the local variable vari at time τ .
• Safety property:
– (∀ i : electedτi ⇒ (∀τ ≥ τ : electedτi )) ∧ (∀ i : doneτi ⇒ (∀τ ≥ τ : doneτi )).
– ∀ i, j, τ, τ : (i = j ) ⇒ ¬(electedτi ∧ electedτj ).
– ∀ i : doneτi ⇒ (∃j, τ ≤ τ : electedτj ).
The first property states that the local Boolean variables electedi and donei are
stable (once true, they remain true forever). The second property states that at
most one process is elected, while the third property states that a process cannot
learn that the election has terminated while no process has yet been elected.
• Termination property:
– ∃i, τ : electedτi .
– ∀ i : ∃τ : doneτi .
This liveness property states that a process is eventually elected, and this fact is
eventually known by all processes.
This section assumes that the processes have no identities. Said differently, there is
no way to distinguish a process pi from another process pj . It follows that all the
processes have the same number of neighbors, the same code, and the same initial
state (otherwise these features could be considered as their identities). The next
theorem shows that, in such an anonymity context, there is no solution to the leader
election problem. For simplicity reasons, the theorem considers the case where the
network is a ring.
Basic Assumptions Due to the previous theorem, the rest of this chapter assumes
that each process pi has an identity idi , and that distinct processes have different
identities (i.e., i = j ⇒ idi = idj ). Moreover, it is assumed that identities can be
compared with the help of the operators <, =, >.
Basic Principles of the Election Algorithms The basic idea of election algo-
rithms consists in electing the process whose identity is an extremum (the greatest
identity or the smallest one) among the set of all processes or a set of candidate
processes. As no two processes have the same identity and all identities can be com-
pared, there is a single extremum identity.
The code of the algorithm is described in Fig. 4.1. In addition to its identity idi , and
the Boolean variables electedi and donei , each process pi manages another Boolean
denoted parti (and initialized to false), and a variable leaderi that will contain the
identity of the elected process.
The processes that launch the election are the processes pi that receive the exter-
nal message START () while parti = false. When this occurs, pi becomes a partici-
pant and sends the message ELECTION (idi ) on its single outgoing channel (line 1).
When a process pi receives a message ELECTION (id) on its single input channel,
there are three cases. If id > idi , pi cannot be the winner, and it forwards the mes-
sage it has received (line 2). If id < idi , if not yet done, pi sends the message ELEC -
TION (idi ) (line 3). Finally, if id = id i , its message ELECTION (idi ) has visited all the
processes, and consequently idi is the highest identity. So, pi is the elected process.
Hence, it informs the other processes by sending the message ELECTED (idi ) on its
outgoing channel (line 4).
When a process pi receives a message ELECTED (id), it learns both the identity of
the leader and the fact that the election is terminated (line 5). Then, it forwards the
message ELECTED (id) so that all processes learn the identity of the elected leader.
In all cases, the algorithm sends n messages ELECTED (), which are sent one after
the other. Hence, in the following we focus only on the cost due to the messages
ELECTION (). To compute the time complexity, we assume that each message takes
one time unit (i.e., all messages take the worst transfer delay which is defined as
being equal to one time unit).
The best case occurs when only the process pi with the highest identity receives
a message START (). It is easy to see that the time complexity is n (the message
4.2 A Simple O(n2 ) Leader Election Algorithm for Unidirectional Rings 81
Fig. 4.2
Worst identity distribution
for message complexity
ELECTION (idi ) sent by pi is passed from each process to its neighbor on the ring
before returning to pi ).
The worst case occurs when the process pj with the second highest identity (a)
is the only process that receives a message START (), and (b) follows on the ring the
process pi that has the highest identity. Hence, (n − 1) processes separate pj and pi
The message ELECTION (idj ) takes (n − 1) time units before attaining pi , and then
the message ELECTION (idi ) takes n times units to travel the ring. Hence, (2n −
1) time units are required. It follows that, whatever the case, an election requires
between n and (2n − 1) times units.
Best Case and Worst Case The best case for the number of messages is the same
as for the time complexity, which happens when only the process with the high-
est identity receives a message START (). The algorithm then gives rise to exactly
n messages ELECTION (). The worst case is when (a) each process pi receives a
message START () while parti = false, (b) each message takes one time unit, and (c)
the processes are ordered in the ring as follows: first, the process with the highest
identity, then the one with the second highest identity, etc., until the process with
the smallest identity (Fig. 4.2 where idn is the highest identity, etc., until id1 , which
is the smallest one).
It follows that the message START () received by the process with the smallest
identity gives rise to one message ELECTION (), the one with the second smallest
identity gives rise to two messages ELECTION () (one from itself to the process with
the smallest identity, plus another one from this process to the process with the
highest identity), and so on until the process with the highest identity whose message
START () gives rise to n messages ELECTION (). It follows that the total number of
messages is ni=1 i = n(n+1) 2
2 , i.e., O(n ).
Average Case To compute the message complexity in the average case, let P (i, k)
be the probability that the message ELECTION () sent by the process px with the
ith smallest identity is forwarded k times. Assuming that the direction of the ring
82 4 Leader Election Algorithms
is clockwise (as in Fig. 4.2), P (i, k) is the probability that the k − 1 clockwise
neighbors of px (the processes that follow px on the ring) have an identity smaller
than idx and the kth clockwise neighbor of px has an identity greater than idx . Let
us recall that there are (i − 1) processes with an identity smaller than idx and (n − i)
processes
with an identity greater than idx .
Let ab denote the number of ways of choosing b elements in a set of a elements.
We have
i−1
k−1 n−i
P (i, k) = n−1 × .
n−k
k−1
Since there is a single message that makes a full turn on the ring (the one carrying the
highest identity), let us consider each of the (n − 1) other messages. The expected
number of passes of the ith message (where i denotes the rank of the identity of the
corresponding ELECTION () message) is then for i = n:
n−1
Ei (k) = kP (i, k).
k=1
n−1
n−1
E =n+ kP (i, k),
i=1 k=1
n−1
n 1 1 1
E =n+ = n 1 + + + ··· + .
k+1 2 3 n
k=1
The previous algorithm always elects the process with the maximal identity. It is
possible to modify this algorithm to obtain an algorithm that elects the process with
the highest identity among the processes whose participation in the algorithm is
due to the reception of a START () message (this means that, whatever its identity, a
process that does not receive a START () message, or that receives such a message
only after having received an ELECTION () message, cannot be elected).
The corresponding algorithm is depicted in Fig. 4.3. Each process pi manages
a local variable idmaxi which contains the greatest identity of a competing process
seen by pi . Initially, idmaxi is equal to 0.
4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings 83
Fig. 4.3 A variant of Chang and Robert’s election algorithm (code for pi )
When it receives a START () message (if it ever receives one), a process pi con-
siders and processes it only if idmaxi = 0. Moreover, when it receives a message
ELECTION (id), a process pi discards it if id < idmaxi (this is because pi has seen a
competing process with a higher identity). The rest of the algorithm is similar to the
algorithm of Fig. 4.1.
Context This section considers a bidirectional ring. As before, each process has
a left neighbor and a right neighbor, but it can now send a message to, and receive
a message from, any of these neighbors. Given a process pi , the notations lefti and
righti are used to denote the channels connecting pi to its left neighbor and to its
right neighbor, respectively.
The notions of left and right are global: they are the same for all processes, i.e.,
going only to left allows a message to visit all processes (and similarly when going
only to right).
Principle The algorithm presented below is due to D.S. Hirschberg and J.B. Sin-
clair (1980). It is based on the following idea. The processes execute asynchronous
rounds. During each round, processes compete, and only the processes that win in
a round r are allowed to continue competing during round r + 1. During the first
round (denoted round 0), all processes compete.
A process pi , which is competing during round r, is a winner at the end of that
round if it has the largest identifier on the part of the ring that spans the 2r pro-
cesses on its left side and the 2r processes on its right side, i.e., in a continuous
neighborhood of 2r+1 + 1 processes. This is illustrated in Fig. 4.4.
84 4 Leader Election Algorithms
If it has the highest identity, pi proceeds to the next round as a competitor. Oth-
erwise, it no longer competes to become leader. It follows that any two processes
that remain competitors after round r are at a distance d > 2r (see Fig. 4.5, where
pi and pj are competitors at th end of round r). Said differently, after each round,
the number of processes that compete to become leader is divided at least by two.
If idi > id, the process whose identity is id cannot be elected (this is because
its identity is not the greatest one). Hence, pi stops the progress of the message
ELECTION (id, −, −) (line 4). Finally, if id i = id, the message ELECTION (id, r, −)
sent by pi has visited all the processes without being stopped. It follows that idi is
the greatest identity, and pi consequently sends a message ELECTED (idi ) (line 5),
which will inform all processes that the election is over (lines 13–14).
When a process pi receives a message REPLY (id, r), pi forwards it to the appro-
priate (right or left) neighbor if it is not the final destination process of this message
(line 8). If it is the final destination process (idi = id), and this is the second mes-
sage REPLY (id, r) (i.e., the message coming from the other side of the ring), pi
learns that it has the highest identity in both its left and right neighborhoods of size
2r (line 9). It consequently proceeds to the next round by sending to its left and
right neighbors the message ELECTION (idi , r + 1, 1) (line 10), which starts its next
round.
Time Complexity To simplify, let us assume that n = 2k , and let us consider the
best case, namely, all the processes start simultaneously. Moreover, as usual for time
complexity, it is assumed that each message takes one time unit. The process with
the highest identity will execute from round 0 until round k, and a round r will
take 2r time units. By summing the time taken by all rounds, the process with the
highest identity will be elected after at most 2(1 + 21 + 22 + · · · + 2r + · · · + 2k ) time
units (the factor 2 is due to a message ELECTION () in one direction plus a message
REPLY () in the other direction). It follows that, in the best case, the time complexity
is upper bounded by 2( 2 2−1−1 ) = 4n − 2, i.e., O(n).
k+1
The worst case time complexity, which is also O(n), is addressed in Exercise 3.
This means that the time complexity is linear with respect to the size of the ring.
Context This section considers a unidirectional ring network in which the chan-
nels are FIFO (i.e., on each channel, messages are received in their sending order).
As in the previous section, each process pi knows only its identity idi and the fact
that no two processes have the same identity. No process knows the value n.
The novel idea is the way the processes simulate a bidirectional ring. Let us
consider three processes pi , pj , and pk such that pi and pk are the left and the
right neighbor of pj on the ring (Fig. 4.7). Moreover, let us assume that a process
receives messages from its right neighbor and sends messages to its left neighbor.
During the first round, each process sends its identity to its left neighbor, and
after it has received the identity idx of its right neighbor px , it forwards idx to its
left neighbor. Hence, when considering Fig. 4.7, pi receives first idj and then idk ,
which means that it knows three identities: idi , idj , and idk . It follows that it can
play the role of pj . More precisely
• If idj > max(idi , idk ), pi considers idj as the greatest identity it has seen and
progresses to the next round as a competitor on behalf of idj .
• If idj < max(idi , idk ), pi stops being a competitor and its role during the next
rounds will be to forward to the left the messages it receives from the right.
It is easy to see that, if pi remains a competitor (on behalf of idj ) during the
second round, its left neighbor ph and its right neighbor pj can no longer remain
competitors (on behalf of idi , and on behalf of idk , respectively). This is because
idj > max(idi , idk ) ⇒ ¬ idi > max(idh , idj ) ∧ ¬ idk > max(idj , id ) .
It follows that, during the second round, at most half of the processes remain com-
petitors. During that round, the processes that are no longer competitors only for-
ward the messages they receive, while the processes that remain competitors do the
same as during the first round, except that they consider the greatest identity they
have seen so far instead of their initial identity. This is illustrated in Fig. 4.8, where it
is assumed that idj > max(idi , idk ), id < max(idk , idm ), and idm > max(id , idp ),
and consequently pi competes on behalf of idj , and p competes on behalf of idm .
The processes that remain competitors during the second round define a logical ring
with at most n/2 processes. This ring is denoted with dashed arrows in the figure.
Finally, the competitor processes that are winners during the second round pro-
ceed to the third round, etc., until a round with a single competitor is attained, which
occurs after at most 1 + log2 n rounds.
88 4 Leader Election Algorithms
Fig. 4.9 Dolev, Klawe, and Rodeh’s election algorithm (code for pi )
As in Sect. 4.3, it is assumed that each process receives an external message START ()
before any message sent by the algorithm. The algorithm is described in Fig. 4.9.
Local Variables In addition to donei , leaderi , and electedi , each process pi man-
ages three local variables.
• competitori is a Boolean which indicates if pi is currently competing on behalf of
some process identity or is only relaying messages. The two other local variables
are meaningful only when competitori is equal to true.
• maxidi is the greatest identity know by pi .
• proxy_fori is the identity of the process for which pi is competing.
• If the message ELECTION (1, idi ) is such that id = maxidi , then it has made a full
turn on the ring, and consequently maxidi is the greatest identity. In this case,
pi sends the message ELECTED (maxidi , idi ) (line 6), which is propagated on the
ring to inform all the processes (lines 16–17).
• If message ELECTION (1, id) is such that id = maxidi , pi copies id in proxy_fori ,
and forwards the message ELECTION (2, id) on its outgoing channel.
When a process pi receives a message ELECTION (2, id), it forwards it (as pre-
viously) on its outgoing channel if it is no longer a competitor (lines 9–10). If it
is a competitor, pi checks if proxy_fori > max(id, maxidi ), i.e., if the identity of
the process it has to compete for (namely, proxy_fori ) is greater than both maxidi
(the identity of the process on behalf of which pi was previously competing) and
the identity id it has just received (line 11). If it is the case, pi updates maxidi
and starts a new round (line 12). Otherwise, proxy_fori is not the highest identity.
Consequently, as pi should compete for an identity that cannot be elected, it stops
competing (line 13).
Message Complexity During each round, except the last one, each process sends
two messages: a message ELECTION (1, −) and a message ELECTION (2, −). More-
over, there are only ELECTION (1, −) messages during the last round. As we have
seen, there are at most log2 n + 1 rounds. It follows that the number of messages
ELECTION (1, −) and ELECTION (2, −) sent by the algorithm is at most 2n log n + n.
When the Indexes Are the Identities When the identity of a process is its index,
and both this fact and n are known by all processes, the leader election problem
is trivial. It is sufficient to statically select an index and define the corresponding
process as the leader. While this works, it has a drawback, namely, the same process
is always elected.
There is a simple way to solve this issue, as soon as the processes can use random
numbers. Let random(1, n) be a function that returns a random number in {1, . . . , n}
each time it is called.
The algorithm described in Fig. 4.10 is a very simple randomized election al-
gorithm. Each process first obtains a random number and sends it to all. Then, it
waits until it has received all random numbers. When this occurs, the processes can
compute the same process identity, and consistently elect the corresponding process
(line 3). The costs of the algorithm are O(1) time units and O(n2 ) messages.
The probability that a given process px is elected can be computed from the
specific probability law associated with the function random().
4.6 Summary
This chapter was devoted to election algorithms on a ring. After a simple proof that
no such algorithm exists in anonymous ring networks, the chapter presented three
algorithms for non-anonymous rings. Non-anonymous means here that (a) each pro-
cess pi has an identity idi , (b) no two processes have the same identity, (c) identities
can be compared, (d) initially, a process knows only its identity, and (e) no process
knows n (total number of processes).
Interestingly, this chapter has presented two O(n log n) elections algorithms that
are optimal. The first, due to Hirschberg and Sinclair, is suited to bidirectional rings,
while the second, due to Dolev, Klawe, and Rodeh, is suited to both unidirectional
rings and bidirectional rings (this is because a bidirectional ring can always be con-
sidered as if it was unidirectional). This algorithm shows that, contrary to what one
could think, the fact that a ring is unidirectional or bidirectional has no impact on
its optimal message complexity when considering a O() point of view.
1. Extend the proof of Theorem 2 so that it works for any anonymous regular net-
work.
2. Considering the variant of Chang and Robert’s election algorithm described in
Fig. 4.3, and assuming that k processes send an ELECTION () message at line 1,
what is the maximal number of ELECTION () messages sent (at lines 1 and 2)
during an execution of the algorithm?
Answer: nk − k(k−1)
2 .
92 4 Leader Election Algorithms
3. Show that the worst case for the time complexity of Hirschberg and Sinclair’s
election algorithm is when n is 1 more than a power of 2. Show that, in this case,
the time complexity is 6n − 6.
Solution in [185].
Chapter 5
Mobile Objects Navigating a Network
A mobile object is an object (such as a file or a data structure) that can be accessed
sequentially by different processes. Hence, a mobile object is a concurrent object
that moves from process to process in a network of processes.
When a process momentarily owns a mobile object, the process can use the ob-
ject as if it was its only user. It is assumed that, after using a mobile object, a pro-
cess eventually releases it, in order that the object can move to another process that
requires it. So what has to be defined is a navigation service that provides the pro-
cesses with two operations denoted acquire_object() and release_object() such that
any use of the object by a process pi is bracketed by an invocation to each of these
operations, namely
As already noticed, in order that the state of the object remains always consistent,
it is required that the object be accessed by a single process at a time, and the object
has to be live in the sense that any process must be able to obtain the mobile object.
This is captured by the two classical safety and liveness properties which instantiate
as follows for this problem (where the sentences “the object belongs to a process”
or “a process owns the object” means that the object is currently located at this
process):
• Safety: At any time, the object belongs to at most one process.
• Liveness: Any process that invokes acquire_object() eventually becomes the
owner of the object.
Let us notice that a process pi invokes release_object() after having obtained
and used the object. Hence, it needs to invoke again acquire_object() if it wants to
use the mobile object again.
The particular case where the ownership of the mobile object gives its owner a
particular right (e.g., access to a resource) is nothing more than an instance of the
mutual exclusion problem. In that case, the mobile object is a stateless object, which
is usually called a token. The process that has the token can access the resource,
while the other processes cannot. Moreover, any process can require the token in
order to be eventually granted the resource.
Token-based mutual exclusion algorithms define a family of distributed mutual
exclusion algorithms. We will see in Chap. 10 another family of distributed mutual
exclusion algorithms, which are called permission-based algorithms.
• When the home process receives a message RELEASE _ OBJECT (i), it stores the
new value of the object (if any), and suppresses the request of pi from its local
queue. Then, if the queue is not empty, p sends the object to the first process of
the queue.
This three-way handshake algorithm is illustrated in Fig. 5.1, with two processes
pi and pj . When the home process p receives the message REQUEST (j ), it adds
it to its local queue, which already contains REQUEST (i). Hence, the home process
will answer this request when it receives the message RELEASE _ OBJECT (i), which
carries the last value of the object as modified by pi .
Discussion Let us first observe that the home process p can manage its internal
queue on a FIFO basis or use another priority discipline. This actually depends on
the application.
While they may work well in small systems, the main issue of home-based al-
gorithms lies in their poor ability to cope with scalability and locality. If the object
is heavily used, the home process can become a bottleneck. Moreover, always re-
turning the object to its home process can be inefficient (this is because when a
process releases the object, it could be sent to its next user without passing through
its home).
The algorithms that are described in this chapter are not home-based and do not
suffer the previous drawback. They all have the following noteworthy feature: If,
when a process pi releases the object, no other process wants to acquire it, the
object remains at its last user pi . It follows that, if the next user is pi again, it does
not need to send messages to obtain the object. Consequently, no message is needed
in this particular case.
The three algorithms that are presented implicitly consider that the home of the
object is dynamically defined: the home of the mobile object is its last user. These
96 5 Mobile Objects Navigating a Network
algorithms differ in the structure of the underlying network they assume. As we will
see, their cost and their properties depend on this structure. They mainly differ in
the way the mobile object and the request messages are routed in the network.
As there is no statically defined home notion, the main issue that has to be solved
is the following: When a process pi releases the object, to which process has it to
send the mobile object?
A Control Data Inside the Mobile Object First, the mobile object is enriched
with a control data, denoted obtained[1..n], such that obtained[i] counts the number
of times that pi has received the object. Let us notice that it is easy to ensure that,
for any i, obtained[i] is modified only when pi has the object. It follows that the
array obtained[1..n] always contains exact values (and no approximate values). This
array is initialized to [0, . . . , 0].
Local Data Structure In order to know which processes are requesting the object,
a process pi , which does not have the object and wants to acquire it, sends a message
REQUEST (i) to every other process to inform them that it is interested in the object.
Moreover, each process pi manages a local array request_byi [1..n] such that
request_byi [j ] contains the number of REQUEST () messages sent by pj , as know
by pi .
Determining Requesting Processes Let pi be the process that has the mobile
object. The set of processes that, from its point of view, are requesting the token
can be locally computed from the arrays request_byi [1..n] and obtained[1..n]. It is
the set Si including the processes pk such that (to pi ’s knowledge) the number of
REQUEST () messages sent by pk is higher than the number of times pk has received
the object (which is saved in obtained[k]), i.e., it is the set
Si = k | request_byi [k] > obtained[k] .
This provides pi with a simple predicate (Si = ∅) that allows it to know if pro-
cesses are requesting the mobile object. This predicate can consequently be used to
5.2 A Navigation Algorithm for a Complete Network 97
ensure that, if processes want to acquire the object, eventually there are processes
that obtain it (deadlock-freedom property).
Ensuring Starvation-Freedom Let us consider the case where all processes have
requested the object and p1 has the object. When, p1 releases the object, it sends
it p2 , and just after sends a message REQUEST (1) to again obtain the object. It is
possible that, when it releases the object, p2 sends it to p1 , and just after sends a
message REQUEST (2) to again obtain the object. This scenario can repeat forever,
and, while processes p1 and p2 repeatedly forever obtain the mobile object, the
other processes never obtain it.
A simple way to solve this problem (and consequently obtain the starvation-
freedom property) consists in establishing an order on the processes of Si that pi
has to follow when it releases the object. The order, which depends on i, is the
following one for pi :
i + 1, i + 2, . . . , n, 1, . . . , i − 1.
The current owner of the object pi sends the object to the first process of this list
that belongs to Si . This means that, if (i + 1) ∈ Si , pi sends the object to pi+1 .
Otherwise, if (i + 2) ∈ Si , pi sends the object to pi+2 , etc. It is easy to see that,
as no process can be favored, no process can be missed, and, consequently, any
requesting process will eventually receive the mobile object.
Structural View The structural view of the navigation algorithm at each process
is described in Fig. 5.2. The module associated with each process pi contains the
previous local variables and four pieces of code, namely:
• The two pieces of code for the algorithms implementing the operations acquire_
object() and release_object(), which constitute the interface with the application
layer.
98 5 Mobile Objects Navigating a Network
• Two additional pieces of code, each associated with the processing of a mes-
sage, namely the message REQUEST () and the message OBJECT (). As we have
seen, the latter contains the mobile object itself plus the associated control data
obtained[1..n].
The Algorithm The navigation algorithm is described in Fig. 5.3. Each piece of
code is executed atomically, except the wait() statement. This means that, at each
process pi , the execution of lines 1–4, lines 7–14, line 15, and lines 16–19, are
mutually exclusive. Let us notice that these mutual exclusion rules do not prevent
a process pi , which is currently using the mobile object, from being interrupted to
execute lines 16–19 when it receives a message REQUEST ().
When a process pi invokes acquire_object(), it first sets interestedi to true
(line 1). If it is the last user of the object, (object_presenti is then true, line 2),
pi returns from the invocation and uses the object. If it does not have the object, pi
increases request_byi (line 3), sends a REQUEST (i) message to each other process
(line 4), and waits until it has received the object (lines 5 and 15).
When a process which was using the object invokes release_object(), it first in-
dicates that it is no longer interested in the mobile object (line 7), and updates the
global control variable obtained[i] to the value request_byi [i] (line 8). Then, start-
ing from pi+1 (line 9), pi looks for the first process pk that has requested the mobile
object more often than the number of times it acquired the object (line 10) (let us
notice that pi is then such that ¬ interestedi ∧ object_presenti ).
If there is such a process, pi sends it the object (lines 11–12) and returns from
the invocation of release_object(). If, to its local knowledge, no process wants the
object, pi keeps it and returns from the invocation of release_object() (let us notice
that pi is then such that ¬ interestedi ∧ object_presenti ).
As already seen, pi sets object_presenti to the value true when it receives the ob-
ject. A process can receive the mobile object only if it has previously sent a request
message to obtain it.
Finally, when a process pi receives a message REQUEST (k), it first increases
accordingly request_byi [k] (line 16). Then, if it has the object and is not using it, pi
sends it to pk by return (lines 17–19).
5.2 A Navigation Algorithm for a Complete Network 99
operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) request_byi [i] ← request_byi [i] + 1;
(4) for k ∈ {1, . . . , n} \ {i} do send REQUEST (i) to pk end for;
(5) wait (object_presenti )
(6) end if.
operation release_object() is
(7) interestedi ← false;
(8) obtained[i] ← request_byi [i];
(9) for k from i + 1 to n and then from 1 to i − 1 do
(10) if (request_byi [k] > obtained[k]) then
(11) object_presenti ← false;
(12) send OBJECT () to pk ; exit loop
(13) end if
(14) end for.
Cost of the Algorithm The number of messages needed for one use of the mobile
object is 0 when the object is already present at the process that wants to use it, or n
(n − 1 request messages plus the message carrying the object).
A message REQUEST () carries a process identity. Hence, its size is O(log2 n).
The time complexity is 0 when the object is already at the requesting process.
Otherwise, let us first observe that the transit time used by request messages that
travel while the object is used has not to be counted. This is because, whatever their
speed, as the object is used, these transit times cannot delay the transfer of the object.
Assume that all messages take one time unit; it follows that the time complexity lies
between 1 and 2. One time unit happens in heavy load: When a process pi releases
the object, the test of line 10 is satisfied, and the object takes one time unit before
being received by its destination process. Two times units are needed in light load:
One time unit is needed for the request message to attain the process that has the
object, and one more time unit is needed for the object to travel to the requesting
process (in that case, the object is sent at line 18).
Are Early Updates Good? When a process pi receives the object (which car-
ries the control data obtained[1..n]), it is possible that some entries k are such
that obtained[k] > request_byi [k]. This is due to asynchrony, and is independent
of whether or not the channels are FIFO. This occurs when some request messages
100 5 Mobile Objects Navigating a Network
are very slow, as depicted in Fig. 5.4, where the path followed by the mobile object
is indicated with dashed arrows (the dashed arrow on a process is the period during
which this process uses the mobile object). Moreover, as shown in the figure, the
REQUEST () message from pj to pi is particularly slow with respect to the object.
Hence the question: when a process pi receives the object, is it interesting for pi
to benefit from the array obtained[1..n] to update its local array request_byi [1..n],
i.e., to execute at line 15 the additional statement
Due to the fact that each REQUEST () message has to be counted exactly once, this
early update demands other modifications so that the algorithm remains correct. To
that end, line 4 has to be replaced by
(4 ) for k ∈ {1, . . . , n} \ {i} do send REQUEST(i, request_byi [i]) to pk end for,
and, when a message REQUEST (k, rnb) is received, line 16 has to be replaced by
It follows that trying to benefit from the array obtained[1..n] carried by the mo-
bile object to early update the local array request_byi [1..n] requires us to add (a)
sequence numbers to REQUEST () messages, and (b) associated update statements.
It follows that these modifications make the algorithm less efficient and a little bit
more complicated. Hence, contrarily to what one could a priori hope, early updates
are not good for this navigation algorithm.
in this section has only bounded variables. It is based on a statically defined span-
ning tree of the network, and each process communicates only with its neighbors
in the tree, hence the algorithm is purely local. This algorithm is due to K. Ray-
mond (1989).
As just indicated, the algorithm considers a statically defined spanning tree of the
network. Only the channels of this spanning tree are used by the algorithm.
Tree Invariant Initially, the process that has the mobile object is the root of the
tree, and each process has a pointer to its neighbor on its path to the root. This is
depicted in Fig. 5.5, where the mobile object is located at process pa .
This tree structure is the invariant maintained by the algorithm: A process always
points to its neighbor in the subtree containing the object. In that way, a process
always knows in which part of the tree the object is located.
Let us consider the process pc that wants to acquire the object. Due to the tree
orientation, it can send a request to its current parent in the tree, namely process
pb , which in turn can forward it to its parent, etc., until it attains the root of the
tree. Hence, the tree orientation allows requests to be propagated to the appropriate
part of the tree. When pa receives the request, it can send the object to pb , which
forwards it to pc . The object consequently follows the same path of the tree (in the
reverse order) as the one the request. Moreover, in order to maintain the invariant
associated with the tree orientation, the mobile object reverses the direction of the
edges during its travel from the process where it was previously located (pa in the
figure) to its destination process (pc in the figure). This technique, which is depicted
in Fig. 5.6, is called edge reversal.
The Notion of a Proxy Let us consider process pb that receives a request for the
mobile object from its neighbor pd after it has received a request from pc (Fig. 5.6).
Hence, it has already sent a request to pa to obtain the object on behalf of pc .
Moreover, a process only knows its neighbors in the tree. Additionally, after having
102 5 Mobile Objects Navigating a Network
received requests from its tree neighbors pc and pd , pb may require the object for
itself. How do solve these conflicts between the requests by pc , pd , and pb ?
To that end, pb manages a local FIFO queue, which is initially empty. When
it receives the request from pc , it adds the request to the queue, and as the queue
contains a single request, it plays a proxy role, namely, it sends a request for itself
to its parent pa (but this request is actually for the process pc , which is at the head
of its local queue). This is a consequence of the fact that a process knows only its
neighbors in the tree (pa does not know pc ).
Then, when it receives the request from pd , pb adds to its queue, and, as this
request is not the only one in the queue, pb does not send another request to pa .
When later pb receives the mobile object, it forwards it to the process whose request
is at the head of its queue (i.e., pc ), and suppresses its request from the queue.
Moreover, as the queue is not empty, pb continues to play its proxy role: it sends
to pc a request (for the process at the head of its queue, which is now pd ). This
is depicted in Fig. 5.7, where the successive states of the local queue of pb are
explicitly described.
The structure of the algorithm is the same as the one described in Fig. 5.2.
• queuei is the local queue in which pi saves the requests it receives from its neigh-
bors in the tree, or from itself. Initially, queuei is empty. If pi has d neighbors in
the tree, queuei will contain at most d process identities (one per incoming edge
plus one for itself). It follows that the bit size of queuei is bounded by d log2 n.
The following notations are used:
– queuei ← queuei + j means “add j at the end of queuei ”.
– queuei ← queuei − j means “withdraw j from queuei (which is at its head)”.
– head(queuei ) denotes the first element of queuei .
– |queuei | denotes the size of queuei , while ∅ is used to denote the empty queue.
• parenti contains the identity of the parent of pi in the tree. The set of local vari-
ables {parenti }1≤i≤n is initialized in such a way that they form a tree rooted at the
process where the object is initially located. The root of the tree is the process pk
such that parentk = k.
The Boolean object_presenti used in the previous algorithm is no longer
needed. This is because it is implicitly encoded in the local variable parenti . More
precisely, we have (parenti = i) ≡ object_presenti .
operation acquire_object() is
(1) interestedi ← true;
(2) if (parenti = i) then
(3) queuei ← queuei + i;
(4) if (|queuei | = 1) then send REQUEST (i) to pparenti end if;
(5) wait (parenti = i)
(6) end if.
operation release_object() is
(7) interestedi ← false;
(8) if (queuei = ∅) then
(9) let k = head(queuei ); queuei ← queuei − k;
(10) send OBJECT () to pk ; parenti ← k;
(11) if (queuei = ∅) then send REQUEST (i) to pparenti end if
(12) end if.
considers the request at the head of queuei (line 20). If it is its own request, the
object is for it, and consequently pi becomes the new root of the spanning tree and
is the current user of the object (line 21). Otherwise, the object has to be forwarded
to the neighbor pk whose request was at the head of queuei (line 22), and pi has
to send a request to its new parent pk if queuei = ∅ (line 23). (Let us observe that
lines 22–23 are the same as lines 10–11.)
On Messages As shown in Figs. 5.5 and 5.6, it follows that, during the object’s
travel in the tree to its destination process, the object reverses the direction of the
tree edges that have been traversed (in the other direction) by the corresponding
request messages.
5.3 A Navigation Algorithm Based on a Spanning Tree 105
Each message REQUEST () carries the identity of its sender (lines 4, 11, 18, 23).
Thus, the control information carried by these messages is bounded. Moreover, all
local variables are bounded.
Non-FIFO Channels As in the previous one, this algorithm does not require the
channels to be FIFO. When channels are not FIFO, could a problem happen if a
process pi , which has requested the object for it or another process pk , first receives
a request and then the object, both from one of its neighbor pj in the tree?
This scenario is depicted in Fig. 5.9, where we initially have parenti = j , and
pi has sent the message REQUEST (i) to its parent on behalf of the process py at
the head of queuei (hence |queuei | ≥ 1). Hence, when pi receives REQUEST (j ),
it is such that parenti = i and |queuei | ≥ 1. It then follows from lines 13–19 that
the only statement executed by pi is line 17, namely, pi adds j to its local queue
queuei , which is then such that |queuei | ≥ 2).
On the Tree Invariant The channels used by the algorithm constitute an undi-
rected tree. When considering the orientation of these channels as defined by the
local variable parenti , we have the following.
Initially, these variables define a directed tree rooted at the process where the
object is initially located. Then, according to requests and the move of the object, it
is possible that two processes pi and pj are such that parenti = j and parentj = i.
Such is the case depicted in Fig. 5.9, in which the message REQUEST (j ) can be
suppressed (let us observe that this is independent of whether or not the channel
connecting pi and pj is FIFO). When this occurs we necessarily have:
• parentj = i and pj has sent the object to pi , and
• parenti = j and pi has sent the message REQUEST (i) to pj .
When the object arrives at pi , parenti is modified (line 21 or line 22, according to
the fact that pi is or is not the destination of the object), and, from then on, the edge
of the spanning tree is directed from pj to pi .
Hence, let us define an abstract spanning tree as follows. The orientation of all
its edges, except the one connecting pi and pj when the previous scenario occurs,
are defined by the local variables parentk (k = i, j ). Moreover, when the previous
scenario occurs, the edge of the abstract spanning tree is from pj to pi , i.e., the
106 5 Mobile Objects Navigating a Network
direction in which the object is moving. (This means that the abstract spanning tree
logically considers that the object has arrived at pi .) The invariant preserved by the
algorithm is then the following: At any time there is exactly one abstract directed
spanning tree, whose edges are directed to the process that has the object, or to
which the object has been sent.
Cost of the Algorithm As in the previous algorithm, the message cost is 0 mes-
sages in the most favorable case. The worst case depends on the diameter D of the
tree, and occurs when the two most distant processes are such that one has the object
and the other requests it. In this case, D messages and D time units are necessary
for the requests to arrive at the owner of the object, and again D messages and D
time units are necessary for the object to arrive at the requesting process. Hence,
both message complexity and time complexity are O(D).
On Priority As each local queue queuei is managed according to the FIFO disci-
pline, it follows that each request is eventually granted (the corresponding process
obtains the mobile object). It is possible to apply other management rules to these
queues, or particular rules to a subset of processes, in order to favor some processes,
or obtain specific priority schemes. This versatility dimension of the algorithm can
be exploited by some applications.
Theorem 3 The algorithm described in Fig. 5.8 guarantees that the object is never
located simultaneously at several sites (safety), and any process that wants the ob-
ject eventually acquires it (liveness).
Proof Proof of the safety property. Let us first observe that the object is initially
located at a single process. The proof is by induction. Let us assume that the property
is true up to the xth move of the object from a process to another process. Let pi be
the process that has the object after its xth move. If pi keeps the object forever, the
property is satisfied. Hence, let us assume that eventually pi sends the object. There
are three lines at which pi can send the object. Let us consider each of them.
5.3 A Navigation Algorithm Based on a Spanning Tree 107
exit() (i.e., release_object()) to exit the queue and allows its successor in the queue
to become the new head of the queue.
A Distributed Queue To implement the queue, each process pi manages two lo-
cal variables. The Boolean interestedi is set to the value true when pi starts entering
the queue, and is reset to the value false when it exits the queue. The second local
variable is denoted nexti , and is initialized to ⊥. It contains the successor of pi in
the queue. Moreover, nexti = ⊥ if pi is the last element of the queue.
Hence, starting from the process px that is the head of the queue, the queue is
defined by the sequence of pointers nextx , nextnextx , etc., until the process py such
that nexty = ⊥.
Due to its very definition, there is a single process at the head of the queue.
Hence, the algorithm considers that this is the process at which the mobile object
is currently located. When this process pi exits the queue, it has to send the mobile
object to pnexti (if nexti = ⊥).
How to Enter the Queue: A Spanning Tree to Route Messages Let p be the
process that is currently the last process of the queue (hence, interested ∧ next =
⊥). The main issue that has to be solved consists in the definition of an addressing
mechanism that allows a process pi , that wants to enter the queue, to inform p that
it is no longer the last process in the queue and it has a successor.
110 5 Mobile Objects Navigating a Network
To that end, we need a distributed routing structure that permits any process to
send a message to the last process of the queue. Moreover, this distributed routing
structure has to be able to evolve, as the last process of the queue changes according
to the request issued by processes. The answer is simple, namely, the distributed
routing structure we are looking for is a dynamically evolving spanning tree whose
current root is the last process of the queue.
More specifically, let us consider Fig. 5.11. There are five processes, p1 , . . . , p5 .
The local variable parenti of each process pi points to the process that is the current
parent of pi in the tree. As usual, parentx = x means that px is the root of the
tree. The process p4 is the current root (initial configuration). The process p2 sends
a message REQUEST (2) to its parent p1 and defines itself as new root by setting
parent2 to 2. When p1 receives this message, as it is not the root, it forwards this
message to its parent p4 and redefines its new parent as being p2 (intermediary
configuration). Finally, when p4 receives the message REQUEST (2) forwarded by
p1 , it discovers that it has a successor in the queue (hence, it executes next4 ← 2),
and it considers p2 as the new root of the spanning tree (update of parent4 to 2 in
the final configuration).
As shown by the previous example, it is possible that, at some times, several trees
coexist (each spanning a distinct partition of the network). The important point is
that there is no creation of cycles, and there is a single new spanning tree when all
control messages have arrived and been processed.
Differently from the algorithm described in Fig. 5.8, which is based on “edge re-
versal” on a statically defined tree, this algorithm benefits from the fact any channel
can be part of the dynamically evolving spanning tree, The directed path p2 , p1 , p4
of the initial spanning tree, is replaced by two new directed edges (from p1 to p2 ,
and from p4 to p2 ). Hence, a path (in the initial configuration) of length d from the
new root to the old root is replaced by d edges, each directly pointing to the new
root (in the final configuration).
Heuristic Used by the Algorithm The previous discussion has shown that the
important local variable for a process pi to enter the queue is parenti . It follows
from the modification of the edges of the spanning tree, which are entailed by the
5.4 An Adaptive Navigation Algorithm 111
messages REQUEST (), that the variable parenti points to the process pk that has
issued the last request seen by pi .
Hence, when pi wants later to enter the queue, it sends its request to the process
pparenti , because this is the last process in the queue from pi ’s point of view.
The Case of the Empty Queue As in the diffusion-based algorithm of Fig. 5.3,
each process pi manages a local Boolean variable object_presenti , whose value is
true if and only if the mobile object is at pi (if we are not interested in the navigation
of a mobile object, this Boolean could be called first_in_queuei ).
As far as the management of the queue is concerned, its implementation has
to render an account of the case where the queue is empty. To that end, let us
consider the case where the queue contains a single process pi . This process
is consequently the first and the last process of the queue, and we have then
object_presenti ∧ (parenti = i).
If pi exits the queue, the queue becomes empty. The previous predicate remains
satisfied, but we have then ¬ interestedi . It follows that, if a process pi is such that
(¬ interestedi ∧ object_presenti ) ∧ (parenti = i), it knows that it was the last user
of the queue, which is now empty.
operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) send REQUEST (i) to pparenti ; parenti ← i;
(4) wait(object_presenti )
(5) end if.
operation release_object() is
(6) interestedi ← false;
(7) if (nexti = ⊥) then
(8) send OBJECT () to pnexti ; object_presenti ← false; nexti ← ⊥
(9) end if.
from which we conclude that, the last time pi has invoked release_object(), it was
such that nexti = ⊥ (line 7), and no other process has required the object (otherwise,
pi would have set object_presenti to false at line 14). It follows that the object re-
mained at pi since the last time it used it (i.e., the queue was empty and pi was its
fictitious single element).
When pi invokes release_object(), it first resets interestedi to the value false,
i.e., it exits the queue (line 6). Then, if nexti = ⊥, pi keeps the object. Otherwise, it
first sends the object to pnexti (i.e., it sends it the property “you are the head of the
queue”), and then resets object_presenti to false, and nexti to ⊥ (line 8).
When pi receives a message REQUEST (k), its behavior depends on the fact that
it is, or it is not, the last element of the queue. If it is not, it has only to forward the
request message to its current parent in the spanning tree, and redefine parenti as
being pk (lines 11–12 and 17). This is to ensure that parenti always points to the last
process that, to pi ’s knowledge, has issued a request. Let us remark that the message
REQUEST (k) that is forwarded is exactly the message received. This ensures that the
variables parentx of all the processes px visited by this message will point to pk .
If pi is such that parenti = ⊥, there are two cases. If interestedi is true, pi is in
the queue. It consequently adds pk to the queue (line 13). If interestedi is false, we
are in the case where the queue is empty (pi is its fictitious single element). In this
case, pi has the object and sends it to pk (line 14). Moreover, whatever the case, to
preserve its correct meaning (pparenti is the last process that, to pi knowledge, has
issued a request), pi updates parenti to k (line 17).
5.4 An Adaptive Navigation Algorithm 113
5.4.4 Properties
Variable and Message Size, Message Complexity A single bit is needed for each
local variable object_presenti and interestedi ; log2 n bits are needed for each vari-
able parenti , while log2 (n + 1) bits are needed for each variable nexti . As each
message REQUEST () carries a process identity, it needs to log2 n bits. If the al-
gorithm is used only to implement a queue, the message OBJECT () does not have
to carry application data, and a single bit is needed to distinguish the two message
types.
In the best case, an invocation of acquire_object() does not generate REQUEST ()
messages, while it gives rise to (n − 1) REQUEST () messages plus one OBJECT ()
message in the worst case. This case occurs when the spanning tree is a chain.
The left side of Fig. 5.13 considers the worst case: The process denoted pi issues
a request, and the object is at the process denoted pj . The right side of the figure
shows the spanning tree after the message REQUEST (i) has visited all the processes.
The new spanning tree is optimal in the sense that the next request by a process
will generate only one request message. More generally, it has been shown that
the average number of messages generated by an invocation of acquire_object() is
O(log n).
ACQUIRE _ OBJECT (), they are “ignored” by the algorithm whose cost is accordingly
reduced.
in the middle of the bottom line of the figure. Finally, the last subfigure (right of the
bottom line) depicts the value of the pointers parenti and nexti after a new invocation
of acquire_object() by process p2 .
As we can see, when all messages generated by the invocations of acquire_
object() and release_object() have been received and processed, the sets of point-
ers parenti and nexti define a single spanning tree and a single queue, respectively.
Moreover, this example illustrates also the adaptivity property. When considering
the last configuration, if p5 and p1 do not invoke the operation acquire_object(),
they will never receive messages generated by the algorithm.
5.5 Summary
This chapter was on algorithms that allow a mobile object to navigate a network
made up of distributed processes. The main issue that these algorithms have to solve
lies in the routing of both the requests and the mobile object, so that any process
that requires the object eventually obtains it. Three navigation algorithms have been
presented. They differ in the way the requests are disseminated, and in the way the
mobile object moves to its destination.
The first algorithm, which assumes a fully connected network, requires O(n)
messages per use of the mobile object. The second one, which uses only the edges
of a fixed spanning tree built on the process network, requires O(D) messages,
where D is the diameter of the spanning tree. The last algorithm, which assumes a
fully connected network, manages a spanning tree whose shape evolves according
to the requests issued by the processes. Its average message cost is O(log n). This
algorithm has the noteworthy feature of being adaptive, namely, if after some time a
process is no longer interested in the mobile object, there is a finite time after which
it is no longer required to participate in the algorithm.
When the mobile object is a (stateless) token, a mobile object algorithm is noth-
ing more than a token-based mutual exclusion algorithm. Actually, the algorithms
presented in this chapter were first introduced as token-based mutual exclusion al-
gorithms.
Finally, as far as algorithmic principles are concerned, an important algorithmic
notion presented in this chapter is the “edge reversal” notion.
• The edge reversal technique was first proposed (as far as we know) by E. Gafni
and D. Bertsekas [141]. This technique is used in [78] to solve resource alloca-
tion problems. A monograph entirely devoted to this technique has recently been
published [388].
• The algorithm presented in Sect. 5.4 is due to M. Naimi and M. Trehel [275, 374].
A formal proof of it, and the determination of its average message complexity
(namely, O(log n)), can be found in [276].
• A generic framework for mobile object navigation along trees is presented
in [172]. Both the algorithms described in Sects. 5.3 and 5.4, and many other
algorithms, are particular instances of this very general framework.
• Variants of the algorithms presented in Sects. 5.3 and 5.4 have been proposed.
Among them are the algorithm presented by J.M. Bernabéu-Aubán and M.
Ahamad [50], the algorithm proposed by M.L. Nielsen and M. Mizuno [279]
(see Exercise 4), a protocol proposed by J.L.A. van de Snepscheut [355], and the
arrow protocol proposed by M.J. Demmer and M. Herlihy [105]. (The message
complexity of this last algorithm is investigated in [182].)
• The algorithm proposed by J.L.A. van de Snepscheut [355] extends a tree-based
algorithm to work on any connected graph.
• Considering a mobile object which is a token, a dynamic heuristic-based navi-
gation algorithm is described in [345]. Techniques to regenerate lost tokens are
described in [261, 285, 321]. These techniques can be extended to more sophisti-
cated objects.
1. Modify the navigation algorithm described in Fig. 5.3 (Sect. 5.2), so that all the
local variables request_byi [k] have a bounded domain [1..M].
(Hint: consider the process that is the current user of the object.)
2. The navigation algorithm described in Fig. 5.3 assumes that the underlying com-
munication network is a complete point-to-point network. Generalize this algo-
rithm so that it works on any connected network (i.e., a non-necessarily complete
network).
Solution in [176].
3. A greedy version of the spanning tree-based algorithm described in Fig. 5.8
(Sect. 5.3) can be defined as follows. When a process pi invokes acquire_object()
while parenti = i (i.e., the mobile object is not currently located at the invoking
process), pi adds its identity i at the head of queuei (and not at its tail as done in
Fig. 5.8). The rest of the algorithm is left unchanged.
What is the impact of this modification on the order in which the processes
obtain the mobile object? Does the liveness property remain satisfied? (Justify
your answers.)
Solution in [304].
5.7 Exercises and Problems 117
operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) send REQUEST (i, i) to pparenti ; parenti ← i;
(4) wait(object_presenti )
(5) end if.
when release_object() is
(6) interestedi ← false;
(7) if (nexti = ⊥) then
(8) send OBJECT () to pnexti ; object_presenti ← false; nexti ← ⊥
(9) end if.
4. In the algorithm described in Fig. 5.8 (Sect. 5.3), the spanning tree is fixed, and
both the requests and the object navigate its edges (in opposite direction). Dif-
ferently, in the algorithm described in Fig. 5.12, the requests navigate a spanning
tree that they dynamically modify, and the object is sent directly from its current
owner to its next owner.
Hence the idea to design a variant of the algorithm of Fig. 5.12 in which the
requests are sent along a fixed spanning (only the direction of its edges is mod-
ified according to requests), and the object is sent directly from its current user
to its next user. Such an algorithm is described in Fig. 5.15. A main difference
with both previous algorithms lies in the message REQUEST (). Such a message
carries now two process identities j and k, where pj is the identity of the pro-
cess that sent the message, while pk is the identity of the process from which
originates this request. (In the algorithm of Fig. 5.8, which is based on the notion
of a proxy process, a request message carries only the identity of its sender. Dif-
ferently, in the algorithm of Fig. 5.12, a request message carries only the identity
of the process from which the request originates.) The local variables have the
same names (interestedi , object_presenti , parenti , nexti ), and the same meaning
as in the previous algorithms.
Is this algorithm correct? If it is not, find a counterexample. If it is, prove its
correction and compute its message complexity.
Solution in [279].
Part II
Logical Time and Global States
in Distributed Systems
This part of the book, which consists of four chapters (Chap. 6 to Chap. 9), is de-
voted to the concepts of event, local state, and global state of a distributed compu-
tation and associated notions of logical time. These are fundamental notions that
provide application designers with sane foundations on the nature of asynchronous
distributed computing in reliable distributed systems.
Message Relation Let M be the set of all the messages exchanged during an exe-
cution. Considering the associated send and receive events, let us define a “message
order” relation, denoted “−→msg ”, as follows. Given any message m ∈ M, let s(m)
denote its send event, and r(m) denote its receive event. We have
This relation expresses the fact that any message m is sent before being received.
Causal Path A causal path is a sequence of events a(1), a(2), . . . , a(z) such that
ev
∀x : 1 ≤ x < z : a(x) −→ a(x + 1).
124 6 Nature of Distributed Computations and the Concept of a Global State
ev
Hence, a causal path is a sequence of consecutive events related by −→. Let us
notice that each process history is trivially a causal path.
When considering Fig. 6.1, the sequence of events e22 , e23 , e32 , e33 , e34 , e13 is a causal
path connecting the event e22 (sending by p2 of a message to p3 ) to the event e13
(reception by p1 of a message sent by p3 ). Let us also observe that this causal
relates an event on p2 (e23 ) to an event on p1 (e13 ), despite the fact that p2 never
sends a message to p1 .
Concurrent Events, Causal Past, Causal Future, Concurrency Set Two events
a and b are concurrent (or independent) (notation a||b), if they are not causally
related, none of them belongs to the causes of the other, i.e.,
def ev ev
a||b = ¬(a −→ b) ∧ ¬(b −→ a).
Three more notions follow naturally from the causal precedence relation. Let e be
an event.
ev
• Causal past of an event: past(e) = {f | f −→ e}.
This set includes all the events that causally precede the event e.
ev
• Causal future of an event: future(e) = {f | e −→ f }. This set includes all the
events that have e in their causal past.
• Concurrency set of an event: concur(e) = {f | f ∈ / (past(e) ∪ future(e))}.
This set includes all the events that are not causally related with the event e.
Examples of such sets are depicted in Fig. 6.2, where the event e23 is considered.
The events of past(e23 ) are the three events to the left of the left bold line, while the
events of future(e23 ) are the six events to the right of the bold line on the right side
of the figure.
It is important to notice that while, with respect to physical time, e31 occurs “be-
fore” e23 , and e12 occurs “after” e23 , both are independent from e23 (i.e., they are log-
ically concurrent with e23 ). Process p1 cannot learn that the event e23 has been pro-
duced before receiving the message from p3 (event e13 , which terminates the causal
6.1 A Distributed Execution Is a Partial Order on Local Events 125
Fig. 6.2 Past, future, and concurrency sets associated with an event
path from starting at e23 on p2 ). A process can learn it only thanks to the flow of
control created by the causal path e23 , e32 , e33 , e34 , e13 .
Cut and Consistent Cut A cut C is a set of events which define initial prefixes
of process histories. Hence, a cut can be represented by a vector [prefix(h1 ), . . . ,
prefix(hn )], where prefix(h i ) is the corresponding prefix for process pi . A consistent
ev
cut C is a cut such that ∀ e ∈ C: f −→ e ⇒ f ∈ C.
As an example, C = {e1 , e2 , e2 , e31 } is not a cut because the only event from p1
2 1 3
is e12 , which is not a prefix of the history of p1 (an initial prefix also has to include
e11 , because it has been produced before e12 ).
C = {e11 , e12 , e21 , e22 , e31 , e32 } is a cut because its events can be partitioned into ini-
tial prefix histories: e11 , e12 is an initial prefix of h1 , e21 , e22 is an initial prefix of h2 ,
and e31 , e32 is an initial prefix of h3 . It is easy to see that the cut C is not consis-
ev
/ C, e32 ∈ C, and e23 −→ e32 . Differently, the cut C = C ∪ {e23 } is
tent, because e23 ∈
consistent.
The term “cut” comes from the fact that a cut can be represented by a line sep-
arating events in a space-time diagram. The events at the left of the cut line are the
events belonging to the cut. The cut C and the consistent cut C are represented by
dashed lines in Fig. 6.3.
As we have seen, the definition of a distributed execution does not refer to physical
time. This is similar to the definition of a sequential execution, and is meaningful
as long as we do not consider real-time distributed programs. Physical time is a
resource needed to execute a distributed program, but is not a programming object
126 6 Nature of Distributed Computations and the Concept of a Global State
In an asynchronous system, the passage of time alone does not provide infor-
mation to the processes. This is different in synchronous systems, where, when a
process proceeds to the next synchronous round, it knows that all the processes do
the same. (This point will be addressed in Chap. 9, which is devoted to synchroniz-
ers.)
From Events to Local States Each process pi starts from an initial local state
denoted σi0 . Then, its first event ei1 entails its move from σi0 to its next local state
σi1 , and more generally, its xth event eix entails its progress from σix−1 to σix . This is
depicted in Fig. 6.5, where the small rectangles denote the consecutive local states
of process pi .
We sometimes use the transition-like notation σix = δ(σix−1 , eix ) to state that the
statement generating the event eix makes pi progress from the local state σix−1 to
the local state σix .
ev
A Slight Modification of the Relation −→ In order to obtain a simple definition
ev
for a relation on local states, let us consider the relation on events denoted −→,
ev
which is −→ enriched with reflexivity. This means that the “process order” part of
ev
the definition of −→, namely i = j ∧ x < y, is extended to i = j ∧ x ≤ y (i.e., by
definition, each event precedes itself).
A Partial Order on Local States Let S be the set of all local states produced by a
ev
distributed execution. Thanks to −→, we can define, in a simple way, a partial order
σ
on the elements of S. This relation, denoted −→, is defined as follows
σ y def ev y
σix −→ σj = eix+1 −→ ej .
y
• If the events eix+1 and ej have been produced by the same process (i = j ), and are
y
consecutive (y = x + 1), then the local state σix preceding eix+1 = ej “happens
y y
before” (causally precedes) the local state σj = σix+1 generated by ej (bottom
y
of the figure). In this case, the reflexivity (eix+1
is ej ) can be interpreted as an
“internal communication” event, where pi sends to itself a fictitious message.
y
And, the corresponding send event eix+1 and receive event ej are then merged to
define a single same internal event.
σ
It follows from the definition of −→ that a distributed execution can be abstracted
as a partial order
S on the set of the process local states, namely,
σ
S = (S, −→).
Concurrent Local States Two local states σ 1 and σ 2 are concurrent (or indepen-
dent, denoted σ 1||σ 2) if none of them causally precedes the other one, i.e.,
def σ σ
σ 1||σ 2 = ¬(σ 1 −→ σ 2) ∧ ¬(σ 2 −→ σ 1).
It is important to notice that two concurrent local states may coexist at the same
y−1
physical time. This is, for example, the case of the local states σix+1 and σj
in the top of Fig. 6.6 (this coexistence lasts during the—unknown and arbitrary—
transit time of the corresponding message). On the contrary, this can never occur for
causally dependent local states (a “cause” local state no longer exists when any of
its “effect” local states starts existing).
6.3 Global State and Lattice of Global States 129
Global State and Consistent Global State A global state Σ of a distributed exe-
cution is a vector of n local states, one per process:
Σ = [σ1 , . . . , σi , . . . , σn ],
where, for each i, σi is a local state of process pi .
Intuitively, a consistent global state is a global state that could have been ob-
served by an omniscient external observer. More formally, it is a global state
Σ = [σ1 , . . . , σi , . . . , σn ] such that
∀i, j : i = j ⇒ σi ||σj .
This means no two local states of a consistent global state can be causally dependent.
A Distributed Execution as a Lattice of Global States The fact that the previous
execution has two processes, generates a graph with “two dimensions”: one asso-
ciated with the events issued by p1 (edges going from right to left in the figure),
the other one associated with the events issued by p2 (edges going left to right in
the figure). More generally, an n-process execution gives rise to a graph with “n
dimensions”.
The reachability graph actually has a lattice structure. A lattice is a directed graph
such that any two vertices have a unique greatest common predecessor and a unique
lowest common successor. As an example, the consistent global states denoted [2, 1]
and [0, 2] have several common predecessors, but have a unique greatest common
predecessor, namely the consistent global state denoted [0, 1]. Similarly they have a
unique lowest common successor, namely the consistent global state denoted [2, 2].
As we will see later, these lattice properties become particularly relevant when one
is interested in determining if some global state property is satisfied by a distributed
execution.
• O2 = e21 , e11 , e22 , e12 , e23 . This sequential observation is depicted by the dotted line
in the middle of Fig. 6.9.
• O3 = e21 , e22 , e11 , e12 , e23 . This sequential observation is depicted by the dashed/
dotted line at the right of Fig. 6.9.
As each observation is a total order on all the events that respects their partial or-
dering, it follows that we can go from one observation to another one by permuting
any pair of consecutive events that are concurrent. As an example, O2 can be ob-
tained from O1 by permuting e12 and e22 (which are independent events). Similarly,
as e11 and e22 are independent events, permuting them in O2 provides us with O3 .
Let us finally observe that the intersection of all the sequential observations (i.e.,
their common part) is nothing more than the partial order on the events associated
with the computation.
Remark 1 As each event makes the execution proceed from a consistent global
state to another consistent global state, an observation can also be defined at a se-
quence of consistent global states, each global state being directly reachable from
its immediate predecessor in the sequence. As an example, O1 corresponds to the
sequence of consistent global states [0, 0], [0, 1], [1, 1], [2, 1], [2, 2], [2, 3].
Remark 2 Let us insist on the fact that the notion of global state reachability
considers a single event at a time. During an execution, independent events can be
executed in any order or even simultaneously. If, for example, e11 and e22 are executed
“simultaneously”, the execution proceeds “directly” from the global state [0, 1] to
the global state [1, 2]. Actually, whatever does really occur, it is not possible to know
it. The advantage of the lattice approach, which considers one event at a time, lies
in the fact that no global state in which the execution could have passed is missed.
In some cases, we are not interested in global states consisting of only one local state
per process, but in global states consisting both of process local states and channel
states.
To that end we consider that processes are connected by unidirectional channels.
This is without loss of generality as a bidirectional channel connecting pi and pj
can be realized with two unidirectional channels, one from pi to j and one from pj
to pi . The state of the channel from pi to pj consists in the messages that have been
sent by pi , and have not yet been received by pj . A global state is consequently
made up of two parts:
• a vector Σ with a local state per process, plus
• a set ΣC whose each element represents the state of a given channel.
6.4 Global States Including Process States and Channel States 133
Notation Let σi be a local state of a process pi , and e and f be two events pro-
duced by pi . The notation e ∈ σi means that pi has issued e before attaining the
local state σi , while f ∈
/ σi means that pi has issued f after its local state σi . We
then say “e belongs to the past of σi ”, and “f belongs to the future of σi ”. These
notations are illustrated in Fig. 6.10.
In-transit Messages and Orphan Messages The notions of in-transit and orphan
messages are with respect to an ordered pair of local states. Let m be a message
sent by pi to pj , and σi and σj two local states of pi and pj , respectively. Let us
recall that s(m) and r(m) are the send event (by pi ) and the reception event (by pj )
associated with m, respectively.
• A message m is in-transit with respect to the ordered pair of local states σi , σj
if
s(m) ∈ σi ∧ r(m) ∈
/ σj .
• A message m is orphan with respect to the ordered pair of local states σi , σj if
/ σi ∧ r(m) ∈ σj .
s(m) ∈
These definitions are illustrated in Fig. 6.11, where there are three processes,
and two channels, one from p1 to p2 , and one from p3 to p2 . As, on the one hand,
s(m1 ) ∈ σ1 and s(m3 ) ∈ σ1 , and, on the other hand, r(m1 ) ∈ σ2 and r(m3 ) ∈ σ2 , both
m1 and m3 are “in the past” of the directed pair σ1 , σ2 . Differently, as s(m5 ) ∈
σ1 and r(m5 ) ∈ / σ2 , the message m5 is in-transit with respect to the ordered pair
σ1 , σ2 .
Similarly, the message m2 belongs to the “past” of the pair of local states σ3 , σ2 .
Differently, with respect to this ordered pair, the message m4 has been received by
p2 (r(m4 ) ∈ σ2 ), but has not been sent by p3 (s(m4 ) ∈ / σ3 ); hence it is an orphan
message.
Considering Fig. 6.11, let us observe that message m5 , which is in-transit with
respect to pair of local states σ1 , σ2 , does not make this directed pair inconsistent
134 6 Nature of Distributed Computations and the Concept of a Global State
(it does not create a causal path invalidating σ1 || σ2 ). The case of an orphan message
σ
is different. As we can see, the message m4 creates the dependence σ3 −→ σ2 , and,
consequently, we do not have σ3 || σ2 . Hence, orphan messages prevent local states
from belonging to the same consistent global state.
Consistent Global State Let us define the state of a FIFO channel as a sequence
of messages, and the state of a non-FIFO channel as a set of messages. The state of a
channel from pi to pj with respect to a directed pair σi , σj is denoted c_state(i, j ).
As already indicated, it is the sequence (or the set) of messages sent by pi to pj
whose send events are in the past of σi , while their receive events are not in the past
of σj (they are in its “future”).
Let C = {(i, j ) | ∃ a directed channel from pi to pj }. A global state (Σ, CΣ),
where Σ = [σ1 , . . . , σn ] and CΣ = {c_state(i, j )}(i,j )∈C , is consistent if, for any
message m, we have (where ⊕ stands for exclusive or)
• C1: (s(m) ∈ σi ) ⇒ (r(m) ∈ σj ⊕ m ∈ c_state(i, j )), and
• C2: (s(m) ∈
/ σi ) ⇒ (r(m) ∈
/ σj ∧ m ∈
/ c_state(i, j )).
This definition states that, to be consistent, a global state (Σ, CΣ) has to be
such that, with respect to the process local states defined by Σ , (C1) each in-transit
message belongs to the state of the corresponding channel, and (C2) there is no
message received and not sent (i.e., no orphan message).
Σ Versus (Σ, CΣ) Let us observe that the knowledge of Σ contains implicitly
the knowledge of CΣ . This is because, for any channel, the past of the local state
σi contains implicitly the messages sent by pi to pj up to σi , while the past of the
local state σj contains implicitly the messages sent by pi and received by pj up
to σj . Hence, in the following we use without distinction Σ or (Σ, CΣ).
The notion of a cut was introduced in Sect. 6.1.3. A cut C is a set of events defined
from prefixes of process histories. Let σi be the local state of pi obtained after
executing the events in its prefix history prefix(hi ) as defined by the cut, i.e., σi =
δ(si0 , prefix(hi )) using the transition function-based notation (σi0 being the initial
state of pi ). It follows that [σ1 , . . . , σi , . . . , σn ] is a global state Σ . Moreover, the
cut C is consistent if and only if Σ is consistent.
When considering (Σ, CΣ), we have the following. If the cut giving rise to Σ
is consistent, the state of the channels in CΣ correspond to the messages that cross
the cut line (their send events belong to the cut, while their receive events do not).
If the cut is not consistent, there is at least one message that crosses the cut line in
the “bad direction” (its send event does not belong to the cut, while its receive event
does, this message is an orphan message).
6.5 On-the-Fly Computation of Global States 135
The aim is to design algorithms that compute a consistent global state of a dis-
tributed application made up of n processes p1 , . . . , pn . To that end, a controller (or
observer) process, denoted cpi , is associated with each application process pi . The
role of a controller is to observe the process it is associated with, in such a way that
the set of controllers computes a consistent global state of the application.
Hence, the computation of a consistent global state of a distributed computation
is an observation problem: The controllers have to observe the application processes
without modifying their behavior. This is different from problems such as the nav-
igation of a mobile object (addressed in the previous chapter), where the aim was
to provide application processes with appropriate operations (such as acquire() and
release()) that they can invoke.
The corresponding structural view is described in Fig. 6.13. Ideally, the addi-
tion/suppression of controllers/observers must not modify the execution of the dis-
tributed application.
136 6 Nature of Distributed Computations and the Concept of a Global State
The global state that is computed depends actually on the execution and the inter-
leaving of the events generated by the application processes and by the controllers in
charge of the computation of the global state. The validity property states only that
Σ is a consistent global state that the execution might have passed through. Maybe
the execution passed through Σ , maybe it did not. Actually, there is no means to
know if the distributed execution passed through Σ or not. This is due to fact that
independent events can be perceived, by distinct sequential observers, as having
been executed in different order (see Fig. 6.9). Hence, the validity property charac-
terizes the best that can be done, i.e., (a) Σ is consistent, and (b) it can have been
passed through by the execution. Nothing stronger can be claimed.
The Case of Stable Properties A stable property is a property that, once true,
remains true forever. “The application has terminated”, “there is a deadlock”, or “ a
distributed cell has become inaccessible”, are typical examples of stable properties
that can be defined on the global states of a distributed application. More precisely,
let P be a property on the global states of a distributed execution. If P is a stable
property, we have
P(Σ) ⇒ ∀Σ : Σ −→ Σ : P Σ .
Σ
Let Σ be a consistent global state that has been computed, and Σstart and Σend
as defined previously. We have the following:
Σ
• As P is a stable property and Σ −→ Σend , we have P(Σ) ⇒ P(Σend ). Hence,
if P(Σ) is true, whether the distributed execution passed through Σ or not, we
conclude that P(Σend ) is satisfied, i.e., the property is satisfied in the global state
Σend , which is attained by the distributed execution (and can never be explicitly
known). Moreover, from then on, due to its stability, we know that P is satisfied
on all future global states of the distributed computation.
• If ¬P(Σ), taking the contrapositive of the definition of a stable property, we can
conclude ¬P(Σstart ), but nothing can be concluded on P(Σend ).
In this case, a new global state can be computed, until either a computed global
state satisfies P, or the computation has terminated. (Let us observe that, if P
is “the computation has terminated”, it is eventually satisfied. Special chapters
of the book are devoted to the detection of stable properties such as distributed
termination and deadlock detection.)
In order that (Σ, CΣ) be consistent, the controllers have to cooperate to ensure
the conditions C1 and C2 stated in Sect. 6.4.2. This cooperation involves both syn-
chronization and message recording.
• Synchronization. In order that there is no orphan message with respect to a pair
σj , σi , when a message is received from pj , the controller cpi might be forced
to record the local state of pi before giving it the message.
• Message recording. In order that the in-transit messages appear in the state of the
corresponding channels, controllers have to record them in one way or another.
While nearly all global state computation algorithms use the same synchroniza-
tion technique to ensure Σ is consistent, they differ in (a) the technique they use to
record in-transit messages, and (b) the FIFO or non-FIFO assumption they consider
for the directed communication channels.
The algorithm presented in this section is due to K.M. Chandy and L. Lamport
(1985). It was the first global state computation algorithm proposed. It assumes that
(a) the channels are FIFO, and (b) the communication graph is strongly connected
(there is a directed communication path from any process to any other process).
A Local Variable per Process Plus a Control Message The local variable
gs_statei contains the state of pi with respect to the global state computation. Its
value is red (its local state has been recorded) or green (its local state has not yet
been recorded). Initially, gs_statei = green.
A controller cpi , which has not yet recorded the local state of its associated pro-
cess pi , can do it at any time. When cpi records the local state of pi , it atomically
(a) updates gs_statei to the value red, and (b) sends a control message (denoted
MARKER (), and depicted with dashed arrows) on all its outgoing channels. As chan-
nels are FIFO, a message MARKER () is a synchronization message separating the
application messages sent by pi before it from the application messages sent after
it. This is depicted in Fig. 6.14.
When a message is received by the pair (pi , cpi ), the behavior of cpi depends on
the value of gs_statei . There are two cases.
• gs_statei = green. This case is depicted in Fig. 6.15. The controller cpi discovers
that a global state computation has been launched. It consequently participates in
it by atomically recording σi , and sending MARKER () messages on its outgoing
channels to inform their destination processes that a global state computation has
been started.
6.6 A Global State Algorithm Suited to FIFO Channels 139
Moreover, for the ordered pair of local states σj , σi to be consistent, cpi
defines the state c_state(j, i) of the incoming channel from pj as being empty.
• gs_statei = red. In this case, depicted in Fig. 6.16, cpi has already recorded σi .
Hence, it has only to ensure that the recorded channel state c_state(j, i) is consis-
tent with respect to the ordered pair σj , σj . To that end, cpi records the sequence
of messages that are received on this channel between the recording of σi and the
reception of the marker sent by cpj .
Properties Assuming at least one controller process cpi launches a global state
computation (Fig. 6.14), it follows from the strong connectivity of the communi-
140 6 Nature of Distributed Computations and the Concept of a Global State
Fig. 6.17 Global state computation (FIFO channels, code for cpi )
cation graph and the rules associated with the control messages MARKER () that
exactly one local state per process is recorded, and exactly one marker is sent on
each directed channel. It follows that a global state is eventually computed.
Let us color a process with the color currently in cs_statei . Similarly, let the
color of an application message be the color of its sender when the message is
sent. According to these colorings, an orphan message is a red message received
by a green process. But, as the channel from pj to pi is FIFO, no red message can
arrive before a marker, from which it follows that there is no orphan message. The
fact that all in-transit messages are recorded follow the recording rules expressed in
Figs. 6.15 and 6.16. The recorded global state is consequently consistent.
Local Variables Let c_ini denote the sets of process identities j such that there
is a channel from pj to pi ; this channel is denoted in_channeli [j ]. Similarly, let
c_outi denote the sets of process identities j such that there is a channel from pi to
pj ; this channel is denoted out_channeli [j ]. These sets are known both by pi and
its controller cpi . (As the network is strongly connected, no set c_ini or c_outi is
empty.)
The local array closedi [c_ini ] is a Boolean array. Each local variable closedi [j ]
is initialized to false, and is set to value true when cpi receives a marker from cpj .
Fig. 6.18
A simple automaton
for process pi (i = 1, 2)
atomic. This means they exclude each other, and exclude also concurrent execution
of the process pi . (Among other issues, the current state of pi has not to be modified
when cpi records σi ).
One or several controller processes launch the algorithm when they receive an ex-
ternal message START (), while their local variable gs_statei = green (line 4). Such
a process then invokes the internal operation record_ls(). This operation records the
current local state of pi (line 1), switches gs_statei to red (line 2), initializes the
channel states c_state(j, i) to the empty sequence, denoted ∅ (line 3), and sends
markers on pi ’s outgoing channels (line 4).
The behavior of a process cpi that receives a message depends on the message. If
it is marker (control message), cpi learns that the sender cpj has recorded σj . Hence,
if not yet done, cpi executes the internal operation record_ls() (line 6). Moreover,
in all cases, it sets closedi [j ] to true (line 7), to indicate that the state of the input
channel from pj has been computed, this channel state being consistent with respect
to the pair of local states σj , σi (see Figs. 6.15 and 6.16).
If the message is an application message, and the computation of c_state(j, i)
has started and is not yet finished (predicate of line 8), cpi adds it at the end of
c_state(j, i). In all cases, the message is passed to the application process pi .
This section illustrates the previous global state computation algorithm with a sim-
ple example. Its aim is to give the reader a deeper insight on the subtleties of global
state computations.
control variables gs_state1 and gs_state2 , respectively, and send MARKER () mes-
sages. This global state computation involves four time instants (defined from an
external omniscient observer’s point of view).
• At time τ0 , the controller cp1 receives a message START () (line 5). Hence, when
the global state computation starts, the distributed execution is in the global state
Σstart = [σ10 , σ20 ]. The process cp1 consequently invokes record_ls(). It records
the current state of p1 , which is then σ10 , and sends a marker on its only outgoing
channel (lines 1–5).
• Then, at time τ1 , the application message m2 arrives at p1 (lines 8–11). This
message is first handled by cp1 , which adds a copy of it to the channel state
c_state(2, 1). It is then received by p1 which moves from σ11 to σ10 .
• Then, at time τ2 , cp2 receives the marker sent by cp1 (lines 6–7). As gs_state2 =
green, cp2 invokes record_ls(): It records the current state of p2 , which is σ21 , and
sends a marker to cp1 .
• Finally, at time τ3 , the marker sent by cp2 arrives at cp1 . As gs_state1 = red,
cp1 has only to stop recording the messages sent by p2 to p1 . Hence, when the
global state computation terminates, the distributed execution is in the global state
Σend = [σ10 , σ21 ].
6.7 A Global State Algorithm Suited to Non-FIFO Channels 143
Fig. 6.21 Consistent cut associated with the computed global state
It follows that the global state that been cooperatively computed by cp1 and cp2
is the pair (Σ, CΣ), where Σ = [σ10 , σ21 ], and CΣ = {c_state(1, 2), c_state(2, 1)}
with c_state(1, 2), = ∅ and c_state(2, 1) = m2 . It is important to see that this com-
putation is not at all atomic: as illustrated in the space-time diagram, it is distributed
both with respect to space (processes) and time.
Yang (1987), differs from the previous one mainly in the way it records the state of
the channels.
The local variables c_ini , c_outi , in_channeli [k], and out_channeli [k] have the
same meaning as before. In addition to gs_statei , which is initialized to green and
is eventually set to red, each controller cpi manages two arrays of sets defined as
follows:
• rec_msgi [c_ini ] is such that, for each k ∈ c_ini , we have
rec_msgi [k] =
{all the messages received on in_channeli [k] since the beginning}.
• sent_msgi [c_outi ] is such that, for each k ∈ c_outi , we have
sent_msgi [k] = {all the messages sent on out_channeli [k] since the beginning}.
Each array is a log in which cpi records all the messages it receives on each input
channel, and all the messages it sends on each output channel. This means that (dif-
ferently from the previous algorithm), cpi has to continuously observe pi . But now
there is no control message: All the messages are application messages. Moreover,
each message inherits the current color of its sender (hence, each message carries
one control bit).
The basic rule is that, while a green message can always be consumed, a red
message can be consumed only when its receiver is red. It follows that, when a
green process pi receives a red message, cpi has first to record pi ’s local state so
that pi becomes red before being allowed to receive and consume this message.
The algorithm is described in Fig. 6.23. The text is self-explanatory. m.color
denotes the color of the application message m. Line 7 ensures that a red message
cannot be received by a green process, which means that there is no orphan message
with respect to an ordered pair of process local states that have been recorded.
Let cp be any controller process. After it has recorded σi , rec_msgi [c_ini ], and
sent_msgi [c_outi ] (line 3), a controller cpi sends this triple to cp (line 4).
When it has received such a triple from each controller cpi , the controller cp
pieces together all the local states to obtain Σ = [σ1 , . . . , σn ]. As far as CΣ is
concerned, it defines the state of the channel from pj to pi as follows:
Fig. 6.23 Global state computation (non-FIFO channels, code for cpi )
It follows from (a) this definition, (b) the fact that sent_msgj [i] is the set of
messages sent by pj to pi before σj , and (c) the fact that rec_msgi [j ] is the set of
messages received by pi from pj before σi , that c_state(j, i) records the messages
that are in-transit with respect to the ordered pair σj , σi . As there is no orphan
message with respect to the recorded local states (see above), the computed global
state is consistent.
passed to p4 . Then, p4 sends a (red) message to p3 . As this message is red, its re-
ception entails the recording of σ3 by cp3 , after which the message is received and
processed by p3 .
The cut corresponding to this global state is indicated by a bold dashed line. It is
easy to see that it is consistent. The in-transit messages are the two green messages
(plain arrows) that cross the cut line.
Remark The logs sent_msgi [c_outi ] and rec_msgi [c_ini ] may require a large
memory, which can be implemented with a local secondary storage. In some ap-
plications, based on the computation of global states, what is relevant is not the
exact value of these sets, but their cardinality. In this case, each set sent_msgi [k],
and each set rec_msgi [j ], can be replaced by a simple counter. As a simple example,
this is sufficient when one wants to know which channels are empty in a computed
global state.
6.8 Summary
This chapter was on the nature of a distributed execution and the associated notion
of a global state (snapshot). It has defined basic notions related to the execution
of a distributed program (event, process history, process local state, event concur-
rency/independence, cut, and global state). It has also introduced three approaches
to model a distributed execution, namely, a distributed execution can be represented
as a partial order on events, a partial order on process local states, or a lattice of
global states.
The chapter then presented algorithms that compute on the fly consistent global
states of a distributed execution. It has shown that the best that can be done is the
computation of a global state that might have been passed through by the distributed
computation. The global state that has been computed is consistent, but no process
can know if it really occurred during the execution. This nondeterminism is inherent
to the nature of asynchronous distributed computing. An aim of this chapter was to
give the reader an intuition of this relativistic dimension, when the processes have
to observe on the fly the distributed execution they generate.
• The algorithm by Chandy and Lamport computes a single global state. It has been
extended in [57, 357] to allow for repeated global state computations.
• The algorithm for non-FIFO channels presented in Sect. 6.7 is due to Y.T. Lai and
Y.T. Yang [222].
• Numerous algorithms that compute consistent global states in particular contexts
have been proposed. As an example, algorithms suited to causally ordered com-
munication are presented in [1, 14]. Other algorithms that compute (on the fly)
consistent global states are presented in [169, 235, 253, 377]. A reasoned con-
struction of an algorithm computing a consistent global state is presented in [80].
A global state computation algorithm suited to large-scale distributed systems
is presented in [213]. An introductory survey on global state algorithms can be
found in [214].
• The algorithms that have been presented are non-inhibitory, in the sense that they
never freeze the distributed execution they observe. The role of inhibition for
global state computation is investigated in [167, 363].
• The view of a distributed computation as a lattice of consistent global states is
presented and investigated in [29, 98, 250]. Algorithms which determine which
global states of a distributed execution satisfy some predefined properties are pre-
sented in [28, 98]. Those algorithms are on-the-fly algorithms.
• A global state Σ of a distributed execution is inevitable if, when considering the
lattice associated with this execution, it belongs to all the sequential observations
defined by this lattice. An algorithm that computes on the fly all the inevitable
global states of a distributed execution is described in [138].
• The causal precedence (happened before) relation has been generalized in several
papers (e.g., [174, 175, 297]).
• Snapshot computation in anonymous distributed systems is addressed in [220].
1. Adapt the algorithm described in Sect. 6.6 so that the controller processes are
able to compute several consistent global states, one after the other.
Answer in [357].
2. Use the rubber band transformation to give a simple characterization of an orphan
message.
3. Let us consider the algorithm described in Fig. 6.25 in which each controller
process cpi manages the local variable gs_statei (as in the algorithms described
in this chapter), plus an integer counti . Moreover, each message m carries the
color of its sender (in its field m.color). The local control variable counti counts
the number of green messages sent by pi minus the number of green messages
received by pi . The controller cp is one of the controllers, known by all, that is
in charge of the construction of the computed global state.
When
cp has received a pair (σi , counti ) from each controller cpi , it computes
ct = 1≤i≤n counti .
148 6 Nature of Distributed Computations and the Concept of a Global State
Fig. 6.25 Another global state computation (non-FIFO channels, code for cpi )
This chapter is on the association of consistent dates with events, local states, or
global states of a distributed computation. Consistency means that the dates gen-
erated by a dating system have to be in agreement with the “causality” generated
by the considered distributed execution. According to the view of a distributed ex-
ecution we are interested in, this causality is the causal precedence order on events
ev σ
(relation −→), the causal precedence order on local states (relation −→), or the
Σ
reachability relation in the lattice of global states (relation −→), all introduced in
the previous chapter. In all cases, this means that the date of a “cause” has to be
earlier than the date of any of its “effects”. As we consider time-free asynchronous
distributed systems, these dates cannot be physical dates. (Moreover, even if pro-
cesses were given access to a global physical clock, the clock granularity should be
small enough to always allow for a consistent dating.)
Three types of logical time are presented, namely, scalar (or linear) time, vector
time, and matrix time. Each type of time is defined, its properties are stated, and
illustrations showing how to use it are presented.
ev
Linear time considers events and the associated partial order relation −→ produced
by a distributed execution. (The same could be done by considering local states, and
σ
the associated partial order relation −→.) This notion of logical time is due to L.
ev
Lamport (1978), who introduced it together with the relation −→.
Linear (Scalar) Clock As just indicated, the aim is to associate a logical date with
events of a distributed execution. Let date(e) be the date associated with event e. To
be consistent the dating system has to be such that
ev
∀ e1 , e2 : (e1 −→ e2 ) ⇒ date(e1 ) < date(e2 ).
The simplest time domain which can respect event causality is the sequence of in-
creasing integers (hence the name linear or scalar time): A date is an integer.
Hence, each process pi manages an integer local variable clocki (initialized to
ev
0), which increases according to the relation −→, as described by the local clock
management algorithm of Fig. 7.1. Just before producing its next internal event,
or sending a message, a process pi increases its local clock clocki , and this new
clock value is the date of the corresponding internal or send event. Moreover, each
message carries its sending date. When a process receives a message, it first updates
its local clock and then increases it, so that the receive event has a date greater than
both the date of the corresponding send event and the date of the last local event.
It follows trivially from these rules that the linear time increases along all causal
ev
paths, and, consequently, this linear time is consistent with the relation −→ (or
σ
−→). Any increment value d > 0 could be used instead of 1. Considering d = 1
allows for the smallest clock increase while keeping consistency.
to value 6. The fact that (a) logical time increases along causal paths, and (b) the
increment value has been chosen equal to 1, provides us with the following property
date(e) = x ⇔ (there are x events on the longest causal path ending at e).
Properties The following properties follow directly from clock consistency. Let
e1 and e2 be two events.
ev
• (date(e1 ) ≤ date(e2 )) ⇒ ¬(e2 −→ e1 ).
• (date(e1 ) = date(e2 )) ⇒ (e1 ||e2 ).
The first property is simply the contrapositive of the clock consistency prop-
erty, while the second one is a restatement of (date(e1 ) ≤ date(e2 )) ∧ (date(e2 ) ≤
date(e1 )).
These properties are important from an operational point of view. The partial
ev
order −→ on events allows us to understand and reason on distributed executions,
while the dates associated with events are operational and can consequently be used
to design and write distributed algorithms. Hence, the aim is to extract information
on events from their dates.
ev
Unfortunately, it is possible to have (date(e1 ) < date(e2 )) ∧ ¬(e1 −→ e2 ). With
linear time, it is not because the date of an event e1 is earlier (smaller) than the date
of an event e2 , that e1 is a cause of e2 .
The previous observation is the main limitation of any linear time system. But, for-
tunately, this is not a drawback if we have to totally order events while respecting
ev
the partial order −→. To that end, in addition to its temporal coordinate (its date), let
us associate with each event a spatial coordinate (namely the identity of the process
pi that issued this event).
152 7 Logical Time in Asynchronous Distributed Systems
Notation In the following the notation h, i < k, j is used to denote (h <
k) ∨ ((h = k) ∧ (i < j )).
Fig. 7.4
A sequential observation
obtained from timestamps
Linear time and timestamps are particularly useful when one has to order operations
or messages. The most typical case is the establishment of a total order on a set of
requests, which have to be serviced one after the other. This use of timestamps will
be addressed in Chap. 10, which is devoted to permission-based mutual exclusion.
This section considers another illustration of timestamps, namely, it presents a
timestamp-based implementation of a high-level communication abstraction, called
total order broadcast.
process, and this order has to respect the causal precedence order. To simplify the
presentation, it is assumed that each message is unique (this can be easily realized
by associating a pair sequence number, sender identity with each message).
Total Order Broadcast: Definition The total order broadcast abstraction is for-
mally defined by the following properties. Said differently, this means that, to be
correct, an implementation of the total order broadcast abstraction has to ensure that
these properties are always satisfied.
• Validity. If a process to_delivers a message m, there is a process that has
to_broadcast m.
• Integrity. No message is to_delivered twice.
• Total order. If a process to_delivers m before m , no process to_delivers m be-
fore m.
• Causal precedence order. If m →M m , no process to_delivers m before m.
• Termination. If a process to_broadcasts a message m, any process to_delivers m.
The first four properties are safety properties. Validity relates the outputs to the
inputs. It states that there is neither message corruption, nor message creation. In-
tegrity states that there is no message duplication. Total order states that messages
are to_delivered in the same order at every process, while causal precedence states
that this total order respects the message causality relation →M . Finally, the termi-
nation property is a liveness property stating that no message is lost.
Fig. 7.5 Total order broadcast: the problem that has to be solved
cooperate so that they to_deliver m1 and m2 in the same order. This remains true if
only p1 to_broadcasts a message. This is because, when it issues to_broadcast(m1 ),
p1 does not know whether p2 has independently issued to_broadcast(m2 ) or not.
It follows that a to_broadcast message generates two distinct communication
events at each process. The first one is associated with the reception of the message
from the underlying communication network, while the second one is associated
with its to_delivery.
Message Stability A means to implement the same to_delivery order at each pro-
cess consists in providing each process pi with information on the clock values of
the whole set of processes. This local information can then be used by each process
pi to know which, among the to_broadcast messages it has received and not yet
to_delivered, are stable, where message stability is defined as follows.
A message timestamped k, j received by a process pi is stable (at that process)
if pi knows that all the messages it will receive in the future will have a timestamp
greater than k, j . The main job of a timestamp-based implementation consists in
ensuring the message stability at each process.
Global Structure and Local Variables at a Process pi The structure of the im-
plementation is described in Fig. 7.6. Each process pi has a local module imple-
156 7 Logical Time in Asynchronous Distributed Systems
menting the operations to_broadcast() and to_deliver(). Each local module manages
the following local variables.
• clocki [1..n] is an array of integers initialized to [0, . . . , 0]. The local variable
clocki [i] is the local clock of pi , which implements the global linear time. Dif-
ferently, for j = i, clocki [j ] is the best approximation of the value of the local
clock of pj , as known by pi . As the communication channels are FIFO, clocki [j ]
contains the last value of clockj [j ] received by pi . Hence, in addition to the fact
that the set of local clocks {clocki }1≤i≤n implement a global scalar clock, the lo-
cal array clocki [1..n] of each process pi represents its current knowledge on the
progress of the whole set of local logical clocks.
• to_deliverablei is a sequence, initially empty (the empty sequence is denoted ).
This sequence contains the list of messages that (a) have been received by pi , (b)
have then been totally ordered, and (c) have not yet been to_delivered. Hence,
to_deliverablei is the list of messages that can be to_delivered to the local upper
layer application process.
• pendingi is a set of pairs m, d, j , where m is a message whose timestamp is
d, j . Initially, pendingi = ∅.
operation to_broadcast(m) is
(1) clocki [i] ← clocki [i] + 1;
(2) let ts(m) = clocki [i], i;
(3) pendingi ← pendingi ∪ {m, ts(m)};
(4) for each j ∈ {1, . . . , n} \ {i} do send TOBC (m, ts(m)) to pj end for.
operation to_deliver(m) is
(5) wait (to_deliverablei = );
(6) let m be the first message in the list to_deliverablei ;
(7) withdraw m from to_deliverablei ;
(8) return(m).
background task T is
(16) repeat forever
(17) wait (pendingi = ∅);
(18) let m, d, k be the pair in pendingi with the smallest timestamp;
(19) if (∀j = k : d, k < clocki [j ], j ) then
(20) add m at the tail of to_deliverablei ;
(21) pendingi ← pendingi \ {m, d, k}
(22) end if
(23) end repeat.
process a control message denoted CATCH _ UP (). This message carries the new
clock value of pi (line 13). As the channels are FIFO, it follows that when this
message is received by pk , 1 ≤ k = i ≤ n, this process has necessarily received all
the messages TOBC () sent by pi before this CATCH _ UP () message. This allows
pk to know if, among the to_broadcast messages it has received, the one with the
smallest timestamp is stable (Fig. 7.8).
Finally, a background task checks if some of the to_broadcast messages which
have been received by pi are stable, and can consequently be moved from the set
158 7 Logical Time in Asynchronous Distributed Systems
Theorem 4 The algorithm described in Fig. 7.7 implements the total order broad-
cast abstraction.
Proof The validity property (neither corruption, nor creation of messages) follows
directly from the reliability of the channels. The integrity property (no duplica-
tion) follows from the reliability of the underlying channels, the fact that no two
to_broadcast messages have the same timestamp, and the fact that when a message
is added to to_deliverablei , it is suppressed from pendingi .
The proof of the termination property is by contradiction. Assuming that
to_broadcast messages are never to_delivered by a process pi (i.e., added to
to_deliverablei ), let m be the one with the smallest timestamp, and let ts(m) =
d, k.
Let us observe that, each time a process pj updates its local clock (to a greater
value), it sends its new clock value to all processes. This occurs at lines 1 and 4, or
at lines 12 and 13.
As each other process pj receives the message TOBC (m, ts(m)) sent by pk , its
local clock becomes greater than d (if it was not before). It then follows from
the previous observation that a value of clockj [j ] greater than d becomes even-
tually known by each process, and we eventually have clocki [j ] > d. Hence, the
to_delivery predicate for m becomes eventually satisfied at pi . The message m is
consequently moved from pendingi to to_deliverablei , which contradicts the initial
assumption and proves the termination property.
The proof of the total order property is also by contradiction. Let mx and my be
two messages timestamped ts(mx ) = dx , x and ts(my ) = dy , y, respectively. Let
us assume that ts(mx ) < ts(my ), and my is to_delivered by a process pi before mx
(i.e., my is added to to_deliverablei before mx ).
Just before my is added to to_deliverablei , (my , ts(my )) is the pair with the
smallest timestamp in pendingi , and ∀j = y : dy , y < clocki [j ], j (lines 18–19).
It follows that we have then dx , x < dy , y < clocki [x], x. As (a) px sends
only increasing values of its local clock (lines 1 and 4, and lines 12–13), (b)
dx < clocki [x], and (c) the channels are FIFO, it follows that pi has received the
message TOBC (mx , ts(mx )) before the message carrying the value of clockx [x]
which entailed the update of clocki [x] making true the predicate dy , y <
7.2 Vector Time 159
Hence, vci [k] is the number of events produced by pk in the causal past of the
event e. The term 1(k, i) is to count the event e, which has been produced by pi .
The value of vci [1..n] is the vector date of the event e.
Vector Clock: Algorithm The algorithm implementing the vector clock system
has exactly the same structure as the one for linear time (Fig. 7.1). The difference
lies in the fact that only the entry i of the local vector clock of pi (i.e., vci [i])
is increased each time it produces a new event, and each message m piggybacks
the current value of the vector time, which defines the sending time of m. This
value allows the receiver to update its local vector clock, so that the date of the
receive event is after both the date of the sending event associated with m and the
immediately preceding local event produced by the receiver process. The operator
max() on integers is extended to vectors as follows (line 5):
max(v1, v2)
= max v1[1], v2[1] , . . . , max v1[j ], v2[j ] , . . . , max v1[n], v2[n] .
Let us observe that, for any pair (i, k), it follows directly from the vector clock
algorithm that (a) vci [k] never decreases, and (b) at any time, vci [k] ≤ vck [k].
7.2 Vector Time 161
Notation Given two vectors vc1 and vc2, both of size n, we have
def
• vc1 ≤ vc2 = (∀ k ∈ {1, . . . , n} : vc1[k] ≤ vc2[k]).
def
• vc1 < vc2 = (vc1 ≤ vc2) ∧ (vc1 = vc2).
162 7 Logical Time in Asynchronous Distributed Systems
def
• vc1||vc2 = ¬(vc1 ≤ vc2) ∧ ¬(vc1 ≤ vc2).
When considering Fig. 7.10, we have [0, 2, 0, 0] < [0, 3, 2, 2], and [0, 2, 0, 0] || [0, 0,
1, 0].
Theorem 5 Let e.vc be the vector date associated with event e, by the algorithm of
Fig. 7.9. These dates are such that, for any two distinct events e1 and e2 we have (a)
ev
(e1 −→ e2 ) ⇔ (e1 .vc < e2 .vc), and (b) (e1 ||e2 ) ⇔ (e1 .vc || e2 .vc).
Corollary 1 Given the dates of two events, determining if these events are causally
related or not can require up to n comparisons of integers.
Reducing the Cost of Comparing Two Vector Dates As the cost of comparing
two dates is O(n), an important question is the following: Is it possible to add con-
trol information to a date in order to reduce the cost of their comparison? If the
events are produced by the same process pi , a simple comparison of the ith entry of
their vector dates allows us to conclude. More generally, given an event e produced
by a process pi , let us associate with e a timestamp defined as the pair e.vc, i. We
have then the following theorem from which it follows that, thanks to the knowledge
of the process that produced an event, the cost of deciding if two events are or not
causally related is reduced to two comparisons of integers.
7.2 Vector Time 163
Theorem 6 Let e1 and e2 be events timestamped e1 .vc, i and e2 .vc, j , re-
ev
spectively, with i = j . We have (e1 −→ e2 ) ⇔ (e1 .vc[i] ≤ e2 .vc[i]), and
(e1 || e2 ) ⇔ ((e1 .vc[i] > e2 .vc[i]) ∧ (e2 .vc[j ] > e1 .vc[j ])).
Proof Let us first observe that time increases only along causal paths, and only the
process that produced an event entails an increase of the corresponding entry in a
vector clock (Observation O). The proof considers each case separately.
ev
• (e1 −→ e2 ) ⇔ (e1 .vc[i] ≤ e2 .vc[i]).
ev
If e1 −→ e2 , there is a causal path from e1 to e2 , and we have e1 .vc ≤ e2 .vc
(Theorem 6), from which e1 .vc[i] ≤ e2 .vc[i] follows.
If e1 .vc[i] ≤ e2 .vc[i], it follows from observation O that there is a causal path
from e1 to e2 .
• (e1 || e2 ) ⇔ ((e1 .vc[i] > e2 .vc[i]) ∧ (e2 .vc[j ] > e1 .vc[j ])).
As, at any time, we have vcj [i] ≤ vci [i], pi increases vci [i] when it produces
e1 , it follows from the fact that there is no causal path from e1 to e2 and obser-
vation O that e1 .vc[i] > e2 .vc[i]. The same applies to e2 .vc[j ] with respect to
e1 .vc.
In the other direction, we conclude from e1 .vc[i] > e2 .vc[i] that there is no
causal path from e1 to e2 (otherwise we would have e1 .vc[i] ≤ e2 .vc[i]). Simi-
larly, e2 .vc[j ] > e1 .vc[j ] implies that there is no causal path from e2 to e1 .
To illustrate this theorem, let us consider Fig. 7.10, where the first event of p2 is
denoted e21 , first event of p3 is denoted e31 , and the second event of p3 is denoted e32 .
Event e21 is timestamped [0, 1, 0, 0], 2, e31 is timestamped [0, 0, 1, 0], 2, and e32 is
ev
timestamped [0, 3, 2, 2], 2. As e21 .vc[2] = 1 ≤ e32 .vc[2] = 3, we conclude e21 −→
e32 . As e21 .vc[2] = 1 > e31 .vc[2] = 0 and e31 .vc[3] = 1 > e21 .vc[3] = 0, we conclude
e21 || e31 .
Let us consider the message m2 . As its sending time is [3, 0] and its receive time
is [3, 4], it follows that it is impossible for an external observer to see a global time
GD such that GD[1] < 3 and GD[2] ≥ 4. The restrictions on the set of possible
vector dates due to the four messages exchanged in the computation are depicted in
Fig. 7.12. Each message prevents some vector clock values from being observed. As
an example, the date [2, 5] cannot exist, while the vector date [3, 3] can be observed
by an external observer.
The borders of the area including the vector dates that could be observed by
external observers are indicated with dotted lines in the figure. They correspond to
the history (sequence of events) of each process.
Let us consider the distributed execution described on the left top of Fig. 7.13. Vec-
tor dates of events and local states are indicated in this figure. Both initial local states
σ10 and σ20 are dated [0, 0]. Then, each σix inherits the vector date of the event eix
that generated it. As an example the date of the local state σ22 is [0, 2].
The corresponding lattice of global states is described at the right of Fig. 7.13. In
this lattice, a vector date is associated with each global state as follows: the ith entry
of the vector is the number of events produced by pi . This means that, when con-
y
sidering the figure, the vector date of the global state [σix , σj ] is [x, y]. (This dating
system for global states, which is evidently based on vector clocks, was implicitly
introduced and used in Sect. 6.3.2, where the notion of a lattice of global states was
introduced.) Trivially, the vector time associated with global dates increases along
each path of the lattice.
Let us recall that the greatest lower bound (GLB) of a set of vertices of a lattice is
their greatest common predecessor, while the least upper bound (LUB) is their least
common successor. Due to the fact that the graph is a lattice, each of the GLB and
the LUB of a set of vertices (global states) is unique.
An important consequence of the fact that the set of consistent global states is a
lattice and the associated dating of global states is the following. Let us consider two
global states Σ and Σ whose dates are [d1 , . . . , dn ] and [d1 , . . . , dn ], respectively.
Let Σ − = GLB(Σ , Σ ) and Σ + = LUB(Σ , Σ ). We have
• date(Σ − ) = date(GLB(Σ , Σ )) = [min(d1 , d1 ), . . . , min(dn , dn )], and
166 7 Logical Time in Asynchronous Distributed Systems
To illustrate the use of vector clocks, this section presents an algorithm that deter-
mines the first global state of a distributed computation that satisfies a conjunction
of stable local predicates.
On the Notion of a “First” Global State The consistent global state Σ defined
by a process pi from the vector date firsti [1..n] is the first global state satisfying
i LPi , in the following sense: There is no global state Σ such that (Σ = Σ) ∧
Σ
(Σ |= i LPi ) ∧ (Σ −→ Σ).
7.2 Vector Time 167
Where Is the Difficulty Let us consider the execution described in Fig. 7.14. The
predicates LP1 , LP2 and LP3 are satisfied from the local states σ1x , σ2b , and σ3c ,
respectively. (The fact that they are stable is indicated by a bold line on the corre-
sponding process axis.) Not all local states are represented; the important point is
y
that σ2 is the state of p2 after it has sent the message m8 to p1 , and σ3z is the state of
p3 after it has sent the message m5 to p2 . The first consistent global state satisfying
y
LP1 ∧ LP2 ∧ LP3 is Σ = [σ1x , σ2 , σ3z ].
Using causality created by message exchanges and appropriate information pig-
gybacked by messages, the processes can “learn” information related to the predi-
cate detection. More explicitly, we have the following when looking at the figure.
• When p1 receives m1 , it learns nothing about the predicate detection (and simi-
larly for p3 when it receives m2 ).
• When p1 receives m4 (sent by p2 ), it can learn that (a) the global state
[σ10 , σ2b , σ30 ] is consistent and (b) it “partially” satisfies the global predicate,
namely, σ2b |= LP2 .
• When p2 receives the message m3 (from p1 ), it can learn that the global state
[σ1a , σ2b , σ30 ] is consistent and such σ2a |= LP2 . When it later receives the message
m5 (from p3 ), it can learn that the global state [σ1a , σ2b , σ3c ] is consistent and such
(σ2b |= LP2 ) ∧ (σ3c |= LP3 ). Process p3 can learn the same when it receives the
message m7 (sent by p2 ).
• When p1 receives the message m8 (sent by p2 ), it can learn that, while LP1 is
not yet locally satisfied, the global state [σ1a , σ2b , σ3c ] is the first consistent global
state that satisfies LP2 and LP3 .
• Finally, when p1 produces the internal event giving rise to σ1x , it can learn that
y
the first consistent global state satisfying the three local predicates is [σ1x , σ2 , σ3z ].
The corresponding consistent cut is indicated by a dotted line on the figure.
y
Let us recall that the vector date of the local state σ2 is the date of the preceding
event, which is the sending of m8 , and this date is piggybacked by m5 . Similarly,
the date of σ3z is the date of the sending of m5 .
Another example is given in Fig. 7.15. In this case, due to the flow of control
created by the exchange of messages, it is only when p3 receives m3 that it can learn
168 7 Logical Time in Asynchronous Distributed Systems
y
that [σ1x , σ2 , σ3z ] is the first consistent global state satisfying i LPi . As previously,
the corresponding consistent cut is indicated by a dotted line on the figure.
As no control messages are allowed, it is easy to see that, in some scenarios,
enough application messages have to be sent after i LPi is satisfied, in order to
compute
the vector date of the first consistent global state satisfying the global pred-
icate i LPi .
The Algorithm The algorithm is described in Fig. 7.16. It ensures that if consis-
tent global states satisfy LPi , at least one process will compute the vector date of
the first of them. As already indicated, this is under the assumption that the processes
send enough application messages.
Before producing a new event, pi always increases its local clock vci [i] (lines 7,
10, and 13). If the event e is an internal event, and LPi has not yet been satisfied
(i.e., donei is false), pi invokes the operation check_lp() (line 9). If its current local
state σ satisfies LPi (line 1), pi adds its identity to sati and, as it is the first time that
7.2 Vector Time 169
LPi is satisfied, it defines accordingly firsti as being vci (which is the vector date
of the global state associated with the causal past of the event e currently produced
by pi , line 2). Process pi then checks if the consistent global state defined by the
vector date firsti (= vci ) satisfies the whole global predicate j LPj (lines 3–4). If
it is the case, pi has determined the first global state satisfying the global predicate.
If the event e is the sending of a message by pi to pj , before sending the message,
process pi first moves to its next local state σ (line 10), and does the same as if e was
an internal event. The important point is that, in addition to the application message,
pi sends to pj (line 12) its full current state (from the global state determination
point of view), which consists of vci (current vector time at pi ), sati (processes
whose local predicates are satisfied), and firsti (date of the consistent global state
satisfying the local predicates of the processes in sati ).
If the event e is the reception by pi of a message sent by pj , pi updates first its
vector clock (line 13), and moves to its next local state (line 14). As in the previous
cases, if LPi has not yet been satisfied, pi then invokes check_lp() (line 15). Finally,
if pi learns something new with respect to local predicates (test of line 16), it “adds”
what it knew before (sati and firsti [1..n]) with what it learns (sat and first[1..n]).
The new value of firsti [1..n] is the vector date of the first consistent global state
170 7 Logical Time in Asynchronous Distributed Systems
in which all the local states of the processes in sati ∪ sat satisfy their local predi-
cates. Finally, pi checks if the global state defined by firsti [1..n] satisfies all local
predicates (lines 18–20).
When looking at the executions described in Figs. 7.14 and 7.15, we have the
following. In Fig. 7.14, p1 ends the detection at line 4 after it has produced the
event that gave rise to σ1x . In Fig. 7.15, p3 ends the detection at line 19 after it has
received the protocol message MSG(m3 , [−, −, −], {1, 2}, [x, y, 0]). Just before it
receives this message we have sat3 = {2, 3} and first3 = [0, y, z].
Fig. 7.18 Vector clock system for relevant events (code for process pi )
An Algorithm Solving the IPT Problem: Local Variables The kth relevant
event on process pk is unambiguously identified by the pair (k, vck), where vck is the
value of vck [k] when pk has produced this event. The aim of the algorithm is conse-
172 7 Logical Time in Asynchronous Distributed Systems
quently to associate with each relevant event e a set e.imp such that (k, vck) ∈ e.imp
if and only if the corresponding event is an immediate predecessor of e.
To that end, each process pi manages the following local variables:
• vci [1..n] is the local vector clock.
• impi [1..n] is a vector initialized to [0, . . . , 0]. Each variable impi [j ] contains 0
or 1. Its meaning is the following: impi [j ] = 1 means that the last relevant event
produced by pj , as known by pi , is candidate to be an immediate predecessor of
the next relevant event produced by pi .
An Algorithm Solving the IPT Problem: Process Behavior The algorithm ex-
ecuted by each process pi is a combined management of both the vectors vci and
impi . It is described in Fig. 7.20.
When a process pi produces a relevant event e, it increases its own vector clock
(line 1). Then, it considers impi [1..n] to compute the immediate predecessors of
e (line 2). According to the definition of each entry of impi , those are the events
identified by the pairs (k, cvi [k]) such that impi [k] = 1 (which indicates that the
last relevant event produced by pk and known by pi , is still a candidate to be an
immediate predecessor of e).
Then, as it produced a new relevant event e, pi must reset its local array
impi [1..n] (lines 3–4). It resets (a) impi [i] to 1 (because e is candidate to be an
immediate predecessor of relevant events that will appear in its causal future) and
(b) each impi [j ] (j = i) to 0 (because no event that will be produced in the future
of e can have them as immediate predecessors).
When pi sends a message, it attaches to this message all its local control in-
formation, namely, cvi and impi (line 5). When it receives a message, pi updates
it local vector clock as in the basic algorithm so that the new value of vci is the
component-wise maximum of vc and the previous value of vci .
7.3 On the Size of Vector Clocks 173
Fig. 7.21 Four possible cases when updating impi [k], while vci [k] = vc[k]
The update of the array impi depends on the value of each entry of vci .
• If vci [k] < vc[k], pj (the sender of the message) has fresher information on pk
than pi . Consequently, pi adopts what is known by pj , and sets impi [k] to imp[k]
(line 7).
• If vci [k] = vc[k], the last relevant event produced by pk and known by pi is the
same as the one known by pj . If this event is still candidate to be an immedi-
ate predecessor of the next event produced by pi from both pi ’s point of view
((impi [k]) and pj ’s point of view ((imp[k]), then impi [k] remains equal to 1;
otherwise, impi [k] is set to 0 (line 8). The four possible cases are depicted in
Fig. 7.21.
• If vci [k] > vc[k], pi knows more on pk than pj . Hence, it does not modify the
value of impi [k] (line 9).
This section first shows that vector time has an inherent price, namely, the size of
vector clocks cannot be less than the number of processes. Then it introduces the
notion of a relevant event and presents a general technique that allows us to reduce
the number of vector entries that have to be transmitted in each message. Finally,
it presents the notions of approximation of the causality relation and approximate
vector clocks.
174 7 Logical Time in Asynchronous Distributed Systems
A vector clock has one entry per process, i.e., n entries. It follows from the algorithm
of Fig. 7.9 and Theorem 5 that vectors of size n are sufficient to capture causality
and independence among events. Hence the question: Are vectors of size n neces-
sary to capture causality and concurrency among the events produced by n asyn-
chronous processes communicating by sending and receiving messages through a
reliable asynchronous network?
This section shows that the answer to this question is “yes”. This means that there
are distributed executions in which causality and concurrency cannot be captured
by vector clocks of size smaller than n. The proof of this result, which is due to B.
Charron-Bost (1991), consists in building such a specific execution and showing a
contradiction if vector clocks of size smaller than n are used to capture causality
and independence of events.
Proof As any process first sends messages before receiving any message, there is
no causal chain involving more than one message. The lemma follows from this
observation and the fact pi+1 does not send message to pi .
ev
Lemma 2 ∀ i, j : 1 ≤ i = j ≤ n : fsi+1 −→ lrj .
7.3 On the Size of Vector Clocks 175
Proof Let us first consider the case j = i +1. We have then lrj = lri+1 . As both fsi+1
and lri+1 are produced by pi+1 , and the send phase precedes the receive phase, the
lemma follows.
Let us now consider the case j = i + 1. Due to the assumption, we have then
j∈/ {i, i + 1}. It then follows from the communication pattern that pi+1 sends a
message to pj . Let s(i, j ) and r(i, j ) be the corresponding send event and receive
ev ev
event, respectively. We have (a) fsi+1 = s(i, j ) or fsi+1 −→ s(i, j ), (b) s(i, j ) −→
ev
r(i, j ), and (c) r(i, j ) = lrj or r(i, j ) −→ lrj . Combining the previous relations,
ev
we obtain fsi+1 −→ lrj .
Proof Let e.vc[1..k] (e.vc) be the vector date associated with event e. Let us con-
sider a process spi . It follows from Lemma 1 that, for each i ∈ {1, . . . , n}, we have
lri || fsi+1 , which means that the vector dates lri .vc and fsi+1 .vc have to be incom-
parable. If lri .vc[j ] ≥ fsi+1 .vc[j ] for any j , we would have lri .vc ≥ fsi+1 .vc, which
is impossible since the events are independent. There exist consequently indexes x
such that lri .vc[x] < fsi+1 .vc[x]. Let (i) be one of these indexes. As this is true for
any i ∈ {1, . . . , n}, we have defined a function
The rest of the proof shows that k ≥ n. This is done by showing that the function
() is one-to-one.
Let us assume by contradiction that () is such that there are two distinct indexes
i and j such that (i) = (j ) = x. Due to the definition of (), we have lri .vc[x] <
fsi+1 .vc[x] (A1), and lrj .vc[x] < fsj +1 .vc[x] (A2). On another side, it follows from
ev
Lemma 2 that fsi+1 −→ lrj . Since vector clocks are assumed to capture causality, we
have fsi+1 .vc ≤ lrj .vc (A3). Combining (A1), (A2), and (A3), we obtain lri .vc[x] <
ev
fsi+1 .vc[x] ≤ lrj .vc[x] < fsj +1 .vc[x], from which we conclude ¬(fsj +1 −→ lri ),
which contradicts Lemma 2 and concludes the proof of the theorem.
The reader can notice that the previous proof relies only on the fact that vector
entries are comparable. Moreover, from a theoretical point of view, it does not re-
quire the value domain of each entry to be restricted to the set of integers (even if
integers are easier to handle).
Theorem 7 shows that there are distributed executions in which the dimension of
vector time must be n if one wants to capture causality and independence (concur-
rency) among the events generated by n processes.
Considering an abstraction level defined by relevant events (as defined in
Sect. 7.2.6), this section presents an abstract condition, and two of its implemen-
tations, that allows each message to carry only a subset of the vector clock of its
sender. Of course, in the worst case, this subset is the whole vector clock of the
sender. This condition is on a “per message” basis. This means that the part of a
vector clock carried by a message m depends on what is known by its sender about
the values of all vector clocks, when it sends this message. This control information
is consequently determined just before a message is sent.
Let us recall that communication events cannot be relevant events. Only a subset
of the internal events are relevant.
s(m, i, j ) denote its send event and r(m, i, j ) denote its receive event. Moreover, let
pre(m, i, j ) denote the last relevant event (if any) produced by pj before r(m, i, j ).
Moreover, e being any (relevant or not) event produced by a process px , let e.vcx [k]
be the value of vcx [k] when px produces e. Let K(m, i, j, k) be the following pred-
icate:
def
K(m, i, j, k) = s(m, i, j ).vci [k] ≤ pre(m, i, j ).vcj [k].
When true, K(m, i, j, k) means that vci [k] is smaller or equal to vcj [k], when pi
sends m to pj ; consequently, it is not necessary for pi to transmit the value of vci [k]
to pj . The next theorem captures the full power of this predicate.
Proof Necessity. Let us assume that ¬K(m, i, j, k) is satisfied, i.e., s(m, i, j ).vci [k]
> pre(m, i, j ).vcj [k]. According to the definition of vector clocks we must have
s(m, i, j ).vci [k] ≤ r(m, i, j ).vcj [k]. If the pair (k, vci [k]) is not attached to m, pj
cannot update vcj [k] to its correct value, which proves the necessity part.
Sufficiency. Let us consider a message m sent by pi to pj such that K(m, i, j, k)
is satisfied. Hence, we have s(m, i, j ).vci [k] ≤ pre(m, i, j ).vcj [k]. As r(m, i, j ).
vcj [k] = pre(m, i, j ).vcj [k], we have s(m, i, j ).vci [k] ≤ r(m, i, j ).vcj [k], from
which it follows that, if the pair (k, vci [k]) is attached to m, it is useless as vcj [k]
does not need to be updated.
Fig. 7.24 Management of vci [1..n] and kprimei [1..n, 1..n] (code for process pi ): Algorithm 1
A First Algorithm The algorithm of Fig. 7.24 describes the way each process pi
has to manage its vector clock vci [1..n] and its matrix kprimei [1..n, 1..n] so that the
previous relation is satisfied. Let us recall that vci [1..n] is initialized to [0, . . . , 0],
while kprimei [1..n, 1..n] is initialized to [[1, . . . , 1], . . . , [1, . . . , 1]].
When it produces a relevant event, pi increases vci [i] (line 1) and resets to 0
(line 2) all entries of the column kprimei [1..n, i] (except its own entry). This is
because, pi knows then that vci [i] > vc [i] for = i.
When it sends a message to a process pj , pi adds to it the set vc_set con-
taining the pairs (k, vck) such that, to its knowledge vcj [k] < vci [k] (line 3). Ac-
cording to the definition of kprimei [1..n, 1..n], those are the pairs (k, −) such that
kprimei [j, k] = 0.
When a process pi receives a message m with an associated set of pairs vc_set, it
considers separately each pair (k, vck) ∈ vc_set. This is in order to preserve the prop-
erty associated with K (m, i, j, k) for each k, i.e., (kprimei [, k] = 1) ⇒ (vci [k] ≤
vc [k]). The behavior of pi depends on the values of the pair (vci [k], vck). More
precisely, we have the following.
• If vci [k] < vck, pi is less informed on pk than the sender pj of the mes-
sage. It consequently updates vci [k] to a more recent value (line 6), and sets
(a) kprimei [, k] to 0 for = i, j, k (this is because pi does not know if
vc [k] ≥ vci [k], lines 7–9), and (b) kprimei [j, k] to 1 (because now it knows that
vcj [k] ≥ vci [k], line 10).
• If vci [k] = vck, pi sets accordingly kprimei [j, k] to 1 (line 11).
• If vci [k] > vck, pi is more informed on pk than the sender pj of the message. It
consequently does not modify the array kprimei (line 12).
7.3 On the Size of Vector Clocks 179
Fig. 7.25 Management of vci [1..n] and kprimei [1..n, 1..n] (code for process pi ): Algorithm 2
Remark When considering a process pi , the values in the column kprimei [1..n, i]
(but kprimei [i, i]) remain equal to 0 after its first update (line 2).
The Case of FIFO Channels If the channels are FIFO, when a process pi sends
a message m to another process pj , the following line can be added after (line 3):
These updates save future sendings of pairs to pj as long as pi does not produce
a new relevant event (i.e., until vci [i] is modified). In particular, with this enhance-
ment, a process pi sends the pair (i, vci [i]) to a process pj only if, since its last
relevant event, this sending is the first sending to pj .
because, if vci [k] = vcj [k] and kprimej [, k] = 1, pi knows that vc [k] ≥ vci [k], if
it did not know it before. There is of course a tradeoff between the number of pairs
whose sending is saved and the number of binary vectors which have now to be sent
(see below).
predicate heuristic() is return((n − Σ1≤x≤n kprimei [x, k]) > c) end predicate.
The value c is a threshold: if the number of triplets to transmit is not greater than c,
then algorithm P2 is used, otherwise algorithm P1 is used.
Of course, plenty of possibilities are offered to the user. As a toy example, the
messages sent to processes with an even identity could be sent and received with
P1, while the other would be sent and received with P2. A more interesting strategy
is the following. Let pi and pj be any pair of processes, where pi is the sender and
pj the receiver. When P0 is not more efficient that P1 or P2 (line 2), pi alternates in
using P1 and P2 for its successive messages to pj . Another strategy would consist to
draw (at line 4) a random number in {1, 2}, which would be used to direct a process
to use P1 or P2.
7.3 On the Size of Vector Clocks 181
k-Restricted Vector Clocks The notion of restricted vector clocks was introduced
by F. Torres-Rojas and M. Ahamad (1999). It imposes a bound k, 1 ≤ k ≤ n, on the
size of vector clocks (i.e., the dimension of vector time). The vector clock of each
process pi has only k entries, namely, vci [1..k]. These k-restricted vector clocks are
managed by the algorithm described in Fig. 7.27 (which is the same as the vector
clock algorithm described in Fig. 7.9, except for the way the vector entries are used).
Let fk () be a deterministic surjective function from {1, . . . , n} (the set of process
identities) to {1, . . . , k} (the set of vector clock entries). As a simple example, fk (i)
can be (i mod k) + 1. The function fk () defines the set of processes that share the
same entry of the restricted vector clocks.
Let e1 and e2 be two events timestamped e1 .vc[1..k], i and e2 .vc[1..k], j ,
respectively. The set of timestamps defines an approximation relation as follows:
ev
• ((i = j ) ∧ (e1 .vc[i] < e2 .vc[i])) ⇒ (e1 −→ e2 ).
ev
• ((i = j ) ∧ (e1 .vc[i] > e2 .vc[i])) ⇒ (e2 −→ e1 ).
• (e1 .vc || e2 .vc) ⇒ (e1 || e2 ).
app ev
• ((i = j ) ∧ (e1 .vc < e2 .vc)) ⇒ (e1 −→ e2 ). (We have then e1 −→ e2 or e1 || e2 .)
app ev
• ((i = j ) ∧ (e1 .vc > e2 .vc)) ⇒ (e2 −→ e1 ). (We have then e2 −→ e1 or e1 || e2 .)
182 7 Logical Time in Asynchronous Distributed Systems
Fig. 7.27 Implementation of a k-restricted vector clock system (code for process pi )
app app
When e1 −→ e2 , while e1 || e2 , we say that −→ adds false causality.
If k = 1, we have then fk (i) = 1 for any i, and the k-restricted vector clock
system boils down to linear time. If k = n and fk (i) = i, the k-restricted vector clock
app
system implements the vector time of dimension n. The approximate relation −→
ev app
then boils down to the (exact) causal precedence relation −→. When 1 < k < n, −→
adds false causality. Experimental results have shown that for n = 100 and 1 < k ≤
5, the percentage of false causality (with respect to all the pairs of causally related
app
events) added by −→ remains smaller than 10 %. This shows that approximations of
the causality relation giving few false positives can be obtained with a k-restricted
vector clock system working with a very small time dimension k. This makes k-
restricted vector clock systems attractive when one has to simultaneously keep track
of causality and cope with scaling problems.
The matrix clock of a process pi is a two-dimensional array, denoted mci [1..n, 1..n],
such that:
• mci [i, i] counts the number of events produced by pi .
• mci [i, k] counts the number of events produced by pk , as known by pi .
It follows than mci [i, 1..n] is nothing else than the vector clock of pi .
7.4 Matrix Time 183
This example is related to the previous property of matrix clocks. It concerns the
management of a message buffer.
A Buffer Management Problem A process can invoke two operations. The oper-
ation broadcast(m) allows it to send a message to all processes, while the operation
deliver() returns to it a message that has been broadcast.
For some reasons (fault-tolerance, archive recording, etc.) each process keeps in
a private space (e.g., local disk) called buffer, all the messages it has broadcast or
delivered. A condition imposed to a process which wants to destroy messages to
free buffer space is that a message has to be known by all other processes before
being destroyed. A structural view of this buffer management problem is described
in Fig. 7.30.
represented here by the number of messages it has broadcast. The definition of the
content of the vector entry mci [j, k] has consequently to be interpreted as follows:
mci [j, k] = x means that, to pi ’s knowledge, the x first messages broadcast by pk
have been delivered by pj .
The corresponding algorithm (which assumes FIFO channels) is described in
Fig. 7.31. The FIFO assumption simplifies the design of the algorithm. It guarantees
that, if a process pi delivers a message m broadcast by a process pk , it has previously
delivered all the messages broadcast by pk before m.
When pi broadcasts a message m it increases mci [i, i], which is the sequence
number of m (line 1). Then, after it has associated the timestamp mci [i, 1..n], i
with m, pi sends m and its timestamp to all the other process, and deposits the pair
(m, mci [i], i) in its local buffer (lines 2–3).
When it receives a message m with its timestamp vc, j , a process pi first de-
posits the pair (m, vc[j ], j ), in its buffer and delivers m to the local application
process (line 4). Then, pi increases mci [i, j ] (it has delivered one more message
from pj ), and updates its local view of the vector clock of pj , namely, mci [j, 1..n],
to vc (line 5). The fact that a direct assignment replaces the usual vector clock update
operation broadcast(m) is
(1) mci [i, i] ← mci [i, i] + 1;
(2) for each j ∈ {1, . . . n} \ {i} do send(m, mci [i, 1..n], i) end for;
(3) deposit(m, mci [i], i) into the buffer and deliver m locally.
background task T is
(6) repeat forever
(7) if (∃(m, sn, k) ∈ buffer such that sn ≤ min(mci [1, k], . . . , mci [n, k]))
(8) then (m, sn, k) can be discarded from the buffer
(9) end if
(10) end repeat.
statement mci [j, 1..n] ← max(mci [j, 1..n], vc[1..n]) is due to the FIFO property of
the channels.
Finally, a local task T runs forever in the background. This task is devoted to the
management of the buffer. If there is message m in the buffer whose tag sn, k is
such that sn ≤ min(mci [1, k], . . . , mci [n, k]), pi can conclude that all the processes
have delivered this message. As a particular case, pi knows that pj has delivered
m (which was broadcast by pk ) because (a) mci [j, k] ≥ sn and (b) the channels
are FIFO (mci [j, k] being a local variable that pi modifies only when it receives
a message from pj , it follows that pi has previously received from pj a message
carrying a vector vcj such that vcj [k] ≥ sn).
Remark The reader can observe that this simple algorithm includes three notions of
logical time: local time with sequence numbers, vector time which allows a process
to know how many messages with a specific sender have been delivered by the other
processes, and matrix time which encapsulates the previous notions of time. Matrix
clocks are local, messages carry only vector clocks, and each buffer registers only
sequence numbers.
7.5 Summary
This chapter has addressed the concept of logical time in distributed computations.
It has introduced three types of logical time: linear (or scalar) time, vector time,
and matrix time. An appropriate notion of virtual clock is associated with each of
them. The meaning of these time notions has been investigated, and examples of
their use have been presented. Basically, linear time is fundamental when one has to
establish a total order on events that respects causal precedence, vector time captures
exactly causal precedence, while matrix time provides processes with a “second
order” knowledge on the progress of the whole set of processes.
• The notion of linear (scalar) time was introduced in 1978 by L. Lamport in [226].
This is a fundamental paper in which Lamport also introduced the happened be-
fore relation (causal precedence), which captures the essence of a distributed com-
putation.
• The timestamp-based total order broadcast algorithm presented in Sect. 7.1.4 is a
variant of algorithms described in [23, 226].
• The notions of vector time and vector clocks were introduced in 1988 (ten
years after linear clocks) simultaneously and independently by C.J. Fidge [124],
F. Mattern [250], and F. Schmuck [338]. The underlying theory is described
in [125, 250, 340].
7.7 Exercises and Problems 187
Preliminary intuitions of vector clocks appear in several papers (e.g., [53, 238,
290, 307, 338, 360]). Surveys on vector clock systems appear in [40, 149, 325].
The power and limitations of vector clocks are investigated in [126, 312].
• Tracking of causality in specific contexts is the subject of numerous papers. The
case of synchronous systems is addressed in [6, 152], while the case of mobile
distributed systems is addressed in [300].
• The algorithm which detects the first global state satisfying a conjunction of stable
local predicates is due to M. Raynal [312].
• The proof of the lower bound showing that the size of vector clocks has to be at
least n (the number of processes) if one wants to capture causality and indepen-
dence of events is due to B. Charron-Bost [86].
• The notion of an efficient implementation of vector clocks was first intro-
duced in [350]. The efficient algorithms implementing vector clocks presented
in Sect. 7.3.2 are due J.-M. Hélary, M. Raynal, G. Melideo, and R. Baldoni [181].
An algorithm to reset vector clocks is presented in [394].
• The notion of a dependency vector was introduced in [129]. Such a vector is
a weakened vector clock. This notion is generalized in [37] to the notion of k-
dependency vector clock (k = n provides us with vector clocks).
• The notion of immediate predecessors of relevant events was introduced in [108,
198]. The corresponding tracking algorithm presented in Sect. 7.2.6 is due to E.
Anceaume, J.-M. Hélary, and M. Raynal [17, 18].
• The notion of k-restricted vector clocks is due to F. Torres-Rojas and M.
Ahamad [370], who also introduced a more general notion of plausible clock
systems.
• Matrix time and matrix clocks were informally introduced in [127] and used
in [334, 390] to discard obsolete data (see also [8]).
• Another notion of virtual time, suited to distributed simulation, is presented and
studied in [199, 263].
consistent if, for any pair of distinct local states σi and σj , we have σi || σj (none
of them depends on the other).
In many applications, we are not interested in all the local states, but only in a
subset of them. Each process defines which of its local states are relevant. This is, for
example, the case for the detection of properties on global states (a local checkpoint
being then a local state satisfying some property), or for the definition of local states
for consistent recovery. Such local states are called local checkpoints, and a set of n
local checkpoints, one per process, is a global state called the global checkpoint.
A local checkpoint is denoted cix , where i is the index (identity) of the corre-
sponding process and x is its sequence number (among the local checkpoints of the
same process). In the following, C will denote the set of all the local checkpoints.
An example of a distributed execution with local checkpoints is represented in
Fig. 8.1, which is called a checkpoint and communication pattern. The local check-
points are depicted with grey rectangular boxes. As they are irrelevant from a check-
pointing point of view, the other local checkpoints are not represented. It is usually
assumed that the initial local state and the final local state of every process are local
checkpoints.
It is easy to see that the global checkpoint [ci1 , cj1 , ck1 ] is consistent, while the
global checkpoint [ci2 , cj2 , ck1 ] is not consistent.
Zigzag Dependency Relation and Zigzag Path These notions, which are due to
σ
R.H.B. Netzer and J. Xu (1995), are an extension of the relation −→ defined on local
states. A relation on local checkpoints, called z-dependency, is defined as follows.
y zz y
A checkpoint cix z-depends on a local checkpoint cj (denoted cix −→ cj ), if:
y y
• cix and cj are in the same process (i = j ) and cix appears before cj (x < y), or
8.1 Definitions and Main Theorem 191
σ zz
(c1 −→ c2 ) ⇒ (c1 −→ c2 ),
zz σ
while it is possible that (c1 −→ c2 ) ∧ ¬(c1 −→ c2 ). As an example, in Fig. 8.1,
zz σ zz
we have (ck0 −→ ci2 ) ∧ ¬(ck0 −→ ci2 ). In that sense, the Z-dependency relation −→
is weaker (i.e., includes more pairs) than the causal precedence relations on events
σ
−→.
192 8 Asynchronous Distributed Checkpointing
A fundamental question associated with local and global checkpoints is the fol-
lowing one: Given a checkpoint and communication pattern, and a set LC of local
checkpoints (with at most one local checkpoint per process), is it possible to extend
LC (with local checkpoints of the missing processes, if any) in order to obtain a
consistent global checkpoint?
What Is the Difficulty To illustrate this question let us again consider Fig. 8.1.
• Let us first consider LC1 = {cj1 }. Is it possible to extend this set with a local
checkpoint from pj , and another one from pi , such that, once pieced together,
these three local checkpoints define a consistent global checkpoint? The figure
shows that the answer is “yes”: the global checkpoint [ci1 , cj1 , ck0 ] answers the
question (as does also the global checkpoint [ci1 , cj1 , ck1 ]).
• Let us now consider the question with LC2 = {ci2 , ck0 }. It is easy to see that neither
cjt with t ≤ 1, nor cjt with t ≥ 2, can be added to LC2 to obtain a consistent global
checkpoint. Hence, the answer is “no” for LC2.
• Let us finally consider the case LC3 = {ck2 }. The figure shows that neither ci2 nor
ci3 can be consistent with ck2 (this is due to the causality precedence relating these
local checkpoints to ck2 ). Hence, the answer is “no” for LC3.
If there is a causal path relating two local checkpoints, we know (from Chap. 6)
that they cannot belong to the same consistent global checkpoint. Hence, to better
appreciate the difficultly of the problem (and the following theorem), let us consider
γ +1 β β+1
Fig. 8.2. Let LC = {ciα , ck }. It is easy to see that adding either cj or cj to
LC does not work. As depicted, each cut line (dotted line in the figure) defines an
inconsistent global checkpoint. Consequently, there is no mean to extend LC with
a local checkpoint of pj in order to obtain a consistent global checkpoint. This
observation shows that, absence of causal dependences among local checkpoints is
a necessary condition to have a consistent global checkpoint, but is not a sufficient
condition.
8.1 Definitions and Main Theorem 193
This example shows that, in addition to causal precedence, there are hidden de-
pendences among local checkpoints that prevent them to belong to the same con-
sistent global checkpoint. These hidden dependences are the ones created by zigzag
patterns. These patterns, together with causal precedence, are formally captured by
zz
the relation −→. The following theorem, which characterizes the set of local check-
points that can be extended to form a consistent global checkpoint, is due to R.H.B.
Netzer and J. Xu (1995).
local checkpoints of LC, none of them can be the starting local checkpoint of a
zigzag path to a local checkpoint in LC).
σ
3. c1 ∈ LC, c2 ∈ LC, and c1 −→ c2 .
In this case, c2 cannot be an initial local checkpoint. This is because no lo-
σ
cal checkpoint can causally precede (relation −→) an initial local checkpoint.
Hence, as c2 ∈ LC, c2 is a local checkpoint defined by Case 1, i.e., c2 is the
first local checkpoint (of some process pj ) that has no zigzag path to a local
checkpoint in LC.
Let c2 the local checkpoint of pj immediately preceding c2 . This local check-
point c2 must have a zigzag path to a local checkpoint c3 ∈ LC (otherwise, c2
would not be the first local checkpoint of pj that has no zigzag path to a local
checkpoint in LC).
σ
This zigzag path from c2 to c3 , plus the messages giving rise to c1 −→ c2 ,
establish a zigzag path from c1 to c3 (Fig. 8.3). But this contradicts the fact that
zz
we have (assumption) ¬(c1 −→ c3 ) for any pair c1 , c3 ∈ LC. This contradiction
concludes the proof of the case.
σ
4. c1 , c2 ∈ LC and c1 −→ c2 .
It follows from the argument of the previous item that there is a zigzag path
from c1 to a local checkpoint in LC (see the figure where c1 ∈ LC is replaced
by c1 ∈ LC). But this contradicts the definition of c1 (which is the first local
checkpoint—of some process pj —with no zigzag path to any local checkpoints
of LC).
Proof of the “only if” part. We have to show that, if there are two (not necessarily
zz
distinct) local checkpoints c1 , c2 ∈ LC such that c1 −→ c2 , there is no consistent
zz
global checkpoint Σ including c1 and c2 . Hence, let us assume that c1 −→ c2 . If c1
and c2 are from the same process, the proof follows directly from the definition of
global checkpoint consistency. Hence, let us assume that c1 and c2 are from different
processes. There is consequently a zigzag path m1 ; . . . ; mq starting after c1 and
finishing before c2 . The proof is by induction on the number of messages in this
path.
zz
• Base case: q = 1. If c1 −→ c2 and the zigzag path contains a single message,
σ
we necessarily have c1 −→ c2 (a zigzag path made up of a single message is
necessarily a causal path), and the proof follows.
8.1 Definitions and Main Theorem 195
• Induction case: q > 1. Let us assume that, if a zigzag path made up of at most
q messages joins two local checkpoints, these local checkpoints cannot belong
to the same consistent global checkpoint. We have to prove that if two lo-
cal checkpoints c1 and c2 are connected by a zigzag path of q + 1 messages
m1 ; . . . ; mq ; mq+1 , they cannot belong to the same consistent global check-
point.
The proof is by contradiction. Let us assume that there is a consistent global
checkpoint Σ including c1 and c2 such that these two local checkpoints are con-
nected by a zigzag path of q + 1 messages (if c1 = c2 , the zigzag path is a zigzag
cycle).
Let c3 be the local checkpoint preceding the reception of the message mq ,
and pj be the corresponding receiving process. This means that c1 is connected
to c3 by the zigzag path m1 ; . . . ; mq (Fig. 8.4). It follows from the induction
assumption that c1 and c3 cannot belong to the same consistent global checkpoint.
More generally, given any c3 that appears after c3 on the same process pj , c1 and
c3 cannot belong to the same consistent global checkpoint.
It follows from this observation that, for both c1 and c2 to belong to the
same consistent global checkpoint Σ , this global checkpoint must include a local
checkpoint of pj that precedes c3 . But, due the definition of “zigzag path”, the
message mq+1 has necessarily been sent after the local checkpoint of pj that im-
mediately precedes c3 . This local checkpoint is denoted c3 in Fig. 8.4. It follows
that any local checkpoint of pj that causally precedes c3 , causally precedes c2 .
Thus, c2 cannot be combined with any local checkpoint preceding c3 to form
a consistent global checkpoint. The fact that no local checkpoint of pj can be
combined with both c1 and c2 to form a consistent global checkpoint concludes
the proof of the theorem.
As an example, let us consider again Fig. 8.1. The local checkpoint ck2 is useless
because it belongs to the zigzag path m7 ; m5 ; m6 which includes the zigzag pattern
m7 ; m5 .
Let us recall that C denotes the set of all the local checkpoints defined during a
distributed computation
σ zz
S = (S, −→). The pair (C, −→) constitutes a checkpoint-
ing abstraction of this distributed computation. Hence the fundamental question of
zz
the asynchronous checkpointing problem: Is (C, −→) a consistent checkpointing
abstraction of the distributed computation
S?
Two different consistency conditions can be envisaged to answer this question.
8.2.1 Z-Cycle-Freedom
zz
Definition An abstraction (C, −→) is z-cycle-free if none of its local checkpoints
zz
belongs to a z-cycle: ∀c ∈ C : ¬(c −→ c).
This means that z-cycle-freedom guarantees that no local checkpoint is useless,
or equivalently each local checkpoint belongs to at least one consistent global check-
point.
Domino Effect The domino effect is a phenomenon that may occur when looking
for a consistent global checkpoint. As a simple example, let us consider processes
that define local checkpoints in order to be able to recover after a local failure. In
order that the computation be correct, the restart of a process from one of its pre-
vious checkpoints may entail the restart of other processes (possibly all) from one
of their previous checkpoints. This is depicted in Fig. 8.5. After its local checkpoint
c21 , process p2 experiences a failure and has to restart from one of its previous local
state c2 , which is such that there is a local state c1 of process p1 and the global
checkpoint [c1 , c2 ] is consistent. Such a global checkpoint cannot be [c13 , c21 ] be-
cause it is not consistent. Neither can it be [c12 , c21 ] for the same reason. The reader
can check that, in this example, the only consistent global checkpoint is the initial
8.2 Consistent Checkpointing Abstractions 197
one, namely [c10 , c20 ]. This backtracking to find a consistent global checkpoint is
called the domino effect.
It is easy to see that, if the local checkpoints satisfy the z-cycle-freedom property,
no domino effect can occur. In the example, there would be a local checkpoint c1
on p1 , such that Σ = [c1 , c21 ] would be consistent, and the computation could be
restarted from this consistent global checkpoint. Moreover, this consistent global
checkpoint Σ has the property to be “as fresh as possible” in the sense that, any other
consistent global checkpoint Σ from which the computation could be restarted is
Σ Σ
such that Σ −→ Σ (where −→ is the reachability relation on global states defined
in Chap. 6).
As causal dependences can be easily tracked by vector clocks, the RDT con-
sistency condition can benefit and simplify the design of many checkpoint-based
applications such as the detection of global properties, or the definition of global
checkpoints for failure recovery. A noteworthy property of RDT is the following
one: it allows us to associate with each local checkpoint c (on the fly and without
additional cooperation among processes) the first consistent global checkpoint in-
cluding c. (The notion of “first” considered here is with respect to the sublattice
of consistent global states obtained by eliminating the global states which are not
global checkpoints, see Chap. 6.) This property is particularly interesting when one
has to track software errors or recover after the detection of a deadlock.
Let us associate with each local checkpoint c a integer denoted c.date (a logical date
zz
from an operational point of view). Let (C, −→) be a checkpointing abstraction.
zz zz
Theorem 10 (∀c1 , c2 ∈ C : (c1 −→ c2 ) ⇒ (c1 .date < c2 .date)) ⇔ ((C, −→)
is z-cycle-free).
zz
Proof Direction ⇒. Let us assume by contradiction that there is a z-cycle c −→ c.
It follows from the assumption that c.date < c.date, which is clearly impossible.
zz
Direction ⇐. Let us consider that (C, −→) is acyclic. There is consequently a
topological sort of its vertices. It follows that each vertex c (local checkpoint) can
zz
be labeled with an integer c.date such that c1 −→ c2 ⇒ c1 .date < c2 .date.
This theorem shows that all the algorithms ensuring the z-cycle-freedom prop-
erty implement (in an explicit or implicit way) a consistent logical dating of local
checkpoints (the time notion being linear time). It follows that, when considering
the algorithms that implement explicitly such a consistent dating system, the local
checkpoints that have the same date belong to the same consistent global check-
point.
Fig. 8.6
Proof by contradiction
of Theorem 11
zz
Theorem 11 Let (C, −→) be a z-cycle-free checkpointing and communication pat-
tern abstraction, in which the first local checkpoint of each process is dated 0, and
zz
the date of each other local checkpoint c is such that c.date = max{c .date | c −→
c} + 1. Let us associate with each local checkpoint c the global checkpoint Σ(c) =
[c1x1 , . . . , c1xn ] where cixi is the last local checkpoint of pi , such that cixi .date ≤
c.date. Then, Σ(c) includes c and is consistent.
Proof By construction of Σ(c), c belongs to Σ(c). The proof that Σ(c) is consistent
is by contradiction; α being the date of c (i.e., c.date = α), let us assume that Σ(c) is
not consistent. Let next(x) be the local checkpoint (if any) that appears immediately
after x on the same process.
Both the algorithms that follow are based on Theorems 10 and 11: They both ensure
z-cycle-freedom and associate dates with local checkpoints as described in Theo-
rem 11. Hence, given any local checkpoint c, the processes are able to determine a
consistent global checkpoint to which c belongs.
To that end, each process pi has a scalar local variable clocki that it manages and
uses to associate a date with its local checkpoints. Moreover, each process defines its
initial local state as its first local checkpoint (e.g., with date 1), and all local clocks
are initialized to that value.
Fig. 8.8
To take or not to take
a forced local checkpoint
zz
∀c1 , c2 : (c1 −→ c2 ) ⇒ (c1 .date < c2 .date) (used in Theorem 10 to obtain z-cycle-
freedom). On the contrary, if clocki < sd, pi has to prevent the possible formation
of a zz-pattern with nonincreasing dates. A simple way to solve this issue (and be
in agreement with the assumption of Theorem 10) consists in directing pi to update
clocki and take a forced local checkpoint.
A Simple Improvement Let us observe that, whatever the values of clocki and sd,
no zz-pattern can be formed if pi has not sent messages since its last local check-
point. Let us introduce a Boolean variable senti to capture this “no-send” pattern.
The previous algorithm is consequently modified as follows:
• senti is set to false at line 1 and set to true at line 4.
• Moreover, line 6 becomes
clocki ← sd; if (senti ) then take_local_checkpoint() end if.
receives has a smaller date. This means that c(j, 2) can be combined with c(k, 1) to
form a consistent global checkpoint (let us notice that, as shown by the figure, it can
also be combined with c(k, 2), and that is the combination imposed by Theorem 11).
The forced local checkpoint c(i, 4) is taken to prevent the formation of a z-cycle
including c(k, 4), while c(k, 5) is taken to prevent the possible formation of a z-
cycle (that does not exist). Additionally, c(k, 5) is needed to associate a consistent
global checkpoint with c(i, 5) as defined by Theorem 11). If c(k, 5) is not taken
but the clock of pk is updated when m8 is received, c(i, 5) would be associated
with c(j, 5) and c(k, 4) which (due to message m7 ) defines an inconsistent global
checkpoint. If c(k, 5) is not taken and the clock of pk is not updated when m8 is
received, c(k, 6) would be dated 5 and consequently denoted c (k, 5); c(i, 5) would
then be associated with c(j, 5) and c (k, 5), which (due to message m8 ) defines an
inconsistent global checkpoint.
Fig. 8.10 A vector clock system for rollback-dependency trackability (code for pi )
is stronger than z-cycle-freedom: c1 and c2 being any pair of local checkpoints, RDT
zz σ
states that (c1 −→ c2 ) ⇒ (c1 −→ c2 ). This means that there is no z-dependence
relation among local checkpoints which remains hidden from a causal precedence
point of view. As we have seen, given any local checkpoint c, a noteworthy property
of RDT is the possibility to associate with c a global checkpoint to with it belongs,
and this can be done on the fly and without additional communication among pro-
cesses.
Vector Clock for RDT As causality tracking is involved in RDT, at the opera-
tional level, a vector clock system suited to checkpointing can be defined as follows.
Each process pi has a vector clock, denoted tdvi [1..n] (transitive dependence vec-
tor), managed as described in Fig. 8.10. The entry tdvi [i] is initialized to 1, while all
other entries are initialized to 0. Process pi increases tdvi [i] after it has taken a new
local checkpoint. Hence, tdvi [i] is the sequence number of the current interval of
pi (see Fig. 8.1), which means that tdvi [i] is the sequence number of its next local
checkpoint.
A vector date is associated with each local checkpoint. Its value is the current
value of the vector clock tdvi [1..n] of the process pi that takes the corresponding
local checkpoint. Let us consider Fig. 8.11 in which Ijx+1 is the interval separating
cjx and cjx+1 , which means that tdvj [j ] = x + 1 just after cjx . It follows that the
messages forming the causal path from Ijx+1 to pi carry a value tdv[j ] > x, and we
y
have consequently ci .tdv[j ] > x.
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 205
To ensure the RDT property, a very simple algorithm consists in preventing any
zz-pattern from forming. To that end, the algorithm forces the communication and
checkpoint pattern to be such that, at any process, there is no message sending fol-
lowed by a message reception without a local checkpoint separating them. The only
pattern allowed is the one described in Fig. 8.12. This pattern (called Russell’s pat-
tern) states that the only message pattern that can appear at a process between two
consecutive local checkpoints is a (possibly empty) sequence of message receptions,
followed by a (possibly empty) sequence of message sendings.
The corresponding algorithm (due to Russell, 1980), is described in Fig. 8.13.
The interest of this algorithm lies in its simplicity and in the fact that it needs only
one Boolean per process.
It is possible to use the vector clock algorithm of Fig. 8.10 to associate a vector
date with each checkpoint c. In this way, we obtain on the fly the first consistent
global checkpoint to which c belongs.
The FDAS Predicate and the FDAS Algorithm The predicate is on the local
Boolean variable senti used in Russell’s checkpointing algorithm (Fig. 8.13), the
vector clock tdvi [1..n] of the process pi , and the vector date tdv[1..n] carried by
the message received by pi . The corresponding checkpointing algorithm, with the
management of vector clocks, is described in Fig. 8.14. The predicate controlling
forced checkpoints appears at line 5. It is the following
senti ∧ ∃ k : tdv[k] > tdvi [k] .
A process takes a forced local checkpoint if, when a message m arrives, it has sent
a message since its last local checkpoint and, because of m, its vector tdvi [1..n] is
about to change.
As we have already seen, no zz-pattern can be created by the reception of a
message m if the receiver pi has not sent messages since its last checkpoint. Conse-
quently, if senti is false, no local forced checkpoint is needed. As far as the second
part of the predicate is concerned, we have the following. If ∀k : tdv[k] ≤ tdvi [k],
from a local checkpoint point of view, pi knows everything that was known by the
sender of m (when it sent m). As this message does not provide pi with new infor-
mation on dependencies among local checkpoints, it cannot create local checkpoint
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 207
dependencies that would remain unknown to pi . Hence, the name FDAS comes
from the fact that, at any process, after the first message sending in any interval, the
transitive dependency vector remains unchanged until the next local checkpoint.
The fact that a process takes a forced local checkpoint at some point of its exe-
cution has trivially a direct impact on the communication and checkpoint pattern
defined by the application program: The communication pattern is not modified but
the checkpoint pattern is. This means that it might not be possible to determine on
the fly if taking now an additional forced local checkpoint would decrease the num-
ber of forced local checkpoints taken in the future. The best that can be done is to
find predicates (such as FDAS) that, according to the control information at their
disposal, strive to take as few as possible forced local checkpoints. This section
presents such a predicate (called BHMR) and the associated checkpointing algo-
rithm. This predicate, which is stronger than FDAS, and the associated algorithm
are due to R. Baldoni, J.-M. Hélary, A. Mostéfaoui, and M. Raynal (1997).
Additional Control Variables The idea that underlies the design of this predicate
and the associated checkpointing algorithm is based on additional control variables
that capture the interplay of the causal precedence relation and the last local check-
points taken by each process. To that end, in addition to the vector clock tdvi [1..n],
each process manages the following local variables, which are all Boolean arrays.
• sent_toi [1..n] is a Boolean array such that sent_toi [j ] is true if and only if pi sent
a message to pj since its last local checkpoint. (This array replaces the Boolean
variable senti used in the FDAS algorithm, Fig. 8.14.) Initially, for any j , we have
sent_toi [j ] = false.
208 8 Asynchronous Distributed Checkpointing
• causali [1..n, 1..n] is a two-dimensional Boolean array such that causali [k, j ] is
true if and only if, to pi ’s knowledge, there is a causal path from the last local
checkpoint taken by pk (as known by pi ) to the next local checkpoint that will be
taken by pj (this is the local checkpoint of pj that follows its last local checkpoint
known by pi ). Initially, causali [1..n, 1..n] is equal to true on its diagonal, and
equal to false everywhere else.
As an example, let us consider Fig. 8.15 where μ, μ and μ are causal paths.
When pi sends the message m, we have causali [k, j ] = true (pi learned the
causal path μ from pk to pj thanks to the causal path μ ), causali [k, i] = true
(this is due to both the causal paths μ and μ; μ ), and causali [j, i] = true (this
is due to the causal path μ ).
• purei [1..n] is a Boolean array such that purei [j ] is true if and only if, to pi ’s
knowledge, no causal path starting at the last local checkpoint of pj (known
by pi ) and ending at pi , contains a local checkpoint. An example is given in
Fig. 8.16. A causal path from a process to itself without local checkpoints is
called pure. The entry purei [i] is initialized to true and keeps that value forever;
for any j = i, purei [j ] is initialized to false.
The BHMR Predicate to Take Forced Local Checkpoints Let MSG (m, tdv, pure,
causal) be a message received by pi . Hence, if pj is the sender of this message,
tdv[1..n], pure[1..n], and causal[1..n] are the values of tdvj , purej , and causalj ,
respectively, when it sent the message. The predicate is made up of two parts.
Fig. 8.16 Pure (left) vs. impure (right) causal paths from pj to pi
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 209
• The first part of the predicate, which concerns the causal paths from any process
pj (j = i) to pi , is
∃(k, ): sent_toi [k] ∧ tdv[k] > tdvi [k] ∧ ¬causal[k, ] .
As we can see, the sub-predicate ∃k : sent_toi [k] ∧ (tdv[k] > tdvi [k]) is just the
FDAS predicate expressed on a process basis. If it is true, pi sent a message m
to pk since its last checkpoint and the sender pj knows more local checkpoints
of pk than pi .
But, if causal[k, ] is true, pj knows also that there is a causal path from the
last local checkpoint it knows of pk to p . Hence, there is no need for pi to take a
forced local checkpoint as the zz-pattern created by the message m just received,
and the message m previously sent by pi to pk , is doubled by a causal path (see
the zigzag path μ ; m; m in Fig. 8.15 which is doubled by the causal path μ).
Consequently, pi takes conservatively a local checkpoint only if ¬causal[k, ].
• The second part of the predicate concerns the causal paths from pi to itself, which
start after its last local checkpoint. It is
tdv[i] = tdvi [i] ∧ ¬pure[i].
In this case (see Fig. 8.17), if the causal path whose last message is m is not
pure, a local checkpoint c has been taken along this causal path starting and end-
ing in the same interval Iitdv[i] of pi . In order this checkpoint c belongs to a con-
sistent global checkpoint, pi takes a forced local checkpoint (otherwise, a z-cycle
would form).
when receiving MSG (m, tdvi [1..n], pure[1..n], causal[1..n, 1..n]) from pj do
(8) if [∃(k, ) : sent_toi [k] ∧ ((tdv[k] > tdvi [k]) ∧ ¬causal[k, ])]
(9) ∨ [(tdv[i] = tdvi [i]) ∧ ¬pure[i]]
(10) then take_local_checkpoint() % forced local checkpoint
(11) end if;
(12) for each k ∈ {1, . . . , n} do
(13) case (tdv[k]> tdvi [k]) then
(14) tdvi [k] ← tdv[k]; purei [k] ← pure[k];
(15) for each ∈ {1, . . . , n} do causali [k, ] ← causal[k, ] end for
(16) (tdv[k] = tdvi [k]) then
(17) purei [k] ← purei [k] ∧ pure[k];
(18) for each ∈ {1, . . . , n}
(19) do causali [k, ] ← causali [k, ] ∨ causal[k, ]
(20) end for
(21) (tdv[k] < tdvi [k]) then skip
(22) end case
(23) end for;
(24) for each ∈ {1, . . . , n} do causali [, i] ← causali [, i] ∨ causal[, j ] end for;
(25) Deliver the message m to the application process.
that has been presented previously (lines 8–9). Then, before delivering the message
(line 25), pi updates its control data structures so that their current values correspond
to their definition (lines 12–24).
For each process pk , pi compares tdv[k] (the value of tdvj [k] when pj sent the
message) with its own value tdvi [k].
• If tdv[k] > tdvi [k], pj knows more local checkpoints of pk than pi . In this case,
pi resets tdvi [k], purei [k], and causali [k, ] for every , to the corresponding
more up to date values sent by pj (lines 13–15).
• If tdv[k] = tdvi [k], pi and pj know the same last local checkpoint of pk . As they
possibly know it through different causal paths, pi adds what is known by pj
(pure[k] and causal[k, ]) to what it already knows (lines 16–18).
• If pi knows more on pk than pj , it is more up to date and consequently does
nothing (line 21).
Finally, as the message m extends causal paths ending at pj , process pi updates
accordingly each Boolean causal[, i], 1 ≤ ≤ n (line 24).
8.5 Message Logging for Uncoordinated Checkpointing 211
This algorithm reduces the number of forced local checkpoints at the price of
more control data and more information carried by each application message, which
has to carry n2 + n bits and n integers (logical dates). Nevertheless, as we have seen
in Sect. 8.3.4 for z-cycle prevention, there is no optimal communication-induced
checkpointing algorithm that ensures the RDT property.
Basic Principle Let cix denote the xth local checkpoint taken by a process pi .
When pi sends a message m, it saves it in a local volatile log. When later pi takes
212 8 Asynchronous Distributed Checkpointing
its next local checkpoint cix+1 , it (a) saves on stable storage cix+1 and the content of
its volatile log, and (b) re-initializes to empty its volatile log. Hence, the messages
are saved on stable storage by batch, and not individually. This decreases the number
of input/output and consequently the overhead associated with message logging.
A simple example is depicted in Fig. 8.19. After it has taken its local checkpoint
cix , the volatile log of pi is empty. Then, when it sends m1 , m2 , and m3 , it saves
them in its volatile log. Finally, when it takes cix+1 , pi writes cix+1 and the current
content of its volatile message log on stable storage before emptying its volatile log.
To Log or Not to Log: That Is the Question The question is then the following
one: Which messages saved in its volatile log, pi has to save in stable storage when
it take its next local checkpoint cix+1 ?
To answer this question, let us consider Fig. 8.20. When looking at the execution
y
of the left side, the local checkpoints cj and cix+1 are concurrent and can conse-
quently belong to the same consistent global checkpoint. The corresponding state of
the channel from pi to pj then comprises the message m (this message is in transit
y
with respect to the ordered pair (cix+1 , cj )).
When looking at the execution of the right side, there is an additional causal path
y y
starting after cj and ending before cix+1 . It follows that, in this case, cj and cix+1
are no longer concurrent (independent) and it is not necessary to save the message
m on stable storage as it cannot appear in a consistent global checkpoint. Hence,
if pi knows this fact, it does not have to save m from its volatile storage to stable
storage. Process pi can learn it if there is causal path starting from pj after it has
received m, and arriving at pi before it takes cix+1 (this is depicted in the execution
at the bottom of the figure).
Fig. 8.22 Retrieving the messages which are in transit with respect to the pair (ci , cj )
Let sn(i, j ) be the value of the sequence number sni [j ] which has been saved
by pi on its stable storage together with ci ; this is the number of messages sent
by pi to pj before taking its local checkpoint ci . It follows that the messages sent
by pi to pj (after ci ) have a sequence number greater than sn(i, j ). Similarly, let
rk(j, i) be the value of rec_knownj [j, i] which has been saved by pj on its stable
storage together with cj ; this is the number of messages received (from pi ) by
pj before it took its local checkpoint cj .
It follows that the messages from pi received by pj (before cj ) have a se-
quence number smaller than or equal to rk(j, i). As the channel is FIFO, the
messages whose sequence number sqnb is such that rk(j, i) < sqnb ≤ sn(i, j ),
define the sequence of messages which are in transit with respect to ordered pair
(ci , cj ). This is depicted in Fig. 8.22, which represents the sequence numbers
attached to messages.
Due to the predicate used at line 12 of the checkpointing algorithm, these mes-
sages have not been withdrawn from the volatile log of pi and are consequently
with ci in pi ’s stable storage.
Adding Forced Checkpoints Let us note that, even if the basic checkpointing
algorithm is uncoordinated, a few forced checkpoints can be periodically taken to
reduce the impact of the domino effect.
Cost The fact that each message sent by a process pi has to carry the current value
of the integer matrix rec_knowni [1..n, 1..n] can reduce efficiency and penalize the
application program. Actually, only the entries of the matrix corresponding to the
(directed) channels of the application program have to be considered. When the
communication graph is a directed ring, the matrix shrinks to a vector. A similar gain
is obtained when the communication graph is a tree with bidirectional channels.
216 8 Asynchronous Distributed Checkpointing
Moreover, as channels are FIFO, for any two consecutive messages m1 and m2
sent by pi to pj , for each entry rec_knowni [k, ], m2 has only to carry the difference
between its current value and its previous value (which was communicated to pj
by m1 ).
The Case of Synchronous Systems Let us finally note that synchronous systems
can easily benefit from uncoordinated checkpointing without suffering the domino
effect. To that end, it is sufficient for the processes to take local checkpoints at the
end of each round of a predefined sequence of rounds.
8.6 Summary
• Let c1 and c2 be two local checkpoints (from distinct processes). Show that
they can belong to the same consistent global checkpoint if
– no z-cycle involving c1 or c2 exists, and
– no zigzag path exists connecting c1 and c2 .
Solution in [283].
2. Let c be a local checkpoint of a process pi , which is not its initial checkpoint. The
notation pred(c) is used to denote the local checkpoint that precedes immediately
218 8 Asynchronous Distributed Checkpointing
zz zz
c on pi . Let us define a new relation −→ as follows. c1 −→ c2 if:
σ σ zz
(c1 −→ c2 ) ∧ ∃c ∈ C : (c1 −→ c) ∧ pred(e) −→ c2 .
zz zz
Show first that −→≡−→. Then give a proof of Theorem 9 based on the rela-
σ zz σ zz
tions −→ and −→ (instead of the relations −→ and −→).
Solution in [149] (Chap. 29).
3. Prove formally that the predicate BHMR used to take a forced checkpoint in
Sect. 8.4.4 is stronger than FDAS.
Solution in [34].
4. The communication-induced checkpointing algorithms presented in Sects. 8.3
and 8.4 do not record the messages which are in transit with respect to the corre-
sponding pairs of local checkpoints. Enrich one of the checkpointing algorithms
presented in these sections so that in-transit messages are recorded.
Solution in [173].
5. Modify the uncoordinated checkpointing algorithm described in Sect. 8.5 so that
only causal paths made up of a single message are considered (i.e., when consid-
ering that the causal path μ of Fig. 8.20 has a single message, and this message
is from pj ).
When comparing this algorithm with the one described in Fig. 8.21, does this
constraint reduce the size of the control information carried by messages? Does
it increase or reduce the number of messages that are logged on stable storage?
Which algorithm do you prefer (motivate your choice)?
Chapter 9
Simulating Synchrony
on Top of Asynchronous Systems
Synchronous distributed algorithms are easier to design and analyze than their asyn-
chronous counterparts. Unfortunately, they do not work when executed in an asyn-
chronous system. Hence, the idea to simulate synchronous systems on top of an
asynchronous one. Such a simulation algorithm is called a synchronizer. First, this
chapter presents several synchronizers in the context of fully asynchronous sys-
tems. It is important to notice that, as the underlying system is asynchronous, the
synchronous algorithms simulated on top of it cannot consider physical time as a
programming object they could use (e.g., to measure physical duration). The only
notion of time they can manipulate is a logical time associated with the concept of a
round. Then, the chapter presents synchronizers suited to partially synchronous sys-
tems. Partial synchrony means here that message delays are bounded but the clocks
of the processes (processors) are not synchronized (some private local area networks
have such characteristics).
Fig. 9.1
A space-time diagram
of a synchronous execution
• The initial values of the variables leveli , 1 ≤ i ≤ n, are such that levela = 0, and
leveli = +∞ for i = a. At the end of the algorithm, leveli contains the level of pi
(i.e., its distance to pa ).
• The initial values of the variables parenti , 1 ≤ i ≤ n, are such that parenta = a,
and parenti = ⊥ for i = a. At the end of the algorithm, parenti contains the index
of the channel connecting pi to its parent in the tree rooted at pa .
The corresponding synchronous algorithm is described in Fig. 9.2. It is particu-
larly simple. When the clock is set to 0, the algorithm starts, and pa sends the mes-
sage LEVEL (0) to all its neighbors which discover they are at distance 1 from pa by
the end of the first pulse. Then, when the next pulse is generated (CLOCK = 1), each
of them sends the message LEVEL (1) to all its neighbors, etc. Moreover, the first
message received by a process defines its parent in the breadth-first tree rooted at
pa . The global synchronization provided by the model ensures that the first LEVEL ()
message received by a process has followed a path whose number of channels is the
distance from the root to that process.
It is easy to see that the time complexity is equal to the eccentricity of pa (maxi-
mal distance from pa to any other process), and the message complexity is equal to
O(e), the number of communication channels. (Let us observe that (n−1) messages
can be saved by preventing a process from sending a message to its parent.)
Time and Message Costs An important question concerns the cost added by a
synchronizer to simulate a given synchronous algorithm. As being a synchronous
algorithm, let Ts (As ) and Ms (As ) be its complexities in time and in number of
messages (when executed in a synchronous system).
As we have just seen, a synchronizer Σ generates a sequence of pulses on each
process in such a way that all the processes are simultaneously (with respect to a
logical time framework) at the same pulse at the same time.
pulse pulse
• The simulation of a pulse by Σ requires MΣ messages and TΣ time units.
• Moreover, the initialization of Σ requires MΣ
init messages and T init time units.
Σ
9.2 Basic Principle for a Synchronizer 223
These values allows for the computation of the time and message complexities,
denoted Tas (As ) and Mas (As ), of the asynchronous algorithm resulting from the
execution of As by Σ . More precisely, as As requires Ts (As ) pulses, and each pulse
pulse pulse
costs MΣ messages and TΣ time units, we have
pulse
Mas (As ) = Ms (As ) + MΣ init
+ Ts (As ) × MΣ , and
pulse
Tas (As ) = TΣinit + Ts (As ) × TΣ .
pulse
init , T init , M pulse
A synchronizer Σ will be efficient if MΣ Σ Σ , and TΣ are “rea-
sonably” small. Moreover, these four numbers, which characterize every synchro-
nizer, allow synchronizers to be compared. Of course, given a synchronizer Σ , there
is a compromise between these four values and they cannot be improved simultane-
ously.
Design Cost As indicated on the right of Fig. 9.1, combining a synchronous al-
gorithm with a synchronizer gives an asynchronous algorithm. In some cases, such
an asynchronous algorithm can “compete” with ad hoc asynchronous algorithms
designed to solve the same problem.
Another interest of the synchronizer concept lies in its implementations. Those
are based on distributed synchronization techniques, which are general and can be
used to solve other distributed computing problems.
Solving the previous issue requires adding synchronization among neighboring pro-
cesses. To that end, let us introduce the notion of a safe process.
From Safe Processes to the Property P The notion of a safe process can be used
to ensure the property P as follows: the local module implementing the synchronizer
at a process pi can generate a new pulse (r + 1) at that process when (a) pi has
carried on all its processing of pulse r and (b) learned that each of its neighbors is
safe with respect to the pulse r.
The synchronizers, which are presented below, differ in the way they deliver to
each process pi the information “your neighbor pj is safe with respect to the current
pulse r”.
9.3.1 Synchronizer α
pulse pulse
Complexities For the complexities Mα and Tα , we have the following,
where e is the number of channels of the communication graph. At each pulse, ev-
ery application message is acknowledged, and each process pi informs its ci neigh-
pulse pulse
bors that it is safe. Hence Mα = O(e), i.e. Mα ≤ O(n2 ). As far the time
complexity is concerned, let us observe that control messages are sent only between
pulse
neighbors, hence Tα = O(1).
For the complexities Mαinit and Tαinit , it follows from the fact that there is no
specific initialization part that we have Mαinit = Tαinit = 0.
Local Variables of the Synchronizer α The local variable clocki is built by the
synchronizer and its value can only be read by the local synchronous application
process.
The local variables channelsi and channeli [1..ci ] are defined by the communi-
cation graph of the synchronous algorithm. As in previous algorithms, channelsi is
the set {1, . . . , ci } (which consists of the indexes of the ci channels of pi ), and for
each x ∈ channelsi , channeli [x] denotes locally at pi the corresponding channel.
The other two local variables, which are hidden to the upper layer, are used only
to implement the required synchronization. These variables are the following:
• expected_acki contains the number of acknowledgments that pi has still to re-
ceive before becoming safe with respect to the current round.
• neighbors_safei is a multiset (also called a bag) that captures the current percep-
tion of pi on which of its neighbors are safe. Initially, neighbors_safei is empty.
A multiset is a set that can contain several times some of its elements. As we
about to see, it is possible that a process pi receives several messages SAFE ()
(each corresponding to a distinct pulse r, r + 1, etc.), from a neighbor pj while
it pi still at pulse r. Hence, the use of a multiset.
On the Wait Statement It is assumed that the code of all the synchronizers can-
not be interrupted except in a wait statement. This means that a process receives
and processes a message (MSG (), ACK (), or SAFE ()) only when it executes line 5
or line 7 of Fig. 9.4 when considering the synchronizer α. The same holds for the
other synchronizers.
Algorithm of the Synchronizer α The algorithm associated with the local mod-
ule implementing α at a process pi is described in Fig. 9.4. When a process starts its
next pulse (line 1), it first sends the messages MSG (m) (if any), which correspond to
226 9 Simulating Synchrony on Top of Asynchronous Systems
repeat
(1) clocki ← clocki + 1; % next pulse is generated %
(2) Send the messages MSG () of the current pulse of the local synchronous algorithm;
(3) expected_acki ← number of MSG (m) sent during the current pulse;
(4) neighbors_safei ← neighbors_safei \ channelsi ;
(5) wait (expected_acki = 0); % pi is safe respect to the current pulse %
(6) for each x ∈ channelsi do send SAFE () on channeli [x] end for;
(7) wait (channelsi ⊆ neighbors_safei ) % The neighbors of pi are safe %
% pi has received all the messages MSG () sent to it during pulse clocki %
until the last local pulse has been executed end repeat.
the application messages m that the local synchronous algorithm must send at pulse
clocki (line 2).
Then pi initializes appropriately expected_acki (line 3), and neighbors_safei
(line 4). As neighbors_safei is a multiset, its update consists in suppressing one
copy of each channel index. Process pi then waits until it has become safe (line 5),
and when this has happens, it sends a message SAFE () to each of its neighbors to in-
form them (line 6). Finally, it waits until all its neighbors are safe before proceeding
to the next pulse (line 7).
When, it receives a message ACK () or SAFE (), a process pi updates the corre-
sponding local variable expected_acki (line 13), or neighbors_safei (line 14).
Let us remark that, due to the control messages SAFE () exchanged by neighbor
processes, a message MSG (m) sent at pulse r by a process pj to a process pi will
arrive before pi starts pulse (r + 1). This is because pi must learn that pj is safe
with respect to pulse r before being allowed to proceed to pulse (r + 1). But a
message MSG (m) sent at pulse r to a process pi can arrive before pi starts pulse r .
This is depicted on Fig. 9.5, where pj and pk are two neighbors of pi , and where
r = r + 1 and pi receives a pulse (r + 1) message while it is still at pulse r.
Moreover, let us benefit from this figure to consider the case where pj does not
send application messages during pulse (r + 1). In Fig. 9.5, the message MSG () sent
by pj is consequently replaced by the message SAFE (). In that case, neighbors_safei
contains twice the local index of the channel connecting pj to pi . This explains why
neighbors_safei has to be a multiset.
9.3 Basic Synchronizers: α and β 227
When it receives a message MSG (m) on a channel channeli [x] (which connects
it to its neighbor pj ), a process pi first sends back an ACK () message (line 8).
According to the previous observation (Fig. 9.5), the behavior of pi depends then
on the current value of neighbors_safei (line 9). There are two cases.
• The channel index x ∈ / neighbors_safei (line 10). In that case, pi has not received
from pj the SAFE () message that closes pulse r. Hence, the message MSG (m)
is associated with pulse r, and consequently, pi delivers m it to the upper layer
local synchronous algorithm.
• The channel index x ∈ neighbors_safei (line 11). This means that pi has already
received the message SAFE () from pj concerning the current pulse r. Hence, m
is a message sent at pulse r + 1. Consequently, pi has to store the message m,
and delivers it during pulse r + 1 (after it has sent its messages associated with
pulse r + 1).
9.3.2 Synchronizer β
If a rooted tree pre-exists, we have Mβinit = Tβinit = 0. If the tree has to be built,
Mβinit and Tβinit are the costs of building a spanning tree (see Chap. 1).
repeat
(1) if (channeli [parenti ] = ⊥) then wait (PULSE() received on channeli [parenti ]) end if;
% pi and all its neighbors are safe with respect to the pulse clocki : %
% it has received all the messages MSG () sent to it during pulse clocki %
(2) clocki ← clocki + 1; % next pulse is generated %
(3) for each x ∈ childreni do send PULSE () on channeli [x] end for;
(4) Send the messages MSG (−, pb) of the local synchronous algorithm
(5) where pb = (clocki mod 2);
(6) expected_acki ← number of MSG (m) sent during the current pulse;
(7) wait ((expected_acki = 0) ∧ (children_safei = childreni ));
% pi and all its children are safe respect to the current pulse %
(8) if (channeli [parenti ] = ⊥) then send SAFE () on channeli [parenti ] end if;
(9) children_safei ← ∅
until the last local pulse has been executed end repeat.
9.4.1 Synchronizer γ
Two Particular (Extreme) Cases If each process constitutes a group, the span-
ning trees inside each group, the tree-related variables (parenti , childreni ), and
the tree-related messages (GROUP _ SAFE ()) disappear. Moreover, we have then
it_group_channelsi = channelsi . The parity can also be suppressed. It follows that
we can suppress lines 1, 3, 9–11, 15, 17–21, 28, and 29–31. As the reader can
check, we then obtain the synchronizer α (where the messages SAFE () are replaced
by GROUP _ SAFE ()).
9.4 Advanced Synchronizers: γ and δ 233
repeat
(1) if (channeli [parenti ] = ⊥) then wait (PULSE() received on channeli [parenti ]) end if;
% all processes are safe with respect to pulse clocki %
(2) clocki ← clocki + 1; % next pulse is generated %
(3) for each x ∈ childreni do send PULSE () on channeli [x] end for;
(4) Send the messages MSG (−, pb) of the local synchronous algorithm
(5) where pb = (clocki mod 2);
(6) expected_acki ← number of MSG (m) sent during the current pulse;
(7) neighbors_safei ← neighbors_safei \ (it_group_channelsi ∪ childreni );
(8) wait ((expected_acki = 0) ∧ (children_safei = childreni ));
% pi and all its children are safe respect to the current pulse %
(9) if (channeli [parenti ] = ⊥)
(10) then send SAFE () on channeli [parenti ]
(11) else % pi is the root of its group %
(12) for each x ∈ childreni ∪ it_group_channelsi
(13) do send GROUP _ SAFE () on channeli [x]
(14) end for
(15) end if;
(16) wait (it_group_channelsi ⊆ neighbors_safei );
% pi ’s group and its neighbor groups are safe with respect to pulse clocki %
(17) wait (∀x ∈ childreni : ALL _ GROUPS _ SAFE () received on channelsi [x] );
(18) if (channeli [parenti ] = ⊥)
(19) then send ALL _ GROUPS _ SAFE () on channeli [parenti ]
(20) end if;
(21) children_safei ← ∅
until the last local pulse has been executed end repeat.
when MSG (m, pb) is received on channeli [x] do
(22) send ACK () on channeli [x];
(23) if (pb = (clocki mod 2))
(24) then m belongs to the current pulse; deliver it to the synchronous algorithm
(25) else m belongs to the next pulse; keep it to deliver it at the next pulse
(26) end if.
when ACK () is received on channeli [x] do
(27) expected_acki ← expected_acki − 1.
when SAFE () is received on childreni [x] do % we have then x ∈ childreni %
(28) children_safei ← children_safei ∪ {x}.
when GROUP _ SAFE () is received on channeli [parenti ] do
(29) for each x ∈ childreni ∪ it_group_channelsi
(30) do send GROUP _ SAFE () on channeli [x]
(31) end for.
when GROUP _ SAFE () is received on channeli [x] where x ∈ it_group_channelsi do
(32) neighbors_safei ← neighbors_safei ∪ {x}.
In the other extreme case, there is a single group to which all the processes
belong. In this case, both it_group_channelsi , which is now equal to ∅, and
neighbors_safei become useless and disappear. Similarly, the messages GROUP _
SAFE () and ALL _ GROUP _ SAFE () become useless. It follows that we can suppress
lines 7, 11–14, 16–20, 29–31, and 32, and synchronizer β.
234 9 Simulating Synchrony on Top of Asynchronous Systems
9.4.2 Synchronizer δ
Graph Spanner Let us first recall that a partial graph is obtained by suppressing
edges, while a subgraph is obtained by suppressing vertices and their incident edges.
Given a connected undirected graph G = (V , E) (V is the set of vertices and E
the set of edges), a partial subgraph G = (V , E ) is a t-spanner if, for any edge
(x, y) ∈ E, there is path (in G ) from the vertex x to the vertex y whose distance
(number of channels) is at most t.
The notion of a graph spanner generalizes the notion of a spanning tree. It is used
in distributed message-passing distributed systems to define overlay structures with
appropriate distance properties.
Theorem 12 For all k ∈ [0..t] and every process pi , when phi is set to k, the pro-
cesses at distance d ≤ k from pi in the communication graph are safe.
Proof The proof is by induction. Let us observe that the invariant is true for k = 0
(pi is safe when it sets phi to 0). Let us assume that the invariant in satisfied up to
k. When pi increases its counter to k + 1, it has received (k + 1) messages SAFE ()
9.4 Advanced Synchronizers: γ and δ 235
repeat
% pi and all its neighbors are safe with respect to the pulse clocki %
(1) clocki ← clocki + 1; % next pulse is generated %
(2) Send the messages MSG (−, pb) of the local synchronous algorithm
(3) where pb = (clocki mod 2);
(4) expected_acki ← number of MSG (m) sent during the current pulse;
(5) wait (expected_acki = 0 );
% pi and all its children are safe respect to the current pulse %
(6) phi ← 0;
(7) repeat t_neighbors_safei ← t_neighbors_safei \ spanner_channelsi ;
(8) for each x ∈ spanner_channelsi do send SAFE () on channeli [x] end for;
(9) wait (spanner_channelsi ⊆ t_neighbors_safei );
(10) phi ← phi + 1
(11) until (phi = t) end repeat
until the last local pulse has been executed end repeat.
from each of its neighbors in the t-spanner. Let pj be one of these neighbors. When
pj sent its (k + 1)th message SAFE () to pi , we had phj = k. It follows from the
induction assumption that the processes at distance d ≤ k from pj (in the commu-
nication graph) are safe, and this applies to every neighbor of pi in the t-spanner.
So, when pi increases its counter to k + 1, all its neighbors in the communication
graph at a distance d ≤ k + 1 are safe.
pulse
is the diameter of the communication graph), and we have Cγ = O(D) and
pulse
Tγ = O(nD).
While synchronous and asynchronous algorithms are two extreme points of the syn-
chronous behavior spectrum, there are distributed systems that are neither fully syn-
chronous nor entirely asynchronous. This part of this chapter is devoted to such a
type of system, and the construction of synchronizers, which benefit from its spe-
cific properties, is used to illustrate their noteworthy properties. Both synchronizers
presented here are due to C.Y. Chou, I. Cidon, I. Gopal, and S. Zaks (1987).
Bounded Delay Networks These networks are systems in which (a) commu-
nication delays are bounded (by one time unit), (b) each process (processor) has
a physical clock, (c) the clocks progress to same speed but are not synchronized,
and (d) processing times are negligible when compared to message delay and are
consequently assumed to be equal to 0.
Thus, the local clocks do not necessarily show the same time at the same moment
but advance by equal amounts in equal time intervals. If the clocks could be started
simultaneously, we would obtain a perfectly synchronous system. The fact that the
system is not perfectly synchronous requires the addition of a synchronizer when
one wants to execute synchronous algorithms on top of such systems.
Let us consider an abstract global time, and let τi be the time at which pi set s
timeri to 0. This global time, which is not accessible to the processes, can be seen as
the time of an omniscient external observer. Its unit is assumed to be the same as the
one of the local clocks. The previous initialization provides us with the following
relation:
∀(i, j ): |τi − τj | ≤ d(i, j ) (R1),
where d(i, j ) is the distance separating pi and pj (minimal number of chan-
nels between pi and pj ). Let us notice that if pi and pj are neighbors, we have
|τi − τj | ≤ 1.
After a process pi has received a message INIT (), timeri accurately measures the
passage of time, one unit of time being always the maximum transit time for a
message.
Let us consider a scenario where, after the initialization of the physical clocks,
there is no additional synchronization. This scenario is depicted in Fig. 9.14 where
there are two neighbor processes pi and pj , and a logical pulse takes one physical
time unit. At the beginning of its first pulse, pj sends a message to pi , but this
message arrives while pi is at its second pulse. Hence, this message arrives too late,
and violates the fundamental property of synchronous message-passing.
Hence, synchronizers are not given for free in bounded delay networks. Imple-
menting a synchronizer requires to compute an appropriate duration of ρ physical
time units for a logical pulse (the previous scenario shows that we have necessarily
ρ > 1). The rth pulse of pi will then start when timeri = rρ, and will terminate
when timeri reaches the value (r + 1)ρ.
Several synchronizers can be defined. They differ in the value of ρ and the in-
stants at which they send messages of the synchronous algorithm they interpret.
Thee next sections present two of them, denoted λ and μ.
238 9 Simulating Synchrony on Top of Asynchronous Systems
Fig. 9.15 Interval during which a process can receive pulse r messages
9.5.3 Synchronizer λ
Pulse Duration Let pj be a neighbor of a pi , and let τj (r) be the global instant
time at which pj receives a pulse r message m from pi . This message must be
received and processed before pj enter pulse r + 1, i.e., we must have
the left-hand side of this inequality is the abstract global time at which pj starts
pulse r + 1. As transfer delays are at most one time unit, we have τj (r) < (τi +
rρ) + 1. Combined with (R1), namely τi < τj + 1, we obtain τj (r) < τj + rρ + 2,
that is to say
τj (r) < τj + (r + 1)ρ + (2 − ρ) (R3).
It follows that we have
(ρ ≥ 2) ∧ (R3) ⇒ (R2).
This means that the property (R2) required for the correct implementation of a syn-
chronizer (namely, no message arrive too late) is satisfied as soon as ρ ≥ 2.
(R3) gives an upper bound on the global time instant at which a process can
receive pulse r messages sent by its neighbors. In the same way, it is possible to
find a lower bound. Let pi be the sender of a pulse r message received by process
pj at time τj (r). We have τj (r) ≥ τi + rρ. Combining this inequality with (R1)
(τi ≥ τj + 1), we obtain
Hence, the condition ρ ≥ 2 ensures that a message sent at pulse r will be received by
it destination process pi (a) before pi progresses to the pulse (r + 1), and (b) after
pi has started its pulse (r − 1). This is illustrated in Fig. 9.15.
9.5 The Case of Networks with Bounded Delays 239
when timeri = ρr do
clocki ← r; % next pulse is generated %
Send the messages MSG (−, pb) of the local synchronous algorithm where
pb = (clocki mod 2);
process the pulse r messages.
It follows that the pulse r of a process pi spans the global time interval
τi + ρr, τi + (ρ + 1)r ,
and during this time interval, pi can receive from its neighbors only messages sent
at pulse r or r + 1. It follows that messages have to carry the parity bit of the pulse
at which they are sent so that the receiver be able to know if the received message is
for the current round r or the next one (r + 1).
9.5.4 Synchronizer μ
From an operational point of view, this means that the parity bits used in λ have
to be eliminated.
240 9 Simulating Synchrony on Top of Asynchronous Systems
when timeri = ρr do
clocki ← r. % next pulse is generated %
and
τj (r) ≥ τi + rρ + η ≥ τj + rρ (R7)
or again
ρ ≥ η + 1 + (τi − τj ) and η ≥ τj − τi .
But we have from (R1): τj − τi < 1 and τi − τj < 1. Hence, conditions (R6) and
(R7) are satisfied when
ρ ≥η+2 and η ≥ 1.
It follows that any pair of values (η, ρ) satisfying the previous condition ensures
that any message is received at the same pulse at which it has been sent. The smallest
values for η and ρ are thus 1 and 3, respectively.
When the Clocks Drift The previous section assumed that, while local clocks
of the processes (processors) do not output the same value at the same reference
9.5 The Case of Networks with Bounded Delays 241
time, they progress at the same speed. This simplifies the design of synchronizers α
and β.
Unfortunately, physical clocks drift, but fortunately their drift with respect to
the abstract global time perceived by an omniscient external observer (also called
reference time) is usually bounded and captured by a parameter denoted (called
the clock drift).
Hence, we consider that the faster clock counts one unit for (1 − ) of refer-
ence time, while the slowest clock counts one unit for (1 + ) of reference time
(Fig. 9.18). Let denote a time duration measured with respect to the reference
time, and i be this time duration as measured by the clock of pi . We have
pj has the faster clock (timeri counts 1 for 1 − of reference time). In such a
context, the condition (R6) becomes
It follows from these conditions that the greatest pulse number rmax that can be
attained without problem is such that 2rmax ρ = ρ((1 − ) − 2 − η(1 + )) = η(1 −
) − 1, which is obtained for
ρ(1 − ) − 1
η= ,
2
and we have then
ρ(1 − )2 − 3 +
rmax = .
4ρ
Let us remark that, when there is no drift, we have = 0, and we obtain rmax =
+∞ and 2η = ρ − 1. As already seen in Sect. 9.5.4, ρ = 3 and η = 1 are the smallest
values satisfying this equation.
Considering physical clocks whose drift is 10−1 seconds a day (i.e., =
1/864 000), Table 9.1 shows a few numerical results for three increasing values
of ρ.
9.6 Summary
This chapter has presented the concept of a synchronizer which encapsulates a gen-
eral methodology to simulate (non-real-time) distributed synchronous algorithms on
9.7 Bibliographic Notes 243
4 54 000
8 135 000
12 162 000
This part of the book and the following one are on the enrichment of the distributed
message-passing system in order to offer high-level operations to processes. This
part, which is on resource allocation, is composed of two chapters. (The next part
will be on high-level communication operations.)
Chapter 10 introduces the mutual exclusion (mutex) problem, which is the most
basic problem encountered in resource allocation. An algorithm solving the mutex
problem has to ensure that a given hardware or software object (resource) is accessed
by at most one process at a time, and that any process that wants to access it will be
able to do so. Two families of mutex algorithms are presented. The first is the family
of algorithms based on individual permissions, while the second is the family of
algorithms based on arbiter permissions. A third family of mutex algorithms is the
family of token-based algorithms. Such a family was already presented in Chap. 5,
devoted to mobile objects navigating a network (a token is a dataless mobile object).
Chapter 11 considers first the problem posed by a single resource with several
instances, and then the problem posed by several resources, each with multiple in-
stances. It assumes that a process is allowed to acquire several instances of several
resources. The main issues are then to prevent deadlocks from occurring and to pro-
vide processes with efficient allocation algorithms (i.e., algorithms which reduce
process waiting chains).
Chapter 10
Permission-Based Mutual Exclusion Algorithms
This chapter is on one of the most important synchronization problems, namely mu-
tual exclusion. This problem (whose name is usually shortened to “mutex”) consists
of ensuring that at most one process at a time is allowed to access some resource
(which can be a physical or a virtual resource).
After having defined the problem, the chapter presents two approaches which
allow us to solve it. Both are based on permissions given by processes to other
processes. The algorithms of the first approach are based on individual permis-
sions, while the algorithms of the second approach are based on arbiter permissions
(arbiter-based algorithms are also called quorum-based algorithms).
10.1.1 Definition
Fig. 10.1 A mutex invocation pattern and the three states of a process
The Three States of a Process Given a process pi , let cs_statei be a local vari-
able denoting its current local state from the critical section point of view. We have
cs_statei ∈ {out, trying, in}, where
• cs_statei = out, means that pi is not interested in executing the statement cs.
• cs_statei = trying, means that pi is executing the operation acquire_mutex().
• cs_statei = in, means that pi is executing the statement cs.
An invocation pattern of acquire_mutex() and release_mutex(), together with the
corresponding values of cs_statei , is represented in Fig. 10.1.
Mutex Versus Election The election problem was studied in Chap. 4. The aim
of both an election algorithm and a mutex algorithm is to create some asymme-
try among the processes. But these problems are deeply different. In the election
problem, any process can be elected and, once elected, a process remains elected
forever. In the mutex problem, each process that wants to enter the critical section
must eventually be allowed to enter it. In this case, the asymmetry pattern evolves
dynamically according to the requests issued by the processes.
10.2 A Simple Algorithm Based on Individual Permissions 249
Remark on the Underlying Network In all the algorithms that are presented in
Sects. 10.2, 10.3, and 10.4, the communication network is fully connected (there
is a bidirectional channel connecting any pair of distinct processes). Moreover, the
channels are not required to be FIFO.
The algorithm presented in this section is due to G. Ricart and A.K. Agrawala
(1981).
Principle: Permissions and Timestamps The principle that underlies this algo-
rithm is very simple. We have Ri = {1, . . . , n} \ {i}, i.e., a process pi needs the
permission of each of the (n − 1) other processes in order to be allowed to enter the
critical section. As already indicated, the intuitive meaning of the permission sent
by pj to pi is the following “as far as pi (only) is concerned, pi allows pj to enter
the critical section”.
250 10 Permission-Based Mutual Exclusion Algorithms
Hence, when a process pi wants to enter the critical section, it sends a REQUEST()
message to each other process pj , and waits until it has received the (n − 1) corre-
sponding permissions. The core of the algorithm is the predicate used by a process
pj to send its permission to a process pi when it receives a request from this pro-
cess. There are two cases. The behavior of pj depends on the fact that it is currently
interested or not in the critical section.
It follows that this algorithm uses timestamps to ensure both the safety property
and the liveness property defining the mutex problem.
Structural View The structure of the local module implementing the mutual ex-
clusion service at a process pi is described in Fig. 10.2. (As the reader can check,
this structure is the same as the one described in Fig. 5.2.)
10.2 A Simple Algorithm Based on Individual Permissions 251
operation acquire_mutex() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) waiting_fromi ← Ri ; % Ri = {1, . . . , n} \ {i}
(4) for each j ∈ Ri do send REQUEST (rdi , i) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.
operation release_mutex() is
(7) cs_statei ← out;
(8) for each j ∈ perm_delayedi do send PERMISSION (i) to pj end for;
(9) perm_delayedi ← ∅.
and waits until it has received the corresponding permissions (line 5). When this
occurs, it enters the critical section (line 6).
When it receives a permission, pi updates accordingly its set waiting_fromi
(line 15).
When it invokes release_mutex(), pi proceeds to the local state out (line 7), and
sends its permission to all the processes of the set perm_delayedi . This is the set
of processes whose requests were competing with pi ’s request, but pi delayed the
corresponding permission-sending because its own request has priority over them
(lines 8–9).
When pi receives a message REQUEST (k, j ), it first updates its local clock
(line 10). It then computes if it has priority (line 11), which occurs if cs_statei = out
(it is then interested in the critical section) and the timestamp of its current request
is smaller than the timestamp of the request it has just received. If pi has priority, it
adds the identity j to the set perm_delayedi (line 12). If pi does not have priority,
it sends by return its permission to pj (line 13).
A Remark on the Management of clocki Let us observe that clocki is not in-
creased when pi invokes acquire_mutex(): The date associated with the current re-
quest of pi is the value of clocki plus 1 (line 2). Moreover, when pi receives a
request message, it updates clocki to max(clocki , k), where k is the date of the re-
quest just received by pi (line 10), and this update is the only update of clocki .
As a very particular case, let us consider a scenario in which only pi wants to
enter the critical section. It is easy to see that, not only 1, i is the timestamp of its
first request, but 1, i is the timestamp of all its requests (this is because, as line 10
is never executed, clocki remains forever equal to 0).
As we are about to see in the following proof, the algorithm is correct. It actually
considers that, when a process pi enters several times the critical section while the
other processes are not interested, it is not necessary to increase the clock of pi .
This is because, in this case and from a clock point of view, the successive invoca-
tions of acquire_mutex() issued by pi can appear as a single invocation. Hence, the
algorithm increases the local clocks as slowly as possible.
Message Cost It is easy to see that each use of the critical section by a process
requires 2(n − 1) messages: (n − 1) request messages and (n − 1) permission mes-
sages.
Lemma 3 The algorithm described in Fig. 10.3 satisfies the mutex safety property.
Proof The proof is by contradiction. Let us assume that two processes pi and pj
are simultaneously in the critical section, i.e., from an external omniscient observer
point of view, we have cs_statei = in and cs_statej = in. It follows from the code of
10.2 A Simple Algorithm Based on Individual Permissions 253
Fig. 10.4 Proof of the safety property of the algorithm of Fig. 10.3
acquire_mutex() that each of them has sent a request message to the other process
and has received its permission. A priori two scenarios are possible for pi and pj to
be simultaneously inside the critical section. Let h, i and k, j be the timestamps
of the request messages sent by pi and pj , respectively.
• Each process has sent its request message before receiving the request from the
other process (left side of Fig. 10.4).
As i = j , we have either h, i < k, j or k, j < h, i. Let us assume (with-
out loss of generality) that h, i < k, j . In this case, pj is such that ¬prioj
when it received REQUEST (h, i) (line 11) and, consequently, it sent its permis-
sion to pi (line 13). Differently, when pi received REQUEST (h, i), we had prioi ,
and consequently pi did not send its permission to pj ((line 12; it will send the
permission only when it will execute (line 8 of release_mutex()).
It follows that, when this scenario occurs, pj cannot enter the critical section
while pi is inside the critical section, which contradicts the initial assumption.
• One process (e., pj ) has sent its permission to the other process (pi ) before send-
ing its own request (right side of Fig. 10.4).
When pj receives REQUEST (h, i), it executes clockj ← max(clockj , h)
(line 10), hence we have clockj ≥ h. Then, when later pj invokes acquire_
mutex(), it executes rd j ← clocki + 1 (line 2). Hence, rdj > h, and the mes-
sage REQUEST (k, j ) sends by pj to pi is such that k = rdj > h. It follows
that, when pi receives this message we have cs_statei = out (assumption), and
h, i < k, j . Consequently prioi is true (line 11), and pi does not send its per-
mission to pj (line 12). It follows that, as in the previous case, pj cannot enter
the critical section while pi is inside the critical section, which contradicts the
initial assumption.
It can be easily checked that the previous proof remains valid if messages are
lost.
Lemma 4 The algorithm described in Fig. 10.3 satisfies the mutex liveness prop-
erty.
Proof The proof of the liveness property is done in two parts. The first part
shows that the algorithm is deadlock-free. The second part shows the algorithm
254 10 Permission-Based Mutual Exclusion Algorithms
Fig. 10.5 Proof of the liveness property of the algorithm of Fig. 10.3
is starvation-free. Let us first observe that as the clocks cannot decrease, the times-
tamps cannot decrease.
Proof of the deadlock-freedom property. Let us assume, by contradiction, that
processes have invoked acquire_mutex() and none of them enters the local state in.
Among all these processes, let pi be the process that sent the request message
with the smallest timestamp h, i, and let pj any other process. When pi re-
ceives REQUEST (h, i) it sends by return the message PERMISSION (j ) to pi if
cs_tatej = out. If cs_tatej = out, let k, j the timestamp of its request. Due to def-
inition of h, i, we have h, i < k, j . Consequently, pj sends PERMISSION (j )
to pi . It follows that pi receives a permission message from each other process
(line 15). Consequently, pi stops waiting (5) and enters the critical section (line 6).
Hence, the algorithm is deadlock-free.
Proof of the starvation-freedom property. To show that the algorithm is starvation-
free, let us consider two processes pi and pj which are competing to enter the crit-
ical section. Moreover, let us assume that pi repeatedly invokes acquire_mutex()
and enters the critical section, while pj remains blocked at line 5 waiting for the
permission from pi . The proof consists in showing that this cannot last forever.
Let h, i and k, j be the timestamps of the requests of pi and pj , respec-
tively, with h, i < k, j . The proof shows that there is a finite time after which the
timestamp of a future request of pi will be h , i > k, j . When this will occur, the
request of pj will have priority with respect to that of pi . The worst case scenario is
described in Fig. 10.5: clocki = h − 1 < clockj = k − 1, the request messages from
pi to pj and the permission messages from pj to pi are very fast, while the request
message from pj to pi is very slow. When it receives the message REQUEST (h, i),
pj sends by return its permission to pi . This message pattern, which is surrounded
by an ellipsis, can occur repeatedly an unbounded number of times, but the impor-
tant point is that it cannot appear an infinite number of times. This is because, when
pi receives the message REQUEST (k, j ), it updates clocki to k and, consequently, its
next request message (if any) will carry a date greater than k and will have a smaller
priority than pj ’s current request. Hence, no process can prevent another process
from entering the critical section.
10.2 A Simple Algorithm Based on Individual Permissions 255
The following property of the algorithm follows from the proof of the previous
lemma. The invocations of acquire_mutex() direct the processes to enter the critical
section according to the total order on the timestamps generated by these invoca-
tions.
Theorem 13 The algorithm described in Fig. 10.3 solves the mutex problem.
An Extended Mutex Algorithm Let us associate with each operation op() two
control operations denoted begin_op() and end_op(). These control operations are
used to bracket each invocation of op() as follows:
begin_op(); op(); end_op().
Let op_type be the synchronization type associated with the operation op() (let
us observe that several operations can be associated with the same synchroniza-
tion type). The algorithm described in Fig. 10.6 is a trivial extension of the mutex
256 10 Permission-Based Mutual Exclusion Algorithms
operation begin_op() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) waiting_fromi ← Ri ; % Ri = {1, . . . , n} \ {i}
(4 ) for each j ∈ Ri do send REQUEST (rdi , i, op_type) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.
operation end_op() is
(7) cs_statei ← out;
(8) for each j ∈ perm_delayedi do send PERMISSION (i) to pj end for;
(9) perm_delayedi ← ∅.
algorithm described in Fig. 10.3. It ensures that (a) the concurrency constraints ex-
pressed by the exclusion matrix are respected, and (b) the invocations which are not
executed concurrently are executed according to their timestamp order.
Only two lines need to be modified to take into account the synchronization type
of the operation (their number is postfixed by ). At line 4 , the request message has
to carry the type of the corresponding operation. Then, at line 11 , the Boolean value
of exclude(op_type, op_t) (where op_type is the type of the current operation of pi
and op_t is the type of the operation that pj wants to execute) is used to compute
the value of prioi , which determines if pi has to send by return its permission to pj .
It is easy to see that, if there is a single operation op(), and this operation ex-
cludes itself (i.e., its type op_type is such that exclude(op_type, op_type = true),
the algorithm boils down to that of Fig. 10.3.
interested in accessing the critical section, then there is a time τ ≥ τ after which
this process is no longer required to participate in the mutual exclusion algorithm. It
is easy to see that the mutex algorithm described in Fig. 10.3 is not adaptive: Each
time a process pi wants to enter the critical section, every other process pj has to
send it a permission, even if we always have cs_statej = out.
This section presents two adaptive mutex algorithms. The first one is obtained
from a simple modification of the algorithm of Fig. 10.3. The second one has the
noteworthy property of being both adaptive and bounded.
operation acquire_mutex() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do send REQUEST (rdi , i) to pj end for;
(4) wait (Ri = ∅);
(5) cs_statei ← in.
operation release_mutex() is
(6) cs_statei ← out;
(7) for each j ∈ perm_delayedi do send PERMISSION ({i, j }) to pj end for;
(8) Ri ← perm_delayedi ;
(9) perm_delayedi ← ∅.
Fig. 10.7 An adaptive mutex algorithm based on individual permissions (code for pi )
of perm_delayedi , which contains the set of processes to which pi has just sent the
permission it shares with each of them.
Finally, when pi receives a message REQUEST (k, j ), it has the priority if it is
currently inside the critical section or it is waiting to enter it (line 11). If its current
request has priority with respect to the request it has just received, it delays the send-
ing of its permission (line 12). If it does not have priority, it sends by return to pj the
message PERMISSION ({i, j }) (line 13) and adds j to Ri (line 14). Moreover, if pi
is competing for the critical section (cs_statei = trying), it sends a request message
to pj so that pj eventually returns to it the shared message PERMISSION ({i, j })
(line 15) so that it will be allowed to enter the critical section.
allows the local clocks to be synchronized in the sense that, as it has been already
noticed, we always have |clocki − clockj | ≤ n − 1. Due to the adaptivity feature of
the previous algorithm, this local clock synchronization is no longer ensured and the
difference between any two local clocks cannot be bounded.
The mutex algorithm that is presented in this section has two main properties: It
is adaptive and has only bounded variables. Moreover, process identities (whose
scope is global) can be replaced by channel identities whose scopes are local to
each process.
This algorithm is due to K.M. Chandy and J. Misra (1984). It is derived here from
the permission-based algorithms which were presented previously. To simplify the
presentation, we consider a version of the algorithm which uses process identities.
260 10 Permission-Based Mutual Exclusion Algorithms
operation acquire_mutex() is
(1) cs_statei ← trying;
(2) for each j ∈ Ri do send REQUEST () to pj end for;
(3) wait (Ri = ∅);
(4) cs_statei ← in;
(5) for each j ∈ {1, . . . , n} \ {i} do perm_statei [j ] ← used end for.
operation release_mutex() is
(6) cs_statei ← out;
(7) for each j ∈ perm_delayedi do send PERMISSION ({i, j }) to pj end for;
(8) Ri ← perm_delayedi ;
(9) perm_delayedi ← ∅.
Fig. 10.10 A bounded adaptive algorithm based on individual permissions (code for pi )
state will be new, which means that the current request of pi has priority on
the request of pj .
This scenario, which is due to the fact that channels are not FIFO, is ex-
actly the scenario which has been depicted in Fig. 10.8 for the timestamp-based
adaptive mutex algorithm (in Fig. 10.8, as pi knows both the timestamp of its
last request and the timestamp of pj ’s request, a timestamp comparison is used
instead of the predicate j ∈ Ri ).
It follows that, when pi receives a request from pj , it has priority if
(cs_statei = in) ∨ (cs_statei = trying) ∧ perm_statei [j ] = new ∨ (j ∈ Ri ) .
The code is nearly the same as in Fig. 10.7. The only modifications are the sup-
pression of the management of the local clocks, the addition of the management of
the permission states, and the appropriate modification of the priority predicate.
• Just before entering the critical section, a process pi indicates that the permission
it shares with any other process pj is now used (line 5).
• When pi receives (from pj ) the message PERMISSION ({i, j }), pi sets to new the
state of this permission (line 17).
• The computation of the value of the Boolean prioi is done as explained previously
(the timestamp-based comparison rdi , i < k, j used in Fig. 10.7 is replaced
by the predicate perm_statei [j ] = new ∨ j ∈ Ri , which is on bounded local vari-
ables).
Adaptivity and Cost As for the algorithm of Fig. 10.7, it is easy to see that
the algorithm is adaptive and each use of the critical section costs 2|Ri | messages,
where the value of |Ri | is such that 0 ≤ |Ri | ≤ n − 1 and depends on the current
state of the system. Moreover, the number of distinct messages is bounded.
As far as the local memory of a process pi is concerned, we have the following:
As in the previous algorithms, cs_statei , Ri , and perm_delayedi are bounded, and
so is perm_statei [1..n], which is an array of one-bit values.
An Acyclic Directed Graph The following directed graph G (which evolves ac-
cording to the requests issued by the processes) is central to the proof of the liveness
property of the bounded adaptive algorithm. The vertices of G are the n processes.
There is a directed edge from pi to pj (meaning that pj has priority over pi ) if:
• the message PERMISSION ({i, j }) is located at process pi and perm_statei [j ] =
used, or
• the message PERMISSION ({i, j }) is in transit from pi to pj , or
• the message PERMISSION ({i, j }) is located at process pj and perm_statej [i] =
new.
It is easy to see that the initial values are such that there is a directed edge from pi
to pj if and only if i < j . Hence, this graph is initially acyclic.
Proof Let us observe that the only statement that can change the direction of an
edge is (a) when a process uses the corresponding permission and (b) the previous
state of this permission was new. This occurs at line 5. If perm_statei [j ] = new
before executing perm_statei [j ] ← used, the execution of this statement changes
the priority edge from pj to pi into an edge from pi to pj (i.e., pj then has priority
with respect to pi ).
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 263
It follows that, whatever the values of the local variables perm_statei [x] before
pi executes line 5, after it has executed this line, the edges adjacent to pi are only
outgoing edges. Consequently, no cycle involving pi can be created by this state-
ment. It follows that, if the graph G was acyclic before the execution of line 5, it
remains acyclic. The fact that the graph is initially acyclic concludes the proof of
the lemma.
Theorem 14 The algorithm described in Fig. 10.10 solves the mutex problem.
Proof Proof of the safety property. It follows from the initialization that, for any pair
of processes pi and pj , we initially have either (i ∈ Rj ) ∧ (j ∈ / Ri ) or (j ∈ Ri ) ∧
(i ∈
/ Rj ). Then, when a process pi sends a permission it adds the corresponding
destination process to Ri (lines 7–8, or lines 12–13). Moreover, when it receives a
permission from a process pj , pi suppresses this process from Ri . It follows that
there is always a single copy of each permission message.
Due to the waiting predicate of line 3, a process pi has all the permissions it
shares with each other process when it is allowed to enter the critical section. From
then on, and until it executes release_mutex(), we have cs_statei = in. Hence, dur-
ing this period, the Boolean variable prioi cannot be false, and consequently, pi
does not send permissions. As permissions are not duplicated, and there is a single
permission shared by any pair of processes, it follows that, while cs_statei = in, no
process pj has the message PERMISSION ({i, j }) that it needs to enter the critical
section, which proves the safety property of the bounded adaptive mutex algorithm.
Proof of the liveness property. Considering a process pi in the acyclic graph
G, let height(i) be the maximal distance of a path from pi to the process without
outgoing edges. Let pi be a process such that cs_statei = trying, and k = height(i).
The proof consists in showing, by induction on k, that eventually pi is such that
cs_statei = in.
Base case: k = 0. In this case, pi has only incoming edges in G. Let us consider
any other process pj .
• It follows from the directed edge from pj to pi in G that, if the message PER -
MISSION ({i, j }) is at pi (or in transit from pj to pi ), its state is (or will be) new.
It then follows from the priority computed at line 10 that, even if it receives a
request from pj , pi keeps the permission until it invokes release_mutex().
• If the message PERMISSION ({i, j }) is at pj and pj is such that cs_statej = in, it
follows from the directed edge from pj to pi in G that perm_statei [j ] = used.
Hence, pj sends the message PERMISSION ({i, j }) to pi . As the state of this mes-
sage is set to new when it arrives, pi keeps it until it invokes release_mutex().
• If the message PERMISSION ({i, j }) is at pj and pj is such that cs_statej = in,
the previous item applies after pj invoked release_mutex().
It follows that pi eventually obtains and keeps the (n − 1) messages, which allows
it to enter the critical section (lines 3–4).
Induction case: k > 0. Let us assume that all the processes that are at a height
≤ k − 1, eventually enter the critical section. Let pj be a process whose height is k.
264 10 Permission-Based Mutual Exclusion Algorithms
This process has incoming edges and outgoing edges. As far as the incoming edges
are concerned, the situation is the same as in the base case, and pj will eventually
obtain and keep the corresponding permissions.
Let us now consider an outgoing edge from pi to some process pj . The height
of pj is ≤ k − 1. Moreover, (a) the message PERMISSION ({i, j }) is at pi in the state
used, or (b) is in transit from pi to pj , or (c) is at pj in the state new. As the height
of pj is ≤ k − 1, it follows from the induction assumption, that pj eventually enters
the critical section. When pj does, the state of the permission becomes used, and
the direction of the edge connecting pi and pj is then from pj to pi . Hence, the
height of pi becomes eventually ≤ k − 1, and the theorem follows.
Mutex Safety from Intersecting Sets As a process gives its permission to only
one process at a time, the safety property of the mutual exclusion problem is ensured
if we have
∀i, j : Ri ∩ Rj = ∅.
This is because, as there is at least one process pk such that k ∈ Ri ∩ Rj , this process
cannot give PERMISSION (k) to pj , if it has sent it to pi and pi has not yet returned
it. Such a process pk is an arbiter for the conflicts between pi and pj . According to
the definition of the sets Ri and Rj , the conflicts between the processes pi and pj
can be handled by one or more arbiters.
10.4 An Algorithm Based on Arbiter Permissions 265
• Each process pi belongs to the same number D of quorums i.e., ∀i: |{j | i ∈
Rj }| = D. This is the “equal responsibility rule”: All the processes are engaged
in the same number of quorums.
• K and D have to be as small as possible. (Of course, a solution in which ∀i: Ri =
{1, . . . , n} works, but we would then have K = D = n, which is far from being
optimal.)
The two first constraints are related to symmetry, while the third one is on the opti-
mality of the quorum system.
Optimal Values of K and D Let us observe that the previous symmetry and
optimality constraints on K and D link these values. More precisely, as both nK and
nD represent the total number of possible arbitrations, the relation K = D follows.
To compute their smallest value, let us count the greatest possible number of
different sets Ri that can be built. Let us consider a set Ri = {q1 , . . . , qK } (all qj are
distinct and qj is not necessarily pj ). Due to the definition of D, each qj belongs
to Ri and (D − 1) other distinct sets. Hence, an upper bound on the total number
of distinct quorums that can be built is 1 + K(D − 1). (The value 1 comes from the
initial set Ri , the value K comes from the size of Ri , and the value D − 1 is the
number of quorums to which each qj belongs—in addition to Ri —see Fig. 10.12.)
As there is a quorum per process and there are n processes, we have
n = K(K − 1) + 1.
It follows that the lower bound on K√and D, which, satisfies both the symmetry and
optimality constraints, is K = D n.
√
Finite Projective Planes Find n sets Ri satisfying K = D n amounts to find
a finite projective plane of n points. There exists such planes of order k when k is
10.4 An Algorithm Based on Arbiter Permissions 267
⎛ ⎞
Table 10.1 Defining
√ √ 12 8 5 9
quorums from a n × n grid ⎜ ⎟
⎜6 2 13 1⎟
⎜ ⎟
⎜ 10 7⎟
⎝ 3 4 ⎠
14 11 8 6
power of a prime number. Such a plane has n = k(k + 1) + 1 points and the same
number of lines. Each point belongs to (k + 1) distinct lines, and each line is made
up of (k + 1) points. Two distinct points share a single line, and two distinct lines
meet a single point. A projective plane with n = 7 points (i.e., k = 2) is depicted
in Fig. 10.13. (The points are marked with a black bullet. As an example, the lines
“1, 6, 5” and “3, 2, 5” meet only at the point denoted “5”.) A line defines a quorum.
Being optimal, any two quorums (lines) defined from finite projective planes have
a single process (point) in common. Unfortunately, there are not finite projective
planes for any value of n.
√
Grid Quorums A simple way to obtain quorums of size O( n), √ consists in
arbitrarily placing the processes in a square grid. If n is not a square, ( n)2 − n
arbitrary processes can be used several times to√complete the grid. An example with
n = 14 processes is given in Table 10.1. As ( 14)2 − 14 = 2, two processes are
used twice to fill the grid (namely, p5 and p8 appear twice in the grid).
A quorum Ri consists then of all the processes in a line plus one process per
column. As an example the set {6, 2, 13, 1, 8, 4, 14} constitutes a quorum. As any
quorum includes a line of the grid, it follows from their construction rule that any
two quorums intersect. √ Moreover, due√to the grid structure, and according to the
value of n, we have n ≤ |Ri | ≤ 2 n − 1.
Crumbling Walls In a crumbling wall, the processes are arranged in several lines
of possibly different lengths (hence, all quorums will not have the same size). A quo-
rum is then defined as a full line, plus a process from every line below this full line.
A triangular quorum system is a crumbling wall in which the processes are ar-
ranged in such a way the th line has processes (except possibly the last line).
Vote-Based Quorums Quorums can also be defined from weighted votes as-
signed to processes, which means that each process has a weighted permission.
A vote is nothing more than a permission. Let S be the sum of the weight of all
votes. A quorum is then a set of processes whose sum of weighted votes is greater
than S/2. This vote system is called majority voting.
On the Safety Side This section describes a first sketch of an algorithm where the
focus is only on the safety property. This safe algorithm is described in Fig. 10.14.
The meaning of the local variables is the same as in the previous permission-based
algorithms. An empty queue is denoted ∅.
10.4 An Algorithm Based on Arbiter Permissions 269
operation acquire_mutex() is
(1) cs_statei ← trying;
(2) waiting_fromi ← Ri ;
(3) for each j ∈ Ri do send REQUEST () to pj end for;
(4) wait (waiting_fromi = ∅);
(5) cs_statei ← in.
operation release_mutex() is
(7) cs_statei ← out;
(8) for each j ∈ Ri do send PERMISSION (j ) to pj end for.
Fig. 10.14 A safe (but not live) mutex algorithm based on arbiter permissions (code for pi )
On the client side we have the following (lines 1–8). When a process invokes
acquire_mutex(), it sends a request to each process pj of its request set Ri to obtain
the permission PERMISSION (j ) managed by pj (line 3). Then, when it has received
all the permissions (line 6 and line 4), pi enters the critical section (line 5). When, it
releases the critical section, pi returns each permission to the corresponding arbiter
process (line 8).
On the arbiter side, the behavior of pi is as follows (lines 9–16). The local
Boolean variable perm_herei is equal to true if and only if pi has the permission it
manages (namely, PERMISSION (i)).
• When it receives a request from a process pj (we have then i ∈ Rj ), pi sends it
the permission it manages (PERMISSION (i)) if it has this permission (lines 9–10).
Otherwise, it adds the request of pj to a local queue (denoted queuei ) in order to
serve it later.
• When pi is returned its permission from a process pj , it suppresses pj ’s request
from its queue queuei , and, if this queue is not empty, it sends PERMISSION (i) to
the first process of this queue. Otherwise, it keeps its permission until a process
pj (such that j ∈ Rj ) requests it.
Solving the Liveness Issue, Part 1: Using Timestamps Two additional mech-
anisms are introduced to solve the liveness issue. The first is a simple scalar clock
system that permits us to associate a timestamp with each request issued by a pro-
cess. The local clock of a process pi , denoted clocki , is initialized to 0. As we saw in
Chap. 7 and Sect. 10.2.1, this allows requests to be totally ordered, and consequently
provides us with a simple way to establish priority among conflicting requests.
according to timestamp order, with the smallest timestamp at its head), and sends
the message YIELD _ PERM () to pj 1 in order to be able to serve the request with
the highest priority (here the request from pj 2 , which has the smallest timestamp).
• Then pi receives a request message from pj 3 . The timestamp of this message
h3, j 3 is such that h2, j 2 < h3, j 3 < h1, j 1. As this timestamp is not
smaller than the timestamp at the head of the queue, pi only inserts it in queuei .
• When pi receives PERMISSION (i) from pj 1 , pi forwards it to the process whose
request is at the head of its queue (namely pj 2 ).
• Finally, when pj 2 returns its permission to pi , pi forwards it to the process which
is at the head of its queue, namely pj 3 .
The resulting algorithm, which is an extension of the safe algorithm of Fig. 10.14,
is described in Fig. 10.16. On the client side we have the following:
• When a process pi invokes acquire_mutex(), it increases its local clock (line 3),
and sends a message timestamped clocki , i (line 4).
• When pi receives a permission with a date d, pi updates its local clock (line 7)
and its waiting set waiting_fromi (line 8).
• When a process pi invokes release_mutex(), it returns its permission to each
process of the set Ri (line 10).
• When pi receives a message YIELD _ PERM () from one of its arbiters pj , it re-
turns its permission to pj only if its local state is such that cs_statei = trying
(lines 21–23). In this case, as it has no longer the permission of pj , it also adds j
to waiting_fromi .
If cs_statei = in when pi receives a message YIELD _ PERM () from one of
its arbiter pj , it does nothing. This is because, as it is in the critical section, it
will return pj ’s permission when it will invoke release_mutex(). Let us finally
notice that it is not possible to have cs_statei = out when pi receives a message
YIELD _ PERM (). This is because, if pi receives such a message, it is because it has
previously received the corresponding permission (let us recall that the channels
are FIFO).
On the arbiter side, the behavior of pi is defined by the statements it executes
when it receives a message REQUEST () or RETURNED _ PERM (), or when a process
sends it back its the message PERMISSION (i).
• When pi receives a message REQUEST (d) from a process pj , it first updates its
clock (line 11) and adds the corresponding timestamp d, j to queuei (line 12).
Then, if PERMISSION (i) is here, pi sent it by return to pj (lines 13–14). In
this case, the permission message carries the current value of clocki . This is to
allow the client pj to update its local clock (line 7). In this way, thanks to the
intersection property of the sets Ri and Rj , the local clocks of the processes are
forced to progress (and this global progress of local clocks allows each request to
eventually have the smallest timestamp).
If PERMISSION (i) is not here, pi has sent this permission message to some
process pk (whose identity has been saved in sent_toi ). If d, j is the smallest
timestamp in queuei and pi has not reclaimed its permission to pk , pi reclaims it
(lines 15–18).
272 10 Permission-Based Mutual Exclusion Algorithms
operation acquire_mutex() is
(1) cs_statei ← trying;
(2) waiting_fromi ← Ri ;
(3) clocki ← clocki + 1;
(4) for each j ∈ Ri do send REQUEST (clocki ) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.
operation release_mutex() is
(9) cs_statei ← out;
(10) for each j ∈ Ri do send PERMISSION (j ) to pj end for.
• When pi receives from pj the message PERMISSION (i) it manages, pi first sup-
presses pj ’s timestamp from queuei (lines 27). Then, if queuei is not empty,
pi sends its permission (with it local clock value) to the first process in queuei
(lines 27–31).
10.5 Summary
The aim of this chapter was to present the mutual exclusion problem and the class
of permission-based algorithms that solve it. Two types of permission-based algo-
rithms have been presented, algorithms based on individual permissions and algo-
rithms based on arbiter permissions. They differ in the meaning of a permission. In
the first case a permission engages only its sender, while it engages a set of pro-
cesses in the second case. Arbiter-based algorithms are also known under the name
quorum-based algorithms. An aim of the chapter was to show that the concept of a
permission is a fundamental concept when one has to solve exclusion problems. The
notions of an adaptive algorithm and a bounded algorithm have also been introduced
and illustrated.
The algorithms presented in Chap. 5 (devoted to mobile objects navigating a
network) can also be used to solve the mutex problem. In this case, the mobile object
is usually called a token, and the corresponding algorithms are called token-based
mutex algorithms.
• The mutual exclusion problem was introduced in the context of shared memory
systems by E.W. Dijkstra, who presented its first solution [109].
• One of the very first solutions to the mutex problem in message-passing systems
is due to L. Lamport [226]. This algorithm is based on a general state-machine
replication technique. It requires 3n messages per use of the critical section.
274 10 Permission-Based Mutual Exclusion Algorithms
• The mutex problem has received many solutions both in shared memory sys-
tems (e.g., see the books [317, 362]) and in message-passing systems [306].
Surveys and taxonomies of message-passing mutex algorithms are presented in
[310, 333, 349].
• The algorithm presented in Sect. 10.2.1 is due to R. Ricart and A.K. Agrawala
[327].
• The unbounded adaptive algorithm presented in Sect. 10.3.2 is due to O. Carvalho
and G. Roucairol [71]. This algorithm was obtained from a Galois field-based
systematic distribution of an assertion [70].
• The bounded adaptive algorithm presented in Sect. 10.3.3 is due to K.M. Chandy
and J. Misra [78]. This is one of the very first algorithms that used the notion of
edge reversal to maintain a dynamic cycle-free directed graph. A novelty of this
paper was the implementation of edge reversals with the notion of a new/used
permission.
• The algorithm based on arbiter permissions is due to M. Maekawa [243], who
was the first to define optimal quorums from finite projective planes.
• The notion of a quorum was implicitly introduced by R.H. Thomas and D.K.
Gifford in 1979, who were the first to introduce the notion of a vote to solve
resource allocation problems [159, 368].
The mathematics which underly quorum and vote systems are studied in [9, 42,
147, 195]. The notion of an anti-quorum is from [41, 147]. Tree-based quorums
were introduced in [7]. Properties of crumbling walls are investigated in [294].
Availability of quorum systems is addressed in [277]. A general method to define
quorums is presented in [281]. A monograph on quorum systems was recently
published [380].
• Numerous algorithms for message-passing mutual exclusion have been proposed.
See [7, 68, 226, 239, 282, 348] to cite a few. A mutex algorithm combining in-
dividual permissions and arbiter permissions is presented in [347]. An algorithm
for arbitrary networks is presented in [176].
The mutual exclusion problem considers the case of a single resource with a sin-
gle instance. A simple generalization is when there are M instances of the same
resource, and a process may require several instances of it. More precisely, (a) each
instance can be used by a single process at a time (mutual exclusion), and (b) each
time it issues a request, a process specifies the number k of instances that it needs
(this number is specific to each request).
Solving the k-out-of-M problem consists in ensuring that, at any time:
• no resource instance is accessed by more than one process,
• each process is eventually granted the number of resource instances it has re-
quested, and
• when possible, resource instances have to be granted simultaneously. This last
requirement is related to efficiency (if M = 5 and two processes asks for k = 2
and k = 3 resource instances, respectively, and no other process is using or re-
questing resource instances, these two processes must be allowed to access them
simultaneously).
The 1-out-of-M problem is when every request of a process is always for a single
resource instance. This problem is also called mutual exclusion with multiple entries.
This section presents an algorithm that solves the 1-out-of-M problem. This al-
gorithm, which is due to K. Raymond (1989), is a straightforward adaptation of the
mutex algorithm based on individual permission described in Sect. 10.2.
operation acquire_resource() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) nb_permi ← 0;
(4) for each j ∈ Ri do send REQUEST (rdi , i) to pj ;
(5) wait_permi [j ] ← wait_permi [j ] + 1
(6) end for;
(7) wait (nb_permi ≥ n − M));
(8) cs_statei ← in.
operation release_resource() is
(9) cs_statei ← out;
(10) foreach j such that perm_delayedi [j ] = 0 do
(11) send PERMISSION (i, perm_delayed i [j ]); perm_delayedi [j ] ← 0
(12) end for.
Fig. 11.1 An algorithm for the multiple entries mutex problem (code for pi )
• wait_permi [1..n] is an array such that wait_permi [j ] contains the number of per-
missions that pi is waiting on from pj .
• perm_delayedi [1..n] is an array such that perm_delayedi [j ] contains the number
of permissions that pi has to send to pj when it will release the resource instance
it is currently using.
• nb_permi , which is meaningful only when pi is waiting for permission, con-
tains the number of permissions received so far by pi (i.e., nb_permi = |{j =
i such that wait_permi [j ] = 0}|).
Message Cost of the Algorithm There are always (n − 1) request messages per
use of a resource instance, while the number of permission messages depends on
the system state and varies between (n − 1) and (n − M). The total number of
messages per use of a resource instance is consequently at most 2(n − 1) and at
least 2n − (M + 1).
Basic Message Pattern (1) When a process pi sends a timestamped request mes-
sage to pj , it conservatively considers that pj uses the M resource instances. When
it receives the request sent by pi , pj sends by return a message NOT _ USED (M) if it
is not interested, or if its current request has a greater timestamp.
If its current request has priority over pi ’s request, pj sends it by return a mes-
sage NOT _ USED (M − kj ) (where kj is the number of instances it asked for). This
is depicted on Fig. 11.2. This message allows pi to know how many instances of
the resource are currently used or requested by pj . Then, when it will later invoke
release_resource(kj ), pj will send to pi the message NOT _ USED (kj ) to indicate
that it has finished using its kj instances of the resource.
When considering the permission-based terminology, a message PERMISSION ()
sent by a process pj to a process pi in the mutex algorithm based on individual
permissions is replaced here by two messages, namely a message NOT _ USED (M −
kj ) and a message NOT _ USED (kj ). Said in another way, a message NOT _ USED (x)
x
represents a fraction of a whole permission, namely, M % of it.
operation acquire_resource(ki ) is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do send REQUEST (rdi , i) to pj ;
(4) used_byi [j ] ← used_byi [j ] + M
(5) end for;
i [i] ← ki ;
(6) used_by
(7) wait ( 1≤j ≤n used_byi [j ] ≤ M);
(8) cs_statei ← in.
operation release_resource(ki ) is
(9) cs_statei ← out;
(10) foreach j ∈ perm_delayedi do send NOT _ USED (ki ) end for;
(11) perm_delayedi ← ∅.
Fig. 11.4 An algorithm for the k-out-of-M mutex problem (code for pi )
The particular case where a process asks for all instances of the resource is addressed
at line 16.
Let us remember that, except for the wait statement of line 7, the code of each
operation and the code associated with message receptions are locally executed in
mutual exclusion.
The requests are served in their timestamp order. As an example, let us consider
that M = 8 and let us assume that the first five invocations (according to their times-
11.1 A Single Resource with Several Instances 283
tamp) are for 3, 1, 2, 5, 1 instances of the resource, respectively. Then, the first three
invocations are served concurrently. The fourth invocation will be served as soon as
any three instances of the resource will have been released.
Let us finally observe that it is easy to modify the algorithm to allow a process
pi to release separately its ki instances of the resource.
Message Cost of the Algorithm There are always (n − 1) request messages per
use of a resource instance. The number of NOT _ USED () messages depends on the
system state and varies between (n − 1) and 2(n − 1). Hence, the total number of
messages per use of a set of instances of the resource is at least 2(n − 1) and at most
3(n − 1).
This section proves that the previous algorithm solves the k-out-of-M problem. It
assumes that (a) each invocation acquire_resource(k) is such that 1 ≤ k ≤ M, and
(b) the periods during which a process uses the instances of the resource that it has
obtained are finite.
Lemma 6 Let x τ be the number of instances of the resource accessed by the pro-
cesses at time τ . The algorithm described in Fig. 11.4 guarantees that ∀ τ : x τ ≤ M.
Claim C1 ∀ i, j : used_byi [j ] ≥ 0.
Proof of Claim C1 As all the local variables used_byi [j ] are initialized to 0, the
claim is initially true.
A process pi increases all its local counters used_byi [j ] at line 4 when it in-
vokes acquire_resource(ki ) (and this is the only place where pi increases them).
Moreover, it increases each of them by M. After these increases, the claim remains
trivially true.
The reception of a message REQUEST () sent by pi to another process pj gives
rise to either one message NOT _ USED (M) or at most one message NOT _ USED (M −
kj ) followed by at most one message NOT _ USED (kj ). It follows from line 19 that,
when these messages are received by pi , used_byi [j ] is decreased and can be set to
0. Hence, used_byi [j ] never becomes negative and the claim follows. (End of the
proof of the claim.)
Claim C2 ∀ i : ∀ τ : (cs_stateτi = in) ⇒ ( 1≤j ≤n used_byτi [j ] ≤ M).
Proof of Claim C2 It follows from line 7 that the claim is true when pi sets cs_statei
to the value in. As pi does not invoke acquire_resource() while cs_statei = in, no
local variable used_byi [j ] is increased when cs_statei = in, and the claim follows.
(End of the proof of the claim.)
284 11 Distributed Resource Allocation
Let U τ be the number of instances of the resource which are used at time τ .
If IN τ = ∅, no instance is used and consequently U τ = 0 ≤ M. So, let us assume
IN τ = ∅. Let pm be defined as in Claim C3. We have the following:
• U
= j ∈IN τ kj (definition
τ
of U τ ),
•
j ∈IN τ kj ≤ km + j ∈IN τ \{m} used_by
m [j ] (from Claim C3),
τ
• k
m+ j ∈IN \{m}
τ used_by τ
m [j ] ≤ 1≤j ≤n used_bym [j ] (from Claim C1),
τ
Lemma 7 The algorithm described in Fig. 11.4 ensures that any invocation of
acquire_resource() eventually terminates.
Claim C4 Considering WAIT τ = ∅, let pm be the process such that m ∈ WAIT τ , and
its request has the smallest timestamp among all the requests issued by the processes
in WAIT τ .
Then, the quantity 1≤j ≤n used_by
m [j ] decreases when τ increases,
and even-
tually either cs_statem = in or 1≤j ≤n used_bym [j ] = km + z∈PRIO kz > M,
where PRIO = {z | (dz , z < dm , m) ∧ (cs_statez = in)}.
Proof Claim C4 Let j = m. There are two cases.
1. Case j ∈ WAIT τ . Due to the definition of m we have dj , j > dm , m. It follows
that, when pj receives the message REQUEST (dm , m), it sends NOT _ USED (M)
by return to pm (lines 13–15). It follows that j ∈
/ PRIO and we eventually have
used_bym [j ] = 0.
11.2 Several Resources with a Single Instance 285
2. Case j ∈/ WAIT τ . If cs_statej = in or dj , j > dm , m, we are in the same case
as previously, j ∈/ PRIO and eventually we have used_bym [j ] = 0. If cs_statej =
in and dj , j < dm , m, we have j ∈ PRIO and pj sends NOT _ USED (M − kj )
by return to pm (line 16). Hence, we eventually have used_bym [j ] = kj .
While pm is such that cs_statem = trying, it does send new requests, and conse-
quently no used_by
m [x] can increase. It follows that,
after a finite time, we have
cs_statem = in or 1≤j ≤n used_bym [j ] = km + z∈PRIO kz > M, which con-
cludes the proof of the claim. (End of the proof of the claim.)
We now prove that, for any τ , if a process belongs to WAIT τ , it will be such
that cs_statei = in. Let us consider the process pm of WAIT τ that has the smallest
timestamp. It follows from Claim C4 that, after some finite time, we have
• Either cs_statem = in. In this case, pm returns from its invocation of acquire_
resource().
• Or (cs_statem = trying) ∧ (km + z∈PRIO kz > M). In this case, as the resource
instances
are eventually released (assumption), the set PRIO decreases and the
quantity z∈PRIO kz decreases accordingly allowing the predicate cs_statem = in
to eventually become true.
Hence, pm eventually returns from its invocation of acquire_resource().
The proof that any process px returns from acquire_resource() follows from
the fact that the new invocations will eventually have timestamps greater than the
one of px . The proof is the same as for the mutex algorithm based on individual
permissions (see the proof of its liveness property in Lemma 4, Sect. 10.2.3).
Theorem 15 The algorithm described in Fig. 11.4 solves the k-out-of-M problem.
As noted when it was presented, the previous algorithm, which solves the k-out-of-
M mutex problem, is an adaptation of the mutex algorithm described in Sect. 10.2.
More generally, it is possible to extend other mutex algorithms to solve the k-out-of-
M mutex problem. Problem 1 at the end of this chapter considers such an extension
of the adaptive mutex algorithm described in Sect. 10.3.
The pattern formed by (a) the acquisition of resources by a process, (b) their use,
and (c) finally their release, is called a (resource) session. During its execution, a
process usually executes several sessions.
The Notion of a Conflict Graph A graph, called conflict graph, is associated with
each resource. Let CG(x) be the conflict graph associated with the resource type x.
This graph is an undirected fully connected graph whose vertices are the processes
allowed to access the resource x. An edge means a possible conflict between the
two processes it connects: their accesses to the resource must be executed in mutual
exclusion.
As an example, let us consider six processes p1 , . . . , p6 , and three resource types
x1 , x2 , and x3 . The three corresponding conflict graphs are depicted in Fig. 11.5.
The edges of CG(x1 ) are depicted with plain segments, the edges of CG(x2 ) with
dotted segments, and edges of CG(x3 ) with dashed segments. As we can see, the
resource x1 can be accessed by the four processes: p2 , p3 , p5 , and p6 ; p1 accesses
only x3 , while both p2 and p4 are interested in the resources x1 and x3 .
The union of these three conflict graphs defines a graph in which each edge is
labeled by the resource from which it originates (the label is here the fact that the
edge is plain/dotted/dashed). This graph, called global conflict graph, gives a global
view on the conflicts on the whole set of resources.
The global graph associated with the previous three resources is depicted in
Fig. 11.6. As we can see, this global graph allows p1 to access x3 while p6 is access-
ing the pair resources (x1 , x2 ). Differently, p3 and p6 conflict on both the resources
x1 and x2 .
Fig. 11.6
Global conflict graph
When considering the global conflict graph of Fig. 11.6, this means that each
session of p6 is always on both x1 and x2 , while each resource session of p4 is
always on x2 and x3 , etc.
• In the second case, each session of a process pi is dynamic in the sense that it is
on a dynamically defined subset of the resources that pi is allowed to access.
In this case, some sessions of p6 can be only on x1 , others on x2 , and others on
both x1 and x2 . According to the request pattern, dynamic allocation may allow
for more concurrency than static allocation. As a simple example, if p6 wants to
access x1 while p3 wants to access x2 , and no other process wants to access these
resources, then p6 and p3 can access them concurrently.
The first case is sometimes called in the literature the dining philosophers prob-
lem, while the second case is sometimes called the drinking philosophers problem.
Access Pattern We consider here the case where a process pi requires the re-
sources it needs for its session one after the other, hence the name incremental re-
quests. Moreover, a session can be static or dynamic.
Possibility of Deadlock The main issue that has to be solved is the prevention
of deadlocks. Let us consider the processes p3 and p6 in Fig. 11.6. When both
processes want to acquire both the resources x1 and x2 , the following can happen
with incremental requests.
• p3 invokes acquire_resource(x1 ) and obtains the resource x1 .
• p6 invokes acquire_resource(x2 ) and obtains the resource x2 .
• Then p3 and p6 invoke acquire_resource(x2 ) and acquire_resource(x1 ), respec-
tively. As the resource asked by a process is currently owned by the other process,
each of p3 and p6 starts waiting, and the waiting period of each of them will ter-
minate when the resource it is waiting on is released. But, as each process releases
resources only after it has obtained all the resources it needs for the current ses-
sion, this will never occur (Fig. 11.7).
288 11 Distributed Resource Allocation
Fig. 11.7 A deadlock scenario involving two processes and two resources
This is a classical deadlock scenario. (Deadlocks involving more than two pro-
cesses accessing several resources can easily be constructed.)
Theorem 16 If, during each of its sessions, every process asks for the resources (it
needs for that session) according to the total order ≺, no deadlock can occur.
Proof Let us first introduce the notion of a wait-for graph (WFG). Such a graph
is a directed graph defined as follows (see also Sect. 15.1.1). Its vertices are the
processes and its edges evolve dynamically. There is an edge from pi to pj if (a)
pj is currently owning a resource x (i.e., pj has obtained and not yet released the
resource x) and (b) pi wants to acquire it (i.e., pi has invoked acquire_resource(x)).
Let us observe that, when there is a (directed) cycle, this cycle never disappears (for
it to disappear, a process in the cycle should release a resource, but this process
is blocked waiting for another resource). A cycle in the WFG means that there is
deadlock: each process pk in the cycle is waiting for a resource owned by a process
pk that belongs to the cycle, which in turn is waiting for a resource owned by a
process pk that belongs to the cycle, etc., without ever exiting from the cycle. The
progress of each process pk on the directed cycle depends on a statement (releasing
a resource) that it cannot execute.
The proof is by contradiction. Let us suppose that there is a directed cycle in
the WFG. To simplify the proof, and without loss of generality, let us assume that
the cycle involves only two processes p1 and p2 : p1 wants to acquire the resource
xa that is currently owned by p2 , and p2 wants to acquire the resource xb that is
currently owned by p1 (Fig. 11.8).
The following information can be extracted from the graph:
11.2 Several Resources with a Single Instance 289
• As p1 is owning the resource xa and waiting for the resource xb , it invoked first
acquire_resource(xa ) and then acquire_resource(xb ).
• As p2 is owning the resource xb and waiting for the resource xa , it invoked first
acquire_resource(xb ) and then acquire_resource(xa ).
As there is an imposed total order to acquire resources, it follows that either p1 or
p2 does not follow the rule. Taking the contrapositive, we have that “each process
follows the rule” ⇒ “there is no cycle in WFG”. Hence, no deadlock can occur if
every process follows the rule (during all its sessions of resource use).
Hence, the length of the waiting chain (longest path in the WFG) is n − 1 = 5, which
is the worst case: only one process is active.
Example When considering Fig. 11.11, the minimal number of colors is trivially
two. Let us color the non-neighbor resources x1 , x3 , and x5 with color a and the
non-neighbor resources x2 , x4 , and x6 with color b. Moreover, let a < b be the
total order imposed on these colors. Hence, the edges of the corresponding resource
graph are directed from top to bottom, which means that p4 has to acquire first x3
and then x4 , while p5 has to acquire first x5 and then x4 .
The worst case scenario described in Sect. 11.2.2 cannot occur. When p6 has
obtained x5 and x6 , p5 is blocked waiting for x5 (which it has to require before x4 ).
Hence, p4 can obtain x3 and x4 , and p2 can obtain x1 and x2 . More generally, the
maximal length of a waiting chain is the number of colors minus 1.
292 11 Distributed Resource Allocation
Remark Let us remember that the optimal coloring of the vertices of a graph is
an NP-complete problem (of course, a more efficient near-optimal coloring may be
used). Let us observe that the non-optimal coloring in which all the resources are
colored with different colors corresponds to the “total order” strategy presented in
Sect. 11.2.2.
Differently from the incremental request approach, where a process requires, one
after the other, the resources it needs during a session, the simultaneous request
approach directs each process to require simultaneously (i.e., with a single operation
invocation) all the resources it needs during a session. To that end, the processes are
provided with an operation
acquire_resource(res_seti ),
whose input parameter res_seti is the set of resources that the invoking process pi
needs for its session.
As we consider static sessions, every session of a process pi involve all the re-
sources that this process is allowed to access (i.e., res_seti = {x | i ∈ CG(x)}).
Conflict Graph for Static Sessions As, during a session, the requests of a pro-
cess are on all the resources it is allowed to access, the global conflict graph (see
Fig. 11.6) can be simplified. Namely, when two processes conflict on several re-
sources, e.g., xa and xb , these resources can be considered as a single virtual re-
source xa,b . This is because, the sessions being static, each session of these pro-
cesses involves both xa and xb .
The corresponding conflict graph for static session is denoted SS_CG. Its vertices
are the vertices in the conflict graphs 1≤x≤M CG(x). There is an edge (py , pz ) in
SS_CG if there is a conflict graph CG(x) containing such an edge. The graph SS_CG
associated with the global conflict graph of Fig. 11.6 is described in Fig. 11.12. Let
11.2 Several Resources with a Single Instance 293
Mutual Exclusion with Neighbor Processes in the Conflict Graph Let us con-
sider the mutex algorithms based on individual permissions described in Chap. 10.
These algorithms are modified as follows:
• Each site manages an additional local variable neighborsi which contains the
identities of its neighbors in the conflict graph. (Let us notice that if j ∈
neighborsi , then i ∈ neighborsj .)
• When considering the non-adaptive algorithm of Fig. 10.3, the set Ri (which
was a constant equal to {1, . . . n} \ {i}) remains a constant which is now equal to
neighborsi .
• When considering the adaptive algorithm of Fig. 10.7, let pi and pj be two pro-
cesses which are neighbors in the conflict graph. The associated message PER -
MISSION ({i, j }) is initially placed at one these processes, e.g., pi , and the initial
values of Ri and Rj are then such that j ∈ / Ri and i ∈ Rj .
• When considering the bounded adaptive algorithm of Fig. 10.10, the initialization
is as follows. For any two neighbors pi and pj in the conflict graph such that i >
j , the message PERMISSION ({i, j }) is initially at pi (hence, i ∈ Rj and j ∈ / Ri ),
and its initial state is used (i.e., perm_statei [j ] = used).
In the last two cases, Ri evolves according to requests, but we always have
Ri ⊆ neighborsi . Moreover, when the priority on requests (liveness) is determined
from timestamps (the first two cases), as any two processes pi and pj which are not
neighbors in the conflict graph never compete for resources, it follows that pi and
pj can have the same identity if neighborsi ∩ neighborsj = ∅ (this is because their
common neighbors, if any, must be able to distinguish them). Hence, any two pro-
cesses at a distance greater than 2 in the conflict graph can share the same identity.
This can help reduce the length of waiting chains.
The Deadlock Issue In this case, each session of a process pi involves a dynami-
cally defined subset of the resources for which pi is competing with other processes,
i.e., a subset of {x | i ∈ CG(x)}.
Let us consider the global conflict graph depicted in Fig. 11.6. If, during a ses-
sion, the process p6 wants to access the resource x1 only, it needs a mutual exclusion
294 11 Distributed Resource Allocation
Cooperation Between GM and the Mutex Algorithms Associated with Each Re-
source Mutex algorithms between neighbor processes in a graph were introduced
in the last part of Sect. 11.2.4. Let us consider that the mutex algorithm GM and all
the mutex algorithms M(x) are implemented by the one described in the last item
of Sect. 11.2.4, in which all variables are bounded. (As we have seen, this algo-
rithm was obtained from a simple modification of the adaptive mutex algorithm of
Fig. 10.10.)
Let us recall that such a mutex algorithm is fair: Any process pi that invokes
acquire_mutex() eventually enters the critical section state, and none of its neighbors
pj will enter it simultaneously. Moreover, two processes pi and pj which are not
neighbors can be simultaneously in critical section.
Let cs_statei be the local variable of pi that describes its current mutex exclusion
state with respect to the general algorithm GM. As we have seen, its value belongs to
{out, trying, in}. Similarly, let cs_statei [x] be the local variable of pi that describes
its current state with respect to the mutex algorithm M(x).
Let us consider the algorithm GM. Given a process pi , we have the following
with regard the transitions of its local variable cs_statei . Let us recall that the tran-
sition of cs_statei from trying to in is managed by the mutex algorithm itself. Dif-
ferently, the transitions from out to trying, and from in to out, are managed by the
11.3 Several Resources with Multiple Instances 295
each resource. As an example, for a given session, pi may request ki [x] instances
of resource type x and ki [y] instances of the resource type y.
This section presents solutions for dynamic sessions (as we have seen this means
that, in each session, a process defines the specific subset of resources it needs from
the set of resources it is allowed to access). Let us observe that, as these solutions
work for dynamic sessions, they trivially work for static sessions.
The Case of Dynamic Sessions with Incremental Requests In this case, the
same techniques as the ones described in Sect. 11.2 (which was on resources with
a single instance) can be used to prevent deadlocks from occurring (total order on
the whole set of resources, or partial order defined from a vertex-coloring of the
resource graph).
As far as algorithms are concerned, a k-out-of-M mutex algorithm is associated
with each resource type x. A process invokes then acquire_resource(x, ki ), where
x is the resource type and ki the number of its instances requested by pi , with
1 ≤ ki ≤ M[x].
and used_byxi is an array with one entry per process in CG(x). Moreover, each
message is tagged with the corresponding resource type. Figure 11.14 is a simple
extension of the basic- k-out-of-M.
When pi invokes acquire_resource({(x, kix )}x∈RX i ), it first proceeds to the state
trying for each resource x (line 1). Then, it computes a date for its request (line 2).
Let us observe that this date is independent of the set RX i . Process pi then sends a
timestamped request to all the processes with which it competes for the resources in
RX i , and computes an upper bound of the number of instances of the resources in
which it is interested (lines 3–8). When there are enough available instances of the
resources it needs, pi is allowed to use them (line 9). It then proceeds to the state in
with respect to each of these resources (line 10).
When pi invokes release_resource({(x, kix )}x∈RX i ), it executes the same code as
in Fig. 11.4 for each resource of RX i (lines 11–15).
The “server” role of a process (management of the message reception) is the same
as in Fig. 11.4. The messages for a resource type x are processed by the correspond-
ing instance of the basic-algorithm. The important point is that all these instances
share the same logical clock clocki . The key of the solution lies in the fact that a
single timestamp is associated with all the request messages sent during an invoca-
tion, and all conflicting invocations on one or several resources are totally ordered
by their timestamps.
11.4 Summary
This chapter was devoted to the generalized k-out-of-M problem. This problem cap-
tures and abstracts resource allocation problems where (a) there are one or several
298 11 Distributed Resource Allocation
types of resource, (b) each resource has one or several instances, and (c) each pro-
cess may request several instances of each resource type. Several algorithms solving
this problem have been presented. Due to the multiplicity of resources, one of the
main issues that these algorithms have to solve is deadlock prevention. Approaches
that address this issue have been discussed.
The chapter has also introduced the notion of a conflict graph, which is an im-
portant conceptual tool used to capture conflicts among processes. It has also shown
how the length of process waiting chains can be reduced. Both incremental ver-
sus simultaneous requests on the one side, and static versus dynamic sessions of
resource allocation on the other side, have been addressed in detail.
operation acquire_resource(ki ) is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do
(4) if (used_byi [j ] = 0) then send REQUEST (rdi , i) to pj ;
(5) sent_toi [j ] ← true; used_byi [j ] ← M
(6) else sent_toi [j ] ← false
(7) end if
(8) end for;
(9) used_by
i [i] ← ki ;
(10) wait ( 1≤j ≤n used_byi [j ] ≤ M);
(11) cs_statei ← in.
operation release_resource(ki ) is
(12) cs_statei ← out;
(13) foreach j ∈ perm_delayedi do send NOT _ USED (ki ) end for;
(14) perm_delayedi ← ∅.
Fig. 11.15 Another algorithm for the k-out-of-M mutex problem (code for pi )
1. The k-out-of-M algorithm described in Fig. 11.4 requires between 2(n − 1) and
3(n − 1) messages per use of a set of instances of the resource. The following al-
gorithm is proposed to reduce this number of messages. When used_byi [j ] = 0,
the process pi knows an upper bound on the number of instances of the resource
used by pj . In that case, it is not needed for pi to send a request message to pj .
Consequently, pi sends a request to pj only when used_byi [j ] = 0; when this
occurs, pi records it by setting a flag sent_toi [j ] to the value true.
The corresponding algorithm is described
in Fig. 11.15 (where Ri = {1, . . . , n}
\ {i}). Let us recall that the quantity 1≤j ≤n used_byi [j ], which appears in the
wait statement (line 10) is always computed in local mutual exclusion with the
code associated with the reception of a message NOT _ USED ().
• Show that the algorithm is correct when the channels are FIFO.
300 11 Distributed Resource Allocation
• Show that the algorithm is no longer correct when the channels are not FIFO.
To that end construct a counterexample.
• Is the statement sent_toi [j ] ← true at line 24 necessary? Justify your answer.
• Is it possible to replace the static set Ri by a dynamic set (as done in the mutex
algorithm of Fig. 10.7)?
• Can the message exchange pattern described in Fig. 11.3 occur?
• What are the lower and upper bounds on the number of messages per use of a
set of k instances of the resource?
• Is the algorithm adaptive?
• Let the waiting time be the time spent in the wait statement. Compare the
waiting time of the previous algorithm with the waiting time of the algorithm
described in Fig. 11.4. From a waiting time point of view, is one algorithm
better than the other?
Solution in [311].
2. Write the server code (i.e., the code associated with message receptions) of the
generalized k-out-of-M algorithm for simultaneous requests whose client code
is described in Fig. 11.14.
3. The generalized k-out-of-M algorithm for simultaneous requests in dynamic ses-
sions described in Fig. 11.14 uses timestamp.
Assuming the conflict graph of Fig. 11.6 in which each resource has a single
instance (hence k = 1), let us consider an execution in which concurrently
• p2 issues acquire_resource(1) (i.e., p2 requests resource x1 ),
• p6 issues acquire_resource(1, 3) (i.e., p6 requests each of the resources x1
and x3 ),
• p6 issues acquire_resource(3) (i.e., p4 requests resource x3 ).
Moreover, let hi , i be the timestamp of the request of pi . How are these
requests served if
• h2 , 2 < h6 , 6 < h4 , 4?
• h2 , 2 < h4 , 4 < h6 , 6?
What can be concluded about the order in which the requests are served?
4. The algorithm of Fig. 11.14 solves the generalized k-out-of-M problem for si-
multaneous requests in dynamic sessions. This algorithm uses timestamp and is
consequently unbounded. Design a bounded algorithm for the same problem.
5. Write the full code of the algorithm for simultaneous requests in dynamic
sessions (see Sect. 11.2.5), i.e., the code of (a) the corresponding operations
acquire_resource(RX i ) and release_resource(RX i ), and (b) the code associated
with the corresponding message receptions. (As in the algorithm described in
Fig. 11.14, RX i denotes the dynamically defined set of resources that pi needs
for its current session.)
Elements for a solution in [242, 387].
Part IV
High-Level Communication Abstractions
The notion of causal message delivery was introduced by K.P. Birman and
T.A. Joseph (1987).
The Problem Let us consider the communication pattern described at the left of
Fig. 12.1. Process p1 sends first the message m1 to p3 and then the message m2
to p2 . Moreover, after it has received m2 , p2 sends the message m3 to p3 . Hence,
ev
when considering the partial order relation −→ on events (defined in Chap. 6), the
sending of m1 belongs to the causal past of the sending of m2 , and (by the transitivity
created by m2 ) belongs to the causal past of the sending of m3 . We consequently
ev
have s(m1 ) −→ s(m3 ) (where s(m) denotes the “sending of m” event). But we do
ev
not have r(m1 ) −→ r(m3 ) (where r(m) denotes the “reception of m” event).
While the messages m1 and m2 are sent to the same destination process, and
their sending are causally related, their reception order does not comply with their
sending order. The causal message delivery order property (formally defined below)
is not ensured. Differently, the reception order in the communication pattern de-
ev
scribed at the right in Fig. 12.1 is such that r(m1 ) −→ r(m3 ) and satisfies the causal
message delivery order property.
Definition The causal message delivery (also called causal order or causal mes-
sage ordering) abstraction provides the processes with two operations denoted
co_send() and co_deliver(). When a process invokes them, we say that it co_sends
or co_delivers a message. The abstraction is defined by the following properties.
It is assumes that all messages are different (which can be easily realized by associ-
ating with each message a pair made up of a sequence number plus the identity of
the sender process). Let co_s(m) and co_del(m) be the events associated with the
co_sending of m and its co_delivery, respectively.
• Validity. If a process pi co_delivers a message m from a process pj , then m was
co_sent by pj .
• Integrity. No message is co_delivered more than once.
ev
• Causal delivery order. For any pair of messages m and m , if co_s(m) −→
ev
co_s(m ) and m and m have the same destination process, we have co_del(m) −→
co_del(m ).
Fig. 12.2
The delivery pattern prevented
by the empty interval property
three requirements define the safety property of causal message delivery. Validity
states that no message is created from thin air or is corrupted. Integrity states that
there is no message duplication. Causal order states the added value provided by the
abstraction. The last requirement (termination) is a liveness property stating that no
message is lost.
While Fig. 12.1 considers a causal chain involving only two messages (m2 and
m3 ), the length of such a chain in the third requirement can be arbitrary.
A Geometrical Remark As suggested by the right side of Fig. 12.1, causal order
on message delivery is nothing more than the application of the famous “triangle
inequality” to messages.
Let us recall the following definitions associated with each event e (Sect. 6.1.3):
ev
• Causal past of e: past(e) = {f | f −→ e},
ev
• Causal future of e: future(e) = {f | e −→ f }.
Let us consider an execution in which the messages are sent with the operations
send() and receive(), respectively. Moreover, let M be the set of messages that
have been sent. The message exchange pattern of this execution satisfies the causal
message delivery property, if and only if we have
∀ m ∈ M: future s(m) ∩ past r(m) = ∅,
or equivalently,
ev ev
∀ m ∈ M: e | s(m) −→ e ∧ e −→ r(m) = ∅.
This formula (illustrated in Fig. 12.2) states that, for any message m, there is a
single causal path from s(m) to r(m) (namely the path made up of the two events
s(m) followed by r(m)). Considered as a predicate, this formula is called the empty
interval property.
306 12 Order Constraints on Message Delivery
It is important to notice that two messages m and m , which have been co_sent to
the same destination process and whose co_sendings are not causally related, can be
received in any order. Concerning the abstraction power of causal message delivery
with respect to other ordering constraints, we have the following:
• The message delivery constraint guaranteed by FIFO channels is simply causal
order on each channel taken separately. Consequently, FIFO channels define an
ordering property weaker than causal delivery.
• Let us extend causal message delivery from point-to-point communication to
broadcast communication. This is addressed in Sect. 12.3. Let us recall that the
total order broadcast abstraction presented in Sect. 7.1.4 is stronger than causal
broadcast. It is actually “causal broadcast” plus “same message delivery order at
each process” (even the messages whose co_broadcasts are not causally related
must be co_delivered in the same order at any process).
We consequently have the hierarchies of communication abstractions described
in Table 12.1, where “asynchronous” means no constraint on message delivery, and
≺ means “strictly weaker than”.
• senti [1..n, 1..n] is an array of integers, each initialized to 0. The entry senti [k, ]
represents the number of messages co_sent by pk to p , as known by pi . (Let us
recall that a causal message chain ending at a process pi is the only way for pi to
“learn” new information.)
• deliveredi [1..n] is an array of integers, each initialized to 0. The entry
deliveredi [j ] represents the number of messages co_delivered by pi , which have
been co_sent by pj to pi .
• bufferi is the local buffer where pi stores the messages that have been received
and cannot yet be co_delivered. The algorithm is expressed at an abstraction level
at which the use of this buffer is implicit.
operation co_send(m) to pj is
(1) send CO(m, senti ) to pj ;
(2) senti [i, j ] ← senti [i, j ] + 1.
The Algorithm The algorithm based on the previous data structures is described
in Fig. 12.4. When a process pi invokes co_send(m), it first sends the message
CO(m, senti ) to the destination process pj (line 1), and then increases the se-
quence number senti [i, j ], which counts the number of messages co_sent by pi
to pj (line 2). Let us notice that the sequence number of an application message m
is carried by the algorithm message CO(m, sent) (this sequence number is equal to
sent[i, j ] + 1).
When pi receives from the network a message CO(m, sent) from a process pj ,
it stores the message in bufferi until its delivery condition DC(m) becomes satisfied
(line 3). When this occurs, m is co_delivered (line 4), and the control variables are
updated to take this co_delivery into account. First senti [j, i] and deliveredi [j ] are
increased (lines 5–6) to record the fact that m has been co_delivered. Moreover, the
knowledge on the causal past of m (which is captured in sent[1..n, 1..n]) is added
to current knowledge of pi (namely, every local variable senti [x, y] is updated to
max(senti [x, y], sent[x, y]), lines 7–9).
It is assumed that a message is co_delivered as soon as its delivery condition be-
comes true. If, due to the co_delivery of some message m, the conditions associated
with several messages become true simultaneously, these messages are co_delivered
in any order, one after the other.
Remark Let us observe that, for any pair (i, j ), both senti [j, i] and deliveredi [j ]
are initialized to 0 and are updated the same way at the same time (lines 5–6). It
follows that the vector deliveredi [1..n] can be replaced by the vector senti [j, 1..n].
Line 6 can then be suppressed and the delivery condition becomes
DC(m) ≡ ∀k : senti [k, i] ≥ sent[k, i] .
Proof As the process pj does not co_send messages to itself, the only line at which
sentj [j, i] is modified is line 2. The proof trivially follows from the initialization of
sentj [j, i] to 0 and the sequentiality of the lines 1 an 2.
Proof Let m1 and m2 be two application messages such that (a) co_send(m1)
causally precedes co_send(m2), (b) both are co_sent to the same process pi , (c) m1
is co_sent by pj 1 , and (d) m2 is co_sent by pj 2 . Moreover, let CO(m1, sent1) and
CO(m2, sent2) be the corresponding messages sent by the algorithm (see Fig. 12.5).
As there is a causal path from co_send(m1) to co_send(m2), it follows from the
fact that line 2 is executed between send(m1, sent1) and send(m2, sent2) that we
have sent1[j 1, i] < sent2[j 1, i]. We consider two cases.
• j 1 = j 2 (m1 and m2 are co_sent by the same sender). As deliveredi [j 1] is the se-
quence number of the last message co_delivered from pj 1 , it follows from (a) the
predicate deliveredi [j 1] ≥ sent[j 1, i], and (b) the fact that sent[j 1, i] is the se-
quence number of the previous message co_sent by pj 1 to pi (Lemma 8), that
the messages co_sent by pj 1 to pi are co_delivered according to their sequence
number. This proves the lemma for the messages co_sent by the same process.
• j 1 = j 2 (m1 and m2 are co_sent by distinct senders). If the delivery condition
of m2 is satisfied we have deliveredi [j 1] ≥ sent2[j 1, i], which means that, as
sent2[j 1, i] > sent1[j 1, i], we have then deliveredi [j 1] > sent1[j 1, i]. But, it
follows from the previous item that the messages from pj 1 are received according
their sequence numbers, from which we conclude that m1 has been previously
co_delivered, which concludes the proof of the lemma.
Proof Let ≺ denote the following relation on the application messages. Let m and
m be any pair of application messages; m ≺ m if co_send(m) causally precedes
ev
co_send(m ). As the causal precedence relation −→ on events is a partial order, so
is the relation ≺.
Given a process pi , let pendingi be the set of messages which have been co_sent
to it and are never co_delivered. Assuming pendingi = ∅, let m ∈ pendingi a mes-
sage that is minimal with respect ≺ (i.e., a message which has no predecessor—
according to ≺—in pendingi ).
As m cannot be co_delivered by pi there is (at least) one process pk such that
deliveredi [k] < sent[k, i] (line 3). It then follows from Lemma 9 that there is a
message m such that (a) m has been co_sent by pk to pi , co_send(m ) causally
precedes co_send(m), and m is not co_delivered by pi (Fig. 12.5 can still be con-
sidered after replacing m1 by m and m2 by m). Hence this message belongs to
pendingi and is such that m ≺ m. But this contradicts the fact that m is minimal in
pendingi , which concludes the proof of the liveness property.
Theorem 17 The algorithm described in Fig. 12.4 implements the causal message
delivery abstraction.
Proof The proof of the validity and integrity properties are trivial and are left to the
reader. The proof of the causal message delivery follows from Lemma 9, and the
proof of the termination property follows from Lemma 10.
The main drawback of the previous algorithm lies in the size of the control infor-
mation that has to be attached to each application message m. Let b be the number
of bits used for each entry of a matrix senti [1..n, 1..n]. As a process does not send
message to itself, the diagonal senti [j, j ], 1 ≤ j ≤ n, can be saved. The size of
the control information that is transmitted with each application message is conse-
quently (n2 − n)b bits. This section shows how this number can be reduced.
Basic Principle Let us consider a process pi that sends a message CO(m, senti )
to a process pj . The idea consists in sending to pj only the set of values of the
entries of the matrix senti that have been modified since the last message CO() sent
to pj . The set of values that has to be transmitted by pi to pj is the set of 3-tuples
k, , senti [k, ] | senti [k, ] has been
modified since the last co _send of pi to pj .
Let us remember that a similar approach was used in Sect. 7.3.2 to reduce the
size of the vector dates carried by messages.
12.2 A Basic Algorithmfor Point-to-Point Causal Message Delivery 311
A Better Solution: Data Structures This section presents a solution for which
the additional data structures at each process need only (n2 + 1)b bits. These data
structures are the following.
• clocki is a local logical clock which measures the progress of pi , counted as
the number of messages it has co_sent (i.e., clocki counts the number of rele-
vant events—here they are the invocations of co_send()—issued by pi ). Initially,
clocki = 0.
• last_send_toi [1..n] is an vector, initialized to [0, . . . , 0], such that last_send_
toi [j ] records the local date of the last co_send to pj .
• last_modi [1..n, 1..n] is an array such that last_modi [k, ] is the local date of the
last modification of senti [k, ]. The initial value of each last_modi [k, ] is −1.
As the diagonal of last_modi is useless, it can be used to store the vector
last_send_toi [1..n].
operation co_send(m) to pj is
(1) let seti = {k, , senti [k, ] | last_modi [k, ] ≥ last_sent_toi [j ]};
(2) send CO(m, seti ) to pj ;
(3) clocki ← clocki + 1;
(4) senti [i, j ] ← senti [i, j ] + 1; last_modi [i, j ] ← clocki ;
(5) last_sent_toi [j ] ← clocki .
Fig. 12.6 An implementation reducing the size of control information (code for pi )
received by pi . Let us notice that—except possibly for the first message co_sent by
pj —the set set cannot be empty because there is at least the triple j, i, x, where x
is the sequence number of the previous message co_sent by pj to pi . (See Fig. 12.7.)
After m has been co_delivered, pi updates its local control variables
deliveredi [j ], senti [j, i] as in the basic algorithm (line 8–9). It also updates en-
tries of senti [1..n, 1..n] according to the values it has received in the set seti
(line 10–12).
operation co_broadcast(m) is
(1) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, broadcasti [1..n]) to pj end for;
(2) broadcasti [i] ← broadcasti [i] + 1;
(3) co_delivery of m to the application layer.
When this condition becomes true, m is locally co_delivered (line 5), and
broadcasti [j ] is increased to register this co_delivery of a message co_broadcast
by pj (line 6).
Remark on the Vectors broadcasti Let us notice that each array broadcasti [1..n]
is nothing more than a vector clock, where the local progress of each process is mea-
sured by the number of messages it has co_broadcast. Due to the delivery condition,
the update at line 6 is equivalent to the vector clock update
It follows that the technique presented in Sect. 7.3.2 to reduce the size of control
information carried by application messages can be used.
This section presents a simple causal broadcast algorithm that reduces the size of
the control information carried by messages.
Message Causality Graph Let M be the set of all the messages which are
co_broadcast during the execution of an application. Let us define a partial order
relation, denoted ≺im , on application messages as follows. Let m and m be any two
application messages. We have m ≺im m (read: m precedes immediately m in the
message graph) if:
• The co_broadcast(m) causally precedes co_broadcast(m ).
• There is no message m such that (a) co_broadcast(m) causally pre-
cedes co_broadcast(m ), and (b) co_broadcast(m ) causally precedes
co_broadcast(m ).
316 12 Order Constraints on Message Delivery
An example is depicted on Fig. 12.12. The execution is on the left side, and
the corresponding message graph is on the right side. This graph, which captures
the immediate causal precedence on the co_broadcast of messages, states that the
co_delivery of a message depends only on the co_delivery of its immediate pre-
decessors in the graph. As an example, the co_delivery of m4 is constrained by
the co_delivery of m3 and m2 , and the co_delivery of m3 is constrained by the
co_delivery of m1 . Hence, the co_delivery of m4 is not (directly) constrained by the
co_delivery of m1 . It follows that a message has to carry control information only
on its immediate predecessors in the graph defined by the relation ≺im .
Causal Barrier and Local Data Structures Let us associate with each applica-
tion message an identity (a pair made up of a sequence number plus a process iden-
tity). The causal barrier associated with an application message is the set of iden-
tities of the messages that are its immediate predecessors in the message causality
graph. Each process manages accordingly the following data structures.
• causal_barrieri is the set of identities of the messages that are immediate prede-
cessors of the next message that will be co_broadcast by pi . This set is initially
empty.
• deliveredi [1..n] has the same meaning as in previous algorithms. It is initialized
to [0, . . . , 0], and deliveredi [j ] contains the sequence number of the last message
from pj that has been co_delivered by pi .
• sni is a local integer variable initialized to 0. It counts the number of messages
that have been co_broadcast by pi .
operation co_broadcast(m) is
(1) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, causal_barrieri ) to pj end for;
(2) co_delivery of m to the application layer;
(3) sni [i] ← sni + 1;
(4) causal_barrieri ← {sni , i}.
Fig. 12.13 A causal broadcast algorithm based on causal barriers (code for pi )
This section has two aims: show the versatility of the causal barrier notion and
introduce the notion of messages with bounded lifetime. The algorithm presented in
this section is due to R. Baldoni, R. Prakash, M. Raynal, and M. Singhal (1998).
Bounded Lifetime Message The lifetime of a message is the physical time du-
ration during which, after the message has been sent, its content is meaningful and
can consequently be used by its destination process(es). A message that arrives at
its destination process after its lifetime has elapsed becomes useless and must be
discarded. For the destination process, it is as if the message was lost. A message
that arrives at a destination process before its lifetime has elapsed must be delivered
by the expiration of its lifetime.
For simplicity, we assume that all the messages have the same lifetime and
processing times are negligible when compared to message transit times.
Let τ be the sending time of a message. The physical date τ + is the deadline
after which this message is useless for its destination process. This is illustrated in
Fig. 12.14. The message m arrives by its deadline and must be processed by its
deadline. On the opposite, m arrives after its deadline and must be discarded.
It is also assumed that the lifetime is such that, in practice, a great percentage
of messages arrives by their deadline, as is usually the case in distributed multimedia
applications.
When = +∞, it is possible that there is a process pk that, in the causal past
of co_broadcast(m), has co_broadcast a message m (identified sdt, k) such that
(a) sdt, k ∈ causal_barrier, (b) deliveredi [k] > sdt, and (c) m will be discarded
because it will arrive after its deadline. The fact that (i) the co_broadcast of m
causally precedes the co_broadcast of m, and (ii) m is not co_delivered, has not to
prevent pi from co_delivering m. To solve this issue, pi delays the co_delivery of m
until the deadline of m (namely std + ), but no more. The final delivery condition
is consequently
(deliveredi [k] ≥ sdt)
DC (m) ≡ ∀sdt, k ∈ causal_barrier :
∨ (CLOCK − sdt > ).
320 12 Order Constraints on Message Delivery
operation co_broadcast(m) is
(1) st ← CLOCK;
(2) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, st, causal_barrieri ) to pj end for;
(3) co_delivery of m to the application layer;
(4) causal_barrieri ← {st, i}.
Strong Total Order Broadcast Abstraction The total order broadcast abstraction
was introduced in Sect. 7.1.4 to illustrate the use of scalar clocks. For the self-
completeness of this chapter, its definition it repeated here. This abstraction provides
the processes with two operations denoted to_broadcast() and to_deliver() which
satisfy the following properties. Let us recall that it is assumed that all the messages
which are co_broadcast are different.
• Validity. If a process to_delivers a message m, there is a process that has
to_broadcast m.
• Integrity. No message is to_delivered more than once.
12.4 The Total Order Broadcast Abstraction 321
In the following this communication abstraction is called strong total order abstrac-
tion.
Weak Total Order Broadcast Abstraction Some applications do not require that
message delivery complies with causal order. For these applications, the important
point is the fact that the messages are delivered in the same order at each process,
the causality among their broadcast being irrelevant. This defines a weak total order
abstraction, which is total order broadcast without the “causal precedence order”
requirement.
Causal Order Versus Total Order As has been seen in the algorithms ensuring
causal message delivery, when a process pi receives an algorithm message carrying
an application message m, it can be forced to delay the local delivery of m (until
the associated delivery condition becomes true), but it is not required to coordinate
with other processes.
This is the fundamental difference with the algorithms that implement total order
message delivery. Let us consider Fig. 12.17 in which, independently, p1 invokes
to_broadcast(m) and p2 invokes to_broadcast(m ). Let us consider any two distinct
processes pi and pj . As the co_broadcasts of m and m are not causally related, it
is possible that pi receives first the algorithm message carrying m and then the one
carrying m , while pj receives them in the opposite order. If the message delivery
requirement was causal order, there will be no problem, but this is no longer the
case for total order: pi and pj have to coordinate in one way or another, directly or
indirectly, to agree on the same delivery order. This explains why the algorithms im-
plementing the total order message delivery requirement are inherently more costly
(in terms of time and additional exchanged messages) than the algorithms imple-
menting only causal message delivery.
322 12 Order Constraints on Message Delivery
to 0. When a process pi invokes co_broadcast(m), it waits for the token and, when
it receives the token, it increases gsn by 1, whose new value becomes the global se-
quence number associated with m. Then, pi sends the message GTO _ BR (m, gsn) to
all the processes. Finally, a process to_delivers the application messages it receives
according to their sequence numbers. An example is depicted in Fig. 12.19.
As far as the moves of the token are concerned, two approaches are possible.
• As the token is a mobile object, any of the algorithms presented in Chap. 5 can be
used. A process that needs a global sequence number invokes the corresponding
operations acquire_object() and release_object().
This approach can be interesting when the underlying navigation algorithm is
adaptive as, in this case, only the processes that want to to_broadcast messages
are required to participate in the navigation algorithm.
• The processes are arranged along a unidirectional logical ring, and the token
moves perpetually along the ring. When a process receives the token, it com-
putes the next sequence number (if needed), and then forwards the token to the
next process on the ring.
This solution has the drawback that all the processes are required to participate
in the move of the token, even if they do not want to co_broadcast messages. This
approach is consequently more interesting in applications where all the processes
very frequently want to to_broadcast messages.
Fig. 12.20
Clients and servers
in total order broadcast
Delivery Condition The set pendingi contains tuples of the form m, date, i, tag,
where m is an application message, i the identity of its sender, date the tentative de-
livery date currently associated with m, and tag is a control bit. If tag = no_del,
the message ms not yet been assigned its final delivery date and cannot consequently
be to_delivered (added to to_deliverablej ). Let us notice that its final delivery date
will be greater than or equal to its current tentative delivery date. If tag = del, m
has been assigned its final delivery date and can be to_delivered if it is stable. Stabil-
ity means that no message in pendingi (and by transitivity, no message to_broadcast
in the future) can have a timestamp smaller than the one associated with m.
The delivery condition DC(m) for a message m such that m, date, i, del ∈
pendingi is consequently the following (let us recall that all application messages
are assumed to be different):
∀ m , date , i , − ∈ pendingi :
m = m ⇒ (datei < datei ) ∨ datei = datei ∧ i < i .
The Total Order Broadcast Algorithm The corresponding total order broadcast
algorithm is described in Fig. 12.21. A client process pi is allowed to invoke again
to_broadcast() only when it has completed its previous invocation (this means that,
while it is waiting at line 3, a process remains blocked until it has received the
appropriate messages).
When it invokes to_broadcast(m), pi first computes the sequence number that
will identify m (line 1). It then sends the message INQUIRY (m, sni ) to each server
(line 2) and waits for their delivery date proposals (line 3). When pi has received
326 12 Order Constraints on Message Delivery
them, pi computes the final delivery date of m (line 4) and sends it to the servers
(line 5). (As we can see, this is nothing more than a classical handshake coordination
mechanism which has been already encountered in other chapters.)
The behavior of a server qj is made up of two parts. When qj receives a message
TO_1(m, sn) from a client pi , qj stores the sequence of the message (line 6), and in-
creases its local clock (line 7), adds m, clockj , i, no_del to pendingi (line 8), and
sends to pi a proposed delivery date for m (line 9). Then, as far as m is concerned,
qj waits until it has received the final date associated with m (line 10). When this
occurs, it replaces in pendingi the proposed date by the final delivery date and marks
the message as deliverable (line 11), and updates its local scalar clock (line 12).
The second processing part associated with a server is a background task
which suppresses messages from pendingi and adds them at the tail of the queue
to_deliverablej . The core of this task, described at lines 13–17, is the delivery con-
dition DC(m), which has been previously introduced.
This section considers total order broadcast in a synchronous system. The algorithm
which is described is a simplified version of a fault-tolerant algorithm due to F.
Cristian, H. Aghili, R. Strong, and D. Dolev (1995).
12.4 The Total Order Broadcast Abstraction 327
operation co_broadcast(m) is
(1) sdt ← CLOCK;
(2) for each j ∈ {1, . . . , m} do send TO _ BR (m, sdt) to pj end for.
The Algorithm The algorithm is described in Fig. 12.22. When a process invokes
co_broadcast(m) it sends the message TO _ BR (m, sdt) to all the processes (including
itself), where sdt is the message sending date.
When a process pi receives a message TO _ BR (m, sdt), pi delays its delivery until
time sdt + . If several messages have the same delivery date, they are to_delivered
according their timestamp order (i.e., according to the identity of their senders).
The local variables pendingi and to_deliverablei have the same meaning as in the
previous algorithms.
As we can see, this algorithm is based on the following principle: It systemati-
cally delays the delivery of each message as if its transit duration was equal to the
upper bound . Hence, this algorithm reduces all cases to the worst-case scenario.
It is easy to see that the total order on message delivery, which is the total order on
their sending times (with process identities used to order the messages sent at the
same time), is the same at all the processes.
in the coordinator-based
that, in the average, the to_delivery of a message takes 2
algorithm of Sect. 12.4.2, and 3 in the client–server algorithm of Sect. 12.4.3.
Differently, the to_delivery of a message takes always in the synchronous algo-
rithm of Fig. 12.22. It follows that, when considering algorithms executed on top
of a synchronous system, an asynchronous algorithm can be more efficient than a
synchronous algorithm when 2 < (or 3 < ).
Considering a channel, this section considers four ordering properties that can be
associated with each message sent of this channel, and algorithms which implement
them. This section has to be considered as an exercise in two-process communica-
tion.
• type(m) = ct_past. In this case, m cannot bypass the messages m sent before
it (m is controlled by its past). Moreover, nothing prevents m from being bypassed
by messages m sent after it. This constraint is depicted in Fig. 12.24: all the
messages m sent before m are delivered by pj before m.
Formally, if type(m) = ct_past, we have for any other message m sent by
pi to pj :
ev ev
s m −→ s(m) ⇒ del m −→ del(m) .
12.5 Playing with a Single Channel 329
• type(m) = marker. In this case, m can neither bypass messages m sent after
it, nor be bypassed by messages m sent before it. This constraint is depicted in
Fig. 12.25. (This type of message has been implicitly used in Sect. 6.6, devoted
to the determination of a consistent global state of a distributed computation.)
Formally, if type(m) = marker, we have for any other message m sent by
pi to pj :
s(m) −→ s m ⇒ del(m) −→ del m
ev ev
ev ev
∧ s m −→ s(m) ⇒ del m −→ del(m) .
This section presents an algorithm which ensures that the messages sent by pi to pj
are delivered according to the constraints defined by their types.
Such an algorithm is describe in Fig. 12.26. The sender pi manages a local vari-
able sni (initialized to 0), that it uses to associate a sequence number with each
message. The receiver process pj manages two local variables: last_snj (initialized
to 0) contains the sequence number of the last message of pi that it has delivered;
pendingi is a set (initially empty) which contains the messages (with their sequence
numbers) received and not year delivered by pj . The sequence numbers allows pj
to deliver the messages in their sending order.
The Delivery Condition Let (m, sn, type, barrier) be an algorithm message re-
ceived by pj . The delivery condition DC(m) associated with m depends on the type
of m.
• If type(m) is ct_past or marker, m has to be delivered after all the messages
sent before it, which means that we need to have {1, . . . , barrier} ⊆ deliv_snj .
• If type(m) is ct_future or ordinary, m can be delivered before messages
sent before it. Hence, the only constraint is that its delivery must not violate the
requirements imposed by the type of other messages. But these requirements are
captured by the message parameter barrier, which states that the delivery of the
message whose sequence number is barrier has to occur before the delivery of
m. Hence, for these message types, the delivery condition is barrier ∈ deliv_snj .
An example is depicted in Fig. 12.28. The sequence numbers of m1 , m2 ,
m3 , and m4 , are 15, 16, 17, and 18, respectively. Before the sending of m1 ,
no_bypassi = 10. These messages can be correctly delivered in the order m2 ,
m4 , m3 , m1 .
332 12 Order Constraints on Message Delivery
12.6 Summary
The aim of this chapter was to present two communication abstractions, namely,
the causal message delivery abstraction and the total order broadcast abstraction.
These abstractions have been defined and algorithms implementing them have been
described. Variants suited to bounded lifetime messages and synchronous systems
have also been presented. Finally, as an exercise, the chapter has investigated order-
ing properties which can be imposed on messages sent on a channel.
• The notion of a causal barrier and the associated causal broadcast algorithm of
Sect. 12.3.2 are due to R. Baldoni, R. Prakash, M. Raynal, and M. Singhal [39].
This algorithm was extended to mobile environments in [298].
• The notion of bounded lifetime messages and the associated causal order algo-
rithm presented in Sect. 12.3.3 are due to R. Baldoni, A. Mostéfaoui, and M. Ray-
nal [38].
• The notion of total order broadcast was introduced in many systems (e.g., [84]
for an early reference).
• The client/server algorithm presented in Sect. 12.4.3 is due to D. Skeen [353].
• The total order algorithm for synchronous systems presented in Sect. 12.4.4 is
a simplified version of a fault-tolerant algorithm due to F. Cristian, H. Aghili,
R. Strong, and D. Dolev [102].
• State machine replication was introduced by L. Lamport in [226]. A general pre-
sentation of state machine replication can be found in [339].
• The notions of the message types ordinary, marker, controlling the past, and con-
trolling the future, are due to M. Ahuja [12]. The genuine algorithm presented in
Sect. 12.5.2 is due to M. Ahuja and M. Raynal [13].
• A characterization of message ordering specifications and algorithms can be
found in [274]. The interconnection of systems with different message delivery
guarantees is addressed in [15].
• While this book is devoted to algorithms in reliable message-passing systems,
the reader will find algorithms that implement causal message broadcast in asyn-
chronous message-passing systems in [316], and algorithms that implement total
order broadcast in these systems in [24, 67, 242, 316].
1. Prove that the empty interval predicate stated in Sect. 12.1.2 is a characterization
of causal message delivery.
2. Prove that the causal order broadcast algorithm described in Fig. 12.10 is correct.
Solution in [39].
3. When considering the notion of a causal barrier introduced in Sect. 12.3.2, show
that any two messages that belong simultaneously to causal_barrieri are inde-
pendent (i.e., the corresponding invocations of co_broadcast() are not causally
related).
4. Modify the physical time-free algorithm of Fig. 12.4 so that it implements causal
message delivery in a system where the application messages have a bounded
lifetime . Prove then that the resulting algorithm is correct.
Solution in [38].
5. Let us consider an asynchronous system where the processes are structured
into (possibly overlapping) groups. As an example, when considering five pro-
cesses, a possible structuring into four groups is G1 = {p1 , p2 , p4 , p5 }, G2 =
{p2 , p3 , p4 }, G3 = {p3 , p3 , p4 }, and G2 = {p1 , p5 }.
334 12 Order Constraints on Message Delivery
13.1.1 Definition
Sense of Message Transfer As shown by the execution on the right of Fig. 13.1,
it is easy to see that, when considering synchronous communication, the sense of
direction for message transfer is irrelevant.
The fact that any of the messages m1 , m2 , or m3 would be transmitted in the other
direction, does not change the fact that message transfers remain points when con-
sidering synchronous communication. We can even assume such a point abstracts
the fact that two messages are simultaneously exchanged, one in each direction.
Hence, the notions of sender and receiver are not central in synchronous communi-
cation.
Let us consider a simple application in which two processes p1 and p2 share two
objects, each implemented by a server process. These objects are two FIFO queues
Q1 and Q2 , implemented by the processes q1 and q2 , respectively.
The operations on a queue object Q are denoted Q.enqueue(x) (where x is the
item to enqueue), and Q.dequeue(), which returns the item at the head of the queue
or ⊥ if the queue is empty. The invocation of Qj .enqueue(x) and Qj .dequeue()
by a process are translated as described in Table 13.1. (Due to the fact that commu-
nications are synchronous, the value returned by Qj .dequeue() can be returned by
the corresponding synchronous invocation.)
Let us consider the following sequences of invocations by p1 and p2 :
Qj .enqueue(x) synchr_send(enqueue, x) to qj
Qj .dequeue() synchr_send(dequeue) to qj
two messages m1 and m2 , sent to the same process, which are such that the event
sy_s(m1 ) causally precedes the event sy_s(m2 ). We have the following.
• D(sy_s(m1 )) = D(sy_del(m1 )) (from communication synchrony).
• D(sy_s(m2 )) = D(sy_del(m2 )) (from communication synchrony).
ev
• D(sy_s(m1 )) < D(sy_s(m2 )) (due to sy_s(m1 ) −→ sy_s(m2 )).
• D(sy_del(m1 )) < D(sy_del(m2 )) follows from the previous items. Consequently,
it follows from the fact that the function D() is strictly increasing inside a pro-
cess, and any two events of a process are ordered, that we necessarily have
ev
sy_del(m1 ) −→ sy_del(m2 ).
To show that synchronous communication is strictly stronger than causal order, let
us observe that the message pattern described in the left part of Fig. 13.1 satisfies
the causal message delivery order, while it does not satisfy the synchronous com-
munication property.
The Crown Structure As we are about to see, the notion of a crown allows for a
simple characterization of synchronous communication. A crown (of size k ≥ 2) is
a sequence of messages m1 , m2 , . . . , mk , such that we have
ev
sy_s(m1 ) −→ sy_del(m2 ),
ev
sy_s(m2 ) −→ sy_del(m3 ),
...,
ev
sy_s(mk−1 ) −→ sy_del(mk ),
ev
sy_s(mk ) −→ sy_del(m1 ).
Examples of crowns are depicted in Fig. 13.4. On the left side, the event sy_s(m1 )
causally precedes the event sy_del(m2 ) and the event sy_s(m2 ) causally precedes
Proof Let us first show that if the communication pattern satisfies the synchronous
communication property, there is no crown. To that end, let us assume (by contra-
diction) that the communications are synchronous and there is a crown. Hence:
ev
• There is a sequence of k ≥ 2 messages m1 , m2 , . . . , mk such that sy_s(m1 ) −→
ev
sy_del(m2 ), . . . , sy_s(mk ) −→ sy_del(m1 ). It follows that ∀x ∈ {1, . . . , k − 1},
we have D(sy_s(mx )) < D(sy_del(mx+1 )), and D(sy_s(mk )) < D(sy_del(m1 )).
• As the communications are synchronous, we have ∀x ∈ {1, . . . , k}: D(sy_s(mx ))
= D(sy_del(mx )).
Combining the two previous items,we obtain
D sy_s(m1 ) < D sy_del(m2 ) = D sy_s(m2 ) ,
D sy_s(m2 ) < D sy_del(m3 ) = D sy_s(m3 ) , etc., until
D sy_del(mk ) = D sy_s(mk ) < D sy_del(m1 ) = D sy_s(m1 ) ,
i.e., D(sy_s(m1 )) < D(sy_s(m1 )), which is a contradiction.
To show that the communication pattern is synchronous if there is no crown let
us consider the following directed graph G. Its vertices are the application messages
exchanged by the processes and there is an edge from a message m to a message m
if one of the following patterns occurs (Fig. 13.5).
ev
• sy_s(m) −→ sy_s(m ), or
ev
• sy_s(m) −→ sy_del(m ), or
ev
• sy_del(m) −→ sy_s(m ), or
ev
• sy_del(m) −→ sy_del(m ).
13.2 Algorithms for Nondeterministic Planned Interactions 341
ev
It is easy to see that each case implies sy_s(m) −→ sy_del(m ), which means
ev
that a directed edge (m, m ) belongs to G if and only if sy_s(m) −→ sy_del(m ). As
there is no crown, it follows that G is acyclic. G can consequently be topologically
sorted, and such a topological sort defines a dating function D() trivially satisfying
the properties defining synchronous communication.
which means that the process pi executes first the set of statements defined by
statements_1, and then invokes the synchronous operation synchr_send(m) to px .
It remains blocked until px invokes the matching synchronous operation synchr_
deliver(v) from pi . When this occurs the value of m is copied into the local variable
v of px , and then both pi and px continue their sequential execution. As far as pi is
concerned, it executes the set of statements defined by statements_2.
there is a physical time at which both processes are executing their synchronous
communication operations.
operations synchr_send() and synchr_del(). More precisely, any two matching in-
vocations are allowed to appear both in a deterministic context, or in a nondeter-
ministic context, or (as in the previous section) one in a deterministic context and
the other one in a nondeterministic context. Hence, this section focuses only on the
control needed to realize an interaction and not on the fact that a single message, or
a message in each direction, is transmitted during the rendezvous. It is even possi-
ble that no message at all is transmitted. In this case, the interaction boils down to
a pure rendezvous synchronization mechanism. The corresponding algorithm is due
to R. Bagrodia (1989).
This section uses the term interaction, which has to be considered as a synonym
of rendezvous or synchronous communication.
Solving this issue requires us to break the symmetry among the three messages
TOKEN (). As seen in Sect. 11.2.2 when discussing resource allocation, one way to
do that consists in defining a total order on the request messages TOKEN (). Such
a total order can be obtained from the identity of the processes. As an example,
assuming j < i < k, let us consider the process pi when it receives the message
TOKEN ({i, k}, k). Its behavior is governed by the following rule:
• tokensi is a set containing the tokens currently owned by pi (let us recall that the
token TOKEN ({i, j }), allows its current owner—pi or pj —to send a request for
an interaction with the other process).
Initially, each interaction token is placed at one of the processes associated
with it.
• delayedi is a variable which contains the token for which pi has delayed sending
an answer; delayedi = ⊥ if no answer is delayed.
as candidates for a synchronous communication (line 1). Then, if it has the token
for one of these interactions (line 3), it selects one of them (line 4), and sends the
associated token to the corresponding process pj (line 5). It becomes then engaged
(line 5), and sets delayedi to ⊥ (line 7). If the predicate of line 3 is false, pi remains
in the local state interested.
The core of the algorithm is the reception of a message TOKEN ({i, j }, x) that
a process pi receives from a process pj . As x is the identity of the process that
initiated the interaction, we necessarily have x = i or x = j . When it receives such
a message, the behavior of pi depends on its state.
• If it is not interested in interactions or, while interested, j ∈ / interactionsi , pi
sends by return the message NO () to pj and saves TOKEN ({i, j }) in tokensi
(lines 9–10). In that way, pi will have the initiative for the next interaction involv-
ing pi and pj , and during that time pj will no longer try to establish a rendezvous
with it.
• If j ∈ interactionsi , and pi has no pending request (i.e., statei = interested), it
commits the interaction with pj by sending back the message TOKEN ({i, j }, j )
(lines 11–12).
• Otherwise, we have j ∈ interactionsi and statei = engaged. Hence, pi has sent
a message TOKEN ({i, k}, i) for an interaction with a process pk and has not yet
received an answer. According to the previous discussion on deadlock prevention,
there are three cases.
– If x > i and pi has not yet delayed an answer, it delays the answer to pj
(lines 14–15).
– If x < i, or x > i and pi has already delayed an answer, it sends the answer
NO () to pj . In that way pj will be able to try other interactions (lines 16–17).
Let us remark that, if delayedi = ⊥, pi will commit an interaction: either
the interaction with the process pk from which pi is waiting for an answer to
the message TOKEN ({i, k}, i) that pi previously sent to pj , or with the process
p waiting for its answer (p is such that delayedi = TOKEN({, i}, )).
– If x = i, by returning the message TOKEN ({i, j }, i) to pi , its sender pj com-
mits the interaction. In that case, before moving to state out (line 24), pi stores
TOKEN ({i, j }) in tokensi (hence, pi will the initiative for the next interac-
tion with pj ), and sends the answer NO () to the delayed process pk , if any
(lines 18–24).
Finally, when pi receives the answer NO () from a process pj (which means that
it previously became engaged by sending the request TOKEN ({i, j }, i) to pj ), the
behavior of pi depends on the value of delayedi . If delayedi = ⊥, pi has delayed
its answer to the process pk such that delayedi = TOKEN({k, i}, k). In that case, it
commits the interaction with pk (lines 28–29). Otherwise, it moves to the local state
interested and tries to establish an interaction with another process in interactionsi
for which it has the token (lines 2–8).
Properties Due to the round-trip of a message TOKEN ({i, j }, i), it is easy to see
that a process can simultaneously participate in at most one interaction at a time.
350 13 Rendezvous (Synchronous) Communication
Each request (sending of a message TOKEN ({i, j }, i)) gives rise to exactly one
answer (the same message echoed by the receiver or a message NO (). Moreover,
due to the total order on process identities, no process can indefinitely delay another
one. Let us also notice that, if a process pi is such that k ∈ interactionsi and TO -
KEN ({i, k}) ∈ tokensi , pi will send the request TOKEN ({i, k}, k) to pk (if it does not
commit another interaction before).
Let us consider the following directed graph. Its vertices are the processes, and
there is an edge from pi to pj if delayedj = TOKEN({j, i}, i) (i.e., pj is delaying
the sending of an answer to pi ). This graph, whose structure evolves dynamically,
is always acyclic. This follows from the fact that an edge can go from a process
pi to a process pj only if i > j . As the processes that are sink nodes of the graph
eventually send an answer to the processes they delay, it follows (by induction) that
no edge can last forever.
The liveness property follows from the previous observations, namely, if (a) two
processes pi and pj are such that i ∈ interactionsj and j ∈ interactionsi , and
(b) none of them commits another interaction, then pi and pj will commit their
common interaction.
To prevent crown patterns from forming (at the level of application messages), pro-
cesses have sometimes to deliver a message instead of sending one. In the context of
the previous section, this is planned at the programming level and, to that end, pro-
cesses are allowed to use explicitly a nondeterministic communication construct. As
we have seen, this construct allows the communication operation which will prevent
crown formation to be selected at run time.
This section considers a different approach in which there is no nondeterministic
choice planned by the programmer, and all the invocations of synchr_send() appear
in a deterministic context. To prevent crowns from forming, a process can be forced
at any time to deliver a message or to delay the sending of a message, hence the
name forced interaction.
operation synchr_send(m) to pj is
(1) if (i > j ) then wait (¬ engagedi );
(2) send MSG (m) to pj ; engaged i ← true
(3) else bufferi [j ] ← m; send REQUEST () to pj
(4) end if.
The Algorithm Each process pi manages a local Boolean variable engagedi , ini-
tially equal to false. This variable is set to true, when pi is managing a synchronous
communication with a process pj such that i > j . The aim of the variables engagedi
is to prevent the processes from being involved in a cycle that would prevent live-
ness.
To present the algorithm we consider two cases according to the values of i and
j , where pi is the sender and pj the receiver.
• i > j . In that case the sender has the initiative of the interaction. It can send the
message m only if it is not currently engaged with another process. It sends then
MSG (m) to pj and becomes engaged (lines 1–2).
When pj receives MSG (m) from pi , it waits until it is no longer engaged in an-
other rendezvous (line 6). Then it synchr_delivers the message m, and sends the
message ACK () to pi (line 7). When pi receives this message, it learns that the
synchronous communication has terminated and consequently resets engagedi to
false (line 9). The corresponding pattern of messages exchanged at the implemen-
tation level is described in Fig. 13.12.
• i < j . In this case, pi has to ask pj to manage the interaction. To that end, it
sends the message REQUEST () to pj (line 3). Let us notice that pi is not yet en-
gaged in a synchronous communication. When pj receives this message, it waits
until it is no longer engaged in another interaction (line 10). When this occurs,
pj sends to pi a message PROCEED () and becomes engaged with pi (line 11).
When it receives this message, pi sends the application message (which has been
previously saved in bufferi [j ]) and terminates locally the interaction (line 12). Fi-
352 13 Rendezvous (Synchronous) Communication
Fig. 13.12
Forced interaction:
message pattern when i > j
nally, the reception of this message MSG (m) by pj entails the synchr_delivery of
m and the end of the interaction on pj ’s side (line 5). The corresponding pattern
of messages exchanged at the implementation level is described in Fig. 13.13. As
in the previous figure, this figure indicates the logical time that can be associated
with the interaction (from an application level point of view).
This section shows that, when controlled by the previous algorithm, the pattern of
application messages satisfies the synchrony property, and all application messages
are synchr_delivered.
Proof To show that the algorithm ensures that the application messages synchr_sent
and synchr_delivered satisfy the synchrony property, we show that there is no crown
at the application level. It follows then from Theorem 18 that the synchrony property
is satisfied.
The proof is made up of two parts. We first show that, if the computation has a
crown of size k > 2, it has also a crown of size 2 (Part I). Then, we show that there
is no crown of size 2 (Part II). As in Sect. 13.1, let sy_s(m) and sy_del(m) be the
events associated with the synchronous send and the synchronous delivery of the
application message m, respectively.
13.3 An Algorithm for Nondeterministic Forced Interactions 353
The rest of the proof is by induction. Let us assume that each process
p1 , p2 , . . . , pi , . . . , pk , eventually moves eventually to the state ¬ engagedi and
answers the messages it receives. Let us observe that pk+1 can become engaged
only when it sends a message MSG () to a process with a smaller identity (line 2), or
when it sends a message PROCEED () to a process with a smaller identity (line 11).
As these processes will answer the message from pk+1 (induction assumption), it
follows that pk+1 is eventually such that ¬ engagedk+1 , and will consequently an-
swer the message it receives. Hence, no process will remain blocked forever, and all
the invocations of synchr_send() terminate.
The fact that the corresponding messages are synchr_delivered follows trivially
from line 5 and line 7.
Theorem 19 The algorithm of Fig. 13.11 ensures that the message communication
pattern satisfies the synchrony property, no process blocks forever, and each message
that is synchr_sent is synchr_delivered.
If there is a common physical clock that can be read by all processes, the local
clocks can be replaced by this common clock, and then θ = 0.
Let us consider the base case of two processes p and q such that p invokes
timed_send(m, deadlinep ) and q invokes timed_receive(x, deadlineq ). Moreover,
let τp be the time, measured at its local clock, at which p starts its invocation of
timed_send(m, deadlinep ). Similarly, let τq be the time, measured at its local clock,
at which q starts its invocation of timed_receive(x, deadlineq ).
Theorem 20 The algorithm described in Fig. 13.15 ensures that both processes
return the same value. Moreover this value is commit if and only the predicate
(τp + δ + θ ≤ deadlineq ) ∧ (τq + δ + θ ≤ deadlinep ) is satisfied, and then the mes-
sage m is delivered by q.
13.4 Rendezvous with Deadlines in Synchronous Systems 357
Proof Let us observe that if each process receives a control message from the other
process, they necessarily return the same result (commit or timeout) because
they compute the same predicate on the same values. Moreover, if no process re-
ceives a message by its deadline, it follows from the temporal scope construct that
they both return timeout.
If a process receives a control message by its deadline, while the other does not,
we have the following where, without loss of generality, p be the process that does
not receive a message by its deadline. Hence p returns timeout. As it has not
received the message READY (τq , deadlineq ) by its deadline, we have τq + δ + θ >
deadlinep . It follows that, when evaluated by q, the predicate will be false, and q
will return timeout (lines 17).
Finally, it follows from lines 15–16 that q delivers the application m if and only
if the predicate is satisfied.
A Simple Improvement If the message READY () (resp., MSG ()) has already ar-
rived when the sender (resp., receiver) invokes its operation, this process can imme-
diately evaluate the predicate and return timeout if the predicate is false. In that
case, it may also save the sending of the message MSG () (resp., READY ()). If the
predicate is true, it executes its algorithm as described in Fig. 13.15.
It follows that, whatever its fate (commit or timeout), a real-time rendezvous
between two processes requires one or two implementation messages.
358 13 Rendezvous (Synchronous) Communication
In some applications, a sender process may want to propose a rendezvous with dead-
line to any receiver process of a predefined set of processes, or a receiver may want
to propose a rendezvous to any sender of a predefined set of processes. The first
rendezvous that is possible is then committed. We consider here the second case
(multiple senders and one receiver). Each of the senders and the receiver define
their own deadline. Let p1 , p2 , . . . , pn denote the senders and q denote the receiver.
As we can see, differently from the two-process case where the algorithms exe-
cuted by the sender and the receiver are symmetric, the principle is now based on
an asymmetric relation: The sender (which is in a deterministic context) acts as a
client, while the receiver (which is in a nondeterministic context) acts as a server.
Despite this asymmetry, the aim is the same as before: allow the processes to take a
consistent decision concerning their rendezvous.
Remark This asymmetric algorithm can be used when there is a single sender.
We then obtain an asymmetric algorithm for a rendezvous between two processes
(see Problem 5).
360 13 Rendezvous (Synchronous) Communication
The Predicate As each process is now simultaneously a sender and a receiver, the
algorithm is a symmetric algorithm, in the sense that all the processes execute the
same code. Actually, this algorithm is a simple extension of the symmetric algorithm
for two processes, described in Fig. 13.15.
Hence, when a process pi invokes multi_rdv(mi , deadlinei ), it sends the message
MSG (mi , τi , deadlinei ) to each other process (where, as before, τi is the sending
date of this message). Let us consider a process pi that has received such a message
from each other process. As the rendezvous is global, the two-process predicate
(τp +δ +θ ≤ deadlineq )∧(τq +δ +θ ≤ deadlinep ) has to be replaced by a predicate
involving all pairs of processes. We consequently obtain the generalized predicate
∀(k, ): τk + δ + θ ≤ deadline .
Instead of considering all pairs (τk , deadline ), this predicate can be refined as fol-
lows:
• The n starting dates τ1 , . . . , τn , are replaced by a single one, namely the worst one
from the predicate point of view, i.e., the latest sending time τ = max({τx }1≤x≤n ).
• Similarly, the n starting deadlines can be replaced by the worst one from the pred-
icate point of view, i.e., the earliest deadline deadline = min({deadlinex }1≤x≤n ).
The resulting predicate is consequently:
τ + δ + θ ≤ deadline.
13.5 Summary
This chapter was on synchronous communication. This type of communication syn-
chronizes the sender and the receiver, and is consequently also called rendezvous
or interaction. The chapter has first given a precise meaning to what is called syn-
chronous communication. It has also presented a characterization based on a specific
message pattern called a crown.
The chapter then presented several algorithms implementing synchronous com-
munication, each appropriate to a specific context: planned interactions or forced
interaction in fully asynchronous systems, and rendezvous with deadline suited to
synchronous systems.
Fig. 13.19 Comparing two date patterns for rendezvous with deadline
To show this, in addition to τ and deadline, let us define two new values as fol-
lows: τ is the second greatest sending time of a message MSG (m, τi , deadlinei )
sent at line 2, and deadline is the second smallest deadline value. These
values can be computed by a process at line 7 and line 8. Moreover, let
same_proc(τ, deadline) be a predicate whose value is true if and only if the
values τ and deadline come from the same process. Show that the predicate
(τ + δ + θ ≤ deadline) used at line 9 can be replaced by the following weaker
predicate:
same_proc(τ, deadline) ∧ τ + δ + θ ≤ deadline ∧ τ + δ + θ ≤ deadline
∨ ¬ same_proc(τ, deadline) ∧ (τ + δ + θ ≤ deadline) .
Part V
Detection of Properties
on Distributed Executions
The two previous parts of the book were on the enrichment of the system to pro-
vide processes with high-level operations. Part III was on the definition and the
implementation of operations suited to the consistent use of shared resources, while
Part IV introduced communication abstractions with specific ordering properties.
In both cases, the aim is to allow application programmers to concentrate on their
problems and not on the way some operations have to be implemented.
This part of the book is devoted to the observation of distributed computations.
Solving an observation problem consists in superimposing a distributed algorithm
on a computation, which records appropriate information on this computation in or-
der to be able to detect if it satisfies some property. The specificity of the information
is, of course, related to the property one is interested in detecting.
Two detection problems are investigated in this part of the book: the detection of
the termination of a distributed execution (Chap. 14), and the detection of deadlocks
(Chap. 15). Both properties “the computation has terminated” and “there is dead-
lock” are stable properties, i.e., once satisfied they remain satisfied in any future
state of the computation.
Process States Each process pi has a local variable denoted statei , the value of
which is active or passive. Initially, some processes are in the active state, while the
others are in the passive state. Moreover, at least one process is assumed to be in
the active state. As far as the behavior of a process pi is concerned, we have the
following (Fig. 14.1):
• When statei = active:
– pi can execute local computations and send messages to the other processes.
– pi can invoke a message reception. In that case, the value of statei automati-
cally changes from active to passive, and pi starts waiting for a message.
• When, pi receives a message (we have then statei = passive), the value of statei
automatically changed from passive to active.
It is assumed that there is an invocation of a message reception for every message
that is sent. (This assumption will be removed in Sect. 14.5.)
emptyτ(i,j ) denote the value of statei and empty(i,j ) at time τ , respectively. C being a
distributed execution, let us define the predicate TERM(C, τ ) as follows:
TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ ∀i, j : emptyτ(i,j ) .
This predicate captures the fact that, at time τ , C has terminated. Finally, the predi-
cate that captures the termination of C, denoted TERM(C), is defined as follows:
TERM(C) ≡ ∃τ : TERM(C, τ ) .
As already seen, a stable property is a property that, once true, remains true forever.
• A distributed algorithm (with a module per process) which allows the local ob-
servers to cooperate in order to detect the termination.
The termination detection algorithms differ on two points: (a) the assumptions
they do on the behavior of the computation, and (b) the way they cooperate.
As far as notations are concerned, the local observer of a process pi is denoted
obsi .
The atomic model is a simplified model in which only messages take time to travel
from their senders to their destination processes. Message processing is done atom-
ically in the sense that it appears as taking no time. Hence, when it can be observed
a process is always passive.
A time-diagram of an execution in this model is represented in Fig. 14.3. Message
processing is denoted by black dots. Initially, p2 sends a message m1 to p1 , whilep1
and p3 are waiting for messages. When it receives m1 , p1 processes it and sends
two messages, m2 to p3 and m3 to p2 , etc. The processing of the messages m4 ,
m7 , and m8 entails no message sending, and consequently (from a global observer
point of view), the distributed computation has terminated after the reception and
the processing of m8 by p2 .
14.2 Termination Detection in the Asynchronous Atomic Model 371
An Inquiry-Based Principle The idea that underlies this algorithm is fairly sim-
ple. On the observation side, each process pi is required to count the number of
messages it has received reci , and the number of messages it has sent senti .
To simplify the presentation we consider that there is an additional control pro-
cess denoted observer. This process is in charge of the detection of the termination.
It sends an inquiry message to each process pi , which answers by returning its pair
of values (senti , reci ). When the observer has all the pairs, it computes the total
number of messages which have been sent S, and the total number of messages
which have been received R. If S = R it starts its next inquiry.
Unfortunately, as shown by the counterexample presented in Fig. 14.4, the ob-
server cannot conclude from S = R that the computation has terminated. This is
due to the asynchrony of communication. The inquiry messages REQUEST () are
not received at the same time by the application processes, and p1 sends back the
message ANSWER (0, 1), p2 sends back the message ANSWER (1, 0), and p3 sends
back the message ANSWER (0, 0). We then have S = 1 and R = 1, while the com-
putation is not yet terminated. Due to the asynchrony among the reception of the
messages REQUEST () at different processes, the final counting erroneously asso-
ciates the sending of m1 with the reception of m2 . Of course, it could be possible to
replace the counting mechanism by recording the identity of all the messages sent
and received and then compare sets instead of counters, but this would be too costly
and, as we are about to see, there is a simpler counter-based solution.
372 14 Distributed Termination Detection
Proof Let us first prove the liveness property. If the computation terminates, there
is a time τ after which no message is sent. Hence, after τ , no message is sent and no
message is received. Let us consider the first two inquiries launched by the observer
14.2 Termination Detection in the Asynchronous Atomic Model 373
after τ . It follows from the previous observation that E1 = R1 = E2 = R2, and the
algorithm detects termination.
Proving the safety property consists in showing that, if termination is claimed,
then the computation has terminated. To that end, let sentτi be the value of senti
at time τ . As counters are not decreasing, we trivially have (τ ≤ τ ) ⇒ (sentτi ≤
sentτi ), and the same holds for the counters reci .
Let us consider Fig. 14.6, which abstracts two consecutive inquiries launched by
the observer. The first inquiry obtains the pair of values (S1, R1), and the second
inquiry obtains the pair (S2, R2). As the second inquiry is launched after the results
of the previous one have been computed, they are sequential. Hence, there is a time,
between the end of the first inquiry and the beginning of the send one at which the
number of messages sent is S , and the number of messages received is R (these
values are known neither by the processes, nor the observer). We have the following.
• S1 ≤ S ≤ S2 and R1 ≤ R ≤ R2 (from the previous observation on the nonde-
creasing property of the counters senti and reci ).
• If S2 = R1, we obtain (from the previous item) S ≤ S2 = R1 ≤ R .
• S ≥ R (due to the computation itself).
• It follows from the two previous items that (S2 = R1) ⇒ (S = R ).
Hence, if the predicate is satisfied, the computation was terminated before the sec-
ond inquiry was launched, which concludes the proof of the theorem.
Principle The principle that underlies this detection algorithm, which is due to
F. Mattern (1987), consists in using a token that navigates the network of pro-
cesses. This token carries a vector msg_count[1..n] such that, from its point of
view, msg_count[i] is the number of messages sent to pi minus the number of mes-
sages received by pi . When a process receives the token and the token is such that
msg_count = [0, . . . , 0], it claims termination.
374 14 Distributed Termination Detection
After a process pi has received the token and updated its value, it checks the
predicate (msg_count = [0, . . . , 0]) if the token has visited each process at least once
(line 10). If the predicate is satisfied, pi claims termination (line 11). Otherwise, it
propagates the token to the next process (line 12).
Termination Detection, Global States, and Cuts The notions of a cut and of a
global state were introduced in Sects. 6.1.3, and 6.3.1, respectively. Moreover, we
have seen that these notions are equivalent.
Actually, the value of the vector msg_count defines a global state. The global
state Σ2 associated with msg_count2 and the global state Σ4 associated with
msg_count4 are represented on the figure.
As we can see, the global states associated with vectors all of whose values are
not negative are consistent, while the ones associated with vectors in which some
values are negative are inconsistent. This comes from the fact that a negative value
in msg_count[i] means that pi is seen by the token as having received messages,
but their corresponding sendings have not yet been recorded by the token (as is
the case for msg_count3 and msg_count4 , which do not record the sending of m ).
The positive entries of a vector with only non-negative values indicate how many
messages have been sent and are not yet received by the corresponding process (as
376 14 Distributed Termination Detection
an example, in the consistent global state associated with msg_count5 , the message
m sent to p2 is in transit).
This section and the following ones consider the base asynchronous model in which
a process can be active or passive as defined in Sect. 14.1.1.
Some applications are structured in such a way that initially a single process is
active. Without loss of generality, let p1 be this process (which is sometimes called
the environment).
Process p1 can send messages to other processes, which then become active. Let
pi such a process. It can, in turn, send messages to other processes, etc. (hence the
name diffusing computation). After a process has executed some local computation
and possibly sent messages to other processes, it becomes passive. It can become
active again if it receives a new message. The aim is here to design a termination
detection algorithm suited to diffusing computations.
14.3 Termination Detection in Diffusing Computations 377
Principle: Use a Spanning Tree As processes become active when they receive
messages, the idea is to capture the activity of the application with a spanning
tree which evolves dynamically. A process that is not in the tree enters the tree
when it becomes active, i.e., when it receives a message. A process leaves the tree
when there is no more activity in the application that depends on it. This princi-
ple and the corresponding detection algorithm are due to E.W. Dijkstra and C.S.
Scholten (1980).
How to Implement the Previous Idea Initially, only p1 is in the tree. Then, we
have the following. Let us first consider a process pi that receives a message from
a process pj . As we will see, a process that sends a message necessarily belongs to
the tree. Moreover, due to the rules governing the behavior of a process, a process
is always passive when it receives a message (see Fig. 14.1).
• If pi is not in the tree, it becomes active and enters the tree. To that end, it defines
the sender pj of the message as its parent in the tree. Hence, each process pi
manages a local variable parenti . Initially, parenti = ⊥ at any process pi , except
for p1 , for which we have parent1 = 1.
• If pi is in the tree when it receives the message from pj , it becomes active and,
as it does not need to enter the tree, it sends by return to pj a message ACK () so
that pj does not consider it as one of its children.
Let us now introduce a predicate that allows a process to leave the tree. First, the
process has to be passive. But this is not sufficient. If a process pi sent messages,
these messages created activity, and it is possible that, while pi is passive, the ac-
tivated processes are still active (and may have activated other processes, etc.). To
solve this issue, each process pi manages a local variable, called deficiti , which
counts the number of messages sent by pi minus the number of acknowledgment
messages it has received.
• If deficiti > 0, all the messages sent by pi have not yet been acknowledged.
Hence, there is possibly some activity in the subtree rooted at pi . Consequently,
pi conservatively remains in the tree.
• If deficiti = 0, all the messages that pi has sent have been acknowledged, and
consequently no more activity depends on pi . In this case, it can leave the tree,
and does this by sending the message ACK () to, its parent.
Hence, when considering a process pi which is in the tree (i.e., such that parenti =
⊥)), the local predicate that allows it to leave the tree is
(statei = passive) ∧ (deficiti = 0).
Finally, the environment process p1 concludes that the diffusing computation has
terminated when its local predicate deficit1 = 0 becomes satisfied.
378 14 Distributed Termination Detection
Fig. 14.9
when pi starts waiting for a message do
Termination detection
(1) statei ← passive;
of a diffusing computation
(2) let k = parenti ;
(3) if (deficiti = 0)
(4) then send ACK () to pk ; parenti ← ⊥
(5) end if.
also generic in that sense that it uses an abstraction called a wave, which can be im-
plemented in many different ways, each providing a specific instance of the general
algorithm. This reasoned construction is due to J.-M. Hélary and M. Raynal (1991).
(line 1) and, when a process pi is visited by the wave (i.e., receives a message
GO ()), it forwards the message GO () to each of its own children (lines 2–3).
Then, when a process pi invokes return_wave(b), where b is its contribution to
the final result, it waits for the contributions from the processes belonging to the
subtree for which it is the root (lines 5–9). When, it has received all of these contri-
butions, it sends back to its parent the whole contribution of this subtree (line 10).
Finally, when pα (which is the root of the spanning tree) has received the contribu-
tion of all its children, it deposits the Boolean result in its local variable res.
Fig. 14.12 Why ( 1≤i≤n idlexi ) ⇒ TERM(C , ταx ) is not true
This predicate is safe in the sense that idlei ⇒ (statei = passive)∧( j =i empty(i,j ) ).
It allows consequently each process pi to know the state of the channels from it to
the other processes. Let us nevertheless observe that this predicate is unstable. This
is because, if pi receives a message while idlei is satisfied, statei becomes equal to
active, and idlei becomes consequently false.
This follows from the following observation. It follows from idlexi ∧ ct_passxi that
pi remained passive during the time interval [τix , ταx ], and consequently its outgoing
channels remained empty in this time interval. As this is true for all the processes,
14.4 A General Termination Detection Algorithm 383
we conclude that, at time ταx , all the processes are passive and all the channels are
empty.
Unfortunately, as a process pi does not know the time ταx , it cannot evaluate the
predicate ct_passxi . As previously, a simple way to solve this problem consists in
strengthening the predicate ct_passxi in order to obtain a predicate locally evaluable
by pi . As the waves are issued sequentially by pα , such a strengthening is easy. As
ταx < τix+1 , ταx (which is not known by pi ) is replaced by τix+1 (which is known by
pi ). Hence, let us replace ct_passxi by
Fig. 14.13
when pi starts waiting for a message do
A general algorithm
(1) statei ← passive.
for termination detection
when pi executes send(m) to pj do
(2) deficiti ← deficiti + 1.
AND Model In this reception pattern, a receive statement has the following struc-
ture:
We have then dep_seti = {a, . . . , x}. When invoked by pi , this statement terminates
when a message from each process pj , such that j ∈ dep_seti , has arrived at pi .
The reception statement then withdraws these messages from their input buffers and
returns them to pi . Hence, this message reception statement allows for the atomic
reception of several messages from distinct senders.
OR Model The receive statement of the OR pattern has the following structure:
As previously, the dependency set is dep_seti = {a, . . . , x}, but its meaning is dif-
ferent. This receive statement terminates when a message from a process pj , such
that j ∈ dep_seti , has arrived at pi . If messages from several processes in dep_seti
have arrived, a single one is withdrawn from its input buffer and received and con-
sumed by pi . Hence, the “or” is an exclusive “or”. This is consequently a simple
nondeterministic receive statement.
14.5 Termination Detection in a Very General Distributed Model 387
Basic k-out-of-m Model The receive statement of the k-out-of-m pattern has the
following structure:
receive ki messages from dep_seti ,
where dep_seti (a set of process identities). It is assumed that 1 ≤ ki ≤ |dep_seti |.
This statement terminates when a message from ki distinct processes belonging to
dep_seti have arrived at pi .
Let A be a set of process identities such that a message has arrived from each
of these processes and these messages have not yet been consumed (hence, they are
still in their input buffers). fulfilled(A) is satisfied if and only if these messages are
sufficient to reactivate the invoking process pi .
By definition fulfilled(∅) is equal to false. It is moreover assumed that the predi-
cate fulfilled() is monotonous, i.e.,
A ⊆ A ⇒ fulfilled(A) ⇒ fulfilled A .
(Monotonicity states only that, if a process can be reactivated with the messages
from a set of processes A, it can also be reactivated with the messages from a bigger
set of processes A , i.e., such that A ⊆ A .)
The result of an invocation of fulfilled(A) depends on the type of receive state-
ment issued by the corresponding process pi . As a few examples, we have the fol-
lowing:
• In the AND model: fulfilled(A) ≡ (dep_seti ⊆ A).
• In the OR model: fulfilled(A) ≡ (A ∩ dep_seti = ∅).
• In the OR/AND model: fulfilled(A) ≡ (∃dpy : dpy ⊆ A).
• In the k-out-of-n model: fulfilled(A) ≡ (|A ∩ dep_seti | ≥ ki ).
y y
• In the disjunctive k-out-of-n model: fulfilled(A) ≡ (∃y: |A ∩ dpi | ≥ ki ).
A process pi stops executing because it is blocked on a receive statement, or be-
cause it has attained its “end” statement. This case is modeled as a receive statement
in which dep_seti = ∅. As processes can invoke distinct receive statements at dif-
ferent times, the indexed predicate fulfilledi () is used to denote the corresponding
predicate as far as pi is concerned.
Channel Predicates and Process Sets Let us define the following predicates
and abstract variables that will be used to define two notions of termination of a
computation in the very general distributed model.
• empty(j, i) is a Boolean predicate (already introduced) which is true if and only
if the network component of the channel from pj to pi is empty.
• arrivedi (j ) is a Boolean predicate which is true if and only if the buffer compo-
nent of the channel from pj to pi is not empty.
• arr_fromi = {j | arrivedi (j )} (the set of processes from which messages have
arrived at pi but have not yet been received—i.e., consumed—by pi ).
• NEi = {j | ¬empty(j, i)} (the set of processes that sent messages to pi , and these
messages have not yet arrived at pi ).
When looking at Fig. 14.15, we have arr_fromi = {j, k}, and NEi = {j, }.
14.5 Termination Detection in a Very General Distributed Model 389
Local Variables and Locally Evaluable Predicate As the set NEi appears in
S_TERM(C, τ ), a value of it must consequently be computed or approximated. As a
local observer obsi cannot compute the state of its input channels without freezing
(momentarily blocking) sender processes, it instead computes the state of its output
channels. More precisely, as done in previous algorithms, a local observer can learn
the state of the network component of its output channels by associating an acknowl-
edgment with each application message. When a message from a process pi arrives
at a process pj , the local observer obsj sends by return an acknowledgment to the
local observer obsi associated with pi . As previously, let deficiti be the number of
messages sent by pi for which obsi has not yet received an acknowledgment. When
satisfied, the predicate deficiti = 0 allows the local observer obsi to know that the
all the messages by pi have arrived to their destination processes. As we will see in
the proof of the algorithm, replacing NEi by deficiti = 0 allows for a safe detection
of static termination.
To know the state of pi ’s input buffers, the local observer obsi can use the set
arr_fromi (which contains the identities of the processes pj such that there is a
message from pj in the input buffer of pi ). The content of such a set arr_fromi
can be locally computed by the local observer obsi . This is because this observer
(a) sends back an acknowledgment each time a message arrives, and (b) observes
the reception of messages by pi (i.e., when messages are withdrawn from their input
buffers to be consumed pi ).
Finally, the predicate (which is not locally evaluable)
(statei = passive) ∧ (NEi = ∅) ∧ ¬fulfilledi (arr_fromi )
is replaced by the locally evaluable predicate
(statei = passive) ∧ (deficiti = 0) ∧ ¬fulfilledi (arr_fromi ) .
In addition to the previous local variables, the algorithm uses also the Boolean
variable cont_passivei , the role of which is exactly the same as in Fig. 14.13.
The cooperation among the local observers is described at lines 8–19. To simplify
the presentation, we consider that each wave is implemented by a star which is
(a) centered at a process obsα (which does not participate in the computation), and
(b) implemented by messages REQUEST () and ANSWER () (as done in Fig. 14.5).
When obsα launches a new wave it sends a message REQUEST () to each process
observer obsi (line 10). Then, it waits for a message ANSWER (bi ) from each of them
(line 11), and claims termination if the conjunction b1 ∧ · · · ∧ bn is true (lines 12–
14).
When a local observer obsi is visited by a new wave (line 15), it waits until its
local predicate (statei = passive) ∧ (deficiti = 0) ∧ (¬fulfilledi (arr_fromi ) becomes
true. When this occurs, it sends the current value of cont_passivei to obsα and resets
this variable to the value true (lines 17–18).
Fig. 14.17 Definition of time instants for the safety of static termination
Proof Proof of the liveness property. We have to show that, if the computation C is
statically terminated, the algorithm eventually claims it.
If the computation is statically terminated at time τ , we have S_TERM(C, τ )
(definition). As static termination is a stable property, it follows that all the pro-
cesses pi are continuously passive, their re-activation conditions are not satisfied
(i.e., ¬fulfilledi (arr_fromi ) at each pi ), and the network component of each chan-
nel is empty. Moreover, it follows from the acknowledgment mechanism that there
is a time τ ≥ τ such that, from τ , the predicate deficiti = 0 is always satisfied at
each process pi .
Let the wave launched by obsα after τ be the xth wave. It follows from the
predicate at line 16 that neither the xth wave, nor any future wave, will be delayed
at this line. Moreover, while the value bi returned by obsi to the xth wave can be
true or false, the value bi returned to the (x + 1)th wave is necessarily the value true
(this is because the xth wave set the local variables cont_passivei to true, and none
of them has been modified thereafter).
It follows that (if not yet satisfied for the xth wave) the (x + 1)th wave will be
such that bi = true for each i. The observer obsα will consequently claims termina-
tion as soon as it has received all the answers for the (x + 1)th wave.
Proof of the safety property. We have to show that, if the algorithm claims termi-
nation at time τ , there is a time τ ≤ τ such that S_TERM(C, τ ).
To that end, let τ be a time such that, for each i, we have τix < ταx < τ < τix+1 <
ταx+1 , where τix is the time at which obsi sends its answer and locally terminates the
visit of the xth wave, and ταx be the time at which obsα terminates its wait statement
of the xth wave (line 10). The time instants τix+1 and ταx+1 are defined similarly for
(x + 1)th wave. These time instants are depicted in Fig. 14.17.
The proof consists in showing that, if the algorithm claims termination at the end
of the (x + 1)th wave (i.e., just after time ταx+1 ), then computation was statically
terminated at time τ (i.e., S_TERM(C, τ ) is true). The proof is decomposed into
three parts.
• Proof that, for each i, stateτi = passive.
It follows from the management of the local variables cont_passivei (lines 10
and 17) that, if obsα claims termination just after time ταx+1 , each process pi has
14.5 Termination Detection in a Very General Distributed Model 393
been continuously passive between τix and τix+1 . As τix < τ < τix+1 , we conclude
that we have stateτi = passive.
• Proof that, for each i, NEτi = ∅.
Due to the algorithm, each wave is delayed at an observer obsi (at line 16,
before the send of an answer message) until all messages sent by pi have been
acknowledged (predicate deficiti = 0). On the other hand, no process pi sent mes-
sages between τix and τix+1 (because it was continuously passive during the time
interval [τix , τix+1 ], see the previous item). It follows from these two observations
that, at time τ , there is no message sent by pi and not yet arrived at its destination.
As this is true for all the processes, if follows that we have NEτi = ∅.
• Proof that, for each i, ¬fulfilledi (arr_fromτi ).
As previously, a wave is delayed at an observer obsi until ¬fulfilledi (arr_
fromi ). Hence, as the algorithm claims termination just after ταx+1 , it follows
that, for any i, we have ¬fulfilledi (arr_fromx+1 i ), where arr_fromx+1i denotes
x+1
the value of arr_fromi at time τi (line 16).
As any process pi has been continuously passive in the time interval
[τix , τix+1 ], it consumed no message during that period. Consequently the set
arr_fromi can only increase during this period, i.e., we have arr_fromτi ⊆
arr_fromx+1 i . It follows from the monotonicity of the predicate fulfilledi () that
(¬fulfilledi (arr_fromx+1 i ) ⇒ (¬fulfilledi (arr_fromτi )). Hence, for any i we have
¬fulfilledi (arr_fromi ), which concludes the proof of the theorem.
τ
Cost of the Algorithm Each application message entails the sending of an ac-
knowledgment. A wave requires two types of messages, n request messages which
carry no value, and n answers which carry a Boolean. Moreover, after the compu-
tation has terminated, one or two waves are necessary to detect termination. Hence,
4n control messages are necessary in the worst case. If, instead of a star, a ring is
used to implement waves, this number reduces to 2n.
• arri [1..n] is a second array, managed by obsi , such that arri [j ] contains the num-
ber of processes which, since the beginning of the execution, have been sent by
pj and have arrived at pi .
• ap_sentα [1..n, 1..n] is a matrix, managed by obsα such that ap_sentα [i, j ] con-
tains the number of messages that, from obsα ’s point of view, have been sent by pi
to pj . As obsα can learn now information only when it receives control messages
from the local observers obsi , ap_sentα [i, j ] represents an approximate knowl-
edge (hence the identifier prefix ap). The way this array is used is described in
Fig. 14.18.
• As a local observer obsi does not know the value of NEi , it cannot compute
the value of the predicate ¬fulfilledi (NEi ∪ arr_fromi ) (which characterizes dy-
namic termination). To cope with this issue, obsi computes the set ap_nei = {j |
ap_st[j ] > arri [j ]} (line 16), which is an approximation of NEi (maybe more
messages have been sent to pi than the ones whose sending is recorded in the
array ap_st[1..n] sent by obsα ).
The observer obsi can now compute the value of the local predicate ¬fulfilledi
(ap_nei ∪ arr_fromi ) (line 17). Then, obsi sends to obsα (line 19) a Boolean
bi which is true (line 17) if an only if (a) pi has been continuously passive
since the previous wave (as in static detection termination), and (b) the predicate
fulfilledi (ap_nei ∪ arr_fromi ) is false (i.e., pi cannot be reactivated with the mes-
sages that have arrived and are not yet consumed, and a subset of the messages
which are potentially in transit to it).
Moreover, after bi has been computed, obsi resets its local variable cont_
passivei to the Boolean value (statei = passive). This is due to the fact that, dif-
ferently from the previous termination detection algorithms, the waves are not
396 14 Distributed Termination Detection
Cost of the Algorithm After dynamic termination has occurred, two waves are
necessary to detect it, in the worst case. No acknowledgment is used, but a wave
requires 2n messages, each carrying an array of unbounded integers (freezing the
observed computation would allow us to reset the counters). The algorithm does not
require the channels to be FIFO (neither for application, nor for control messages).
But as waves are sequential, the channels from obsα to each obsi , and the chan-
nels from every obsi to obsα , behave as FIFO channels for the control messages
REQUEST () and control messages ANSWER (), respectively. It follows that, instead
of transmitting the value of a counter, it is possible to transmit only the difference
between its current value and the sum of the differences already sent.
14.6 Summary
A distributed computation has terminated when all processes are passive and all
channels are empty. This defines a stable property (once terminated, a computa-
tion remains terminated forever) which has to be detected, in order that the system
can reallocate local and global resources used by the processes (e.g., local memory
space).
This chapter presented several distributed algorithms which detect the termina-
tion of a distributed computation. Algorithms suited to specific models (such as the
asynchronous atomic model), and algorithms suited to specific types of computa-
tions (such as diffusing computations) were first presented. Then the chapter has
considered more general algorithms. In this context it presented a reasoned con-
struction of a very general termination detection algorithm. It also introduced a very
general distributed model which allows for very versatile receive statements, and
presented two termination detection algorithms suited to this distributed computa-
tion model.
• The asynchronous atomic model was introduced by F. Mattern in [249], who pro-
posed the four-counter algorithm (Fig. 14.5), and the counting vector algorithm
(Fig. 14.7) to detect termination in this distributed computing model.
• The notion of diffusing computations was introduced by E.W. Dijkstra and C.S.
Scholten [115], who proposed the algorithm described in Fig. 14.9 to detect the
termination of such computations.
• The reasoned construction presented in Sect. 14.4, and the associated general
termination detection algorithm presented in Fig. 14.13 are due to J.-M. Hélary
and M. Raynal [178, 180].
• The wave concept is investigated in [82, 180, 319, 365].
• The notion of freezing in termination detection algorithms is studied in [132].
• The problem of detecting the termination of a distributed computation in a very
general asynchronous model, and the associated detection algorithms, are due to
J. Brzezinski, J.-M. Hélary, and M. Raynal [64].
• The termination detection problem has given rise to an abundant literature and
many algorithms. Some are designed for the synchronous communication model
(e.g., [114, 265]); some are based on snapshots [190]; some are based on roughly
synchronized clocks [257]; some are based on prime numbers [309] (where the
unicity of the prime number factorization of an integer is used to ensure con-
sistent observations of a global state); some are obtained from a methodological
construction (e.g., [79, 342]); and some are based on the notion of credit dis-
tribution [251]. Other investigations of the termination detection problem and
algorithms can be found in [74, 169, 191, 221, 252, 261, 302, 330] to cite a few.
• Termination detection in asynchronous systems where processes may crash is
addressed in [168].
• Relations (in both directions) between termination detection and garbage collec-
tion are investigated in [366, 367].
Hint. When a process enters the tree, it sends a message ACK (in) to its parent and
send no message if it is already in the tree. It will send to its parent a message
ACK (out) when it leaves the tree.
Solution in [308].
3. Modify the algorithms implementing a wave and the generic termination detec-
tion algorithm described in Figs. 14.10, 14.11, and 14.13, respectively, so that
the initiator pα is also an application process (which has consequently to be ob-
served).
4. The termination detection algorithm presented in Fig. 14.13 assumes that the
application process pi is frozen (Fig. 14.14) from the time the predicate idlei
becomes true (line 13) until obsi starts the invocation of return_wave(b) (line 15).
Considering the xth wave, let us define τix as the time at which idlei becomes
true at line 13, and let us replace line 14 by
b ← cont_passivei ; cont_passivei ← (statei = passive).
Do these modifications allow for the suppression of the freezing of the appli-
cation process pi ?
5. Let us consider the family of distributed applications in which each process has
a local variable xi whose value domain is a (finite or infinite) ordered set, and
each process follows the following rules:
• Each message m sent by pi carries a value cm ≥ xi .
• When it modifies the value of xi , a process pi can only increase it.
• When it receives a message m, which carries the value cm , pi can modify xi
to a value equal to or greater than min(xi , cm ).
These computations are called monotonous computations. An example of a
computation following these rules is given in Fig. 14.20, where the value domain
is the set of positive integers. The value cm associated with a message m is in-
dicated with the corresponding arrow, and a value v on the axis of a process pi
indicates that its local variable xi is updated to that value, e.g., x3 = 1 a time τ1 ,
and x1 = 5 at time τ2 .
14.8 Exercises and Problems 399
At some time τ , the values of the local variables xi and the values c carried
by the messages define the current global state of the application. Let F (τ ) be
the smallest of these values. Two examples are given in Fig. 14.20.
Distributed simulation programs are an example of programs where the pro-
cesses follow the previous rules (introductory surveys on distributed simulation
can be found in [140, 263]). The variables xi are the local simulation times at
each process, and the function F (τ ) defines the global simulation time.
• Show that (τ1 ≤ τ2 ) ⇒ [F (τ1 ) ≤ F (τ2 )].
• An observation algorithm for such a computation is an algorithm that com-
putes approximations of the value of F (τ ) such that
– Safety. If at time τ the algorithm returns the value z, then z ≤ F (τ ).
– Liveness. For any τ , if observations are launched after τ , there is a time after
which all the observations return values greater than or equal to z = F (τ ).
When considering distributed simulation programs, the safety property states a
correct lower bound on the global simulation time, while the liveness property
states that if the global simulation progresses then this progress can eventually
be observed.
Adapt the algorithm of Fig. 14.13 so that it computes F (τ ) and satisfies the
previous safety and liveness properties.
• Considering the value domain {active, passive} with the total order active <
passive, show that the problem of detecting the termination of a distributed
computation consists in repeatedly computing F (τ ) until F (τ ) = passive.
Solutions to all the previous questions in [87, 179].
6. Prove the dynamic termination detection algorithm described in Fig. 14.19. (This
proof is close to that of the static termination detection algorithm presented in
Fig. 14.16. In addition to the monotonicity of the predicate fulfilledi (), the fact
that the counters senti [j ], reci [j ] and ap_sentα [i, j ] are monotonically increas-
ing has to be used.)
Solution in [64].
Chapter 15
Distributed Deadlock Detection
This chapter addresses the deadlock detection problem. After having introduced the
AND deadlock model and the OR deadlock model, it presents distributed algorithms
that detect their occurrence. Let us recall that the property “there is a deadlock” is
a stable property (once deadlocked, a set of processes remain deadlocked until an
external agent—the underlying system—resolves it). Hence, as seen in Sect. 6.5,
algorithms computing global states of a computation can be used to detect dead-
locks. Differently, the algorithms presented in this chapter are specific to deadlock
detection. For simplicity, they all assume FIFO channels.
Waiting for Resources A process pi becomes blocked when it starts waiting for a
resource currently owned by another process pj . This introduces a waiting relation
between pi and pj . It is possible that a process pi needs several resources simul-
taneously. If these resources are currently owned by several processes pj , pk , etc.,
the progress of pi depends on each of these processes.
Waiting for Messages As seen in the previous chapter (Sect. 14.5), a receive state-
ment can be on a message from a specific sender, or on a message from any sender
of a specific set of senders. The specific sender, or set of possible senders, is defined
in the receive statement. (While more general receive statements were introduced in
Sect. 14.5, this chapter considers only the case of a receive statement with a specific
sender or a specific set of senders.)
The important point is that, when a process pi enters a receive statement, it starts
waiting for a message from a predefined process pj , or from any process from a
predefined set of processes.
Resource vs. Message After it has been used, a resource has to be released so that
it can be used by another process. Let us observe that a message can be seen as a
resource that is dynamically created by its sender and consumed by its destination
processes. A receiver has to wait until a message is received. In that sense, a message
can be seen as a consumable resource (the notion of release being then meaningless).
While a resource is shared among several processes and a message involves only
its sender and its receiver, there is no conceptual difference between a process which
is waiting for a resource and a process which is waiting for a message. In both cases,
another process has to produce an action (release a resource, or send a message) in
order that the waiting process be allowed to progress.
It follows that there is no fundamental difference between detection algorithms
for deadlocks due to resources and detection algorithms for deadlocks due to mes-
sages.
Dependency Set Given a wait-for graph at some time τ , the dependency set of
a process pi , denoted dep_seti (as in the previous chapter), is the set of all the
processes pj such that the directed edge (pi , pj ) belongs to the graph at time τ . If
15.1 The Deadlock Detection Problem 403
a process is not waiting for a resource (or a message, according to the model), its
dependency set is empty. As the wait-for graph evolves with time, the dependency
sets are not static sets.
When looking at Fig. 15.1, we have dep_set1 = {2, 5} in both graphs, dep_set5
= ∅ in the graph on the left and dep_set5 = {2} in the graph on the right.
In the AND model, the progress of a process is stopped until it has received a mes-
sage (or been granted a resource) by each process in its current dependency set. This
means that a process is deadlocked when it belongs to a cycle of the current wait-for
graph, or belongs to a path leading to a cycle of this graph.
Let us consider the wait-for graph on the left of Fig. 15.1. The processes p1 , p2 ,
and p3 are deadlocked because, transitively, each of them depends on itself: p2 is
404 15 Distributed Deadlock Detection
As for the other problems (e.g., resource allocation, communication, termination de-
tection), the deadlock detection problem is defined by safety and liveness properties
that any of its solutions has to satisfy. These properties are the following:
• Safety. If, at time τ , an observer claims that there is a deadlock, the process it is
associated with is deadlocked at time τ .
15.2 Deadlock Detection in the One-at-a-Time Model 405
Remark When comparing deadlock detection algorithms with algorithms that al-
low a process to know if it belongs to a cycle, or a knot, of a given communication
graph (see the algorithms presented in Sect. 2.3), the additional difficulty lies in
the fact that the graphs explored by deadlock detection algorithms consider are dy-
namic. The waiting relation depends on the computation itself, and consequently the
wait-for graph evolves with time.
Moreover, while the abstract wait-for graph is modified instantaneously for an
external observer’s point of view (i.e., when reasoning to obtain a characterization
of deadlocks in terms of properties on a graph), at the operational level messages
take time to inform processes about changes in the waiting relation. This creates
an uncertainty on the view of the waiting relation as perceived by each process.
The main issue of deadlock detection algorithms is to provide processes with an
approximate view that allows them to correctly detect deadlocks that occur.
Local Variables Two local control variables, denoted pubi (for public) and privi
(for private) are associated with each process pi . These variables are integers which
can only increase. Moreover, the values of the private local variables are always
distinct from one another, and the local public variables can be read by the other
processes.
A simple way to ensure that no two private variables (privi and privj ) have the
same value consists in considering that the integer implementing a private variable
privi is obtained from the concatenation of an integer with the identity of the corre-
sponding process pi . The reading of the local variable pubj by an observer obsi can
be easily realized by a query/response mechanism involving obsi and obsj .
Hence, the behavior of each observer obsi consists in an appropriate management
of the pair (privi , pubi ) so that, if a cycle appears in the abstract wait-for graph, a
single process in a cycle detects the cycle, and the processes which are not involved
in a cycle never claim a deadlock.
• R1 (Blocking rule).
When pi becomes blocked due to pj , obsi resets the values of its local vari-
ables privi and pubi such that
• R2 (Propagation rule).
When pi is blocked by pj , obsi repeatedly reads the value of pubj , and exe-
cutes
if (pubi < pubj ) then pubi ← pubj end if.
This means that, while pi is blocked by pj , obsi discovers that the public value
pubj is greater than its own public value pubi , it propagates the value of pubj
by assigning it to pubi . Assuming that pubk is the greatest public value in a path
of the wait-for graph, this rule ensures that pubk is propagated from pk to the
process p blocked by pk , then from p to the process blocked by p , etc.
• R3 (Activation rule).
Let pj , pk , etc., be the processes blocked by a process pi . When pi unblocks
one of them (e,g., pj ), obsi informs the observers of the other processes so that
they re-execute rule R1. (This is due to the fact that these processes are now
blocked by pj .)
• R4 (Detect rule).
If after it has read the value of pubj (where pj is the process that blocks pi ),
obsi discovers that
privi = pubj ,
it claims that there is a cycle in the wait-for graph, and it is consequently involved
in a deadlock.
Proof Proof of the liveness property. We have to prove that, if a deadlock occurs,
a process involved in the associated cycle will detect it. Let us assume that there
is a cycle in the wait-for graph. Let us observe that, as a deadlock defines a stable
property, this cycle lasts forever.
It follows from rule R1 that, when it becomes blocked, each process px belonging
to this cycle sets privx and pubx to a value greater than its previous value of pubx
and greater than the value puby of the process py that blocks it. Moreover, due to
the definition of private values, no other process pz sets privz and pubz to the same
408 15 Distributed Deadlock Detection
value as px . It follows that there is a process (say pi ) whose value pubi = v is the
greatest among the values pubx computed by the processes belonging to the cycle.
When the processes of the cycle execute rule R2, the value of v is propagated,
in the opposite direction, along the edges of the wait-for graph. Hence, the process
pj that blocks pi is eventually such that pubj = v. There is consequently a finite
time after which pi reads v from pubj and concludes that there is a cycle (rule R4).
Moreover, as v = privi , v is a value that has necessarily been forged by pi (no other
process can have forged it). It follows that deadlock is claimed by a single process,
which is a process involved in the corresponding cycle of the wait-for graph.
Proof of the safety property. Let us first observe that ∀x : privx ≤ pubx (this is
initially true, and is kept invariant by the rules R1 and R2 executed thereafter). It
follows from this invariant and R1 that, if px is blocked, (privx < pubx ) ⇒ px has
executed R2.
Let us assume that a process pi , blocked by process pj 1 , is such that privi =
pubj 1 = v. We have to show that pi belongs to a cycle of the wait-for graph. This is
proved in three steps.
• It follows from the rule R2 that the value v (which has been forged by obsi )
has been propagated from pi to a process blocked by pi , etc., until pj 1 . Hence,
there is a set of processes pi , pjk , pjk−1 , . . . , pj1 , such that the edges (pjk , pi ),
(pjk−1 , pjk ), . . . , (pj1 , pi ) have been, at some time, edges of the wait-for graph.
The next two items show that all these edges exist simultaneously when pi claims
deadlock.
• The process pi remained continuously blocked since the time it computed the
value v. If it had become active and then blocked again, it would have executed
again R1, and (due to the invariant privi ≤ pubi and the function greater()) we
would consequently have privi = v > v.
• All the other process pjk , pjk−1 , . . . , pj1 remained continuously blocked since
the time they have forwarded v. This follows from the following observation. Let
us assume (by contradiction) that one of these processes, say pjy , became ac-
tive after having transmitted v. This process has been unblocked by pjy−1 , which
has transmitted v before being unblocked by pjy−2 , which in turn, etc., until pj1
which has transmitted v to pi before being unblocked by pi . It follows that pi
has not been continuously passive since the time it computed the value v, which
contradicts the previous item, and completes the proof.
As already indicated, in the AND model, a process is blocked by several other pro-
cesses and each of them has to release a resource or send it a message in order to
allow it to progress. This section presents a relatively simple deadlock detection
algorithm for the communication AND model.
15.3 Deadlock Detection in the AND Communication Model 409
Model with Input Buffers As in the previous chapter, let statei be a control vari-
able whose value domain is {active, passive}; this variable is such that statei =
passive when pi is blocked waiting for messages from a predefined set of processes
whose identities define the set dep_seti .
As a process pi consumes simultaneously a message from each process in
dep_seti (and proceeds then to the state statei = active), it is possible that messages
from processes in dep_seti have arrived and cannot be received and consumed by
pi because there is not yet a message from each process in dep_seti . To take this
into account, the communication model described in Fig. 14.15 is considered. This
model allows a process pi (or more precisely its observer obsi ) to look into its input
buffers to know if messages have arrived and are ready to be consumed.
Fig. 15.2 An algorithm for deadlock detection in the AND communication model
The Algorithm The algorithm is described in Fig. 15.2. All the sequences of state-
ments prefixed by when are executed atomically. Lines 1–6 describe the behavior
of obsi as far as the observation of pi is concerned.
When obsi suspects pi to be deadlocked, we have statei = passive and
arr_fromi dep_seti . This suspicion can be activated by an internal timer, or any
predicate internal to obsi . When this occurs, obsi increases sni [i] (line 7), and sends
a message PROBE () to each observer obsj associated with a process pj from which
it is waiting for a message (lines 8–10).
A probe message is as follows: PROBE (k, seqnb, j, arrived). Let us consider an
observer obsi that receives such a message. The pair (k, seqnb) means that this
probe has been initiated by obsk and it is its the seqnbth probe launched by pk ;
j is the identity of the sender of the message (this parameter could be saved, as a
receiver knows which observer sent this message; it appears as a message parameter
for clarity). Finally, the integer arrived is the number of messages that have been
sent by pi to pj and have arrived at pj .
15.3 Deadlock Detection in the AND Communication Model 411
Fig. 15.4 PROBE () messages sent along a cycle (with no application messages in transit)
Proof Proof of the liveness property. Let us assume that a process pi is deadlocked.
This means that (a) there is a cycle of processes pj1 , pj2 , . . . , pjk , pj1 , where i =
j1 , j2 ∈ dep_setj1 , j3 ∈ dep_setj2 , etc., j1 ∈ dep_setjk , (b) there is no application
message in transit from pj2 to pj1 , from pj3 to pj2 , etc., and from pj1 to pjk , and
(c) none of these processes can be re-activated from the content of its input buffer.
Let us consider the observer obsi , which launches a probe after the cycle
has been formed (see Fig. 15.4, where k = 3). The observer obsi sends the
message PROBE (i, sn, i, arri [j2 ]) to obsj2 (line 9). As the channel from pj2 to
412 15 Distributed Deadlock Detection
pj1 is empty of application messages, the first time obsj2 receives the mes-
sage PROBE (i, sn, i, arri [j2 ]), we have sentj2 [i] = arri [j2 ], and obsj2 executes
lines 14–17. Consequently obsj2 sends the message PROBE (i, sn, j2 , arrj2 [j3 ]) to
obsj3 . And so on, until obsjk which receives the message PROBE (i, sn, jk−1 , arri [jk ])
and sends the message PROBE (i, sn, jk , arrjk [i]) to obsi . As the channel from pi to
pjk is empty, it follows from line 12 that, when it receives this message, obsi is such
that senti [jk ] = arrjk [i]. Hence, obsi claims that pi is deadlocked, which proves the
liveness property.
Proof of the safety property. Let us consider an observer obsi that claims that pi
is deadlocked. We have to show that pi belongs to a cycle pi = pj1 , pj2 , . . . , pjk ,
pj1 , such that there is a time at which simultaneously (a) j2 ∈ dep_setj1 and there is
no message in transit from pj2 to pj1 , (b) j3 ∈ dep_setj2 and there is no message in
transit from pj3 to pj2 , (c) etc. until process pj1 such that j1 ∈ dep_setjk and there
is no message in transit from pj1 to pjk .
As obsj1 claims that pj1 is deadlocked (line 12), there is a time τ at which obsj1
received a message PROBE (j1 , sn, jk , arrjk [j1 ]) from some observer obsjk . Process
pjk was passive when obsjk sent this message at some time τk < τ . Moreover,
as obsj1 did not discard this message when it received it (predicate sentj1 [jk ] =
arrived = arrjk [j1 ], line 11), the channel from pj1 to pjk did not contain applica-
tion messages between τk and τ . It follows that pjk remained continuously passive
from the time τk to time τ . (See Fig. 15.5. The fact that there is no message in transit
from one process to another is indicated by a crossed-out arrow.)
The same observation applies to obsjk . This local observer received at time
τk a control message PROBE (j1 , sn, jk−1 , arrjk−1 [jk ]), which was sent by an ob-
server obsjk−1 at time τk−1 . As obsjk did not discard this message, we have
sentjk [jk−1 ] = arrjk−1 [jk ] from which it follows that the channel from pjk to
pjk−1 did not contain application messages between τk−1 and τk . Moreover, as
jk ∈ dep_setjk−1 \ arr_fromjk−1 , and pk remained continuously passive from time
τk to time τ , it follows that pk−1 remained continuously passive from time τk−1
to time τ . This reasoning can be repeated until the sending by pj1 of the message
PROBE (j1 , sn, j1 , arrj1 [j2 ]), at time τ1 , from which we conclude that pj1 remained
continuously passive from time τ1 to time τ .
15.4 Deadlock Detection in the OR Communication Model 413
It follows that (a) the processes pj1 , . . . , pjk are passive at time τ , (b) the channel
from pj1 to pjk , the channel from pjk to pjk−1 , etc., until the channel from pj2 to
pj1 are empty at time τ , and (c) none of these processes can be re-activated from the
messages in its input buffer. Consequently these processes are deadlocked (which
means that the cycle is a cycle of the wait-for graph), which concludes the proof of
the safety property.
As seen in Sect. 15.1, in the OR communication model, each receive statement spec-
ifies a set of processes, and the invoking process pi waits for a message from any of
these processes. This set of processes is the current dependence set of pi , denoted
dep_seti . As soon as a message from a process of dep_seti has arrived, pi stops wait-
ing and consumes it. This section presents an algorithm which detects deadlocks in
this model. This algorithm is due to K.M. Chandy, J. Misra, and L.M. Haas (1983).
This algorithm considers the following slightly modified definition for a set of
deadlocked processes, namely, a set D of processes is deadlocked if (a) all the pro-
cesses of D are passive, (b) the dependency set of each of them is a subset of D, and
(c) for each pair of processes {pi , pj } ∈ D such that j ∈ dep_seti , there is no mes-
sage in transit from pj to pi . When compared to the definition given in Sect. 15.1.4,
this definition includes the processes blocked by processes belonging to a knot of
deadlocked processes. When considering the wait-for graph on the left of Fig. 15.1,
this definition considers that p4 is deadlocked, while the definition of Sect. 15.1.4
considers it is blocked by a deadlocked process (p2 ).
This algorithm assumes also that a process pi is passive only when it is waiting
for a message from a process belonging to its current dependency set dep_seti ,
which means that processes do not terminate. The case where a process locally
terminates (i.e., attains an “end” statement, after which it remains forever passive)
is addressed in Problem 4.
15.4.1 Principle
Network Traversal with Feedback When the observer obsi associated with a
process pi suspects that pi is involved in a deadlock, it launches a parallel network
traversal with feedback on the edges of the wait-for graph whose it is the origin, i.e.,
on the channels from pi to pj such that j ∈ dep_seti . (A network traversal algorithm
with feedback that builds a spanning tree has been presented in Sect. 1.2.4.) If such
a process pj is active, it discards the message and, consequently, stops the network
traversal. If it is itself blocked, it propagates the network traversal to the processes
which currently define its set dep_setj . And so on. These control messages are used
to build a spanning tree rooted at pi .
414 15 Distributed Deadlock Detection
If the network traversal with feedback terminates (i.e., obsi receives an answer
from each process pj in dep_seti ), the aim is to allow obsi to conclude that pi is
deadlocked. If an observer obsj has locally stopped the progress of the network
traversal launched by obsi , pj is active and may send in the future a message to
the process pk that sent it a network traversal message. If re-activated, this process
pk may in turn re-activate a process p such that ∈ dep_setk , etc. This chain of
process re-activations can end in the re-activation of pi .
to obs4 (message QUERY 2,4 ()), while obs3 forwards it to obs1 and obs4 (messages
QUERY 3,1 () and QUERY 3,4 ()). As it cannot extend the traversal, obs4 sends back a
message ANSWER () each time it receives a query (messages ANSWER 4,2 () and AN -
SWER 4,3 ()). As it has already been visited by the network traversal, obs1 sends by
return the message ANSWER 1,3 () when it receives QUERY 3,1 (). When each of obs2
and obs3 has received all the answers matching its queries, it sends a message AN -
SWER () to obs1 . Finally, when obs1 has received the answers from obs2 and obs3 ,
the network traversal with feedback terminates.
values of the sets dep_setj . (By definition dep_setj = ∅ if pj is active.) The ver-
tices of the corresponding directed graph are the observers in the set DPi , which is
recursively defined as follows
DPi = {i} dep_setx ,
x∈DPi
identified (k, seqnb) will not terminate, and consequently this network traversal will
not allow obsk to claim that pi is deadlocked. If pi is passive, there are two cases:
• If seqnb > sni [k], obsi discovers that this query concerns a new network traversal
launched by obsk . Consequently, it updates sni [k] and defines obsj as its parent
in this network traversal (line 17). It then extends the network traversal by prop-
agating the query message it has received to each observer of dep_seti (line 18),
updates accordingly expected_aswri [i] (line 19), and starts a new observation pe-
riod of pi (with respect to this network traversal) by setting cont_passivei [k] to
true (line 20).
• If seqnb = sni [k], obsi stops the network traversal if it is an old one (seqnb <
sni [k]), or if pi has been re-activated since the beginning of the observation period
(which started at the first reception of a message QUERY (k, seqnb)).
If the query concerns the last network traversal launched by pk and pi has
not been re-activated since the start of the local observation period associated
with this network traversal, obsi sends by return the message ANSWER (k, seqnb)
to obsj (line 22). This is needed to allow the network traversal to return to its
initiator (if no other observer stops it).
Finally, when an observer obsi receives a message ANSWER (k, seqnb) it dis-
cards the message if this message is related to an old traversal, or if pi did not
remained continuously passive since the beginning of this network traversal. Hence,
if (seqnb = sni [k]) ∧ cont_passivei [k], obsi first decreases expected_aswri [k]
(line 27). Then, if expected_aswri [k] = 0, the network traversal with feedback can
leave obsi . To that end, if k = i, obsi forwards the message ANSWER (k, seqnb) to
its parent in the tree built by the first message QUERY (k, seqnb) it has received. If
k = i, the network traversal has returned to obsi , which claims that pi is deadlocked.
Proof Proof of the liveness property. Let D be a set of deadlocked processes, and pi
be one of these processes. Let us assume that, after the deadlock has occurred, obsi
launches a deadlock detection. We have to show that obsi eventually claims that pi
is deadlocked.
As there is deadlock, there is no message in transit among the pairs of processes
of D such that i, j ∈ D and j ∈ dep_seti . Let (i, sn) be the identity of the network
traversal with feedback launched by obsi after the deadlock occurred. The observer
obsi sends a message QUERY (i, sn) to each observer obsj , such that j ∈ dep_seti .
When such an observer receives this message it sets cont_passivej [i] to true, and
forwards QUERY (i, sn) to all the observers obsk such that k ∈ dep_setj , etc. As there
420 15 Distributed Deadlock Detection
is no message in transit among the processes of D, and all the processes in D are
passive, all the Boolean variables cont_passivex [i] of the processes in D are set to
true and thereafter remain true forever. It follows that no message QUERY (i, sn) or
ANSWER (i, sn) is discarded. Consequently, no observer stops the progress of the
network traversal which returns at obsi and terminates, which concludes the proof
of the liveness of the detection algorithm.
Proof of the safety property. We have to show that no observer obsi claims that
pi is deadlocked while it is not. Figure 15.11 is used to illustrate the proof.
Claim C Let obsi be an observer that sends ANSWER () to obsk where parenti = k.
Let us assume that, after this answer has been sent, pi sends a message m to pk
at time τi . Then, there is an observer obsj such that j ∈ dep_seti , which received
QUERY () from obsi and, subsequently, obsj sent an answer to obsi at some time τja
and pj became active at some time τj such that τja < τj < τi .
Proof of the Claim C In order for obsi to send ANSWER () to its parent, it needs to
have received an answer from each observer obsj such that j ∈ dep_seti (lines 19,
27–28, and 30). Moreover, for pi to become active and send a message m to pk ,
it needs to have received an application message m from a process belonging to
dep_seti (lines 6 and 8). Let pj be this process.
As the channels are FIFO, and obsj sent a message ANSWER () to obsi , m is
not an old message still in transit; it has necessarily been sent after obsj sent the
answer message to obsi . It follows that m is received after the answer message and
we have consequently, τja < τj . Finally, as the sending of m by pj occurs before
its reception by pi (which occurs after obsi sent an answer to obsk ), it follows that
pj becomes active at some time τj such that τj < τi , and we obtain τja < τj < τi .
(End of the proof of the claim.)
Let D be the set of processes involved in the network traversal with feedback ini-
tiated by an observer obsz which executes line 29 and claims that pz is deadlocked.
As the network traversal returns to its initiator obsz , it follows that every observer
15.5 Summary 421
in D has received a matching answer for each query it has sent. Hence, every ob-
server sent an answer message to its parent in the spanning tree built by the network
traversal. Moreover, there is at least one cycle in the set D (otherwise, the network
traversal will not have returned to obsz ).
Let us consider any of these cycles after obsz has claimed a deadlock at some
time τ . This cycle includes necessarily a process pi that, when it receives a query
from a process pk of D defines pk as its parent (line 17). Let us assume that such
a process pi is re-activated at some time τi > τ . It follows from the Claim C that
there is a process pj , such that j ∈ dep_seti and pj sent to pi a message m such that
τja < τj < τi . When considering the pair (pj , obsj ), and applying again Claim C, it
follows that there a process p such that ∈ dep_setj and p sent to pj a message
m such that τa < τ < τj . Hence, τ < τi .
Let pi , pj , p , . . . , px , pi be the cycle. Using inductively the previous argument,
it follows that i ∈ dep_setx and pi sent to px a message mx at some time τi such
that τi < τx . Hence, we obtain τi < τx < · · · < τ < τj < τi . But as (a) pi remained
continuously passive between the reception of QUERY () from px and the sending
of the matching message ANSWER (), and (b) the channel from pi to px is FIFO, it
follows that τi > τi , a contradiction.
As the previous reasoning is for any cycle in the set D, it follows that, for any
pair {py , py } ∈ D such that y ∈ dep_sety , there is no message in transit from py to
py at time τ , which concludes the proof of the safety property.
15.5 Summary
A process becomes deadlocked when, while blocked, its progress depends transi-
tively on itself. This chapter has presented two types of communication (or resource)
models, which are the most often encountered. In the AND model, a process waits
for messages from several processes (or for resources currently owned by other
processes). In the OR model, it waits for one message (or resource) from several
possible senders (several distinct resources). After an analysis of the deadlock phe-
nomenon and its capture by the notion of a wait-for graph, the chapter presented
three deadlock detection algorithms. The first one is suited to the one-at-a-time
model, which is the simplest instance of both the AND and OR models, namely,
a process waits for a single message from a given sender (or a single resource) at a
time. The second algorithm allows for the detection of deadlocks in the AND com-
munication model, while the third one allows for the detection of deadlocks in the
OR communication model.
(e.g., [97, 160, 165, 166, 188]). The deadlock problem has also been intensively
studied in database systems (e.g., [218, 259, 287, 331]).
• The advent of distributed systems, distributed computing, and graph-based algo-
rithms has given rise to a new impetus for deadlock detection, and many new
algorithms have been designed (e.g., [61, 169, 216, 278] to cite a few).
• Algorithms to detect cycles and knots in static graphs are presented in [248, 264].
• The deadlock detection algorithm for the one-at-a-time model presented in
Sect. 15.2 is due D.L. Mitchell and M. Merritt [266].
• The deadlock detection algorithm for the AND communication model presented
in Sect. 15.3 is new. Another detection algorithm suited to this model is presented
in [260], and a deadlock avoidance algorithm for the AND model is described
in [389].
• The deadlock detection algorithm for the OR communication model presented in
Sect. 15.4 is due to K.M. Chandy, J. Misra, and L.M. Haas [81]. This algorithm is
based on the notion of a diffusing computation introduced by E.W. Dijkstra and
C.S. Scholten in [115]. Another algorithm for the OR model is described in [379].
• A deadlock detection algorithm for a very general communication model includ-
ing the AND model, the OR model, the k-out-of-n model, and their combination
is described and proved correct in [65].
• Proof techniques for deadlock absence in networks of processes are addressed
in [76] and an invariant-based verification method for a deadlock detection algo-
rithm is investigated in [215].
• Deadlock detection algorithms for synchronous systems are described in
[308, 391].
• Deadlock detection in transaction systems is addressed in [163].
• Introductory surveys on distributed deadlock detection can be found in [206, 346].
1. Considering the deadlock detection algorithm presented in Sect. 15.2, let us re-
place the detection predicate privi = pubj by pubi = pubj , in order to eliminate
the local control variable privi .
Is this predicate correct to ensure that each deadlock is detected, and is it de-
tected by a single process, which is a process belonging to a cycle? To show the
answer is “yes”, a proof has to be designed. To show the answer is “no”, a coun-
terexample has to be produced. (To answer this question, one can investigate the
particular case of the wait-for graph described in Fig. 15.12.)
2. Adapt the deadlock detection algorithm suited to the AND communication model
presented in Sect. 15.3 to obtain an algorithm that works for the AND resource
model.
Solution in [81].
3. Let us consider the general OR/AND receive statement defined in the previ-
ous chapter devoted to termination detection (Sect. 14.5). Using the predicate
15.7 Exercises and Problems 423
This part of the book is made up of two chapters. After having presented the
general problem of building a shared memory on top of a message-passing system,
Chap. 16 addresses the atomicity (linearizability) consistency condition. It defines
it, presents its main composability property, and describes distributed algorithms
that implement it. Then, Chap. 17 considers the sequential consistency condition,
explains its fundamental difference with respect to atomicity, and presents several
implementations of it.
Chapter 16
Atomic Consistency (Linearizability)
This chapter is on the strongest consistency condition for concurrent objects. This
condition is called atomicity when considering shared registers, and linearizability
when considering more sophisticated types of objects. In the following, these two
terms are considered as synonyms.
The chapter first introduces the notion of a distributed shared memory. It then de-
fines formally the atomicity concept, and presents its main composability property,
and several implementations on top of a message-passing system.
All the concurrent objects considered in the following are assumed to be defined
by a sequential specification.
Let us consider Fig. 16.2, which represents a computation involving two processes
accessing a shared register R. The process pw issues write operations, while the
process pr issues read operations (the notation R.read() ← v means that the value
returned by the corresponding read is v).
The question is: Which values v and v can be returned for this register execution
to be correct? As an example do v = 0 and v = 2 define a correct execution? Or do
v = 2 and v = 1 define a correct execution? Are several correct executions possible
or is a single one possible?
The aim of a consistency condition is to answer this question. Whatever the ob-
ject (register, queue, etc.), there are several meaningful answers to this question, and
atomicity is one of them.
By a slight abuse of language, we use the term “operation” also for “execution of
an operation on an object”. Let OP be the set of all the operations issued by the
processes.
A computation of a set of processes accessing a set of concurrent objects is a
partial order on the set of operations issued by the processes. This partial order,
op
= (OP, −→),
denoted OP is defined as follows. Let op1 be any operation issued by
a process pi , and op2 be any operation issued by a process pj ; op1 is on object X,
op
while op2 is on object Y (possibly i = j or X = Y ). op1 −→ op2 if op1 terminated
before op2 started.
op
The projection of −→ on the operations issued by a process is called process
order relation. As each process pi is sequential, the process order relation defines n
total orders (one per process). When op1 and op2 are operations on the same object
op
X, the projection of −→ on the operations on X is called the object order relation.
op
Two operations which are not ordered by −→ are said to be concurrent or over-
lapping. Otherwise they are non-overlapping.
op
The relation −→ associated with the computation described in Fig. 16.2 is de-
picted in Fig. 16.3. The first read operation issued by pr is concurrent with the two
430 16 Atomic Consistency (Linearizability)
op
Fig. 16.3 The relation −→
of the computation described
in Fig. 16.2
last write operations issued by pw , while its last read operation is concurrent only
with the last write operation.
op
The relation −→ associated with the computation described in Fig. 16.2 is de-
picted in Fig. 16.3.
Remark Let us notice that this definition generalizes the definition of a message-
passing execution given in Sect. 6.1.2. A message-passing system is a system where
any directed pair of processes (pi , pj ) communicate by pi depositing (sending)
values (messages) in an object (the channel from pi to pj ), and pj withdrawing
(receiving) values from this object. Moreover, the inescapable transit time of each
message is captured by the fact that a value is always withdrawn after it has been
deposited.
R.write(3) by pk are ordered in the reverse order. If the read operations issued by
pi and pj return the value 3, the register is atomic. If one of them returns another
value, it is not.
Let us finally notice that, as the second read by pi , the operation R.write(2) by
pj , and the operation R.write(3) by pk are all concurrent, it is possible that the
execution be such that the second read by pi appears as being linearized between
R.write(2) by pj and R.write(3) by pk . In that case, for R to be atomic, the second
read by pi has to return the value 2.
The Notion of a Local Property Let P be any property defined on a set of ob-
jects. The property P is said to be local if the set of objects as a whole satisfies P
whenever each object taken separately satisfies P .
16.3 Atomic Objects Compose for Free 433
Proof The “⇒” direction (only if) is an immediate consequence of the definition
is linearizable then, for each object X involved in OP,
of atomicity: If OP OP|X
is
linearizable. So, the rest of the proof is restricted to the “⇐” direction.
Given an object X, let S
X be a linearization of OP|X. It follows from the defi-
nition of atomicity that SX defines a total order on the operations involving X. Let
→X denote this total order. We construct an order relation → defined on the whole
as follows:
set of operations of OP
1. For each object X: →X ⊆→,
op
2. −→⊆→.
Basically, “→” totally orders all operations on the same object X, according to
op
→X (first item), while preserving −→, i.e., the real-time occurrence order on the
operations (second item).
Claim “→ is acyclic”. This claim means that → defines a partial order on the set
of all the operations of OP.
Assuming this claim (see its proof below), it is thus possible to construct a se-
quential history and respecting →. We trivially
S including all operations of OP
have →⊆→S , where →S is the total order (on the operations) defined from S. We
and
have the three following conditions: (1) OP S are equivalent (they contain the
same operations, and the operations of each process are ordered the same way OP
and S), (2) S is sequential (by construction) and legal (due to the first item stated
op
above), and (3) −→⊆→S (due to the second item stated above and the relation
is linearizable.
inclusion →⊆→S ). It follows that OP
434 16 Atomic Consistency (Linearizability)
Proof of the Claim We show (by contradiction) that → is acyclic. Assume first that
→ induces a cycle involving the operations on a single object X. Indeed, as →X is
a total order, in particular transitive, there must be two operations opi and opj on X
op
such that opi →X opj and opj −→ opi .
• As opi →X opj and X is linearizable (respects object order, i.e., respects real-
time order), it follows that opi started before opj terminated (otherwise, we will
have opj →X opi ). Let us denote this as start[opi ] < term[opj ].
op op
• Similarly, it follows from the definition of −→ that opj −→ opi ⇒ term[opj ] <
start[opi ].
These two items contradict each other, from which we conclude that, if there is a
cycle in →, it cannot come from two operations opi and opj , and an object X such
op
that opi →X opj and opj −→ opi .
It follows that any cycle must involve at least two objects. To obtain a contra-
diction we show that, in this case, a cycle in → implies a cycle in →H (which is
acyclic). Let us examine the way the cycle could be obtained. If two consecutive
op
edges of the cycle are due to just some →X or just −→, then the cycle can be short-
ened, as any of these relations is transitive. Moreover, opi →X opj →Y opk is not
possible for X = Y , as each operation is on only one object (opi →X opj →Y opk
would imply that opj is on both X and Y ). So let us consider any sequence of edges
op op
of the cycle such that: op1 −→ op2 →X op3 −→ op4 . We have:
op op
• op1 −→ op2 ⇒ term[op1 ] < start[op2 ] (definition of −→).
• op2 →X op3 ⇒ start[op2 ] < term[op3 ] (as X is linearizable).
op op
• op3 −→ op4 ⇒ term[op3 ] < start[op4 ] (definition of −→).
Combining these statements, we obtain term[op1 ] < start[op4 ], from which we can
op
conclude that op1 −→ op4 . It follows that any cycle in → can be reduced to a cycle
op op
in −→, which is a contradiction as −→ is an irreflexive partial order. (End of the
proof of the claim.)
An Example Locality means that atomic objects compose for free. As an exam-
ple, let us consider two atomic queue objects Q1 and Q2, each with its own im-
plementation I 1 and I 2, respectively (hence, the implementations can use different
algorithms).
Let us define the object Q as the composition of Q1 and Q2 defined as follows
(Fig. 16.6). Q provides processes with the four following operations Q.enq1(),
16.4 Message-Passing Implementations of Atomicity 435
Principle The total order broadcast abstraction has been introduced in Sect. 7.1.4
(where we also presented an implementation of it based on scalar clocks), and in
Sect. 12.4 (where coordinator-based and token-based implementations of it have
been described and proved correct).
This abstraction provides the processes with two operations, denoted to_
broadcast() and to_deliver(), which allow them to broadcast messages and deliver
these messages in the very same order. Let us recall that we then say that a process
to-broadcasts and to-delivers a message.
An algorithm based on this communication abstraction, that implements an
atomic object X, can be easily designed. Each process pi maintains a copy xi of
the object X, and each time pi invokes an operation X.oper(), it to-broadcasts a
message describing this operation and waits until it to-delivers this message.
436 16 Atomic Consistency (Linearizability)
The Case of Operations Which Do Not Modify the Objects Let us consider the
particular case where the object is a read/write register R. One could wonder why,
as the read operations do not modify the object and each process pi has a local copy
xi of R, it is necessary to to-broadcast the invocations of read().
To answer this question, let us consider Fig. 16.8. The register R is initialized
to the value 0. Process p1 invokes R.write(1), and consequently issues an under-
lying to-broadcast of OPERATION (1, write , 1). When a process pi to-delivers this
message, it assigns the new value 1 to xi . After it has to-delivered the message OP -
ERATION (1, write , 1), p3 invokes R.read() and, as xi = 1, it returns the value 1.
Differently, the message OPERATION (1, write , 1) sent by p1 to p2 is slow, and p2
invokes R.read() before this message has been to-delivered. This read consequently
returns the current value of x2 , i.e., the initial value 0.
As the invocation of R.read() by p3 terminates before the invocation of R.read()
op
by p2 starts we have (R.read() by p3 ) −→ (R.read() by p2 ), and consequently, the
read by p3 has to be ordered (in S) before the read by p2. But then, the sequence
S cannot be legal. This is because the read by p3 obtains the new value, while the
read by p2 (which occurs later with respect to real time as formally captured by
op
−→) obtains the initial value, which has been overwritten (as witnessed by the read
of p3 ).
Preventing such incorrect executions requires all operations to be totally ordered
with the to-broadcast abstraction. Hence, when implementing atomicity with to-
broadcast, even the operations which do not modify the object have to participate in
the to-broadcast.
register. This is a consequence of the fact that atomicity is a local property: If X and
Y are atomic, their implementations can be independent (i.e., the manager of X and
the manager of Y never need to cooperate).
Hence, without loss of generality, we consider in the following that there is a
single register X managed by a single server, denoted pX . In addition to pX , each
process pi has a local copy xi of X. The local copy at pX is sometimes called
primary copy. To simplify the presentation (and again without loss of generality)
we consider that the role of the server pX is only to manage X (i.e., it does not
involve operations on X).
The role of pX is to ensure atomicity by managing the local copies so that (a)
each read operation returns a correct value, and (b)—when possible—read opera-
tions are local, i.e., do not need to send or receive messages. To attain this goal, two
approaches are possible.
• Invalidation. In this case, at each write of X, the manager pX invalidates all the
local copies of X.
• Update. In this case, at each write of X, the manager pX updates all the local
copies of X.
The next two sections develop each of these approaches.
pi that there are no more copies of the previous value and consequently the write
operation can terminate (message WRITE _ ACK ()).
Moreover, during the period starting at the reception of the message WRITE _
REQ () and finishing at the sending of the corresponding WRITE _ ACK (), pX be-
comes locked with respect to X, which means that it delays the processing of all the
message WRITE _ REQ () and READ _ REQ (). These messages can be processed only
when pX is not locked.
Finally, when the manager pX receives a message READ _ REQ () from a process
pj , it sends by return the current copy of X to pj . When pj receives it, it writes it
in its copy xj and, if it reads again X, it uses this value until it is invalidated. Hence,
some read operations needs two messages, while others are purely local.
It is easy to see that the resulting register object X is atomic. All the write op-
erations are totally ordered by pX , and each read operation obtains the last written
value. The fact that the meaning of “last” is with respect to real time (as captured by
op
−→) follows from the following observation: When pX stops being locked (time τ
in Fig. 16.9), only the current writer and itself have the last value of X.
operation X.read() is
(4) if (xi = ⊥)
(5) then send READ _ REQ () to pX ;
(6) wait READ _ ACK (v) from pX ;
(7) xi ← v
(8) end if;
(9) return(xi ).
The Basic Pattern The basic pattern associated with the ownership notion is the
following one:
• A process pi writes X: It becomes the owner, which entails the invalidation of
all the copies of the previous value of X. Moreover, pi can continue to update its
local copy of X, without informing pX , until another process pj invokes a read
or a write operation.
• If the operation issued by pj is a write, it becomes the new owner, and the situa-
tion is as in the previous item.
• If the operation issued by pj is a read, the current owner pi is demoted, and it
is no longer the owner of X and is downgraded from the writing/reading mode
to the reading mode only. It can continue reading its copy xi (without passing
through the manager pX ), but its next write operation will have to be managed
by pX . Moreover, pi has to send the current value of xi to pX , and from now on
pX knows the last value of X.
16.4 Message-Passing Implementations of Atomicity 441
operation X.write(v) is
(1) if (¬ owneri )
(2) then send WRITE _ REQ (v) to pX ;
(3) wait WRITE _ ACK () from pX ;
(4) owneri ← true
(5) end if;
(6) xi ← v.
operation X.read() is
(7) if (xi = ⊥)
(8) then send READ _ REQ () to pX ;
(9) wait READ _ ACK (v) from pX ;
(10) xi ← v
(11) end if;
(12) return(xi ).
If another process pk wants to read X, it can then obtain from pX the last value
of X, and keeps reading its new copy until a new write operation invalidates all
copies.
The Read and Write Operations These operations are described in Fig. 16.11.
Each process pi manages an additional control variable, denoted owneri , which is
true if and only if pi is the current owner of X.
If, when pi invokes X.write(v), pi is the current owner of X, it has only to up-
date its local copy xi (lines 1 and 6). Otherwise it becomes the new owner (line 4)
and sends a message WRITE _ REQ (v) to the manager pX so that it downgrades the
previous owner, if any (line 2). When this downgrading has been done (line 3), pi
writes v in its local copy xi , and the write terminates.
The algorithm implementing the read operation is similar to the one implement-
ing the write operation. If there is a local copy of X (xi = ⊥), pi returns it. Other-
wise, it sends a message READ _ REQ () to pX in order to obtain the last value written
into X. This behavior is described by lines 7–12.
The lines 13–15 are related to the management of the ownership of X. When
the manager pX discovers that pi is no longer the owner of X, pX sent to pi a
message DOWNGRADE _ REQ (type), where type = w if the downgrading is due to a
write operation, and type = r if it is due to a read operation. Hence, when it receives
DOWNGRADE _ REQ (type), pi first sets owneri to false. Then it sends by return to
pX an acknowledgment carrying its value of xi (which is the last value that has been
written if type = r).
442 16 Atomic Consistency (Linearizability)
Fig. 16.12 Invalidation and owner-based implementation of atomicity (code of the manager pX )
WRITE _ REQ () – or READ _ REQ () –, DOWNGRADE _ REQ (), DOWNGRADE _ ACK (),
and WRITE _ ACK () – or READ _ ACK () –).
Principle The idea is very similar to the one used in the invalidation approach. It
differs in the fact that, when the manager pX learns a new value, instead of invali-
dating copies, it forwards the new value to the other processes.
To illustrate this principle, let us consider Fig. 16.13. When pX receives a mes-
sage WRITE _ REQ (v), it forwards the value v to all the other processes, and (as
previously) becomes locked until it has received all the corresponding acknowledg-
ments, which means that it processes sequentially all write requests. When a process
pi receives a message UPDATE & LOCK (v), it updates xi to v, sends an acknowledg-
ment to pX , and becomes locked. When pX has received all the acknowledgments,
it knows that all processes have the new value v. This time is denoted τ on the
figure. When this occurs, pX sends a message to all the processes to inform them
that the write has terminated. When a process pi receives this message, it becomes
unlocked, which means that it can again issue read or write operations.
It follows that, differently from the invalidation approach, all reads are purely
local in the update approach (they send and receive messages).
The Algorithm The code of the algorithm, described in Fig. 16.14, is similar to
the previous one. This algorithm considers that there are several atomic objects X,
Y , etc., each managed by its own server pX , pY , etc. The local Boolean variable
lockedi [X] is used by pi to control the accesses of pi to X.
For each object X, the cooperation between the server pX and the application
processes pi , pj , etc., is locally managed by the Boolean variables lockedi [X],
lockedj [X], etc. Thanks to this cooperation, the server pX of each object X guar-
antees that X behaves atomically. As proved in Theorem 27, the set of servers pX ,
pY , etc., do not have to coordinate to ensure that the whole execution is atomic.
444 16 Atomic Consistency (Linearizability)
operation X.read() is
(5) wait (¬lockedi [X]);
(6) return(xi ).
16.5 Summary
This chapter first introduced the concept of a distributed shared memory. It has then
presented, both from intuitive and formal points of view, the atomicity consistency
condition (also called linearizability). It then showed that atomicity is a consistency
condition that allows objects to be composed for free (a set of objects is atomic
if and only if each object is atomic). Then the chapter presented three message-
passing algorithms that implement atomic objects: a first one based on a total order
broadcast abstraction, a second one based on an invalidation technique, and a third
one based on an update technique.
1. Design an algorithm that implements read/write atomic objects, and uses update
(instead of invalidation) and the process ownership notion.
2. Extend the algorithm described in Figs. 16.11 and 16.12, which implements
atomicity with invalidation and the ownership notion, to obtain an algorithm that
implements a distributed shared memory in which each atomic object is a page
of a classical shared virtual memory.
Solution in [234].
446 16 Atomic Consistency (Linearizability)
17.1.1 Definition
As we can see, this “witness” execution S does not respect the real time occurrence
order of the operations from different processes. Hence, it is not atomic.
The dotted arrow depicts what is called the read-from relation when the shared
objects are read/write registers. It indicates, for each read operation, the write
operation that wrote the value read. The read-from relation is the analog of the
send/receive relation in message-passing systems, with two main differences: (1) not
all values written are read, and (2) a value written can be read by several processes
and several times by the same process.
An example of a computation which is not sequentially consistent is described
in Fig. 17.2. Despite the fact that each value that is returned by a read operation has
been previously written, it is not possible to build there a sequence S which respects
both process order and the sequential specification of each register.
Actually, determining whether a computation made up of processes accessing
concurrent read/write registers is sequentially consistent is an NP-complete prob-
lem (Taylor, 1983). This result rules out the design of efficient algorithms that would
17.1 Sequential Consistency 449
While atomic consistency is a local property, and consequently atomic objects com-
pose for free (see Sect. 16.3), this is no longer the case for sequentially consistent
objects. The following counter-example proves this claim.
Let us consider a computation with two processes accessing queues. Each queue
is accessed by the usual operations denoted enqueue(), which adds an item at the
head of the queue, and dequeue(), which withdraws from the queue the item at its
tail and returns it (⊥ is returned if the queue is empty).
The computation described in Fig. 17.3, which involves one queue denoted Q, is
sequentially consistent. This follows from the fact that the sequence
S = Q.enqueuej (b), Q.dequeuej () → b, Q.enqueuei (b)
is legal and respects the order of the operations in each process. As this computation
involves a single object, trivially Q is sequentially consistent.
Let us now consider the computation described in Fig. 17.4, which involves two
queues, Q and Q . It is easy to see that each queue is sequentially consistent. The
existence of the previous sequence S proves it for Q, and the existence of the fol-
lowing sequence proves it for Q ,
S = Q .enqueuei a , Q .dequeuei () → a , Q .enqueuej b .
450 17 Sequential Consistency
Proof Assuming that OP is legal and satisfies the constraint WW, let us consider
the directed graph denoted G and defined as follows. Its vertices are the operations,
ww rf
and there is an edge from op1 to op2 if (a) op1 −→ op2 , or (b) op1 −→ op2 , or
(c) op1 and op2 have been issued by the same process with op1 first (process order
is acyclic, so is G. The proof consists of two steps.
at each pi ). As OP
First step. For each object X, we add edges to G so that all (conflicting) opera-
tions on X are ordered, and these additions preserve the acyclicity and legality of
the modified graph G.
Let us consider a register X. As all write operations on X are totally ordered,
we have only to order the read operations on X with respect to the write operations
on X. Let ri (X)a be a read operation. As ri (X)a is legal (assumption), there exists
rf
wj (X)a such that wj (X)a −→ ri (X)a. (See Fig. 17.5, where the label associated
with an edge explains it.)
ww
Let wk (X) be any write operation such that wj (X)a −→ wk (X). For any such
wk (X), let us add an edge from ri (X)a to wk (X) (dotted edge in the figure). This
addition cannot create a cycle because, due to the legality of G, there is no path from
wk (X) to ri (X)a. Let us now show that this addition preserves the legality of G.
Legality can be violated if adding an edge creates a path from some write op-
eration wx to some read operation ry . Hence, let us assume that the addition of the
edge from ri (X)a to wk (X) creates a new path from wx to ry . It follows that, before
adding the edge from ri (X)a to wk (X), there was a path from wx to ri (X)a and a
ww
path from wk (X) to ry . Due to the relation −→, two cases have to be analyzed.
ww
• wx = wk (X) or wx −→ wk (X). In this case, there is a path from wx to ry which
does not use the edge from ri (X)a to wk (X). This path goes from wx to wk (X)
and then from wk (X) to ry , and (by induction on the previous edges added to G)
none of these paths violate legality.
ww
• wk (X) −→ wx . This case implies that there is path from wk (X) to ri (X)a (this
path goes from wk (X) to wx and then from wx ) to ri (X)a). But this path violates
the assumption stating the graph G built before adding the edge from ri (X)a to
wk (X) is legal. Hence, this case cannot occur.
17.2 Sequential Consistency from Total Order Broadcast 453
It follows that the addition of the edge from ri (X)a to wk (X) does not violate the
legality of the updated graph G.
Second step. The previous step is repeated until, for each object X, all read op-
erations on X are ordered with respect to all write operations. When this is done,
let us consider any topological sort S of the resulting acyclic graph. It is easy to
see that S preserves process order and legality (i.e., no process reads an overwritten
value). (Reminder: a topological sort of a directed acyclic graph is a sequence of all
its vertices, which preserves their partial ordering.)
The proof of this theorem is similar to the proof of the previous theorem. It is left
to the reader.
Total Order on Write Operations This algorithm uses the total order broadcast
abstraction to order all write operations. Moreover, as we are about to see, the legal-
ity of the read operations is obtained for free with a reading of the local copy of the
register X at pi , hence the name fast read algorithm.
454 17 Sequential Consistency
operation X.read() is
(4) return(xi ).
Proof Let us first observe that, thanks to the to-broadcast of all the values which
are written, the algorithm satisfies the WW constraint. Hence, due to Theorem 29,
it only remains to show that no read operation obtains an overwritten value.
To that end, let us construct a sequence
S by enriching the total order on write op-
ww
erations (−→) as follows. Let SEQ _ CONS (j, X, v) and SEQ _ CONS (k, Y, v ) be the
ww
messages associated with any two write operations which are consecutive in −→.
Due to the to-broadcast abstraction, any process to-delivers first SEQ _ CONS (j, X, v)
and then SEQ _ CONS (k, Y, v ). For any process pi let us add (while respecting the
process order defined by pi ) all the read operations issued by pi between the time
it has been to-delivered SEQ _ CONS (j, X, v) and the time it has been to-delivered
SEQ _ CONS (k, Y, v ). It follows from the algorithms that all these read operations
obtain the last value written in each register X, Y , etc., where the meaning of last is
ww
with respect to the total order −→. It follows that, with respect to this total order,
no read operation obtains an overwritten value, which concludes the proof of the
theorem.
and ensure legality. Ordering all write operations allows for a simpler algorithm,
which ensures more than sequential consistency but less than atomicity.
Proof As before, due to Theorem 29, we have only to show that no read opera-
tion obtains an overwritten value. To that end, let ri (X)a be a read operation and
wj (X, a) be the corresponding write operation. We have to show that
op op
wk (X, b) such that wj (X, a) −→ wk (X, b) ∧ wk (X, b) −→ ri (X)a .
Let us assume by contradiction that such an operation wk (X, b) exists. There are
two cases.
• k = i, i.e., wk (X, b) and ri (X)a have been issued by the same process pi . It
follows from the read algorithm that nb_writei = 0 when ri (X)a is executed.
op
As wk (X, b) −→ ri (X)a, this means that xi has been updated to the value b and
due to total order broadcast, this occurs after xi has been assigned the value a. It
follows that pi cannot return the value a, a contradiction.
• k = i, i.e., wk (X, b) and ri (X)a have been issued by different processes.
op
Since wk (X, b) −→ ri (X)a and a = b, there is an operation operi () (issued
op op
by pi ) such that wk (X, b) −→ operi () −→ ri (X)a (otherwise, pi would have
read b). There are two subcases according to the fact that operi () is a read or a
write operation.
456 17 Sequential Consistency
operation X.read() is
(4) wait (nb_writei = 0);
(5) return(xi ).
An interesting property of the previous total order-based fast read and fast write
algorithms lies in the fact that their skeleton (namely, total order broadcast and a
fast operation) can be used to design an algorithm implementing a fast enqueue
sequentially consistent queue. Such an implementation is presented in Fig. 17.8.
The algorithm implementing the operation Q.enqueue(v) is similar to that of the
fast write algorithm of Fig. 17.7, while the algorithm implementing the operation
Q.dequeue() is similar to that of the corresponding read algorithm. The algorithm
assumes that the default value ⊥ can neither be enqueued, nor represent the empty
stack.
Fig. 17.8
operation Q.enqueue(v) is
Fast enqueue algorithm
(1) to_broadcast SEQ _ CONS (i, Q, enq, v);
implementing
(2) return().
a sequentially consistent queue
(code for pi )
operation Q.dequeue() is
(3) resulti ← ⊥;
(4) to_broadcast SEQ _ CONS (i, Q, deq, −);
(5) wait (resulti = ⊥);
(6) return(resulti ).
process. It manages all the objects. Moreover, each process pi manages a copy of
each object, and its object copies behave as a cache memory. As previously, for any
object Y (capital letter), yi (lowercase letter) denotes its local copy at pi .
When it invokes X.write(a), a process contact psm , which thereby defines a to-
tal order on all the write operations. It also contacts psm when it invokes a read
operation Y.read() and its local copy of Y is not up to date (i.e., yi = ⊥).
The object manager process psm keeps track of which processes have the last
value of each object. In that way, when psm answers a request from a process pi ,
psm can inform pi on which of its local copies are no longer up to date.
operation X.read() is
(5) if (xi = ⊥)
(6) then send READ _ REQ (X) to psm ;
(7) wait (READ _ ACK (inval, v) from psm );
(8) for each Y ∈ inval do yi ← ⊥ end for;
(9) xi ← v
(10) end if;
(11) return(xi ).
which pi has not the last value, and sends this set to pi (lines 15–16). The behavior
of psm when it receives READ _ REQ (X) is similar.
which is the value of xsm when psm sent an acknowledgment to pi to terminate op1 .
This is illustrated in Fig. 17.10.
It follows that all the read operations of pi that occurred between op1 and op2
appear as having occurred after op1 and before the first operation processed by psm
after op1 (these read operations are inside an ellipsis and an arrow shows the point
at which—from the “read from” relation point of view—these operations seem to
have logically occurred). Hence, no read operation returns an overwritten value, and
the computation is legal, which concludes the proof of the theorem.
Ensuring the WW Constraint and the Legality of Read Operations The pro-
cess psm used in the previous algorithm is static and each pi sends (receives) mes-
sages to (from) it. The key point is that psm orders all write operations. Another
way to order all write operations consists in using a dynamic approach, namely a
navigating token (see Chap. 5). To write a value in a register, a process has first to
acquire the token. As there is a single token, this generates a total order on all the
write operations.
The legality of read operations can be ensured by requiring the token to carry
the same Boolean array hlv[1..n, [X, Y, . . .]] as the one used in the algorithm of
Fig. 17.9. Moreover, as the current owner of the token is the only process that can
read and write this array, the current owner can use it to provide the process pj to
which it sends the token with up to date values.
operation X.write(v) is
(1) acquire_token();
(2) let (hlv, new_values) be the pair carried by the token;
(3) for each (Y, w) ∈ new_values) do yi ← w; hlv[i, Y ] ← true end for;
(4) hlw[1..n, X] ← [false, . . . , false];
(5) xi ← v; hlv[i, X] ← true;
(6) let pj be the process to which the token is sent;
(7) let new_values = {(Y, yi ) | ¬hlv[j, Y ]};
(8) add the pair (hlv, new_values) to the token;
(9) realese_token() % for pj %.
operation X.read() is
(10) return(xi ).
writes the new values of X into xi and updates accordingly hlv[i, X] and the other
entries hlv[j, X] such that j = i (lines 4–5).
On the token side, pi then computes, with the help of the vector hlv[i, [X, Y, . . .]],
the set of pairs (Y, yi ) for which the next process pj that will have the token does
not have the last values (lines 6–7). After it has added this set of pairs to the token
(line 8), pi releases the token,which is sent to pj (line 9). Finally the read operation
are purely local (line 10).
Let us notice that, when it has the token, a process can issue several write oper-
ations on the same or several registers. The only requirement is that a process that
wants to write eventually must acquire the token.
On the Moves of the Token The algorithm assumes that when a process releases
the token, it knows the next user of the token. This can be easily implemented by
having the token move on a ring. In this case, the token acts as an object used to
update the local memories of the processes with the last values written. Between
two consecutive visits of the token, a process reads its local copies. The consistency
argument is the same as the one depicted in Fig. 17.10.
The structural view is described in Fig. 17.12. There is a manager process pX per
object X, and the set of managers {pA , . . . , pZ } implements, in a distributed way,
the whole shared memory.
17.4 Sequential Consistency with a Server per Object 461
Let us consider the computation described in Fig. 17.13, where X and Y are initial-
ized to 0, in which the last read by p2 is missing. As it is equivalent to the following
legal sequence S,
Y.write2 (1), X.write2 (2), X.read1 () → 2, Y.write1 (3), X.read1 () → 2, X.write2 (4),
Fig. 17.14
operation X.write(v) is
Sequential consistency
(1) valid ← {Y | yi = ⊥};
with a manager per object:
(2) send WRITE _ REQ (X, v, valid) to pX ;
process side
(3) wait (WRITE _ ACK (inval) from pX );
(4) for each Y ∈ inval do yi ← ⊥ end for;
(5) xi ← v.
operation X.read() is
(6) if (xi = ⊥)
(7) then valid ← {Y | yi = ⊥};
(8) send READ _ REQ (X, valid) to pX ;
(9) wait (READ _ ACK (inval, v) from pX );
(10) for each Y ∈ inval do yi ← ⊥ end for;
(11) xi ← v
(12) end if;
(13) return(xi ).
Fig. 17.15 Sequential consistency with a manager per object: manager side
copy of Y at pi is still valid (its future reads by pi will still be legal). The answer
no means that the copy of Y at pi must be invalidated.
The behavior of a manager pY , when it receives a message VALID _ REQ (i) from
a manager pX , is described in Fig. 17.16. When this occurs, pY (conservatively)
asks pX to invalidate the copy of Y at pi if it knows that this copy is no longer up
to date. This is the case if hlwY [i], or if pY is currently processing a write (which,
by construction is a write of Y ). When this occurs, pY answers no. Otherwise, the
copy of Y at pi is still valid, and pY answers yes to pX .
17.5.1 Definition
Definition The set of operations that may affect a process pi are its read and
write operations, plus the write operations issued by the other processes. Given a
let OPi be the computation from which all the read operations not
computation OP,
issued by pi have been removed.
A computation OP is causally consistent if, for any process pi , there is a legal
sequential computation Hi that is equivalent to OPi .
While this means that, for each process pi , OPi is sequentially consistent, it does
is sequentially consistent. (This type of consistency condition is
not mean that OP
oriented toward cooperative work.)
Example 1 Let us first consider the computation described in Fig. 17.2. The fol-
lowing sequential computation Hi and H j can be associated with pi and pj , respec-
tively:
i ≡ R .writej (0), R .readi (0), R .writej (2), R .readi (2),
H
j ≡ R.writei (0), R.readi (0), R.writej (1), R.readi (1).
H
i and H
As both H j are legal, the computation is causally consistent.
op
Let us now assume that R.write4 (1) −→ R.write3 (2). In this case, both H 1
and H2 must order R.write4 (1) before R.write3 (2), and for the computation to be
causally consistent, both read operations have to return the value 2. Let us observe
that, in this case, this computation is also atomic.
Let us notice that, in this case, the computation is also sequentially consistent.
If a = 1 the computation is still causally consistent, but is no longer sequentially
consistent. In this case p3 orders the concurrent write operations of R in the reverse
order, i.e., we have
Causal Consistency and Causal Message Delivery It is easy to see that, for
read/write registers, causal consistency is analogous to causal broadcast on message
deliveries. The notion of causal message delivery was introduced in Sect. 12.1, and
its broadcast instance was developed in Sect. 12.3. It states that no message m can
be delivered to a process before the messages broadcast in the causal past of m.
17.5 A Weaker Consistency Condition: Causal Consistency 467
Fig. 17.19
operation X.write(v) is
A simple algorithm
(1) co_broadcast CAUSAL _ CONS (X, xi );
implementing causal consistency
(2) xi ← v.
operation X.read() is
(3) return(xi );
op
For causal consistency, the causal past is defined with respect to the relation −→,
a write operation corresponds to the broadcast of a message, while a read operation
corresponds to a message reception (with the difference that a value written can
never be read, and the same value can be read several times).
A Simple Algorithm It follows from the previous discussion that a simple way
to implement causal consistency lies in using an underlying causal broadcast algo-
rithm. Several algorithms were presented in Sect. 12.3; they provide the processes
with the operations co_broadcast() and co_deliver(). We consider here the causal
broadcast algorithm presented in Fig. 12.10.
The algorithm is described in Fig. 17.19. Each process manages a copy xi of
every register object X. It is easy to see that both the read operation and the write
operation are fast. This is due to the fact that, as causality involves only the causal
past, no process coordination is required.
A Simple Algorithm This section presents a very simple algorithm that imple-
ments causal consistency when there is a single object X. This algorithm is based
on scalar clocks (these logical clocks were introduced in Sect. 7.1.1).
The algorithm is described in Fig. 17.20. The scalar clock of pi is denoted clocki
and is initialized to 0.
operation X.write(v) is
(1) clocki ← clocki + 1;
(2) xi ← v;
(3) for each j ∈ {1, . . . , n} \ {i} do send CAUSAL _ CONS (v, clocki ) to pj end for.
operation X.read() is
(4) return(xi );
As before, the read and write operations are fast. When a process invokes a write
operation it increases its local clock, associates the corresponding date with its write,
and sends the message CAUSAL _ CONS (v, clocki ) to all the other processes. The
scalar clocks establish a total order on the write operations which are causally de-
pendent.
Write operations with the same date w_date are concurrent. Only the first of
them that is received by a process pi is taken into account by pi . The algorithm
considers that, from the receiver pi ’s point of view, the other ones are overwritten
by the first one. Hence, two processes pi and pj can order differently concurrent
write operations.
17.7 Summary
This chapter introduced sequential consistency and causal consistency, and pre-
sented several algorithms which implement them. As far as sequential consistency
is concerned, it presented two properties (denoted WW and OO) which simplify the
design of algorithms implementing this consistency condition.
17.8 Bibliographic Notes 469
• Other consistency conditions weaker than sequential consistency have been pro-
posed, e.g., slow memory [194], lazy release consistency [204], and PRAM con-
sistency [237] (to cite a few). See [323] for a short introductory survey.
• Normality is a consistency condition which considers process order and object
order [151]. When operations are on a single object, it is equivalent to sequential
consistency.
• Very general algorithms, which can be instantiated with one of many consistency
conditions, are described in [200, 212].
The practice of sequential computing has greatly benefited from the results of the
theory of sequential computing that were captured in the study of formal languages
and automata theory. Everyone knows what can be computed (computability) and
what can be computed efficiently (complexity). All these results constitute the foun-
dations of sequential computing, which, thanks to them, has become a science.
These theoretical results and algorithmic principles have been described in many
books from which students can learn basic results, algorithms, and principles of se-
quential computing (e.g., [99, 107, 148, 189, 205, 219, 258, 270, 351] to cite a few).
Since Lamport’s seminal paper “Time, clocks, and the ordering of events in a dis-
tributed system”, which appeared in 1978 [226], distributed computing is no longer
a set of tricks or recipes, but a domain of computing science with its own concepts,
methods, and applications. The world is distributed, and today the major part of ap-
plications are distributed. This means that message-passing algorithms are now an
important part of any computing science or computing engineering curriculum.
Thanks to appropriate curricula—and good associated books—students have a
good background in the theory and practice of sequential computing. In the same
spirit, an aim of this book is to try to provide them with an appropriate background
when they have to solve distributed computing problems.
Technology is what makes everyday life easier. Science is what allows us to
transcend it, and capture the deep nature of the objects we are manipulating. To that
end, it provides us with the right concepts to master and understand what we are
doing. Considering failure-free asynchronous distributed computing, an ambition of
this book is to be a step in this direction.
Chapter 11: Conflict graph, deadlock prevention, graph coloring, incremental re-
quests, k-out-of-M problem, permission, resource allocation, resource graph, re-
source type, resource instance, simultaneous requests, static/dynamic (resource)
session, timestamp, total order, waiting chain, wait-for graph.
Chapter 12: Asynchronous system, bounded lifetime message, causal barrier,
causal broadcast, causal message delivery order, circulating token, client/server
broadcast, coordinator process, delivery condition, first in first out (FIFO) channel,
order properties on a channel, size of control information, synchronous system.
Chapter 13: Asynchronous system, client-server hierarchy, communication initia-
tive, communicating sequential processes, crown, deadline-constrained interac-
tion, deterministic vs. nondeterministic context, logically instantaneous commu-
nication, planned vs. forced interaction, rendezvous, multiparty interaction, syn-
chronous communication, synchronous system, token.
Chapter 14: AND receive, asynchronous system, atomic model, counting, diffusing
computation, distributed iteration, global state, k-out-of-n receive statement, loop
invariant, message arrival vs. message reception, network traversal, nondetermin-
istic statement, OR receive statement, reasoned construction, receive statement,
ring, spanning tree, stable property, termination detection, wave.
Chapter 15: AND communication model, cycle, deadlock, deadlock detection,
knot, one-at-a-time model, OR communication model, probe-based algorithm, re-
source vs. message, stable property, wait-for graph.
Chapter 16: Atomicity, composability, concurrent object, consistency condition,
distributed shared memory, invalidation vs. update, linearizability, linearization
point, local property, manager process, object operation, partial order on opera-
tions, read/write register, real time, sequential specification, server process, shared
memory abstraction, total order broadcast abstraction.
Chapter 17: Causal consistency, concurrent object, consistency condition, dis-
tributed shared memory, invalidation, logical time, manager process, OO con-
straint, partial order on operations, read/write register, sequential consistency,
server processes, shared memory abstraction, total order broadcast abstraction,
WW constraint.
broadcast (Chap. 12), and (c) distributed termination detection (Chap. 14), if time
permits.
The spirit of this course is to be an introductory course, giving students a cor-
rect intuition of what distributed algorithms are (they are not simple “extensions”
of sequential algorithms), and show them that there are problems which are spe-
cific to distributed computing.
• A second one-semester course on distributed computing could first address the
concept of a global state (Chap. 6). The aim is here to give the student a precise
view of what a distributed execution is and introduce the notion of a global state.
Then, the course could develop and illustrate the different notions of logical times
(Chap. 7).
Distributed checkpointing (Chap. 8), synchronizers (Chap. 9), resource alloca-
tion (Chap. 11), rendezvous communication (Chap. 13), and deadlock detection
(Chap. 15), can be used to illustrate the previous notions.
Finally, the meaning and the implementation of a distributed shared memory
(Part VI) could be presented to introduce the notion of a consistency condition,
which is a fundamental notion of distributed computing.
Of course, this book can also be used by engineers and researchers who work
on distributed applications to better understand the concepts and mechanisms that
underlie their work.
A Series of Books
This book completes a series of four books, written by the author, devoted to concur-
rent and distributed computing [315–317]. More precisely, we have the following.
• As has been seen, this book is on elementary distributed computing for failure-
free asynchronous systems.
• The book [317] is on algorithms in asynchronous shared memory systems where
processes can commit crash failures. It focuses on the construction of reliable
concurrent objects in the presence of process crashes.
A Series of Books 475
1. A. Acharya, B.R. Badrinath, Recording distributed snapshot based on causal order of mes-
sage delivery. Inf. Process. Lett. 44, 317–321 (1992)
2. A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers,
in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New
York, 1994), pp. 73–80
3. S. Adve, K. Gharachorloo, Shared memory consistency models. IEEE Comput. 29(12), 66–
76 (1996)
4. S. Adve, M. Mill, A unified formalization of four shared memory models. IEEE Trans. Par-
allel Distrib. Syst. 4(6), 613–624 (1993)
5. Y. Afek, G.M. Brown, M. Merritt, Lazy caching. ACM Trans. Program. Lang. Syst. 15(1),
182–205 (1993)
6. A. Agarwal, V.K. Garg, Efficient dependency tracking for relevant events in concurrent sys-
tems. Distrib. Comput. 19(3), 163–183 (2007)
7. D. Agrawal, A. El Abbadi, An efficient and fault-tolerant solution for distributed mutual
exclusion. ACM Trans. Comput. Syst. 9(1), 1–20 (1991)
8. D. Agrawal, A. Malpini, Efficient dissemination of information in computer networks. Com-
put. J. 34(6), 534–541 (1991)
9. M. Ahamad, M.H. Ammar, S.Y. Cheung, Multidimensional voting. ACM Trans. Comput.
Syst. 9(4), 399–431 (1991)
10. M. Ahamad, G. Neiger, J.E. Burns, P.W. Hutto, P. Kohli, Causal memory: definitions, imple-
mentation and programming. Distrib. Comput. 9, 37–49 (1995)
11. M. Ahamad, M. Raynal, G. Thia-Kime, An adaptive protocol for implementing causally con-
sistent distributed services, in Proc. 18th Int’l Conference on Distributed Computing Systems
(ICDCS’98) (IEEE Press, New York, 1998), pp. 86–93
12. M. Ahuja, Flush primitives for asynchronous distributed systems. Inf. Process. Lett. 34, 5–12
(1990)
13. M. Ahuja, M. Raynal, An implementation of global flush primitives using counters. Parallel
Process. Lett. 5(2), 171–178 (1995)
14. S. Alagar, S. Venkatesan, An optimal algorithm for distributed snapshots with message causal
ordering. Inf. Process. Lett. 50, 310–316 (1994)
15. A. Alvarez, S. Arévalo, V. Cholvi, A. Fernández, E. Jiménez, On the interconnection of
message passing systems. Inf. Process. Lett. 105(6), 249–254 (2008)
16. L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans.
Softw. Eng. 24(2), 149–159 (1998)
17. E. Anceaume, J.-M. Hélary, M. Raynal, Tracking immediate predecessors in distributed com-
putations, in Proc. 14th Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA’002) (ACM Press, New York, 2002), pp. 210–219
18. E. Anceaume, J.-M. Hélary, M. Raynal, A note on the determination of the immediate pre-
decessors in a distributed computation. Int. J. Found. Comput. Sci. 13(6), 865–872 (2002)
19. D. Angluin, Local and global properties in networks of processors, in Proc. 12th ACM Sym-
posium on Theory of Computation (STOC’81) (ACM Press, New York, 1981), pp. 82–93
20. I. Arrieta, F. Fariña, J.-R. Mendívil, M. Raynal, Leader election: from Higham-Przytycka’s
algorithm to a gracefully degrading algorithm, in Proc. 6th Int’l Conference on Complex,
Intelligent, and Software Intensive Systems (CISIS’12) (IEEE Press, New York, 2012), pp.
225–232
21. H. Attiya, S. Chaudhuri, R. Friedman, J.L. Welch, Non-sequential consistency conditions
for shared memory, in Proc. 5th ACM Symposium on Parallel Algorithms and Architectures
(SPAA’93) (ACM Press, New York, 1993), pp. 241–250
22. H. Attiya, M. Snir, M. Warmuth, Computing on an anonymous ring. J. ACM 35(4), 845–876
(1988)
23. H. Attiya, J.L. Welch, Sequential consistency versus linearizability. ACM Trans. Comput.
Syst. 12(2), 91–122 (1994)
24. H. Attiya, J.L. Welch, Distributed Computing: Fundamentals, Simulations and Advanced
Topics, 2nd edn. (Wiley-Interscience, New York, 2004). 414 pages. ISBN 0-471-45324-2
25. B. Awerbuch, A new distributed depth-first search algorithm. Inf. Process. Lett. 20(3), 147–
150 (1985)
26. B. Awerbuch, Reducing complexities of the distributed max-flow and breadth-first algorithms
by means of network synchronization. Networks 15, 425–437 (1985)
27. B. Awerbuch, Complexity of network synchronization. J. ACM 4, 804–823 (1985)
28. O. Babaoğlu, E. Fromentin, M. Raynal, A unified framework for the specification and the
run-time detection of dynamic properties in distributed executions. J. Syst. Softw. 33, 287–
298 (1996)
29. O. Babaoğlu, K. Marzullo, Consistent global states of distributed systems: fundamental con-
cepts and mechanisms, in Distributed Systems (ACM/Addison-Wesley Press, New York,
1993), pp. 55–93. Chap. 4
30. R. Bagrodia, Process synchronization: design and performance evaluation for distributed al-
gorithms. IEEE Trans. Softw. Eng. SE15(9), 1053–1065 (1989)
31. R. Bagrodia, Synchronization of asynchronous processes in CSP. ACM Trans. Program.
Lang. Syst. 11(4), 585–597 (1989)
32. H.E. Bal, F. Kaashoek, A. Tanenbaum, Orca: a language for parallel programming of dis-
tributed systems. IEEE Trans. Softw. Eng. 18(3), 180–205 (1992)
33. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based
communication-induced checkpointing protocols ensuring the RDT property. Inf. Process.
Lett. 80(2), 105–111 (2001)
34. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpoint-
ing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium
on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77
35. R. Baldoni, J.-M. Hélary, M. Raynal, Consistent records in asynchronous computations. Acta
Inform. 35(6), 441–455 (1998)
36. R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal character-
ization and its protocol. Inf. Comput. 165(2), 144–173 (2001)
37. R. Baldoni, G. Melideo, k-dependency vectors: a scalable causality-tracking protocol, in
Proc. 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing
(PDP’03) (2003), pp. 219–226
38. R. Baldoni, A. Mostéfaoui, M. Raynal, Causal delivery of messages with real-time data in
unreliable networks. Real-Time Syst. 10(3), 245–262 (1996)
39. R. Baldoni, R. Prakash, M. Raynal, M. Singhal, Efficient delta-causal broadcasting. Comput.
Syst. Sci. Eng. 13(5), 263–270 (1998)
40. R. Baldoni, M. Raynal, Fundamentals of distributed computing: a practical tour of vector
clock systems. IEEE Distrib. Syst. Online 3(2), 1–18 (2002)
References 479
41. D. Barbara, H. Garcia Molina, Mutual exclusion in partitioned distributed systems. Distrib.
Comput. 1(2), 119–132 (1986)
42. D. Barbara, H. Garcia Molina, A. Spauster, Increasing availability under mutual exclu-
sion constraints with dynamic vote assignments. ACM Trans. Comput. Syst. 7(7), 394–426
(1989)
43. L. Barenboim, M. Elkin, Deterministic distributed vertex coloring in polylogarithmic time.
J. ACM 58(5), 23 (2011), 25 pages
44. R. Bellman, Dynamic Programming (Princeton University Press, Princeton, 1957)
45. J.-C. Bermond, C. Delorme, J.-J. Quisquater, Strategies for interconnection networks: some
methods from graph theory. J. Parallel Distrib. Comput. 3(4), 433–449 (1986)
46. J.-C. Bermond, J.-C. König, General and efficient decentralized consensus protocols II, in
Proc. Int’l Workshop on Parallel and Distributed Algorithms, ed. by M. Cosnard, P. Quinton,
M. Raynal, Y. Robert (North-Holland, Amsterdam, 1989), pp. 199–210
47. J.-C. Bermond, J.-C. König, Un protocole distribué pour la 2-connexité. TSI. Tech. Sci. In-
form. 10(4), 269–274 (1991)
48. J.-C. Bermond, J.-C. König, M. Raynal, General and efficient decentralized consensus pro-
tocols, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 41–56
49. J.-C. Bermond, C. Peyrat, de Bruijn and Kautz networks: a competitor for the hypercube? in
Proc. Int’l Conference on Hypercube and Distributed Computers (North-Holland, Amster-
dam, 1989), pp. 279–284
50. J.M. Bernabéu-Aubán, M. Ahamad, Applying a path-compression technique to obtain an
effective distributed mutual exclusion algorithm, in Proc. 3rd Int’l Workshop on Distributed
Algorithms (WDAG’89). LNCS, vol. 392 (Springer, Berlin, 1989), pp. 33–44
51. A.J. Bernstein, Output guards and non-determinism in “communicating sequential pro-
cesses”. ACM Trans. Program. Lang. Syst. 2(2), 234–238 (1980)
52. B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery
in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable
Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12
53. K. Birman, T. Joseph, Reliable communication in the presence of failures. ACM Trans. Com-
put. Syst. 5(1), 47–76 (1987)
54. A.D. Birell, B.J. Nelson, Implementing remote procedure calls. ACM Trans. Comput. Syst.
3, 39–59 (1984)
55. K.P. Birman, A. Schiper, P. Stephenson, Lightweight causal and atomic group multicast.
ACM Trans. Comput. Syst. 9(3), 272–314 (1991)
56. H.L. Bodlaender, Some lower bound results for decentralized extrema finding in ring of
processors. J. Comput. Syst. Sci. 42, 97–118 (1991)
57. L. Bougé, Repeated snapshots in distributed systems with synchronous communications and
their implementation in CSP. Theor. Comput. Sci. 49, 145–169 (1987)
58. L. Bougé, N. Francez, A compositional approach to super-imposition, in Proc. 15th Annual
ACM Symposium on Principles of Programming Languages (POPL’88) (ACM Press, New
York, 1988), pp. 240–249
59. A. Boukerche, C. Tropper, A distributed graph algorithm for the detection of local cycles and
knots. IEEE Trans. Parallel Distrib. Syst. 9(8), 748–757 (1998)
60. Ch. Boulinier, F. Petit, V. Villain, Synchronous vs asynchronous unison. Algorithmica 51(1),
61–80 (2008)
61. G. Bracha, S. Toueg, Distributed deadlock detection. Distrib. Comput. 2(3), 127–138 (1987)
62. D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algo-
rithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems
(IEEE Press, New York, 1984), pp. 207–215
63. P. Brinch Hansen, Distributed processes: a concurrent programming concept. Commun.
ACM 21(11), 934–941 (1978)
480 References
64. J. Brzezinski, J.-M. Hélary, M. Raynal, Termination detection in a very general distributed
computing model, in Proc. 13th IEEE Int’l Conference on Distributed Computing Systems
(ICDCS’93) (IEEE Press, New York, 1993), pp. 374–381
65. J. Brzezinski, J.-M. Hélary, M. Raynal, M. Singhal, Deadlock models and a general algo-
rithm for distributed deadlock detection. J. Parallel Distrib. Comput. 31(2), 112–125 (1995)
(Erratum printed in Journal of Parallel and Distributed Computing, 32(2), 232 (1996))
66. G.N. Buckley, A. Silberschatz, An effective implementation for the generalized input-output
construct of CSP. ACM Trans. Program. Lang. Syst. 5(2), 223–235 (1983)
67. C. Cachin, R. Guerraoui, L. Rodrigues, Introduction to Reliable and Secure Distributed Pro-
gramming, 2nd edn. (Springer, Berlin, 2012), 367 pages. ISBN 978-3-642-15259-7
68. G. Cao, M. Singhal, A delay-optimal quorum-based mutual exclusion algorithm for dis-
tributed systems. IEEE Trans. Parallel Distrib. Syst. 12(12), 1256–1268 (1991)
69. N. Carriero, D. Gelernter, T.G. Mattson, A.H. Sherman, The Linda alternative to message-
passing systems. Parallel Comput. 20(4), 633–655 (1994)
70. O. Carvalho, G. Roucairol, On the distribution of an assertion, in Proc. First ACM Symposium
on Principles of Distributed Computing (PODC’1982) (ACM Press, New York, 1982), pp.
18–20
71. O. Carvalho, G. Roucairol, On mutual exclusion in computer networks. Commun. ACM
26(2), 146–147 (1983)
72. L.M. Censier, P. Feautrier, A new solution to coherence problems in multicache systems.
IEEE Trans. Comput. C-27(12), 112–118 (1978)
73. P. Chandra, A.K. Kshemkalyani, Causality-based predicate detection across space and time.
IEEE Trans. Comput. 54(11), 1438–1453 (2005)
74. S. Chandrasekaran, S. Venkatesan, A message-optimal algorithm for distributed termination
detection. J. Parallel Distrib. Comput. 8(3), 245–252 (1990)
75. K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed
systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
76. K.M. Chandy, J. Misra, Deadlock absence proof for networks of communicating processes.
Inf. Process. Lett. 9(4), 185–189 (1979)
77. K.M. Chandy, J. Misra, Distributed computation on graphs: shortest path algorithms. Com-
mun. ACM 25(11), 833–837 (1982)
78. K.M. Chandy, J. Misra, The drinking philosophers problem. ACM Trans. Program. Lang.
Syst. 6(4), 632–646 (1984)
79. K.M. Chandy, J. Misra, An example of stepwise refinement of distributed programs: quies-
cence detection. ACM Trans. Program. Lang. Syst. 8(3), 326–343 (1986)
80. K.M. Chandy, J. Misra, Parallel Program Design (Addison-Wesley, Reading, 1988), 516
pages
81. K.M. Chandy, J. Misra, L.M. Haas, Distributed deadlock detection. ACM Trans. Comput.
Syst. 1(2), 144–156 (1983)
82. E.J.H. Chang, Echo algorithms: depth-first algorithms on graphs. IEEE Trans. Softw. Eng.
SE-8(4), 391–402 (1982)
83. E.J.H. Chang, R. Roberts, An improved algorithm for decentralized extrema finding in cir-
cular configurations of processes. Commun. ACM 22(5), 281–283 (1979)
84. J.-M. Chang, N.F. Maxemchuck, Reliable broadcast protocols. ACM Trans. Comput. Syst.
2(3), 251–273 (1984)
85. A. Charlesworth, The multiway rendezvous. ACM Trans. Program. Lang. Syst. 9, 350–366
(1987)
86. B. Charron-Bost, Concerning the size of logical clocks in distributed systems. Inf. Process.
Lett. 39, 11–16 (1991)
87. B. Charron-Bost, G. Tel, Calcul approché de la borne inférieure de valeurs réparties. Inform.
Théor. Appl. 31(4), 305–330 (1997)
88. B. Charron-Bost, G. Tel, F. Mattern, Synchronous, asynchronous, and causally ordered com-
munications. Distrib. Comput. 9(4), 173–191 (1996)
References 481
89. C. Chase, V.K. Garg, Detection of global predicates: techniques and their limitations. Distrib.
Comput. 11(4), 191–201 (1998)
90. D.R. Cheriton, D. Skeen, Understanding the limitations of causally and totally ordered com-
munication, in Proc. 14th ACM Symposium on Operating System Principles (SOSP’93)
(ACM Press, New York, 1993), pp. 44–57
91. T.-Y. Cheung, Graph traversal techniques and the maximum flow problem in distributed com-
putation. IEEE Trans. Softw. Eng. SE-9(4), 504–512 (1983)
92. V. Cholvi, A. Fernández, E. Jiménez, P. Manzano, M. Raynal, A methodological construction
of an efficient sequentially consistent distributed shared memory. Comput. J. 53(9), 1523–
1534 (2010)
93. C.T. Chou, I. Cidon, I. Gopal, S. Zaks, Synchronizing asynchronous bounded delays net-
works, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 212–218
94. M. Choy, A.K. Singh, Efficient implementation of synchronous communication over asyn-
chronous networks. J. Parallel Distrib. Comput. 26, 166–180 (1995)
95. I. Cidon, Yet another distributed depth-first search algorithm. Inf. Process. Lett. 26(6), 301–
305 (1988)
96. I. Cidon, An efficient knot detection algorithm. IEEE Trans. Softw. Eng. 15(5), 644–649
(1989)
97. E.G. Coffman Jr., M.J. Elphick, A. Shoshani, System deadlocks. ACM Comput. Surv. 3(2),
67–78 (1971)
98. R. Cooper, K. Marzullo, Consistent detection of global predicates, in Proc. ACM/ONR Work-
shop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 163–173
99. Th.H. Cormen, Ch.E. Leiserson, R.L. Rivest, Introduction to Algorithms (The MIT Press,
Cambridge, 1998), 1028 pages
100. J.-M. Couvreur, N. Francez, M. Gouda, Asynchronous unison, in Proc. 12th IEEE Int’l Con-
ference on Distributed Computing Systems (ICDCS’92) (IEEE Press, New York, 1992), pp.
486–493
101. F. Cristian, Probabilistic clock synchronization. Distrib. Comput. 3(3), 146–158 (1989)
102. F. Cristian, H. Aghili, R. Strong, D. Dolev, Atomic broadcast: from simple message diffusion
to Byzantine agreement. Inf. Comput. 118(1), 158–179 (1995)
103. F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived dis-
tributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems
(SRDS’91) (IEEE Press, New York, 1991), pp. 12–20
104. O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Par-
allel Distrib. Comput. 63(12), 1193–1218 (2003)
105. M.J. Demmer, M. Herlihy, The arrow distributed directory protocol, in Proc. 12th Int’l Sym-
posium on Distributed Computing (DISC’98). LNCS, vol. 1499 (Springer, Berlin, 1998), pp.
119–133
106. P.J. Denning, Virtual memory. ACM Comput. Surv. 2(3), 153–189 (1970)
107. P.J. Denning, J.B. Dennis, J.E. Qualitz, Machines, Languages and Computation (Prentice
Hall, New York, 1978), 612 pages
108. Cl. Diehl, Cl. Jard, Interval approximations of message causality in distributed executions,
in Proc. 9th Annual Symposium on Theoretical Aspects of Computer Science (STACS’92).
LNCS, vol. 577 (Springer, Berlin, 1992), pp. 363–374
109. E.W. Dijkstra, Solution of a problem in concurrent programming control. Commun. ACM
8(9), 569 (1965)
110. E.W. Dijkstra, The structure of “THE” multiprogramming system. Commun. ACM 11(5),
341–346 (1968)
111. E.W. Dijkstra, Hierarchical ordering of sequential processes. Acta Inform. 1, 115–138
(1971)
112. E.W. Dijkstra, Self stabilizing systems in spite of distributed control. Commun. ACM 17,
643–644 (1974)
482 References
113. E.W. Dijkstra, Guarded commands, nondeterminacy, and formal derivation of programs.
Commun. ACM 18(8), 453–457 (1979)
114. E.W. Dijkstra, W.H.J. Feijen, A.J.M. van Gasteren, Derivation of a termination detection
algorithm for distributed computations. Inf. Process. Lett. 16(5), 217–219 (1983)
115. E.W.D. Dijkstra, C.S. Scholten, Termination detection for diffusing computations. Inf. Pro-
cess. Lett. 11(1), 1–4 (1980)
116. D. Dolev, J.Y. Halpern, H.R. Strong, On the possibility and impossibility of achieving clock
synchronization. J. Comput. Syst. Sci. 33(2), 230–250 (1986)
117. D. Dolev, M. Klawe, M. Rodeh, An O(n log n) unidirectional distributed algorithm for ex-
trema finding in a circle. J. Algorithms 3, 245–260 (1982)
118. S. Dolev, Self-Stabilization (The MIT Press, Cambridge, 2000), 197 pages
119. M. Dubois, C. Scheurich, Memory access dependencies in shared memory multiprocessors.
IEEE Trans. Softw. Eng. 16(6), 660–673 (1990)
120. E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery proto-
cols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
121. E. Evangelist, N. Francez, S. Katz, Multiparty interactions for interprocess communication
and synchronization. IEEE Trans. Softw. Eng. 15(11), 1417–1426 (1989)
122. S. Even, Graph Algorithms, 2nd edn. (Cambridge University Press, Cambridge, 2011), 202
pages (edited by G. Even)
123. A. Fekete, N.A. Lynch, L. Shrira, A modular proof of correctness for a network synchro-
nizer, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 219–256
124. C.J. Fidge, Timestamp in message passing systems that preserves partial ordering, in Proc.
11th Australian Computing Conference (1988), pp. 56–66
125. C.J. Fidge, Logical time in distributed computing systems. IEEE Comput. 24(8), 28–33
(1991)
126. C.J. Fidge, Limitation of vector timestamps for reconstructing distributed computations. Inf.
Process. Lett. 68, 87–91 (1998)
127. M.J. Fischer, A. Michael, Sacrificing serializability to attain high availability of data, in Proc.
First ACM Symposium on Principles of Database Systems (PODS’82) (ACM Press, New
York, 1882), pp. 70–75
128. R.W. Floyd, Algorithm 97: shortest path. Commun. ACM 5(6), 345 (1962)
129. J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Con-
ference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp.
134–141
130. N. Francez, Distributed termination. ACM Trans. Program. Lang. Syst. 2(1), 42–55 (1980)
131. N. Francez, B. Halpern, G. Taubenfeld, Script: a communication abstraction mechanism. Sci.
Comput. Program. 6(1), 35–88 (1986)
132. N. Francez, M. Rodeh, Achieving distributed termination without freezing. IEEE Trans.
Softw. Eng. 8(3), 287–292 (1982)
133. N. Francez, S. Yemini, Symmetric intertask communication. ACM Trans. Program. Lang.
Syst. 7(4), 622–636 (1985)
134. W.R. Franklin, On an improved algorithm for decentralized extrema-finding in circular con-
figurations of processors. Commun. ACM 25(5), 336–337 (1982)
135. U. Fridzke, P. Ingels, A. Mostéfaoui, M. Raynal, Fault-tolerant consensus-based total order
multicast. IEEE Trans. Parallel Distrib. Syst. 13(2), 147–157 (2001)
136. R. Friedman, Implementing hybrid consistency with high-level synchronization operations,
in Proc. 12th Annual ACM Symposium on Principles of Distributed Computing (PODC’93)
(ACM Press, New York, 1993), pp. 229–240
137. E. Fromentin, Cl. Jard, G.-V. Jourdan, M. Raynal, On-the-fly analysis of distributed compu-
tations. Inf. Process. Lett. 54(5), 267–274 (1995)
138. E. Fromentin, M. Raynal, Shared global states in distributed computations. J. Comput. Syst.
Sci. 55(3), 522–528 (1997)
References 483
139. E. Fromentin, M. Raynal, V.K. Garg, A.I. Tomlinson, On the fly testing of regular patterns in
distributed computations, in Proc. Int’l Conference on Parallel Processing (ICPP’94) (1994),
pp. 73–76
140. R. Fujimoto, Parallel discrete event simulation. Commun. ACM 33(10), 31–53 (1990)
141. E. Gafni, D. Bertsekas, Distributed algorithms for generating loop-free routes in networks
with frequently changing topologies. IEEE Trans. Commun. C-29(1), 11–18 (1981)
142. R.G. Gallager, Distributed minimum hop algorithms. Tech Report LIDS 1175, MIT, 1982
143. R.G. Gallager, P.A. Humblet, P.M. Spira, A distributed algorithm for minimum-weight span-
ning trees. ACM Trans. Program. Lang. Syst. 5(1), 66–77 (1983)
144. I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc.
19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New
York, 1999), pp. 55–62
145. I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency tracka-
bility property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01)
(IEEE Press, New York, 2001), pp. 342–349
146. I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characteri-
zation of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on
Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135
147. H. Garcia Molina, D. Barbara, How to assign votes in a distributed system. J. ACM 32(4),
841–860 (1985)
148. M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-
Completeness (Freeman, New York, 1979), 340 pages
149. V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274
pages
150. V.K. Garg, Elements of Distributed Computing (Wiley-Interscience, New York, 2002), 423
pages
151. V.K. Garg, M. Raynal, Normality: a consistency condition for concurrent objects. Parallel
Process. Lett. 9(1), 123–134 (1999)
152. V.K. Garg, S. Skawratananond, N. Mittal, Timestamping messages and events in a distributed
system using synchronous communication. Distrib. Comput. 19(5–6), 387–402 (2007)
153. V.K. Garg, B. Waldecker, Detection of weak unstable predicates in distributed programs.
IEEE Trans. Parallel Distrib. Syst. 5(3), 299–307 (1994)
154. V.K. Garg, B. Waldecker, Detection of strong unstable predicates in distributed programs.
IEEE Trans. Parallel Distrib. Syst. 7(12), 1323–1333 (1996)
155. Ch. Georgiou, A. Shvartsman, Do-All Computing in Distributed Systems: Cooperation in the
Presence of Adversity (Springer, Berlin, 2008), 219 pages. ISBN 978-0-387-69045-2
156. K. Gharachorloo, P. Gibbons, Detecting violations of sequential consistency, in Proc. 3rd
ACM Symposium on Parallel Algorithms and Architectures (SPAA’91) (ACM Press, New
York, 1991), pp. 316–326
157. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, J.L. Hennessy, Memory
consistency and event ordering in scalable shared memory multiprocessors, in Proc. 17th
ACM Int’l Symposium on Computer Architecture (ISCA’90) (1990), pp. 15–26
158. A. Gibbons, Algorithmic Graph Theory (Cambridge University Press, Cambridge, 1985),
260 pages
159. D.K. Gifford, Weighted voting for replicated data, in Proc. 7th ACM Symposium on Operat-
ing System Principles (SOSP’79) (ACM Press, New York, 1979), pp. 150–172
160. V.D. Gligor, S.H. Shattuck, Deadlock detection in distributed systems. IEEE Trans. Softw.
Eng. SE-6(5), 435–440 (1980)
161. A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of dis-
tributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging
(ACM Press, New York, 1991), pp. 144–156
162. M. Gouda, T. Herman, Stabilizing unison. Inf. Process. Lett. 35(4), 171–175 (1990)
163. D. Gray, A. Reuter, Transaction Processing: Concepts and Techniques (Morgan Kaufmann,
San Mateo, 1993), 1060 pages. ISBN 1-55860-190-2
484 References
164. J.L. Gross, J. Yellen (eds.), Graph Theory (CRC Press, Boca Raton, 2004), 1167 pages
165. H.N. Haberman, Prevention of system deadlocks. Commun. ACM 12(7), 373–377 (1969)
166. J.W. Havender, Avoiding deadlocks in multitasking systems. IBM Syst. J. 13(3), 168–192
(1971)
167. J.-M. Hélary, Observing global states of asynchronous distributed applications, in Proc. 3rd
Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 392 (Springer, Berlin,
1987), pp. 124–135
168. J.-M. Hélary, M. Hurfin, A. Mostéfaoui, M. Raynal, F. Tronel, Computing global functions in
asynchronous distributed systems with perfect failure detectors. IEEE Trans. Parallel Distrib.
Syst. 11(9), 897–909 (2000)
169. J.-M. Hélary, C. Jard, N. Plouzeau, M. Raynal, Detection of stable properties in distributed
systems, in Proc. 6th ACM Symposium on Principles of Distributed Computing (PODC’87)
(ACM Press, New York, 1987), pp. 125–136
170. J.-M. Hélary, A. Maddi, M. Raynal, Controlling information transfers in distributed applica-
tions, application to deadlock detection, in Proc. Int’l IFIP WG 10.3 Conference on Parallel
Processing (North-Holland, Amsterdam, 1987), pp. 85–92
171. J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention
of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
172. J.-M. Hélary, A. Mostéfaoui, M. Raynal, A general scheme for token and tree-based dis-
tributed mutual exclusion algorithms. IEEE Trans. Parallel Distrib. Syst. 5(11), 1185–1196
(1994)
173. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consis-
tent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999)
174. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed
computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002)
175. J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints.
IEEE Trans. Softw. Eng. 2(2), 274–281 (1999)
176. J.-M. Hélary, N. Plouzeau, M. Raynal, A distributed algorithm for mutual exclusion in arbi-
trary networks. Comput. J. 31(4), 289–295 (1988)
177. J.-M. Hélary, M. Raynal, Depth-first traversal and virtual ring construction in distributed
systems, in Proc. IFIP WG 10.3 Conference on Parallel Processing (North-Holland, Ams-
terdam, 1988), pp. 333–346
178. J.-M. Hélary, M. Raynal, Vers la construction raisonnée d’algorithmes répartis: le cas de la
terminaison. TSI. Tech. Sci. Inform. 10(3), 203–209 (1991)
179. J.-M. Hélary, M. Raynal, Synchronization and Control of Distributed Systems and Programs
(Wiley, New York, 1991), 160 pages
180. J.-M. Hélary, M. Raynal, Towards the construction of distributed detection programs with an
application to distributed termination. Distrib. Comput. 7(3), 137–147 (1994)
181. J.-M. Hélary, M. Raynal, G. Melideo, R. Baldoni, Efficient causality-tracking timestamping.
IEEE Trans. Knowl. Data Eng. 15(5), 1239–1250 (2003)
182. M. Herlihy, F. Kuhn, S. Tirthapura, R. Wattenhofer, Dynamic analysis of the arrow distributed
protocol. Theory Comput. Syst. 39(6), 875–901 (2006)
183. M. Herlihy, J. Wing, Linearizability: a correctness condition for concurrent objects. ACM
Trans. Program. Lang. Syst. 12(3), 463–492 (1990)
184. L. Higham, T. Przytycka, A simple efficient algorithm for maximum finding on rings. Inf.
Process. Lett. 58(6), 319–324 (1996)
185. D.S. Hirschberg, J.B. Sinclair, Decentralized extrema finding in circular configuration of
processors. Commun. ACM 23, 627–628 (1980)
186. C.A.R. Hoare, Communicating sequential processes. Commun. ACM 21(8), 666–677 (1978)
187. W. Hohberg, How to find biconnected components in distributed networks. J. Parallel Distrib.
Comput. 9(4), 374–386 (1990)
188. R.C. Holt, Comments on prevention of system deadlocks. Commun. ACM 14(1), 36–38
(1871)
References 485
189. J.E. Hopcroft, R. Motwani, J.D. Ullman, Introduction to Automata Theory, Languages and
Computation, 2nd edn. (Addison-Wesley, Reading, 2001), 521 pages
190. S.-T. Huang, Termination detection by using distributed snapshots. Inf. Process. Lett. 32(3),
113–119 (1989)
191. S.-T. Huang, Detecting termination of distributed computations by external agents, in Proc.
9th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’89) (IEEE Press, New
York, 1989), pp. 79–84
192. M. Hurfin, N. Plouzeau, M. Raynal, Detecting atomic sequences of predicates in distributed
computations. SIGPLAN Not. 28(12), 32–42 (1993). Proc. ACM/ONR Workshop on Parallel
and Distributed Debugging
193. M. Hurfin, M. Mizuno, M. Raynal, S. Singhal, Efficient distributed detection of conjunctions
of local predicates. IEEE Trans. Softw. Eng. 24(8), 664–677 (1998)
194. P. Hutto, M. Ahamad, Slow memory: weakening consistency to enhance concurrency in dis-
tributed shared memories, in Proc. 10th IEEE Int’l Conference on Distributed Computing
Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 302–311
195. T. Ibaraki, T. Kameda, A theory of coteries: mutual exclusion in distributed systems. J. Par-
allel Distrib. Comput. 4(7), 779–794 (1993)
196. T. Ibaraki, T. Kameda, T. Minoura, Serializability with constraints. ACM Trans. Database
Syst. 12(3), 429–452 (1987)
197. R. Ingram, P. Shields, J.E. Walter, J.L. Welch, An asynchronous leader election algorithm for
dynamic networks, in Proc. 23rd Int’l IEEE Parallel and Distributed Processing Symposium
(IPDPS’09) (IEEE Press, New York, 2009), pp. 1–12
198. Cl. Jard, G.-V. Jourdan, Incremental transitive dependency tracking in distributed computa-
tions. Parallel Process. Lett. 6(3), 427–435 (1996)
199. J. Jefferson, Virtual time. ACM Trans. Program. Lang. Syst. 7(3), 404–425 (1985)
200. E. Jiménez, A. Fernández, V. Cholvi, A parameterized algorithm that implements sequential,
causal, and cache memory consistencies. J. Syst. Softw. 81(1), 120–131 (2008)
201. Ö. Johansson, Simple distributed ( + 1)-coloring of graphs. Inf. Process. Lett. 70(5), 229–
232 (1999)
202. D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message
logging and checkpointing. J. Algorithms 11(3), 462–491 (1990)
203. S. Kanchi, D. Vineyard, An optimal distributed algorithm for all-pairs shortest-path. Int. J.
Inf. Theories Appl. 11(2), 141–146 (2004)
204. P. Keleher, A.L. Cox, W. Zwaenepoel, Lazy release consistency for software distributed
shared memory, in Proc. 19th ACM Int’l Symposium on Computer Architecture (ISCA’92),
(1992), pp. 13–21
205. J. Kleinberg, E. Tardos, Algorithm Design (Addison-Wesley, Reading, 2005), 838 pages
206. P. Knapp, Deadlock detection in distributed databases. ACM Comput. Surv. 19(4), 303–328
(1987)
207. R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans.
Softw. Eng. 13(1), 23–31 (1987)
208. E. Korach, S. Moran, S. Zaks, Tight lower and upper bounds for some distributed algorithms
for a complete network of processors, in Proc. 4th ACM Symposium on Principles of Dis-
tributed Computing (PODC’84) (ACM Press, New York, 1984), pp. 199–207
209. E. Korach, S. Moran, S. Zaks, The optimality of distributive constructions of minimum
weight and degree restricted spanning tree in complete networks of processes. SIAM J. Com-
put. 16(2), 231–236 (1987)
210. E. Korach, D. Rotem, N. Santoro, Distributed algorithms for finding centers and medians in
networks. ACM Trans. Program. Lang. Syst. 6(3), 380–401 (1984)
211. E. Korach, G. Tel, S. Zaks, Optimal synchronization of ABD networks, in Proc. Int’l Con-
ference on Concurrency. LNCS, vol. 335 (Springer, Berlin, 1988), pp. 353–367
212. R. Kordale, M. Ahamad, A scalable technique for implementing multiple consistency lev-
els for distributed objects, in Proc. 16th IEEE Int’l Conference on Distributed Computing
Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 369–376
486 References
213. A.D. Kshemkalyani, Fast and message-efficient global snapshot algorithms for large-scale
distributed systems. IEEE Trans. Parallel Distrib. Syst. 21(9), 1281–1289 (2010)
214. A.D. Kshemkalyani, M. Raynal, M. Singhal, Global snapshots of a distributed systems. Dis-
trib. Syst. Eng. 2(4), 224–233 (1995)
215. A.D. Kshemkalyani, M. Singhal, Invariant-based verification of a distributed deadlock de-
tection algorithm. IEEE Trans. Softw. Eng. 17(8), 789–799 (1991)
216. A.D. Kshemkalyani, M. Singhal, Efficient detection and resolution of generalized distributed
deadlocks. IEEE Trans. Softw. Eng. 20(1), 43–54 (1994)
217. A.D. Kshemkalyani, M. Singhal, Necessary and sufficient conditions on information for
causal message ordering and their optimal implementation. Distrib. Comput. 11(2), 91–111
(1998)
218. A.D. Kshemkalyani, M. Singhal, A one-phase algorithm to detect distributed deadlocks in
replicated databases. IEEE Trans. Knowl. Data Eng. 11(6), 880–895 (1999)
219. A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Sys-
tems (Cambridge University Press, Cambridge, 2008), 736 pages
220. A.D. Kshemkalyani, M. Singhal, Efficient distributed snapshots in an anonymous asyn-
chronous message-passing system. J. Parallel Distrib. Comput. 73, 621–629 (2013)
221. T.-H. Lai, Termination detection for dynamically distributed systems with non-first-in-first-
out communication. J. Parallel Distrib. Comput. 3(4), 577–599 (1986)
222. T.H. Lai, T.H. Yang, On distributed snapshots. Inf. Process. Lett. 25, 153–158 (1987)
223. T.V. Lakshman, A.K. Agrawala, Efficient decentralized consensus protocols. IEEE Trans.
Softw. Eng. SE-12(5), 600–607 (1986)
224. K.B. Lakshmanan, N. Meenakshi, K. Thulisaraman, A time-optimal message-efficient dis-
tributed algorithm for depth-first search. Inf. Process. Lett. 25, 103–109 (1987)
225. K.B. Lakshmanan, K. Thulisaraman, On the use of synchronizers for asynchronous com-
munication networks, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87).
LNCS, vol. 312 (Springer, Berlin, 1987), pp. 257–267
226. L. Lamport, Time, clocks, and the ordering of events in a distributed system. Commun. ACM
21(7), 558–565 (1978)
227. L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess
programs. IEEE Trans. Comput. C-28(9), 690–691 (1979)
228. L. Lamport, On inter-process communications, part I: basic formalism. Distrib. Comput.
1(2), 77–85 (1986)
229. L. Lamport, On inter-process communications, part II: algorithms. Distrib. Comput. 1(2),
86–101 (1986)
230. L. Lamport, P.M. Melliar-Smith, Synchronizing clocks in the presence of faults. J. ACM
32(1), 52–78 (1985)
231. Y. Lavallée, G. Roucairol, A fully distributed minimal spanning tree algorithm. Inf. Process.
Lett. 23(2), 55–62 (1986)
232. G. Le Lann, Distributed systems: towards a formal approach, in IFIP World Congress,
(1977), pp. 155–160
233. I. Lee, S.B. Davidson, Adding time to synchronous processes. IEEE Trans. Comput. C-36(8),
941–948 (1987)
234. K. Li, K.P. Huda, Memory coherence in shared virtual memory systems. ACM Trans. Com-
put. Syst. 7(4), 321–359 (1989)
235. T.F. Li, Th. Radhakrishnan, K. Venkatesh, Global state detection in non-FIFO networks, in
Proc. 7th Int’l Conference on Distributed Computing Systems (ICDCS’87) (IEEE Press, New
York, 1987), pp. 364–370
236. N. Linial, Locality in distributed graph algorithms. SIAM J. Comput. 21(1), 193–201 (1992)
237. R.J. Lipton, J.S. Sandberg, PRAM: a scalable shared memory. Tech Report CS-TR-180-88,
Princeton University, 1988
238. B. Liskov, R. Ladin, Highly available distributed services and fault-tolerant distributed
garbage collection, in Proc. 5th ACM Symposium on Principles of Distributed Computing
(PODC’86) (ACM Press, New York, 1986), pp. 29–39
References 487
239. S. Lodha, A.D. Ksemkalyani, A fair distributed mutual exclusion algorithm. IEEE Trans.
Parallel Distrib. Syst. 11(6), 537–549 (2000)
240. M. Luby, A simple parallel algorithm for the maximal independent set problem. SIAM J.
Comput. 15(4), 1036–1053 (1987)
241. N.A. Lynch, Upper bounds for static resource allocation in a distributed system. J. Comput.
Syst. Sci. 23(2), 254–278 (1981)
242. N.A. Lynch, Distributed
√ Algorithms (Morgan Kaufmann, San Francisco, 1996), 872 pages
243. M. Maekawa, A n algorithm for mutual exclusion in decentralized systems. ACM Trans.
Comput. Syst. 3(2), 145–159 (1985)
244. N. Malpani, J.L. Welch, N. Vaidya, Leader election algorithms for mobile ad hoc networks,
in Proc. 4th Int’l ACM Workshop on Discrete Algorithms and Methods for Mobile Computing
and Communications (DIAL-M’00) (ACM Press, New York, 2000), pp. 96–103
245. Y. Manabe, R. Baldoni, M. Raynal, S. Aoyagi, k-arbiter: a safe and general scheme for h-out
of-k mutual exclusion. Theor. Comput. Sci. 193(1–2), 97–112 (1998)
246. D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a dis-
tributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997)
247. D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous
checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems
(ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107
248. D. Manivannan, M. Singhal, An efficient distributed algorithm for detection of knots and
cycles in a distributed graph. IEEE Trans. Parallel Distrib. Syst. 14(10), 961–972 (2003)
249. F. Mattern, Algorithms for distributed termination detection. Distrib. Comput. 2(3), 161–175
(1987)
250. F. Mattern, Virtual time and global states of distributed systems, in Proc. Parallel and Dis-
tributed Algorithms Conference, ed. by M. Cosnard, P. Quinton, M. Raynal, Y. Robert (North-
Holland, Amsterdam, 1988), pp. 215–226
251. F. Mattern, Global quiescence detection based on credit distribution and recovery. Inf. Pro-
cess. Lett. 30(4), 195–200 (1989)
252. F. Mattern, An efficient distributed termination test. Inf. Process. Lett. 31(4), 203–208 (1989)
253. F. Mattern, Efficient algorithms for distributed snapshots and global virtual time approxima-
tion. J. Parallel Distrib. Comput. 18, 423–434 (1993)
254. F. Mattern, Distributed algorithms and causally consistent observations, in Proc. 16th Int’l
Conference on Application and Theory of Petri Nets, (Invited Paper). LNCS, vol. 935
(Springer, Berlin, 1995), pp. 21–22
255. F. Mattern, S. Fünfrocken, A non-blocking lightweight implementation of causal order mes-
sage delivery, in Proc. Int’l Dagstuhl Workshop on Theory and Practice in Distributed Sys-
tems. LNCS, vol. 938 (Springer, Berlin, 1995), pp. 197–213
256. M. Mavronicolas, D. Roth, Efficient, strong consistent implementations of shared memory, in
Proc. 6th Int’l Workshop on Distributed Algorithms (WDAG’92). LNCS, vol. 647 (Springer,
Berlin, 1992), pp. 346–361
257. J. Mayo, Ph. Kearns, Efficient distributed termination detection with roughly synchronized
clocks. Inf. Process. Lett. 52(2), 105–108 (1994)
258. K. Mehlhorn, P. Sanders, Algorithms and Data Structures (Springer, Berlin, 2008), 300
pages
259. D. Menasce, R. Muntz, Locking and deadlock detection in distributed database. IEEE Trans.
Softw. Eng. SE-5(3), 195–202 (1979)
260. J.R. Mendívil, F. Fariña, C.F. Garitagoitia, C.F. Alastruey, J.M. Barnabeu-Auban, A dis-
tributed deadlock resolution algorithm for the AND model. IEEE Trans. Parallel Distrib.
Syst. 10(5), 433–447 (1999)
261. J. Misra, Detecting termination of distributed computations using markers, in Proc. 2nd ACM
Symposium on Principles of Distributed Computing (PODC’83) (ACM Press, New York,
1983), pp. 290–294
262. J. Misra, Axioms for memory access in asynchronous hardware systems. ACM Trans. Pro-
gram. Lang. Syst. 8(1), 142–153 (1986)
488 References
263. J. Misra, Distributed discrete event simulation. ACM Comput. Surv. 18(1), 39–65 (1986)
264. J. Misra, K.M. Chandy, A distributed graph algorithm: knot detection. ACM Trans. Program.
Lang. Syst. 4(4), 678–686 (1982)
265. J. Misra, K.M. Chandy, Termination detection of diffusing computations in communicating
sequential processes. ACM Trans. Program. Lang. Syst. 4(1), 37–43 (1982)
266. D.P. Mitchell, M. Merritt, A distributed algorithm for deadlock detection and resolution, in
Proc. 3rd ACM Symposium on Principles of Distributed Computing (PODC’84) (ACM Press,
New York, 1984), pp. 282–284
267. N. Mittal, V.K. Garg, Consistency conditions for multi-objects operations, in Proc. 18th IEEE
Int’l Conference on Distributed Computing Systems (ICDCS’98) (IEEE Press, New York,
1998), pp. 582–589
268. M. Mizuno, M.L. Nielsen, M. Raynal, An optimistic protocol for a linearizable distributed
shared memory service. Parallel Process. Lett. 6(2), 265–278 (1996)
269. M. Mizuno, M. Raynal, J.Z. Zhou, Sequential consistency in distributed systems, in Int’l
Dagstuhl Workshop on the Theory and Practice in Distributed Systems. LNCS, vol. 938
(Springer, Berlin, 1994), pp. 224–241
270. B. Moret, The Theory of Computation (Addison-Wesley, Reading, 1998), 453 pages
271. A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing pro-
tocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS,
vol. 1150 (Springer, Berlin, 1996), pp. 353–364
272. A. Mostéfaoui, M. Raynal, P. Veríssimo, Logically instantaneous communication on top of
distributed memory parallel machines, in Proc. 5th Int’l Conference on Parallel Computing
Technologies (PACT’99). LNCS, vol. 1662 (Springer, Berlin, 1999), pp. 258–270
273. V.V. Murty, V.K. Garg, An algorithm to guarantee synchronous ordering of messages, in
Proc. 2nd Int’l IEEE Symposium on Autonomous Decentralized Systems (IEEE Press, New
York, 1995), pp. 208–214
274. V.V. Murty, V.K. Garg, Characterization of message ordering specifications and protocols, in
Proc. 7th Int’l Conference on Distributed Computer Systems (ICDCS’97) (IEEE Press, New
York, 1997), pp. 492–499
275. M. Naimi, M. Trehel, An improvement of the log n distributed algorithm for mutual exclu-
sion, in Proc. 7th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’87)
(IEEE Press, New York, 1987), pp. 371–375
276. M. Naimi, M. Trehel, A. Arnold, A log(n) distributed mutual exclusion algorithm based on
path reversal. J. Parallel Distrib. Comput. 34(1), 1–13 (1996)
277. M. Naor, A. Wool, The load, capacity and availability of quorums systems. SIAM J. Comput.
27(2), 423–447 (2008)
278. N. Nararajan, A distributed scheme for detecting communication deadlocks. IEEE Trans.
Softw. Eng. 12(4), 531–537 (1986)
279. M.L. Neilsen, M. Mizuno, A DAG-based algorithm for distributed mutual exclusion, in Proc.
11th IEEE Int’l IEEE Conference on Distributed Computing Systems (ICDCS’91) (IEEE
Press, New York, 1991), pp. 354–360
280. M.L. Neilsen, M. Masaaki, Nondominated k-coteries for multiple mutual exclusion. Inf. Pro-
cess. Lett. 50(5), 247–252 (1994)
281. M.L. Neilsen, M. Masaaki, M. Raynal, A general method to define quorums, in Proc. 12th
Int’l IEEE Conference on Distributed Computing Systems (ICDCS’92) (IEEE Press, New
York, 1992), pp. 657–664
282. M. Nesterenko, M. Mizuno, A quorum-based self-stabilizing distributed mutual exclusion
algorithm. J. Parallel Distrib. Comput. 62(2), 284–305 (2002)
283. R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots.
IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
284. N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1),
68–74 (1997)
285. S. Nishio, K.F. Li, F.G. Manning, A resilient distributed mutual exclusion algorithm for com-
puter networks. IEEE Trans. Parallel Distrib. Syst. 1(3), 344–356 (1990)
References 489
286. B. Nitzberg, V. Lo, Distributed shared memory: a survey of issues and algorithms. IEEE
Comput. 24(8), 52–60 (1991)
287. R. Obermarck, Distributed deadlock detection algorithm. ACM Trans. Database Syst. 7(2),
197–208 (1982)
288. J.K. Pachl, E. Korach, D. Rotem, Lower bounds for distributed maximum-finding algorithms.
J. ACM 31(4), 905–918 (1984)
289. Ch.H. Papadimitriou, The serializability of concurrent database updates. J. ACM 26(4), 631–
653 (1979)
290. D.S. Parker, G.L. Popek, G. Rudisin, L. Stoughton, B.J. Walker, E. Walton, J.M. Chow, D.A.
Edwards, S. Kiser, C.S. Kline, Detection of mutual inconsistency in distributed systems.
IEEE Trans. Softw. Eng. SE9(3), 240–246 (1983)
291. B. Patt-Shamir, S. Rajsbaum, A theory of clock synchronization, in Proc. 26th Annual ACM
Symposium on Theory of Computing (STOC’94) (ACM Press, New York, 1994), pp. 810–
819
292. D. Peleg, Distributed Computing: A Locally-Sensitive Approach. SIAM Monographs on Dis-
crete Mathematics and Applications (2000), 343 pages
293. D. Peleg, J.D. Ullman, An optimal synchronizer for the hypercube. SIAM J. Comput. 18,
740–747 (1989)
294. D. Peleg, A. Wool, Crumbling walls: a class of practical and efficient quorum systems. Dis-
trib. Comput. 10(2), 87–97 (1997)
295. G.L. Peterson, An O(n log n) unidirectional algorithm for the circular extrema problem.
ACM Trans. Program. Lang. Syst. 4(4), 758–762 (1982)
296. L.L. Peterson, N.C. Bucholz, R.D. Schlichting, Preserving and using context information in
interprocess communication. ACM Trans. Comput. Syst. 7(3), 217–246 (1989)
297. S.E. Pomares Hernadez, J.R. Perez Cruz, M. Raynal, From the happened before relation to
the causal ordered set abstraction. J. Parallel Distrib. Comput. 72, 791–795 (2012)
298. R. Prakash, M. Raynal, M. Singhal, An adaptive causal ordering algorithm suited to mobile
computing environments. J. Parallel Distrib. Comput. 41(1), 190–204 (1997)
299. R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing
systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996)
300. R. Prakash, M. Singhal, Dependency sequences and hierarchical clocks: efficient alternatives
to vector clocks for mobile computing systems. Wirel. Netw. 3(5), 349–360 (1997)
301. J. Protic, M. Tomasevic, Distributed shared memory: concepts and systems. IEEE Concurr.
4(2), 63–79 (1996)
302. S.P. Rana, A distributed solution of the distributed termination problem. Inf. Process. Lett.
17(1), 43–46 (1983)
303. B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2),
220–232 (1975)
304. K. Raymond, A tree-based algorithm for distributed mutual exclusion. ACM Trans. Comput.
Syst. 7(1), 61–77 (1989)
305. K. Raymond, A distributed algorithm for multiple entries to a critical section. Inf. Process.
Lett. 30(4), 189–193 (1989)
306. M. Raynal, Algorithms for Mutual Exclusion (The MIT Press, Cambridge, 1986), 107 pages.
ISBN 0-262-18119-3
307. M. Raynal, A distributed algorithm to prevent mutual drift between n logical clocks. Inf.
Process. Lett. 24, 199–202 (1987)
308. M. Raynal, Networks and Distributed Computation: Concepts, Tools and Algorithms (The
MIT Press, Cambridge, 1987), 168 pages. ISBN 0-262-18130-4
309. M. Raynal, Prime numbers as a tool to design distributed algorithms. Inf. Process. Lett. 33,
53–58 (1989)
310. M. Raynal, A simple taxonomy of distributed mutual exclusion algorithms. Oper. Syst. Rev.
25(2), 47–50 (1991)
490 References
311. M. Raynal, A distributed solution to the k-out of-M resource allocation problem, in Proc.
Int’l Conference on Computing and Information. LNCS, vol. 497 (Springer, Berlin, 1991),
pp. 509–518
312. M. Raynal, Illustrating the use of vector clocks in property detection: an example and a
counter-example, in Proc. 5th European Conference on Parallelism (EUROPAR’99). LNCS,
vol. 1685 (Springer, Berlin, 1999), pp. 806–814
313. M. Raynal, Sequential consistency as lazy linearizability, in Proc. 14th ACM Symposium on
Parallel Algorithms and Architectures (SPAA’02) (ACM Press, New York, 2002), pp. 151–
152
314. M. Raynal, Token-based sequential consistency. Comput. Syst. Sci. Eng. 17(6), 359–365
(2002)
315. M. Raynal, Fault-Tolerant Agreement in Synchronous Distributed Systems (Morgan & Clay-
pool, San Francisco, 2010), 167 pages. ISBN 9781608455256
316. M. Raynal, Communication and Agreement Abstractions for Fault-Tolerant Asynchronous
Distributed Systems (Morgan & Claypool, San Francisco, 2010), 251 pages. ISBN
9781608452934
317. M. Raynal, Concurrent Programming: Algorithms, Principles, and Foundations (Springer,
Berlin, 2012), 500 pages. ISBN 978-3-642-32026-2
318. M. Raynal, M. Ahamad, Exploiting write semantics in implementing partially replicated
causal objects, in Proc. 6th EUROMICRO Conference on Parallel and Distributed Processing
(PDP’98) (IEEE Press, New York, 1998), pp. 157–163
319. M. Raynal, J.-M. Hélary, Synchronization and Control of Distributed Systems and Programs.
Wiley Series in Parallel Computing (1991), 126 pages. ISBN 0-471-92453-9
320. M. Raynal, M. Roy, C. Tutu, A simple protocol offering both atomic consistent read op-
erations and sequentially consistent read operations, in Proc. 19th Int’l Conference on Ad-
vanced Information Networking and Applications (AINA’05) (IEEE Press, New York, 2005),
pp. 961–966
321. M. Raynal, G. Rubino, An algorithm to detect token loss on a logical ring and to regenerate
lost tokens, in Int’l Conference on Parallel Processing and Applications (North-Holland,
Amsterdam, 1987), pp. 457–467
322. M. Raynal, A. Schiper, From causal consistency to sequential consistency in shared memory
systems, in Proc. 15th Int’l Conference on Foundations of Software Technology and The-
oretical Computer Science (FST&TCS’95). LNCS, vol. 1026 (Springer, Berlin, 1995), pp.
180–194
323. M. Raynal, A. Schiper, A suite of formal definitions for consistency criteria in distributed
shared memories, in Proc. 9th Int’l IEEE Conference on Parallel and Distributed Computing
Systems (PDCS’96) (IEEE Press, New York, 1996), pp. 125–131
324. M. Raynal, A. Schiper, S. Toueg, The causal ordering abstraction and a simple way to im-
plement. Inf. Process. Lett. 39(6), 343–350 (1991)
325. M. Raynal, S. Singhal, Logical time: capturing causality in distributed systems. IEEE Com-
put. 29(2), 49–57 (1996)
326. M. Raynal, K. Vidyasankar, A distributed implementation of sequential consistency with
multi-object operations, in Proc. 24th IEEE Int’l Conference on Distributed Computing Sys-
tems (ICDCS’04) (IEEE Press, New York, 2004), pp. 544–551
327. G. Ricart, A.K. Agrawala, An optimal algorithm for mutual exclusion in computer networks.
Commun. ACM 24(1), 9–17 (1981)
328. G. Ricart, A.K. Agrawala, Author response to “on mutual exclusion in computer networks”
by Carvalho and Roucairol. Commun. ACM 26(2), 147–148 (1983)
329. R. Righter, J.C. Walrand, Distributed simulation of discrete event systems. Proc. IEEE 77(1),
99–113 (1989)
330. S. Ronn, H. Saikkonen, Distributed termination detection with counters. Inf. Process. Lett.
34(5), 223–227 (1990)
331. D.J. Rosenkrantz, R.E. Stearns, P.M. Lewis, System level concurrency control in distributed
databases. ACM Trans. Database Syst. 3(2), 178–198 (1978)
References 491
332. D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw.
Eng. SE6(2), 183–194 (1980)
333. B. Sanders, The information structure of distributed mutual exclusion algorithms. ACM
Trans. Comput. Syst. 5(3), 284–299 (1987)
334. S.K. Sarin, N.A. Lynch, Discarding obsolete information in a replicated database system.
IEEE Trans. Softw. Eng. 13(1), 39–46 (1987)
335. N. Santoro, Design and Analysis of Distributed Algorithms (Wiley, New York, 2007), 589
pages
336. A. Schiper, J. Eggli, A. Sandoz, A new algorithm to implement causal ordering, in Proc. 3rd
Int’l Workshop on Distributed Algorithms (WDAG’89). LNCS, vol. 392 (Springer, Berlin,
1989), pp. 219–232
337. R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection
for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing
Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176
338. F. Schmuck, The use of efficient broadcast in asynchronous distributed systems. Doctoral
Dissertation, Tech. Report TR88-928, Dept of Computer Science, Cornell University, 124
pages, 1988
339. F.B. Schneider, Implementing fault-tolerant services using the state machine approach. ACM
Comput. Surv. 22(4), 299–319 (1990)
340. R. Schwarz, F. Mattern, Detecting causal relationships in distributed computations: in search
of the Holy Grail. Distrib. Comput. 7, 149–174 (1994)
341. A. Segall, Distributed network protocols. IEEE Trans. Inf. Theory 29(1), 23–35 (1983)
342. N. Shavit, N. Francez, A new approach to detection of locally indicative stability, in 13th
Int’l Colloquium on Automata, Languages and Programming (ICALP’86). LNCS, vol. 226
(Springer, Berlin, 1986), pp. 344–358
343. A. Silberschatz, Synchronization and communication in distributed systems. IEEE Trans.
Softw. Eng. SE5(6), 542–546 (1979)
344. L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium
on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162
345. M. Singhal, A heuristically-aided algorithm for mutual exclusion in distributed systems.
IEEE Trans. Comput. 38(5), 651–662 (1989)
346. M. Singhal, Deadlock detection in distributed systems. IEEE Comput. 22(11), 37–48 (1989)
347. M. Singhal, A class of deadlock-free Maekawa-type algorithms for mutual exclusion in dis-
tributed systems. Distrib. Comput. 4(3), 131–138 (1991)
348. M. Singhal, A dynamic information-structure mutual exclusion algorithm for distributed sys-
tems. IEEE Trans. Parallel Distrib. Syst. 3(1), 121–125 (1992)
349. M. Singhal, A taxonomy of distributed mutual exclusion. J. Parallel Distrib. Comput. 18(1),
94–101 (1993)
350. M. Singhal, A.D. Kshemkalyani, An efficient implementation of vector clocks. Inf. Process.
Lett. 43, 47–52 (1992)
351. M. Sipser, Introduction to the Theory of Computation (PWS, Boston, 1996), 396 pages
352. A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th
ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New
York, 1989), pp. 223–238
353. D. Skeen, A quorum-based commit protocol, in Proc. 6th Berkeley Workshop on Distributed
Data Management and Computer Networks (1982), pp. 69–80
354. J.L.A. van de Snepscheut, Synchronous communication between asynchronous components.
Inf. Process. Lett. 13(3), 127–130 (1981)
355. J.L.A. van de Snepscheut, Fair mutual exclusion on a graph of processes. Distrib. Comput.
2(2), 113–115 (1987)
356. T. Soneoka, T. Ibaraki, Logically instantaneous message passing in asynchronous distributed
systems. IEEE Trans. Comput. 43(5), 513–527 (1994)
357. M. Spezialetti, Ph. Kearns, Efficient distributed snapshots, in Proc. 6th Int’l Conference on
Distributed Computing Systems (ICDCS’86) (IEEE Press, New York, 1986), pp. 382–388
492 References
358. T.K. Srikanth, S. Toueg, Optimal clock synchronization. J. ACM 34(3), 626–645 (1987)
359. M. van Steen, Graph Theory and Complex Networks: An Introduction (2011), 285 pages.
ISBN 978-90-815406-1-2
360. R.E. Strom, S. Yemini, Optimistic recovery in distributed systems. ACM Trans. Comput.
Syst. 3(3), 204–226 (1985)
361. I. Suzuki, T. Kasami, A distributed mutual exclusion algorithm. ACM Trans. Comput. Syst.
3(4), 344–349 (1985)
362. G. Taubenfeld, Synchronization Algorithms and Concurrent Programming (Pearson Prentice-
Hall, Upper Saddle River, 2006), 423 pages. ISBN 0-131-97259-6
363. K. Taylor, The role of inhibition in asynchronous consistent-cut protocols, in Proc. 3rd Int’l
Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 392 (Springer, Berlin, 1987),
pp. 280–291
364. R.N. Taylor, Complexity of analyzing the synchronization structure of concurrent programs.
Acta Inform. 19(1), 57–84 (1983)
365. G. Tel, Introduction to Distributed Algorithms, 2nd edn. (Cambridge University Press, Cam-
bridge, 2000), 596 pages. ISBN 0-521-79483-8
366. G. Tel, F. Mattern, The derivation of distributed termination detection algorithms from
garbage collection schemes. ACM Trans. Program. Lang. Syst. 15(1), 1–35 (1993)
367. G. Tel, R.B. Tan, J. van Leeuwen, The derivation of graph-marking algorithms from dis-
tributed termination detection protocols. Sci. Comput. Program. 10(1), 107–137 (1988)
368. R.H. Thomas, A majority consensus approach to concurrency control for multiple copy
databases. ACM Trans. Database Syst. 4(2), 180–209 (1979)
369. O. Theel, M. Raynal, Static and dynamic adaptation of transactional consistency, in Proc.
30th Hawaï, Int’l Conference on Systems Sciences (HICSS-30), vol. I (1997), pp. 533–542
370. F.J. Torres-Rojas, M. Ahamad, Plausible clocks: constant size logical clocks for distributed
systems. Distrib. Comput. 12(4), 179–195 (1999)
371. F. Torres-Rojas, M. Ahamad, M. Raynal, Lifetime-based consistency protocols for dis-
tributed objects, in Proc. 12th Int’l Symposium on Distributed Computing (DISC’98). LNCS,
vol. 1499 (Springer, Berlin, 1998), pp. 378–392
372. F. Torres-Rojas, M. Ahamad, M. Raynal, Timed consistency for shared distributed objects,
in Proc. 18th Annual ACM Symposium on Principles of Distributed Computing (PODC’99)
(ACM Press, New York, 1999), pp. 163–172
373. S. Toueg, An all-pairs shortest paths distributed algorithm. IBM Technical Report RC 8327,
1980
374. M. Trehel, M. Naimi, Un algorithme distribué d’exclusion mutuelle en log n. TSI. Tech. Sci.
Inform. 6(2), 141–150 (1987)
375. J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced check-
pointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst.
9(10), 963–971 (1998)
376. J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced check-
pointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999)
377. S. Venkatesan, Message optimal incremental snapshots, in Proc. 9th Int’l Conference on
Distributed Computing Systems (ICDCS’89) (IEEE Press, New York, 1989), pp. 53–60
378. S. Venkatesan, B. Dathan, Testing and debugging distributed programs using global predi-
cates. IEEE Trans. Softw. Eng. 21(2), 163–177 (1995)
379. J. Villadangos, F. Fariña, J.R. Mendívil, C.F. Garitagoitia, A. Cordoba, A safe algorithm for
resolving OR deadlocks. IEEE Trans. Softw. Eng. 29(7), 608–622 (2003)
380. M. Vukolić, Quorum Systems with Applications to Storage and Consensus (Morgan & Clay-
pool, San Francisco, 2012), 130 pages. ISBN 798-1-60845-683-3
381. Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints.
IEEE Trans. Comput. 46(4), 456–468 (1997)
382. Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoor-
dinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5),
546–554 (1995)
References 493
383. Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing
in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems
(SRDS’92) (IEEE Press, New York, 1992), pp. 147–154
384. S. Warshall, A theorem on Boolean matrices. J. ACM 9(1), 11–12 (1962)
385. J.L. Welch, Simulating synchronous processors. Inf. Comput. 74, 159–171 (1987)
386. J.L. Welch, N.A. Lynch, A new fault-tolerance algorithm for clock synchronization. Inf.
Comput. 77(1), 1–36 (1988)
387. J.L. Welch, N.A. Lynch, A modular drinking philosophers algorithm. Distrib. Comput. 6(4),
233–244 (1993)
388. J.L. Welch, J.E. Walter, Link Reversal Algorithms (Morgan & Claypool, San Francisco,
2011), 93 pages. ISBN 9781608450411
389. H. Wu, W.-N. Chin, J. Jaffar, An efficient distributed deadlock avoidance algorithm for the
AND model. IEEE Trans. Softw. Eng. 28(1), 18–29 (2002)
390. G.T.J. Wuu, A.J. Bernstein, Efficient solutions to the replicated log and dictionary problems,
in Proc. 3rd Annual ACM Symposium on Principles of Distributed Computing (PODC’84)
(ACM Press, New York, 1984), pp. 233–242
391. G.T.J. Wuu, A.J. Bernstein, False deadlock detection in distributed systems. IEEE Trans.
Softw. Eng. SE-11(8), 820–821 (1985)
392. M. Yamashita, T. Kameda, Computing on anonymous networks, part I: characterizing the
solvable cases. IEEE Trans. Parallel Distrib. Syst. 7(1), 69–89 (1996)
393. M. Yamashita, T. Kameda, Computing on anonymous networks, part II: decision and mem-
bership problems. IEEE Trans. Parallel Distrib. Syst. 7(1), 90–96 (1996)
394. L.-Y. Yen, T.-L. Huang, Resetting vector clocks in distributed systems. J. Parallel Distrib.
Comput. 43, 15–20 (1997)
395. Y. Zhu, C.-T. Cheung, A new distributed breadth-first search algorithm. Inf. Process. Lett.
25(5), 329–333 (1987)
Index
A Causal order
Abstraction Bounded lifetime message, 317
Checkpointing, 196 Broadcast, 313
Adaptive algorithm, 108 Broadcast causal barrier, 315
Bounded mutual exclusion, 259 Causality-based characterization, 305
Causal order, 312 Definition, 304
Mutual exclusion, 256 Point-to-point, 306
AND model Point-to-point delivery condition, 307
Deadlock detection, 403 Reduce the size of control information, 310
Receive statement, 386 Causal past, 124
Anonymous systems, 78 Causal path, 123
Antiquorum, 267 Causal precedence, 123
Arbiter process, 264 Center of graph, 60
Asynchronous atomic model, 370 Channel, 4
Asynchronous system, 5 FIFO, 4
Atomicity FIFO in leader election, 89
Definition, 430 Four delivery properties, 328
From copy invalidation, 438 State, 132, 368
From copy update, 443 Checkpoint and communication pattern, 189
From server processes, 437 Checkpointing
From total order broadcast, 435 Classes of algorithms, 198
Is a local property, 433 Consistent abstractions, 196
Linearization point, 431 Domino effect, 196
Recovery algorithm, 214
B Rollback-dependency trackability, 197
Bounded delay network Stable storage, 211
Definition, 236 Uncoordinated, 211
Local clock drift, 240 Z-cycle-freedom, 196
Breadth-first spanning-tree Communication
Built with centralized control, 20 Deterministic context, 341
Built without centralized control, 17 Nondeterministic context, 342
Broadcast Communication graph, 6
Definition, 9 Concurrency set, 124
On a rooted spanning tree, 10 Conflict graph, 286
Consistency condition, 425
C Atomicity, 430
Causal future, 124 Causal consistency, 464
V Z
Vector clock Z-cycle-freedom
Adaptive communication layer, 180 Algorithms, 201
Approximation, 181 Dating system, 199
Basic algorithm, 160 Definition, 196
Definition, 160 Notion of an optimal algorithm, 203
Efficient implementation, 176 Operational characterization, 199
k-restricted, 181 Z-dependency relation, 190
Lower bound on the size, 174 Zigzag path, 190
Properties, 162 Zigzag pattern, 191