Distributed Algorithms For Message-Passing Systems-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 517

Michel Raynal

Distributed
Algorithms for
Message-Passing
Systems
Distributed Algorithms
for Message-Passing Systems
Michel Raynal

Distributed Algorithms
for Message-Passing Systems
Michel Raynal
Institut Universitaire de France
IRISA-ISTIC
Université de Rennes 1
Rennes Cedex
France

ISBN 978-3-642-38122-5 ISBN 978-3-642-38123-2 (eBook)


DOI 10.1007/978-3-642-38123-2
Springer Heidelberg New York Dordrecht London

Library of Congress Control Number: 2013942973

ACM Computing Classification (1998): F.1, D.1, B.3

© Springer-Verlag Berlin Heidelberg 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

La profusion des choses cachait la rareté des idées et l’usure des croyances.
[. . . ] Retenir quelque chose du temps où l’on ne sera plus.
In Les années (2008), Annie Ernaux
Nel mezzo del cammin di nostra vita
Mi ritrovai per una selva oscura,
Ché la diritta via era smarritta.
In La divina commedia (1307–1321), Dante Alighieri (1265–1321)
Wir müssen nichts sein, sondern alles werden wollen.
Johann Wolfgang von Goethe (1749–1832)
Chaque génération, sans doute, se croit vouée à refaire le monde.
La mienne sait pourtant qu’elle ne le refera pas. Mais sa tâche est peut-être plus grande.
Elle consiste à empêcher que le monde ne se défasse.
Speech at the Nobel Banquet, Stockholm, December 10, 1957, Albert Camus (1913–1960)
Rien n’est précaire comme vivre
Rien comme être n’est passager
C’est un peu fondre pour le givre
Ou pour le vent être léger
J’arrive où je suis étranger.
In Le voyage de Hollande (1965), Louis Aragon (1897–1982)

What Is Distributed Computing? Distributed computing was born in the late


1970s when researchers and practitioners started taking into account the intrinsic
characteristic of physically distributed systems. The field then emerged as a special-
ized research area distinct from networking, operating systems, and parallel com-
puting.
Distributed computing arises when one has to solve a problem in terms of dis-
tributed entities (usually called processors, nodes, processes, actors, agents, sen-
sors, peers, etc.) such that each entity has only a partial knowledge of the many
parameters involved in the problem that has to be solved. While parallel computing
and real-time computing can be characterized, respectively, by the terms efficiency
and on-time computing, distributed computing can be characterized by the term un-
certainty. This uncertainty is created by asynchrony, multiplicity of control flows,

v
vi Preface

absence of shared memory and global time, failure, dynamicity, mobility, etc. Mas-
tering one form or another of uncertainty is pervasive in all distributed computing
problems. A main difficulty in designing distributed algorithms comes from the fact
that each entity cooperating in the achievement of a common goal cannot have in-
stantaneous knowledge of the current state of the other entities; it can only know
their past local states.
Although distributed algorithms are often made up of a few lines, their behavior
can be difficult to understand and their properties hard to state and prove. Hence,
distributed computing is not only a fundamental topic but also a challenging topic
where simplicity, elegance, and beauty are first-class citizens.

Why This Book? While there are a lot of books on sequential computing (both on
basic data structures, or algorithms), this is not the case in distributed computing.
Most books on distributed computing consider advanced topics where the uncer-
tainty inherent to distributed computing is created by the net effect of asynchrony
and failures. It follows that these books are more appropriate for graduate students
than for undergraduate students.
The aim of this book is to present in a comprehensive way basic notions, concepts
and algorithms of distributed computing when the distributed entities cooperate by
sending and receiving messages on top of an underlying network. In this case, the
main difficulty comes from the physical distribution of the entities and the asyn-
chrony of the environment in which they evolve.

Audience This book has been written primarily for people who are not familiar
with the topic and the concepts that are presented. These include mainly:
• Senior-level undergraduate students and graduate students in computer science
or computer engineering, who are interested in the principles and foundations of
distributed computing.
• Practitioners and engineers who want to be aware of the state-of-the-art concepts,
basic principles, mechanisms, and techniques encountered in distributed comput-
ing.
Prerequisites for this book include undergraduate courses on algorithms, and ba-
sic knowledge on operating systems. Selections of chapters for undergraduate and
graduate courses are suggested in the section titled “How to Use This Book” in the
Afterword.

Content As already indicated, this book covers algorithms, basic principles, and
foundations of message-passing programming, i.e., programs where the entities
communicate by sending and receiving messages through a network. The world is
distributed, and the algorithmic thinking suited to distributed applications and sys-
tems is not reducible to sequential computing. Knowledge of the bases of distributed
computing is becoming more important than ever as more and more computer ap-
plications are now distributed. The book is composed of six parts.
Preface vii

• The aim of the first part, which is made up of six chapters, is to give a feel for the
nature of distributed algorithms, i.e., what makes them different from sequential
or parallel algorithms. To that end, it mainly considers distributed graph algo-
rithms. In this context, each node of the graph is a process, which has to compute
a result whose meaning depends on the whole graph.
Basic distributed algorithms such as network traversals, shortest-path algo-
rithms, vertex coloring, knot detection, etc., are first presented. Then, a general
framework for distributed graph algorithms is introduced. A chapter is devoted to
leader election algorithms on a ring network, and another chapter focuses on the
navigation of a network by mobile objects.
• The second part is on the nature of distributed executions. It is made up of four
chapters. In some sense, this part is the core of the book. It explains what a dis-
tributed execution is, the fundamental notion of a consistent global state, and the
impossibility—without freezing the computation—of knowing whether a com-
puted consistent global state has been passed through by the execution or not.
Then, this part of the book addresses an important issue of distributed compu-
tations, namely the notion of logical time: scalar (linear) time, vector time, and
matrix time. Each type of time is analyzed and examples of their uses are given.
A chapter, which extends the notion of a global state, is then devoted to asyn-
chronous distributed checkpointing. Finally, the last chapter of this part shows
how to simulate a synchronous system on top of an asynchronous system (such
simulators are called synchronizers).
• The third part of the book is made up of two chapters devoted to distributed
mutual exclusion and distributed resource allocation. Different families of
permission-based mutual exclusion algorithms are presented. The notion of an
adaptive algorithm is also introduced. The notion of a critical section with mul-
tiple entries, and the case of resources with a single or several instances is also
presented. Associated deadlock prevention techniques are introduced.
• The fourth part of the book is on the definition and the implementation of commu-
nication operations whose abstraction level is higher than the simple send/receive
of messages. These communication abstractions impose order constraints on mes-
sage deliveries. Causal message delivery and total order broadcast are first pre-
sented in one chapter. Then, another chapter considers synchronous communica-
tion (also called rendezvous or logically instantaneous communication).
• The fifth part of the book, which is made up of two chapters, is on the detection
of stable properties encountered in distributed computing. A stable property is a
property that, once true, remains true forever. The properties which are studied are
the detection of the termination of a distributed computation, and the detection of
distributed deadlock. This part of the book is strongly related to the second part
(which is devoted to the notion of a global state).
• The sixth and last part of the book, which is also made up of two chapters, is
devoted to the notion of a distributed shared memory. The aim is here to pro-
vide the entities (processes) with a set of objects that allow them to cooperate at
viii Preface

an abstraction level more appropriate than the use of messages. Two consistency
conditions, which can be associated with these objects, are presented and inves-
tigated, namely, atomicity (also called linearizability) and sequential consistency.
Several algorithms implementing these consistency conditions are described.
To have a more complete feeling of the spirit of this book, the reader is invited
to consult the section “The Aim of This Book” in the Afterword, which describes
what it is hoped has been learned from this book. Each chapter starts with a short
presentation and a list of the main keywords, and terminates with a summary of its
content. Each of the six parts of the book is also introduced by a brief description of
its aim and its technical content.

Acknowledgments This book originates from lecture notes for undergraduate and graduate
courses on distributed computing that I give at the University of Rennes (France) and, as an
invited professor, at several universities all over the world. I would like to thank the students
for their questions that, in one way or another, have contributed to this book. I want also to
thank Ronan Nugent (Springer) for his support and his help in putting it all together.
Last but not least (and maybe most importantly), I also want to thank all the researchers
whose results are presented in this book. Without their work, this book would not exist.

Michel Raynal
Professeur des Universités
Institut Universitaire de France
IRISA-ISTIC, Université de Rennes 1
Campus de Beaulieu, 35042, Rennes, France

March–October 2012
Rennes, Saint-Grégoire, Tokyo, Fukuoka (AINA’12), Arequipa (LATIN’12),
Reykjavik (SIROCCO’12), Palermo (CISIS’12), Madeira (PODC’12), Lisbon,
Douelle, Saint-Philibert, Rhodes Island (Europar’12),
Salvador de Bahia (DISC’12), Mexico City (Turing Year at UNAM)
Contents

Part I Distributed Graph Algorithms


1 Basic Definitions and Network Traversal Algorithms . . . . . . . . . 3
1.1 Distributed Algorithms . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 An Introductory Example:
Learning the Communication Graph . . . . . . . . . . . . 6
1.2 Parallel Traversal: Broadcast and Convergecast . . . . . . . . . . . 9
1.2.1 Broadcast and Convergecast . . . . . . . . . . . . . . . . . 9
1.2.2 A Flooding Algorithm . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree 10
1.2.4 Building a Spanning Tree . . . . . . . . . . . . . . . . . . 12
1.3 Breadth-First Spanning Tree . . . . . . . . . . . . . . . . . . . . . 16
1.3.1 Breadth-First Spanning Tree
Built Without Centralized Control . . . . . . . . . . . . . . 17
1.3.2 Breadth-First Spanning Tree Built with Centralized Control 20
1.4 Depth-First Traversal . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4.1 A Simple Algorithm . . . . . . . . . . . . . . . . . . . . . 24
1.4.2 Application: Construction of a Logical Ring . . . . . . . . 27
1.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 33
2 Distributed Graph Algorithms . . . . . . . . . . . . . . . . . . . . . 35
2.1 Distributed Shortest Path Algorithms . . . . . . . . . . . . . . . . 35
2.1.1 A Distributed Adaptation
of Bellman–Ford’s Shortest Path Algorithm . . . . . . . . 35
2.1.2 A Distributed Adaptation
of Floyd–Warshall’s Shortest Paths Algorithm . . . . . . . 38
2.2 Vertex Coloring and Maximal Independent Set . . . . . . . . . . . 42
2.2.1 On Sequential Vertex Coloring . . . . . . . . . . . . . . . 42

ix
x Contents

2.2.2 Distributed ( + 1)-Coloring of Processes . . . . . . . . . 43


2.2.3 Computing a Maximal Independent Set . . . . . . . . . . . 46
2.3 Knot and Cycle Detection . . . . . . . . . . . . . . . . . . . . . . 50
2.3.1 Directed Graph, Knot, and Cycle . . . . . . . . . . . . . . 50
2.3.2 Communication Graph, Logical Directed Graph,
and Reachability . . . . . . . . . . . . . . . . . . . . . . . 51
2.3.3 Specification of the Knot Detection Problem . . . . . . . . 51
2.3.4 Principle of the Knot/Cycle Detection Algorithm . . . . . . 52
2.3.5 Local Variables . . . . . . . . . . . . . . . . . . . . . . . 53
2.3.6 Behavior of a Process . . . . . . . . . . . . . . . . . . . . 54
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
2.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 58
2.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 58
3 An Algorithmic Framework
to Compute Global Functions on a Process Graph . . . . . . . . . . . 59
3.1 Distributed Computation of Global Functions . . . . . . . . . . . . 59
3.1.1 Type of Global Functions . . . . . . . . . . . . . . . . . . 59
3.1.2 Constraints on the Computation . . . . . . . . . . . . . . . 60
3.2 An Algorithmic Framework . . . . . . . . . . . . . . . . . . . . . 61
3.2.1 A Round-Based Framework . . . . . . . . . . . . . . . . . 61
3.2.2 When the Diameter Is Not Known . . . . . . . . . . . . . 64
3.3 Distributed Determination of Cut Vertices . . . . . . . . . . . . . 66
3.3.1 Cut Vertices . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3.2 An Algorithm Determining Cut Vertices . . . . . . . . . . 67
3.4 Improving the Framework . . . . . . . . . . . . . . . . . . . . . . 69
3.4.1 Two Types of Filtering . . . . . . . . . . . . . . . . . . . . 69
3.4.2 An Improved Algorithm . . . . . . . . . . . . . . . . . . . 70
3.5 The Case of Regular Communication Graphs . . . . . . . . . . . . 72
3.5.1 Tradeoff Between Graph Topology and Number of Rounds 72
3.5.2 De Bruijn Graphs . . . . . . . . . . . . . . . . . . . . . . 73
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.8 Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4 Leader Election Algorithms . . . . . . . . . . . . . . . . . . . . . . . 77
4.1 The Leader Election Problem . . . . . . . . . . . . . . . . . . . . 77
4.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 77
4.1.2 Anonymous Systems: An Impossibility Result . . . . . . . 78
4.1.3 Basic Assumptions and Principles
of the Election Algorithms . . . . . . . . . . . . . . . . . . 79
4.2 A Simple O(n2 ) Leader Election Algorithm
for Unidirectional Rings . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.1 Context and Principle . . . . . . . . . . . . . . . . . . . . 79
4.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 80
4.2.3 Time Cost of the Algorithm . . . . . . . . . . . . . . . . . 80
Contents xi

4.2.4 Message Cost of the Algorithm . . . . . . . . . . . . . . . 81


4.2.5 A Simple Variant . . . . . . . . . . . . . . . . . . . . . . . 82
4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings . 83
4.3.1 Context and Principle . . . . . . . . . . . . . . . . . . . . 83
4.3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 84
4.3.3 Time and Message Complexities . . . . . . . . . . . . . . 85
4.4 An O(n log n) Election Algorithm for Unidirectional Rings . . . . 86
4.4.1 Context and Principles . . . . . . . . . . . . . . . . . . . . 86
4.4.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 88
4.4.3 Discussion: Message Complexity and FIFO Channels . . . 89
4.5 Two Particular Cases . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 91
5 Mobile Objects Navigating a Network . . . . . . . . . . . . . . . . . 93
5.1 Mobile Object in a Process Graph . . . . . . . . . . . . . . . . . . 93
5.1.1 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 93
5.1.2 Mobile Object Versus Mutual Exclusion . . . . . . . . . . 94
5.1.3 A Centralized (Home-Based) Algorithm . . . . . . . . . . 94
5.1.4 The Algorithms Presented in This Chapter . . . . . . . . . 95
5.2 A Navigation Algorithm for a Complete Network . . . . . . . . . 96
5.2.1 Underlying Principles . . . . . . . . . . . . . . . . . . . . 96
5.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3 A Navigation Algorithm Based on a Spanning Tree . . . . . . . . 100
5.3.1 Principles of the Algorithm:
Tree Invariant and Proxy Behavior . . . . . . . . . . . . . 101
5.3.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.3 Discussion and Properties . . . . . . . . . . . . . . . . . . 104
5.3.4 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 106
5.4 An Adaptive Navigation Algorithm . . . . . . . . . . . . . . . . . 108
5.4.1 The Adaptivity Property . . . . . . . . . . . . . . . . . . . 109
5.4.2 Principle of the Implementation . . . . . . . . . . . . . . . 109
5.4.3 An Adaptive Algorithm Based on a Distributed Queue . . . 111
5.4.4 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.5 Example of an Execution . . . . . . . . . . . . . . . . . . 114
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 116

Part II Logical Time and Global States in Distributed Systems


6 Nature of Distributed Computations
and the Concept of a Global State . . . . . . . . . . . . . . . . . . . . 121
6.1 A Distributed Execution Is a Partial Order on Local Events . . . . 122
6.1.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . 122
xii Contents

6.1.2 A Distributed Execution Is a Partial Order on Local Events 122


6.1.3 Causal Past, Causal Future, Concurrency, Cut . . . . . . . 123
6.1.4 Asynchronous Distributed Execution
with Respect to Physical Time . . . . . . . . . . . . . . . . 125
6.2 A Distributed Execution Is a Partial Order on Local States . . . . . 127
6.3 Global State and Lattice of Global States . . . . . . . . . . . . . . 129
6.3.1 The Concept of a Global State . . . . . . . . . . . . . . . . 129
6.3.2 Lattice of Global States . . . . . . . . . . . . . . . . . . . 129
6.3.3 Sequential Observations . . . . . . . . . . . . . . . . . . . 131
6.4 Global States Including Process States and Channel States . . . . . 132
6.4.1 Global State Including Channel States . . . . . . . . . . . 132
6.4.2 Consistent Global State Including Channel States . . . . . 133
6.4.3 Consistent Global State Versus Consistent Cut . . . . . . . 134
6.5 On-the-Fly Computation of Global States . . . . . . . . . . . . . . 135
6.5.1 Global State Computation Is an Observation Problem . . . 135
6.5.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . 136
6.5.3 On the Meaning of the Computed Global State . . . . . . . 136
6.5.4 Principles of Algorithms Computing a Global State . . . . 137
6.6 A Global State Algorithm Suited to FIFO Channels . . . . . . . . 138
6.6.1 Principle of the Algorithm . . . . . . . . . . . . . . . . . . 138
6.6.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 140
6.6.3 Example of an Execution . . . . . . . . . . . . . . . . . . 141
6.7 A Global State Algorithm Suited to Non-FIFO Channels . . . . . . 143
6.7.1 The Algorithm and Its Principles . . . . . . . . . . . . . . 144
6.7.2 How to Compute the State of the Channels . . . . . . . . . 144
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.9 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 146
6.10 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 147
7 Logical Time in Asynchronous Distributed Systems . . . . . . . . . . 149
7.1 Linear Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
7.1.1 Scalar (or Linear) Time . . . . . . . . . . . . . . . . . . . 150
7.1.2 From Partial Order to Total Order:
The Notion of a Timestamp . . . . . . . . . . . . . . . . . 151
7.1.3 Relating Logical Time and Timestamps with Observations . 152
7.1.4 Timestamps in Action: Total Order Broadcast . . . . . . . 153
7.2 Vector Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
7.2.1 Vector Time and Vector Clocks . . . . . . . . . . . . . . . 159
7.2.2 Vector Clock Properties . . . . . . . . . . . . . . . . . . . 162
7.2.3 On the Development of Vector Time . . . . . . . . . . . . 163
7.2.4 Relating Vector Time and Global States . . . . . . . . . . . 165
7.2.5 Vector Clocks in Action:
On-the-Fly Determination of a Global State Property . . . . 166
7.2.6 Vector Clocks in Action:
On-the-Fly Determination of the Immediate Predecessors . 170
7.3 On the Size of Vector Clocks . . . . . . . . . . . . . . . . . . . . 173
Contents xiii

7.3.1 A Lower Bound on the Size of Vector Clocks . . . . . . . . 174


7.3.2 An Efficient Implementation of Vector Clocks . . . . . . . 176
7.3.3 k-Restricted Vector Clock . . . . . . . . . . . . . . . . . . 181
7.4 Matrix Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7.4.1 Matrix Clock: Definition and Algorithm . . . . . . . . . . 182
7.4.2 A Variant of Matrix Time in Action: Discard Old Data . . . 184
7.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 187
8 Asynchronous Distributed Checkpointing . . . . . . . . . . . . . . . 189
8.1 Definitions and Main Theorem . . . . . . . . . . . . . . . . . . . 189
8.1.1 Local and Global Checkpoints . . . . . . . . . . . . . . . . 189
8.1.2 Z-Dependency, Zigzag Paths, and Z-Cycles . . . . . . . . . 190
8.1.3 The Main Theorem . . . . . . . . . . . . . . . . . . . . . 192
8.2 Consistent Checkpointing Abstractions . . . . . . . . . . . . . . . 196
8.2.1 Z-Cycle-Freedom . . . . . . . . . . . . . . . . . . . . . . 196
8.2.2 Rollback-Dependency Trackability . . . . . . . . . . . . . 197
8.2.3 On Distributed Checkpointing Algorithms . . . . . . . . . 198
8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention . . . . . . 199
8.3.1 An Operational Characterization of Z-Cycle-Freedom . . . 199
8.3.2 A Property of a Particular Dating System . . . . . . . . . . 199
8.3.3 Two Simple Algorithms Ensuring Z-Cycle Prevention . . . 201
8.3.4 On the Notion of an Optimal Algorithm
for Z-Cycle Prevention . . . . . . . . . . . . . . . . . . . 203
8.4 Checkpointing Algorithms
Ensuring Rollback-Dependency Trackability . . . . . . . . . . . . 203
8.4.1 Rollback-Dependency Trackability (RDT) . . . . . . . . . 203
8.4.2 A Simple Brute Force RDT Checkpointing Algorithm . . . 205
8.4.3 The Fixed Dependency After Send (FDAS)
RDT Checkpointing Algorithm . . . . . . . . . . . . . . . 206
8.4.4 Still Reducing the Number of Forced Local Checkpoints . . 207
8.5 Message Logging for Uncoordinated Checkpointing . . . . . . . . 211
8.5.1 Uncoordinated Checkpointing . . . . . . . . . . . . . . . . 211
8.5.2 To Log or Not to Log Messages on Stable Storage . . . . . 211
8.5.3 A Recovery Algorithm . . . . . . . . . . . . . . . . . . . . 214
8.5.4 A Few Improvements . . . . . . . . . . . . . . . . . . . . 215
8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 216
8.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 217
9 Simulating Synchrony on Top of Asynchronous Systems . . . . . . . 219
9.1 Synchronous Systems, Asynchronous Systems, and Synchronizers 219
9.1.1 Synchronous Systems . . . . . . . . . . . . . . . . . . . . 219
9.1.2 Asynchronous Systems and Synchronizers . . . . . . . . . 221
9.1.3 On the Efficiency Side . . . . . . . . . . . . . . . . . . . . 222
xiv Contents

9.2 Basic Principle for a Synchronizer . . . . . . . . . . . . . . . . . 223


9.2.1 The Main Problem to Solve . . . . . . . . . . . . . . . . . 223
9.2.2 Principle of the Solutions . . . . . . . . . . . . . . . . . . 224
9.3 Basic Synchronizers: α and β . . . . . . . . . . . . . . . . . . . . 224
9.3.1 Synchronizer α . . . . . . . . . . . . . . . . . . . . . . . 224
9.3.2 Synchronizer β . . . . . . . . . . . . . . . . . . . . . . . 227
9.4 Advanced Synchronizers: γ and δ . . . . . . . . . . . . . . . . . . 230
9.4.1 Synchronizer γ . . . . . . . . . . . . . . . . . . . . . . . 230
9.4.2 Synchronizer δ . . . . . . . . . . . . . . . . . . . . . . . . 234
9.5 The Case of Networks with Bounded Delays . . . . . . . . . . . . 236
9.5.1 Context and Hypotheses . . . . . . . . . . . . . . . . . . . 236
9.5.2 The Problem to Solve . . . . . . . . . . . . . . . . . . . . 237
9.5.3 Synchronizer λ . . . . . . . . . . . . . . . . . . . . . . . . 238
9.5.4 Synchronizer μ . . . . . . . . . . . . . . . . . . . . . . . 239
9.5.5 When the Local Physical Clocks Drift . . . . . . . . . . . 240
9.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
9.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 244

Part III Mutual Exclusion and Resource Allocation


10 Permission-Based Mutual Exclusion Algorithms . . . . . . . . . . . 247
10.1 The Mutual Exclusion Problem . . . . . . . . . . . . . . . . . . . 247
10.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 247
10.1.2 Classes of Distributed Mutex Algorithms . . . . . . . . . . 248
10.2 A Simple Algorithm Based on Individual Permissions . . . . . . . 249
10.2.1 Principle of the Algorithm . . . . . . . . . . . . . . . . . . 249
10.2.2 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 251
10.2.3 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 252
10.2.4 From Simple Mutex to Mutex on Classes of Operations . . 255
10.3 Adaptive Mutex Algorithms Based on Individual Permissions . . . 256
10.3.1 The Notion of an Adaptive Algorithm . . . . . . . . . . . . 256
10.3.2 A Timestamp-Based Adaptive Algorithm . . . . . . . . . . 257
10.3.3 A Bounded Adaptive Algorithm . . . . . . . . . . . . . . . 259
10.3.4 Proof of the Bounded Adaptive Mutex Algorithm . . . . . 262
10.4 An Algorithm Based on Arbiter Permissions . . . . . . . . . . . . 264
10.4.1 Permissions Managed by Arbiters . . . . . . . . . . . . . . 264
10.4.2 Permissions Versus Quorums . . . . . . . . . . . . . . . . 265
10.4.3 Quorum Construction . . . . . . . . . . . . . . . . . . . . 266
10.4.4 An Adaptive Mutex Algorithm
Based on Arbiter Permissions . . . . . . . . . . . . . . . . 268
10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 274
Contents xv

11 Distributed Resource Allocation . . . . . . . . . . . . . . . . . . . . . 277


11.1 A Single Resource with Several Instances . . . . . . . . . . . . . . 277
11.1.1 The k-out-of-M Problem . . . . . . . . . . . . . . . . . . 277
11.1.2 Mutual Exclusion with Multiple Entries:
The 1-out-of-M Mutex Problem . . . . . . . . . . . . . . . 278
11.1.3 An Algorithm for the k-out-of-M Mutex Problem . . . . . 280
11.1.4 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 283
11.1.5 From Mutex Algorithms to k-out-of-M Algorithms . . . . 285
11.2 Several Resources with a Single Instance . . . . . . . . . . . . . . 285
11.2.1 Several Resources with a Single Instance . . . . . . . . . . 286
11.2.2 Incremental Requests for Single Instance Resources:
Using a Total Order . . . . . . . . . . . . . . . . . . . . . 287
11.2.3 Incremental Requests for Single Instance Resources:
Reducing Process Waiting Chains . . . . . . . . . . . . . . 290
11.2.4 Simultaneous Requests for Single Instance Resources
and Static Sessions . . . . . . . . . . . . . . . . . . . . . . 292
11.2.5 Simultaneous Requests for Single Instance Resources
and Dynamic Sessions . . . . . . . . . . . . . . . . . . . . 293
11.3 Several Resources with Multiple Instances . . . . . . . . . . . . . 295
11.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
11.5 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 298
11.6 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 299

Part IV High-Level Communication Abstractions


12 Order Constraints on Message Delivery . . . . . . . . . . . . . . . . 303
12.1 The Causal Message Delivery Abstraction . . . . . . . . . . . . . 303
12.1.1 Definition of Causal Message Delivery . . . . . . . . . . . 304
12.1.2 A Causality-Based Characterization
of Causal Message Delivery . . . . . . . . . . . . . . . . . 305
12.1.3 Causal Order
with Respect to Other Message Ordering Constraints . . . . 306
12.2 A Basic Algorithm for Point-to-Point Causal Message Delivery . . 306
12.2.1 A Simple Algorithm . . . . . . . . . . . . . . . . . . . . . 306
12.2.2 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 309
12.2.3 Reduce the Size of Control Information
Carried by Messages . . . . . . . . . . . . . . . . . . . . . 310
12.3 Causal Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
12.3.1 Definition and a Simple Algorithm . . . . . . . . . . . . . 313
12.3.2 The Notion of a Causal Barrier . . . . . . . . . . . . . . . 315
12.3.3 Causal Broadcast with Bounded Lifetime Messages . . . . 317
12.4 The Total Order Broadcast Abstraction . . . . . . . . . . . . . . . 320
12.4.1 Strong Total Order Versus Weak Total Order . . . . . . . . 320
12.4.2 An Algorithm Based on a Coordinator Process
or a Circulating Token . . . . . . . . . . . . . . . . . . . . 322
xvi Contents

12.4.3 An Inquiry-Based Algorithm . . . . . . . . . . . . . . . . 324


12.4.4 An Algorithm for Synchronous Systems . . . . . . . . . . 326
12.5 Playing with a Single Channel . . . . . . . . . . . . . . . . . . . . 328
12.5.1 Four Order Properties on a Channel . . . . . . . . . . . . . 328
12.5.2 A General Algorithm Implementing These Properties . . . 329
12.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
12.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 332
12.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 333
13 Rendezvous (Synchronous) Communication . . . . . . . . . . . . . . 335
13.1 The Synchronous Communication Abstraction . . . . . . . . . . . 335
13.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 335
13.1.2 An Example of Use . . . . . . . . . . . . . . . . . . . . . 337
13.1.3 A Message Pattern-Based Characterization . . . . . . . . . 338
13.1.4 Types of Algorithms
Implementing Synchronous Communications . . . . . . . . 341
13.2 Algorithms for Nondeterministic Planned Interactions . . . . . . . 341
13.2.1 Deterministic and Nondeterministic Communication
Contexts . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
13.2.2 An Asymmetric (Static) Client–Server Implementation . . 342
13.2.3 An Asymmetric Token-Based Implementation . . . . . . . 345
13.3 An Algorithm for Nondeterministic Forced Interactions . . . . . . 350
13.3.1 Nondeterministic Forced Interactions . . . . . . . . . . . . 350
13.3.2 A Simple Algorithm . . . . . . . . . . . . . . . . . . . . . 350
13.3.3 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 352
13.4 Rendezvous with Deadlines in Synchronous Systems . . . . . . . . 354
13.4.1 Synchronous Systems and Rendezvous with Deadline . . . 354
13.4.2 Rendezvous with Deadline Between Two Processes . . . . 355
13.4.3 Introducing Nondeterministic Choice . . . . . . . . . . . . 358
13.4.4 n-Way Rendezvous with Deadline . . . . . . . . . . . . . 360
13.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361
13.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 361
13.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 362

Part V Detection of Properties on Distributed Executions


14 Distributed Termination Detection . . . . . . . . . . . . . . . . . . . 367
14.1 The Distributed Termination Detection Problem . . . . . . . . . . 367
14.1.1 Process and Channel States . . . . . . . . . . . . . . . . . 367
14.1.2 Termination Predicate . . . . . . . . . . . . . . . . . . . . 368
14.1.3 The Termination Detection Problem . . . . . . . . . . . . 369
14.1.4 Types and Structure of Termination Detection Algorithms . 369
14.2 Termination Detection in the Asynchronous Atomic Model . . . . 370
14.2.1 The Atomic Model . . . . . . . . . . . . . . . . . . . . . . 370
Contents xvii

14.2.2 The Four-Counter Algorithm . . . . . . . . . . . . . . . . 371


14.2.3 The Counting Vector Algorithm . . . . . . . . . . . . . . . 373
14.2.4 The Four-Counter Algorithm
vs. the Counting Vector Algorithm . . . . . . . . . . . . . 376
14.3 Termination Detection in Diffusing Computations . . . . . . . . . 376
14.3.1 The Notion of a Diffusing Computation . . . . . . . . . . . 376
14.3.2 A Detection Algorithm Suited to Diffusing Computations . 377
14.4 A General Termination Detection Algorithm . . . . . . . . . . . . 378
14.4.1 Wave and Sequence of Waves . . . . . . . . . . . . . . . . 379
14.4.2 A Reasoned Construction . . . . . . . . . . . . . . . . . . 381
14.5 Termination Detection in a Very General Distributed Model . . . . 385
14.5.1 Model and Nondeterministic Atomic Receive Statement . . 385
14.5.2 The Predicate fulfilled() . . . . . . . . . . . . . . . . . . . 387
14.5.3 Static vs. Dynamic Termination: Definition . . . . . . . . . 388
14.5.4 Detection of Static Termination . . . . . . . . . . . . . . . 390
14.5.5 Detection of Dynamic Termination . . . . . . . . . . . . . 393
14.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.7 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 396
14.8 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 397

15 Distributed Deadlock Detection . . . . . . . . . . . . . . . . . . . . . 401


15.1 The Deadlock Detection Problem . . . . . . . . . . . . . . . . . . 401
15.1.1 Wait-For Graph (WFG) . . . . . . . . . . . . . . . . . . . 401
15.1.2 AND and OR Models Associated with Deadlock . . . . . . 403
15.1.3 Deadlock in the AND Model . . . . . . . . . . . . . . . . 403
15.1.4 Deadlock in the OR Model . . . . . . . . . . . . . . . . . 404
15.1.5 The Deadlock Detection Problem . . . . . . . . . . . . . . 404
15.1.6 Structure of Deadlock Detection Algorithms . . . . . . . . 405
15.2 Deadlock Detection in the One-at-a-Time Model . . . . . . . . . . 405
15.2.1 Principle and Local Variables . . . . . . . . . . . . . . . . 406
15.2.2 A Detection Algorithm . . . . . . . . . . . . . . . . . . . 406
15.2.3 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 407
15.3 Deadlock Detection in the AND Communication Model . . . . . . 408
15.3.1 Model and Principle of the Algorithm . . . . . . . . . . . . 409
15.3.2 A Detection Algorithm . . . . . . . . . . . . . . . . . . . 409
15.3.3 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 411
15.4 Deadlock Detection in the OR Communication Model . . . . . . . 413
15.4.1 Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . 413
15.4.2 A Detection Algorithm . . . . . . . . . . . . . . . . . . . 416
15.4.3 Proof of the Algorithm . . . . . . . . . . . . . . . . . . . . 419
15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 421
15.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 422
xviii Contents

Part VI Distributed Shared Memory


16 Atomic Consistency (Linearizability) . . . . . . . . . . . . . . . . . . 427
16.1 The Concept of a Distributed Shared Memory . . . . . . . . . . . 427
16.2 The Atomicity Consistency Condition . . . . . . . . . . . . . . . . 429
16.2.1 What Is the Issue? . . . . . . . . . . . . . . . . . . . . . . 429
16.2.2 An Execution Is a Partial Order on Operations . . . . . . . 429
16.2.3 Atomicity: Formal Definition . . . . . . . . . . . . . . . . 430
16.3 Atomic Objects Compose for Free . . . . . . . . . . . . . . . . . 432
16.4 Message-Passing Implementations of Atomicity . . . . . . . . . . 435
16.4.1 Atomicity Based on
a Total Order Broadcast Abstraction . . . . . . . . . . . . 435
16.4.2 Atomicity of Read/Write Objects Based on
Server Processes . . . . . . . . . . . . . . . . . . . . . . . 437
16.4.3 Atomicity Based on
a Server Process and Copy Invalidation . . . . . . . . . . . 438
16.4.4 Introducing the Notion of an Owner Process . . . . . . . . 439
16.4.5 Atomicity Based on a Server Process and Copy Update . . 443
16.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.6 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 444
16.7 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 445
17 Sequential Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 447
17.1 Sequential Consistency . . . . . . . . . . . . . . . . . . . . . . . 447
17.1.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 447
17.1.2 Sequential Consistency Is Not a Local Property . . . . . . 449
17.1.3 Partial Order for Sequential Consistency . . . . . . . . . . 450
17.1.4 Two Theorems
for Sequentially Consistent Read/Write Registers . . . . . 451
17.1.5 From Theorems to Algorithms . . . . . . . . . . . . . . . 453
17.2 Sequential Consistency from Total Order Broadcast . . . . . . . . 453
17.2.1 A Fast Read Algorithm for Read/Write Objects . . . . . . . 453
17.2.2 A Fast Write Algorithm for Read/Write Objects . . . . . . 455
17.2.3 A Fast Enqueue Algorithm for Queue Objects . . . . . . . 456
17.3 Sequential Consistency from a Single Server . . . . . . . . . . . . 456
17.3.1 The Single Server Is a Process . . . . . . . . . . . . . . . . 456
17.3.2 The Single Server Is a Navigating Token . . . . . . . . . . 459
17.4 Sequential Consistency with a Server per Object . . . . . . . . . . 460
17.4.1 Structural View . . . . . . . . . . . . . . . . . . . . . . . 460
17.4.2 The Object Managers Must Cooperate . . . . . . . . . . . 461
17.4.3 An Algorithm Based on the OO Constraint . . . . . . . . . 462
17.5 A Weaker Consistency Condition: Causal Consistency . . . . . . . 464
17.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . 464
17.5.2 A Simple Algorithm . . . . . . . . . . . . . . . . . . . . . 466
17.5.3 The Case of a Single Object . . . . . . . . . . . . . . . . . 467
17.6 A Hierarchy of Consistency Conditions . . . . . . . . . . . . . . . 468
Contents xix

17.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468


17.8 Bibliographic Notes . . . . . . . . . . . . . . . . . . . . . . . . . 469
17.9 Exercises and Problems . . . . . . . . . . . . . . . . . . . . . . . 470
Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
The Aim of This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 471
Most Important Concepts, Notions, and Mechanisms
Presented in This Book . . . . . . . . . . . . . . . . . . . . . . . 471
How to Use This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . 473
From Failure-Free Systems to Failure-Prone Systems . . . . . . . . . . 474
A Series of Books . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Notation

no-op no operation
skip empty statement
process program in action
n number of processes
e number of edges in the process graph
D diameter of the process graph
pi process whose index is i
idi identity of process pi (very often idi = i)
τ time instant (with respect to an external observer)
a, b pair with two elements a and b
ev
−→ causal precedence relation on events
σ
−→ causal precedence relation on local states
zz
−→ z-precedence relation on local checkpoints
Σ
−→ precedence relation on global states
Mutex mutual exclusion
ABCD small capital letters: message type (message tag)
abcdi italics lower-case letters: local variable of process pi
m1 ; . . . ; mq  sequence of messages
ai [1..s] array of size s (local to process pi )
for each i ∈ {1, . . . , m} order irrelevant
for each i from 1 to m order relevant
wait (P ) while ¬P do no-op end while
return (v) returns v and terminates the operation invocation
% blablabla % comments
; sequentiality operator between two statements
¬(a R b) relation R does not include the pair a, b

xxi
List of Figures and Algorithms

Fig. 1.1 Three graph types of particular interest . . . . . . . . . . . . . . 4


Fig. 1.2 Synchronous execution (left) vs. asynchronous (right) execution 5
Fig. 1.3 Learning the communication graph (code for pi ) . . . . . . . . . 7
Fig. 1.4 A simple flooding algorithm (code for pi ) . . . . . . . . . . . . 10
Fig. 1.5 A rooted spanning tree . . . . . . . . . . . . . . . . . . . . . . . 11
Fig. 1.6 Tree-based broadcast/convergecast (code for pi ) . . . . . . . . . 11
Fig. 1.7 Construction of a rooted spanning tree (code for pi ) . . . . . . . 13
Fig. 1.8 Left: Underlying communication graph; Right: Spanning tree . . 14
Fig. 1.9 An execution of the algorithm constructing a spanning tree . . . 14
Fig. 1.10 Two different spanning trees built from the same communication
graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Fig. 1.11 Construction of a breadth-first spanning tree without centralized
control (code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . 18
Fig. 1.12 An execution of the algorithm of Fig. 1.11 . . . . . . . . . . . . 19
Fig. 1.13 Successive waves launched by the root process pa . . . . . . . . 21
Fig. 1.14 Construction of a breadth-first spanning tree with centralized
control (starting code) . . . . . . . . . . . . . . . . . . . . . . . 22
Fig. 1.15 Construction of a breadth-first spanning tree with centralized
control (code for a process pi ) . . . . . . . . . . . . . . . . . . 22
Fig. 1.16 Depth-first traversal of a communication graph (code for pi ) . . 25
Fig. 1.17 Time and message optimal depth-first traversal (code for pi ) . . 27
Fig. 1.18 Management of the token at process pi . . . . . . . . . . . . . . 29
Fig. 1.19 From a depth-first traversal to a ring (code for pi ) . . . . . . . . 29
Fig. 1.20 Sense of direction of the ring and computation of routing tables . 30
Fig. 1.21 An example of a logical ring construction . . . . . . . . . . . . . 31
Fig. 1.22 An anonymous network . . . . . . . . . . . . . . . . . . . . . . 34
Fig. 2.1 Bellman–Ford’s dynamic programming principle . . . . . . . . . 36
Fig. 2.2 A distributed adaptation of Bellman–Ford’s shortest path
algorithm (code for pi ) . . . . . . . . . . . . . . . . . . . . . . 37
Fig. 2.3 A distributed synchronous shortest path algorithm (code for pi ) . 38
Fig. 2.4 Floyd–Warshall’s sequential shortest path algorithm . . . . . . . 39

xxiii
xxiv List of Figures and Algorithms

Fig. 2.5 The principle that underlies Floyd–Warshall’s shortest paths


algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Fig. 2.6 Distributed Floyd–Warshall’s shortest path algorithm . . . . . . 41
Fig. 2.7 Sequential ( + 1)-coloring of the vertices of a graph . . . . . . 42
Fig. 2.8 Distributed ( + 1)-coloring from an initial m-coloring where
n≥m≥+2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Fig. 2.9 One bit of control information when the channels are not FIFO . 45
Fig. 2.10 Examples of maximal independent sets . . . . . . . . . . . . . . 46
Fig. 2.11 From m-coloring to a maximal independent set (code for pi ) . . 47
Fig. 2.12 Luby’s synchronous random algorithm for a maximal
independent set (code for pi ) . . . . . . . . . . . . . . . . . . . 48
Fig. 2.13 Messages exchanged during three consecutive rounds . . . . . . 49
Fig. 2.14 A directed graph with a knot . . . . . . . . . . . . . . . . . . . 51
Fig. 2.15 Possible message pattern during a knot detection . . . . . . . . . 53
Fig. 2.16 Asynchronous knot detection (code of pi ) . . . . . . . . . . . . 55
Fig. 2.17 Knot/cycle detection: example . . . . . . . . . . . . . . . . . . 57
Fig. 3.1 Computation of routing tables defined from distances
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Fig. 3.2 A diameter-independent generic algorithm (code for pi ) . . . . . 65
Fig. 3.3 A process graph with three cut vertices . . . . . . . . . . . . . . 66
Fig. 3.4 Determining cut vertices: principle . . . . . . . . . . . . . . . . 67
Fig. 3.5 An algorithm determining the cut vertices (code for pi ) . . . . . 68
Fig. 3.6 A general algorithm with filtering (code for pi ) . . . . . . . . . 71
Fig. 3.7 The De Bruijn’s directed networks dB(2,1), dB(2,2), and dB(2,3) 74
Fig. 3.8 A generic algorithm for a De Bruijn’s communication graph
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
Fig. 4.1 Chang and Robert’s election algorithm (code for pi ) . . . . . . . 80
Fig. 4.2 Worst identity distribution for message complexity . . . . . . . . 81
Fig. 4.3 A variant of Chang and Robert’s election algorithm (code for pi ) 83
Fig. 4.4 Neighborhood of a process pi competing during round r . . . . 84
Fig. 4.5 Competitors at the end of round r are at distance greater than 2r . 84
Fig. 4.6 Hirschberg and Sinclair’s election algorithm (code for pi ) . . . . 85
Fig. 4.7 Neighbor processes on the unidirectional ring . . . . . . . . . . 87
Fig. 4.8 From the first to the second round . . . . . . . . . . . . . . . . . 87
Fig. 4.9 Dolev, Klawe, and Rodeh’s election algorithm (code for pi ) . . . 88
Fig. 4.10 Index-based randomized election (code for pi ) . . . . . . . . . . 90
Fig. 5.1 Home-based three-way handshake mechanism . . . . . . . . . . 95
Fig. 5.2 Structural view of the navigation algorithm
(module at process pi ) . . . . . . . . . . . . . . . . . . . . . . . 98
Fig. 5.3 A navigation algorithm for a complete network (code for pi ) . . 99
Fig. 5.4 Asynchrony involving a mobile object and request messages . . 100
Fig. 5.5 Navigation tree: initial state . . . . . . . . . . . . . . . . . . . . 101
Fig. 5.6 Navigation tree: after the object has moved to pc . . . . . . . . . 102
Fig. 5.7 Navigation tree: proxy role of a process . . . . . . . . . . . . . . 102
List of Figures and Algorithms xxv

Fig. 5.8 A spanning tree-based navigation algorithm (code for pi ) . . . . 104


Fig. 5.9 The case of non-FIFO channels . . . . . . . . . . . . . . . . . . 105
Fig. 5.10 The meaning of vector
R = [d(i1 ), d(i2 ), . . . , d(ix−1 ), d(ix ), 0, . . . , 0] . . . . . . . . . . 108
Fig. 5.11 A dynamically evolving spanning tree . . . . . . . . . . . . . . 110
Fig. 5.12 A navigation algorithm based on a distributed queue
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Fig. 5.13 From the worst to the best case . . . . . . . . . . . . . . . . . . 113
Fig. 5.14 Example of an execution . . . . . . . . . . . . . . . . . . . . . 114
Fig. 5.15 A hybrid navigation algorithm (code for pi ) . . . . . . . . . . . 117
Fig. 6.1 A distributed execution as a partial order . . . . . . . . . . . . . 124
Fig. 6.2 Past, future, and concurrency sets associated with an event . . . . 125
Fig. 6.3 Cut and consistent cut . . . . . . . . . . . . . . . . . . . . . . . 126
Fig. 6.4 Two instances of the same execution . . . . . . . . . . . . . . . 126
Fig. 6.5 Consecutive local states of a process pi . . . . . . . . . . . . . . 127
Fig. 6.6 From a relation on events to a relation on local states . . . . . . . 128
Fig. 6.7 A two-process distributed execution . . . . . . . . . . . . . . . . 130
Fig. 6.8 Lattice of consistent global states . . . . . . . . . . . . . . . . . 130
Fig. 6.9 Sequential observations of a distributed computation . . . . . . . 131
Fig. 6.10 Illustrating the notations “e ∈ σi ” and “f ∈ σi ” . . . . . . . . . . 133
Fig. 6.11 In-transit and orphan messages . . . . . . . . . . . . . . . . . . 133
Fig. 6.12 Cut versus global state . . . . . . . . . . . . . . . . . . . . . . . 135
Fig. 6.13 Global state computation: structural view . . . . . . . . . . . . . 136
Fig. 6.14 Recording of a local state . . . . . . . . . . . . . . . . . . . . . 139
Fig. 6.15 Reception of a MARKER () message: case 1 . . . . . . . . . . . . 139
Fig. 6.16 Reception of a MARKER () message: case 2 . . . . . . . . . . . . 139
Fig. 6.17 Global state computation (FIFO channels, code for cpi ) . . . . . 140
Fig. 6.18 A simple automaton for process pi (i = 1, 2) . . . . . . . . . . . 141
Fig. 6.19 Prefix of a simple execution . . . . . . . . . . . . . . . . . . . . 142
Fig. 6.20 Superimposing a global state computation
on a distributed execution . . . . . . . . . . . . . . . . . . . . . 142
Fig. 6.21 Consistent cut associated with the computed global state . . . . . 143
Fig. 6.22 A rubber band transformation . . . . . . . . . . . . . . . . . . . 143
Fig. 6.23 Global state computation (non-FIFO channels, code for cpi ) . . . 145
Fig. 6.24 Example of a global state computation (non-FIFO channels) . . . 145
Fig. 6.25 Another global state computation
(non-FIFO channels, code for cpi ) . . . . . . . . . . . . . . . . 148
Fig. 7.1 Implementation of a linear clock (code for process pi ) . . . . . . 150
Fig. 7.2 A simple example of a linear clock system . . . . . . . . . . . . 151
Fig. 7.3 A non-sequential observation obtained from linear time . . . . . 152
Fig. 7.4 A sequential observation obtained from timestamps . . . . . . . 153
Fig. 7.5 Total order broadcast: the problem that has to be solved . . . . . 155
Fig. 7.6 Structure of the total order broadcast implementation . . . . . . 155
Fig. 7.7 Implementation of total order broadcast (code for process pi ) . . 157
Fig. 7.8 To_delivery predicate of a message at process pi . . . . . . . . . 157
xxvi List of Figures and Algorithms

Fig. 7.9 Implementation of a vector clock system (code for process pi ) . 160
Fig. 7.10 Time propagation in a vector clock system . . . . . . . . . . . . 161
Fig. 7.11 On the development of time (1) . . . . . . . . . . . . . . . . . . 164
Fig. 7.12 On the development of time (2) . . . . . . . . . . . . . . . . . . 164
Fig. 7.13 Associating vector dates with global states . . . . . . . . . . . . 165
Fig. 7.14 First global state satisfying a global predicate (1) . . . . . . . . . 167
Fig. 7.15 First global state satisfying a global predicate
 (2) . . . . . . . . . 168
Fig. 7.16 Detection the first global state satisfying i LPi
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 169
Fig. 7.17 Relevant events in a distributed computation . . . . . . . . . . . 171
Fig. 7.18 Vector clock system for relevant events (code for process pi ) . . 171
Fig. 7.19 From relevant events to Hasse diagram . . . . . . . . . . . . . . 171
Fig. 7.20 Determination of the immediate predecessors
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 172
Fig. 7.21 Four possible cases when updating impi [k],
while vci [k] = vc[k] . . . . . . . . . . . . . . . . . . . . . . . . 173
Fig. 7.22 A specific communication pattern . . . . . . . . . . . . . . . . . 175
Fig. 7.23 Specific communication pattern with n = 3 processes . . . . . . 175
Fig. 7.24 Management of vci [1..n] and kprimei [1..n, 1..n]
(code for process pi ): Algorithm 1 . . . . . . . . . . . . . . . . 178
Fig. 7.25 Management of vci [1..n] and kprimei [1..n, 1..n]
(code for process pi ): Algorithm 2 . . . . . . . . . . . . . . . . 179
Fig. 7.26 An adaptive communication layer (code for process pi ) . . . . . 181
Fig. 7.27 Implementation of a k-restricted vector clock system
(code for process pi ) . . . . . . . . . . . . . . . . . . . . . . . 182
Fig. 7.28 Matrix time: an example . . . . . . . . . . . . . . . . . . . . . . 183
Fig. 7.29 Implementation of matrix time (code for process pi ) . . . . . . . 184
Fig. 7.30 Discarding obsolete data: structural view (at a process pi ) . . . . 185
Fig. 7.31 A buffer management algorithm (code for process pi ) . . . . . . 185
Fig. 7.32 Yet another clock system (code for process pi ) . . . . . . . . . . 188
Fig. 8.1 A checkpoint and communication pattern (with intervals) . . . . 190
Fig. 8.2 A zigzag pattern . . . . . . . . . . . . . . . . . . . . . . . . . . 192
Fig. 8.3 Proof of Theorem 9:
a zigzag path joining two local checkpoints of LC . . . . . . . . 194
Fig. 8.4 Proof of Theorem 9:
a zigzag path joining two local checkpoints . . . . . . . . . . . . 195
Fig. 8.5 Domino effect (in a system of two processes) . . . . . . . . . . . 196
Fig. 8.6 Proof by contradiction of Theorem 11 . . . . . . . . . . . . . . 200
Fig. 8.7 A very simple z-cycle-free checkpointing algorithm
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
Fig. 8.8 To take or not to take a forced local checkpoint . . . . . . . . . . 202
Fig. 8.9 An example of z-cycle prevention . . . . . . . . . . . . . . . . . 202
Fig. 8.10 A vector clock system for rollback-dependency trackability
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Fig. 8.11 Intervals and vector clocks for rollback-dependency trackability . 204
List of Figures and Algorithms xxvii

Fig. 8.12 Russell’s pattern for ensuring the RDT consistency condition . . 205
Fig. 8.13 Russell’s checkpointing algorithm (code for pi ) . . . . . . . . . 205
Fig. 8.14 FDAS checkpointing algorithm (code for pi ) . . . . . . . . . . . 207
Fig. 8.15 Matrix causali [1..n, 1..n] . . . . . . . . . . . . . . . . . . . . . 208
Fig. 8.16 Pure (left) vs. impure (right) causal paths from pj to pi . . . . . 208
Fig. 8.17 An impure causal path from pi to itself . . . . . . . . . . . . . . 209
Fig. 8.18 An efficient checkpointing algorithm for RDT (code for pi ) . . . 210
Fig. 8.19 Sender-based optimistic message logging . . . . . . . . . . . . . 212
Fig. 8.20 To log or not to log a message? . . . . . . . . . . . . . . . . . . 212
Fig. 8.21 An uncoordinated checkpointing algorithm (code for pi ) . . . . 214
Fig. 8.22 Retrieving the messages which are in transit
with respect to the pair (ci , cj ) . . . . . . . . . . . . . . . . . . 215
Fig. 9.1 A space-time diagram of a synchronous execution . . . . . . . . 220
Fig. 9.2 Synchronous breadth-first traversal algorithm (code for pi ) . . . 221
Fig. 9.3 Synchronizer: from asynchrony to logical synchrony . . . . . . . 222
Fig. 9.4 Synchronizer α (code for pi ) . . . . . . . . . . . . . . . . . . . 226
Fig. 9.5 Synchronizer α: possible message arrival at process pi . . . . . . 227
Fig. 9.6 Synchronizer β (code for pi ) . . . . . . . . . . . . . . . . . . . 229
Fig. 9.7 A message pattern which can occur with synchronizer β
(but not with α): Case 1 . . . . . . . . . . . . . . . . . . . . . . 229
Fig. 9.8 A message pattern which can occur with synchronizer β
(but not with α): Case 2 . . . . . . . . . . . . . . . . . . . . . . 229
Fig. 9.9 Synchronizer γ : a communication graph . . . . . . . . . . . . . 230
Fig. 9.10 Synchronizer γ : a partitioning . . . . . . . . . . . . . . . . . . . 231
Fig. 9.11 Synchronizer γ (code for pi ) . . . . . . . . . . . . . . . . . . . 233
Fig. 9.12 Synchronizer δ (code for pi ) . . . . . . . . . . . . . . . . . . . 235
Fig. 9.13 Initialization of physical clocks (code for pi ) . . . . . . . . . . . 236
Fig. 9.14 The scenario to be prevented . . . . . . . . . . . . . . . . . . . 237
Fig. 9.15 Interval during which a process can receive pulse r messages . . 238
Fig. 9.16 Synchronizer λ (code for pi ) . . . . . . . . . . . . . . . . . . . 239
Fig. 9.17 Synchronizer μ (code for pi ) . . . . . . . . . . . . . . . . . . . 240
Fig. 9.18 Clock drift with respect to reference time . . . . . . . . . . . . . 241
Fig. 10.1 A mutex invocation pattern and the three states of a process . . . 248
Fig. 10.2 Mutex module at a process pi : structural view . . . . . . . . . . 250
Fig. 10.3 A mutex algorithm based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Fig. 10.4 Proof of the safety property of the algorithm of Fig. 10.3 . . . . . 253
Fig. 10.5 Proof of the liveness property of the algorithm of Fig. 10.3 . . . 254
Fig. 10.6 Generalized mutex based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
Fig. 10.7 An adaptive mutex algorithm based on individual permissions
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Fig. 10.8 Non-FIFO channel in the algorithm of Fig. 10.7 . . . . . . . . . 259
Fig. 10.9 States of the message PERMISSION ({i, j }) . . . . . . . . . . . . 260
xxviii List of Figures and Algorithms

Fig. 10.10 A bounded adaptive algorithm based on individual permissions


(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
Fig. 10.11 Arbiter permission-based mechanism . . . . . . . . . . . . . . . 265
Fig. 10.12 Values of K and D for symmetric optimal quorums . . . . . . . 266
Fig. 10.13 An order two projective plane . . . . . . . . . . . . . . . . . . . 267
Fig. 10.14 A safe (but not live) mutex algorithm
based on arbiter permissions (code for pi ) . . . . . . . . . . . . 269
Fig. 10.15 Permission preemption to prevent deadlock . . . . . . . . . . . . 270
Fig. 10.16 A mutex algorithm based on arbiter permissions (code for pi ) . . 272
Fig. 11.1 An algorithm for the multiple entries mutex problem
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
Fig. 11.2 Sending pattern of NOT _ USED () messages: Case 1 . . . . . . . . 281
Fig. 11.3 Sending pattern of NOT _ USED () messages: Case 2 . . . . . . . . 282
Fig. 11.4 An algorithm for the k-out-of-M mutex problem (code for pi ) . . 282
Fig. 11.5 Examples of conflict graphs . . . . . . . . . . . . . . . . . . . . 286
Fig. 11.6 Global conflict graph . . . . . . . . . . . . . . . . . . . . . . . 287
Fig. 11.7 A deadlock scenario involving two processes and two resources . 288
Fig. 11.8 No deadlock with ordered resources . . . . . . . . . . . . . . . 289
Fig. 11.9 A particular pattern in using resources . . . . . . . . . . . . . . 289
Fig. 11.10 Conflict graph for six processes, each resource being shared
by two processes . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Fig. 11.11 Optimal vertex-coloring of a resource graph . . . . . . . . . . . 291
Fig. 11.12 Conflict graph for static sessions (SS_CG) . . . . . . . . . . . . 292
Fig. 11.13 Simultaneous requests in dynamic sessions
(sketch of code for pi ) . . . . . . . . . . . . . . . . . . . . . . . 296
Fig. 11.14 Algorithms for generalized k-out-of-M (code for pi ) . . . . . . . 297
Fig. 11.15 Another algorithm for the k-out-of-M mutex problem
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
Fig. 12.1 The causal message delivery order property . . . . . . . . . . . 304
Fig. 12.2 The delivery pattern prevented by the empty interval property . . 305
Fig. 12.3 Structure of a causal message delivery implementation . . . . . . 307
Fig. 12.4 An implementation of causal message delivery (code for pi ) . . . 308
Fig. 12.5 Message pattern for the proof of the causal order delivery . . . . 309
Fig. 12.6 An implementation reducing the size of control information
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Fig. 12.7 Control information carried by consecutive messages sent by pj
to pi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
Fig. 12.8 An adaptive sending procedure for causal message delivery . . . 313
Fig. 12.9 Illustration of causal broadcast . . . . . . . . . . . . . . . . . . 313
Fig. 12.10 A simple algorithm for causal broadcast (code for pi ) . . . . . . 314
Fig. 12.11 The causal broadcast algorithm in action . . . . . . . . . . . . . 315
Fig. 12.12 The graph of immediate predecessor messages . . . . . . . . . . 316
Fig. 12.13 A causal broadcast algorithm based on causal barriers
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
Fig. 12.14 Message with bounded lifetime . . . . . . . . . . . . . . . . . . 318
List of Figures and Algorithms xxix

Fig. 12.15 On-time versus too late . . . . . . . . . . . . . . . . . . . . . . 319


Fig. 12.16 A -causal broadcast algorithm (code for pi ) . . . . . . . . . . 320
Fig. 12.17 Implementation of total order message delivery requires
coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Fig. 12.18 Total order broadcast based on a coordinator process . . . . . . . 322
Fig. 12.19 Token-based total order broadcast . . . . . . . . . . . . . . . . . 323
Fig. 12.20 Clients and servers in total order broadcast . . . . . . . . . . . . 324
Fig. 12.21 A total order algorithm from clients pi to servers qj . . . . . . . 326
Fig. 12.22 A total order algorithm from synchronous systems . . . . . . . . 327
Fig. 12.23 Message m with type ct_future (cannot be bypassed) . . . . 329
Fig. 12.24 Message m with type ct_past
(cannot bypass other messages) . . . . . . . . . . . . . . . . . . 329
Fig. 12.25 Message m with type marker . . . . . . . . . . . . . . . . . . 329
Fig. 12.26 Building a first in first out channel . . . . . . . . . . . . . . . . 330
Fig. 12.27 Message delivery according to message types . . . . . . . . . . 331
Fig. 12.28 Delivery of messages typed ordinary and ct_future . . . . 332
Fig. 13.1 Synchronous communication:
messages as “points” instead of “intervals” . . . . . . . . . . . . 336
Fig. 13.2 When communications are not synchronous . . . . . . . . . . . 338
Fig. 13.3 Accessing objects with synchronous communication . . . . . . . 339
Fig. 13.4 A crown of size k = 2 (left) and a crown of size k = 3 (right) . . 339
Fig. 13.5 Four message patterns . . . . . . . . . . . . . . . . . . . . . . . 340
Fig. 13.6 Implementation of a rendezvous when the client is the sender . . 344
Fig. 13.7 Implementation of a rendezvous when the client is the receiver . 345
Fig. 13.8 A token-based mechanism to implement an interaction . . . . . . 346
Fig. 13.9 Deadlock and livelock prevention in interaction implementation . 347
Fig. 13.10 A general token-based implementation for planned interactions
(rendezvous) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Fig. 13.11 An algorithm for forced interactions (rendezvous) . . . . . . . . 351
Fig. 13.12 Forced interaction: message pattern when i > j . . . . . . . . . 352
Fig. 13.13 Forced interaction: message pattern when i < j . . . . . . . . . 352
Fig. 13.14 When the rendezvous must be successful
(two-process symmetric algorithm) . . . . . . . . . . . . . . . . 356
Fig. 13.15 Real-time rendezvous between two processes p and q . . . . . . 357
Fig. 13.16 When the rendezvous must be successful
(asymmetric algorithm) . . . . . . . . . . . . . . . . . . . . . . 358
Fig. 13.17 Nondeterministic rendezvous with deadline . . . . . . . . . . . . 359
Fig. 13.18 Multirendezvous with deadline . . . . . . . . . . . . . . . . . . 361
Fig. 13.19 Comparing two date patterns for rendezvous with deadline . . . 363
Fig. 14.1 Process states for termination detection . . . . . . . . . . . . . . 368
Fig. 14.2 Global structure of the observation modules . . . . . . . . . . . 370
Fig. 14.3 An execution in the asynchronous atomic model . . . . . . . . . 371
Fig. 14.4 One visit is not sufficient . . . . . . . . . . . . . . . . . . . . . 371
Fig. 14.5 The four-counter algorithm for termination detection . . . . . . . 372
Fig. 14.6 Two consecutive inquiries . . . . . . . . . . . . . . . . . . . . . 373
xxx List of Figures and Algorithms

Fig. 14.7 The counting vector algorithm for termination detection . . . . . 374
Fig. 14.8 The counting vector algorithm at work . . . . . . . . . . . . . . 375
Fig. 14.9 Termination detection of a diffusing computation . . . . . . . . . 378
Fig. 14.10 Ring-based implementation of a wave . . . . . . . . . . . . . . 380
Fig. 14.11 Spanning
 tree-based implementation of a wave . . . . . . . . . . 381
Fig. 14.12 Why ( 1≤i≤n idlexi ) ⇒ TERM(C, ταx ) is not true . . . . . . . . . 382
Fig. 14.13 A general algorithm for termination detection . . . . . . . . . . 384
Fig. 14.14 Atomicity associated with τix . . . . . . . . . . . . . . . . . . . 384
Fig. 14.15 Structure of the channels to pi . . . . . . . . . . . . . . . . . . 386
Fig. 14.16 An algorithm for static termination detection . . . . . . . . . . . 391
Fig. 14.17 Definition of time instants for the safety of static termination . . 392
Fig. 14.18 Cooperation between local observers . . . . . . . . . . . . . . . 394
Fig. 14.19 An algorithm for dynamic termination detection . . . . . . . . . 395
Fig. 14.20 Example of a monotonous distributed computation . . . . . . . . 398
Fig. 15.1 Examples of wait-for graphs . . . . . . . . . . . . . . . . . . . . 402
Fig. 15.2 An algorithm for deadlock detection
in the AND communication model . . . . . . . . . . . . . . . . 410
Fig. 15.3 Determining in-transit messages . . . . . . . . . . . . . . . . . 411
Fig. 15.4 PROBE () messages sent along a cycle
(with no application messages in transit) . . . . . . . . . . . . . 411
Fig. 15.5 Time instants in the proof of the safety property . . . . . . . . . 412
Fig. 15.6 A directed communication graph . . . . . . . . . . . . . . . . . 414
Fig. 15.7 Network traversal with feedback on a static graph . . . . . . . . 414
Fig. 15.8 Modification in a wait-for graph . . . . . . . . . . . . . . . . . . 415
Fig. 15.9 Inconsistent observation of a dynamic wait-for graph . . . . . . 416
Fig. 15.10 An algorithm for deadlock detection
in the OR communication model . . . . . . . . . . . . . . . . . 418
Fig. 15.11 Activation pattern for the safety proof . . . . . . . . . . . . . . . 420
Fig. 15.12 Another example of a wait-for graph . . . . . . . . . . . . . . . 423
Fig. 16.1 Structure of a distributed shared memory . . . . . . . . . . . . . 428
Fig. 16.2 Register: What values can be returned by read operations? . . . . 429
op
Fig. 16.3 The relation −→ of the computation described in Fig. 16.2 . . . 430
Fig. 16.4 An execution of an atomic register . . . . . . . . . . . . . . . . 432
Fig. 16.5 Another execution of an atomic register . . . . . . . . . . . . . . 432
Fig. 16.6 Atomicity allows objects to compose for free . . . . . . . . . . . 435
Fig. 16.7 From total order broadcast to atomicity . . . . . . . . . . . . . . 436
Fig. 16.8 Why read operations have to be to-broadcast . . . . . . . . . . . 437
Fig. 16.9 Invalidation-based implementation of atomicity:
message flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
Fig. 16.10 Invalidation-based implementation of atomicity: algorithm . . . 440
Fig. 16.11 Invalidation and owner-based implementation of atomicity
(code of pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441
Fig. 16.12 Invalidation and owner-based implementation of atomicity
(code of the manager pX ) . . . . . . . . . . . . . . . . . . . . . 442
Fig. 16.13 Update-based implementation of atomicity . . . . . . . . . . . . 443
List of Figures and Algorithms xxxi

Fig. 16.14 Update-based algorithm implementing atomicity . . . . . . . . . 444


Fig. 17.1 A sequentially consistent computation (which is not atomic) . . . 448
Fig. 17.2 A computation which is not sequentially consistent . . . . . . . 449
Fig. 17.3 A sequentially consistent queue . . . . . . . . . . . . . . . . . . 449
Fig. 17.4 Sequential consistency is not a local property . . . . . . . . . . . 450
Fig. 17.5 Part of the graph G used in the proof of Theorem 29 . . . . . . . 452
Fig. 17.6 Fast read algorithm implementing sequential consistency
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454
Fig. 17.7 Fast write algorithm implementing sequential consistency
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456
Fig. 17.8 Fast enqueue algorithm implementing a sequentially consistent
queue (code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . 457
Fig. 17.9 Read/write sequentially consistent registers
from a central manager . . . . . . . . . . . . . . . . . . . . . . 458
Fig. 17.10 Pattern of read/write accesses used in the proof of Theorem 33 . 459
Fig. 17.11 Token-based sequentially consistent shared memory
(code for pi ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460
Fig. 17.12 Architectural view associated with the OO constraint . . . . . . 461
Fig. 17.13 Why the object managers must cooperate . . . . . . . . . . . . . 461
Fig. 17.14 Sequential consistency with a manager per object:
process side . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Fig. 17.15 Sequential consistency with a manager per object:
manager side . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Fig. 17.16 Cooperation between managers is required by the OO constraint 464
Fig. 17.17 An example of a causally consistent computation . . . . . . . . . 465
Fig. 17.18 Another example of a causally consistent computation . . . . . . 466
Fig. 17.19 A simple algorithm implementing causal consistency . . . . . . 467
Fig. 17.20 Causal consistency for a single object . . . . . . . . . . . . . . . 467
Fig. 17.21 Hierarchy of consistency conditions . . . . . . . . . . . . . . . . 469
Part I
Distributed Graph Algorithms

This first part of the book is on distributed graph algorithms. These algorithms con-
sider the distributed system as a connected graph whose vertices are the processes
(nodes) and whose edges are the communication channels. It is made up of five
chapters.

After having introduced base definitions, Chap. 1 addresses network traversals.


It presents distributed algorithms that realize parallel, depth-first, and breadth-first
network traversals. Chapter 2 is on distributed algorithms solving classical graph
problems such as shortest paths, vertex coloring, maximal independent set, and
knot detection. This chapter shows that the distributed techniques to solve graph
problems are not obtained by a simple extension of their sequential counterparts.
Chapter 3 presents a general technique to compute a global function on a process
graph, each process providing its own input parameter, and obtaining its own output
(which depends on the whole set of inputs). Chapter 4 is on the leader election prob-
lems with a strong emphasis on uni/bidirectional rings. Finally, the last chapter of
this part, Chap. 5, presents several algorithms that allow a mobile object to navigate
a network.
In addition to the presentation of distributed graph algorithms, which can be used
in distributed applications, an aim of this part of the book is to allow readers to have
a better intuition of the term distributed when comparing distributed algorithms and
sequential algorithms.
Chapter 1
Basic Definitions
and Network Traversal Algorithms

This chapter first introduces basic definitions related to distributed algorithms. Then,
considering a distributed system as a graph whose vertices are the processes and
whose edges are the communication channels, it presents distributed algorithms for
graph traversals, namely, parallel traversal, breadth-first traversal, and depth-first
traversal. It also shows how spanning trees or rings can be constructed from these
distributed graph traversal algorithms. These trees and rings can, in turn, be used to
easily implement broadcast and convergecast algorithms.
As the reader will see, the distributed graph traversal techniques are different
from their sequential counterparts in their underlying principles, behaviors, and
complexities. This come from the fact that, in a distributed context, the same type of
traversal can usually be realized in distinct ways, each with its own tradeoff between
its time complexity and message complexity.

Keywords Asynchronous/synchronous system · Breadth-first traversal ·


Broadcast · Convergecast · Depth-first traversal · Distributed algorithm ·
Forward/discard principle · Initial knowledge · Local algorithm ·
Parallel traversal · Spanning tree · Unidirectional logical ring

1.1 Distributed Algorithms

1.1.1 Definition

Processes A distributed system is made up of a collection of computing units,


each one abstracted through the notion of a process. The processes are assumed to
cooperate on a common goal, which means that they exchange information in one
way or another.
The set of processes is static. It is composed of n processes and denoted Π =
{p1 , . . . , pn }, where each pi , 1 ≤ i ≤ n, represents a distinct process. Each process
pi is sequential, i.e., it executes one step at a time.
The integer i denotes the index of process pi , i.e., the way an external observer
can distinguish processes. It is nearly always assumed that each process pi has its
own identity, denoted idi ; then pi knows idi (in a lot of cases—but not always—
idi = i).

M. Raynal, Distributed Algorithms for Message-Passing Systems, 3


DOI 10.1007/978-3-642-38123-2_1, © Springer-Verlag Berlin Heidelberg 2013
4 1 Basic Definitions and Network Traversal Algorithms

Fig. 1.1 Three graph types


of particular interest

Communication Medium The processes communicate by sending and receiving


messages through channels. Each channel is assumed to be reliable (it does not
create, modify, or duplicate messages).
In some cases, we assume that channels are first in first out (FIFO) which means
that the messages are received in the order in which they have been sent. Each
channel is assumed to be bidirectional (can carry messages in both directions) and
to have an infinite capacity (can contain any number of messages, each of any size).
In some particular cases, we will consider channels which are unidirectional (such
channels carry messages in one direction only).
Each process pi has a set of neighbors, denoted neighborsi . According to the
context, this set contains either the local identities of the channels connecting pi to
its neighbor processes or the identities of these processes.

Structural View It follows from the previous definitions that, from a structural
point of view, a distributed system can be represented by a connected undirected
graph G = (Π, C) (where C denotes the set of channels). Three types of graph are
of particular interest (Fig. 1.1):
• A ring is a graph in which each process has exactly two neighbors with which it
can communicate directly, a left neighbor and a right neighbor.
• A tree is a graph that has two noteworthy properties: it is acyclic and connected
(which means that adding a new channel would create a cycle while suppressing
a channel would disconnect it).
• A fully connected graph is a graph in which each process is directly connected to
every other process. (In graph terminology, such a graph is called a clique.)

Distributed Algorithm A distributed algorithm is a collection of n automata, one


per process. An automaton describes the sequence of steps executed by the corre-
sponding process.
In addition to the power of a Turing machine, an automaton is enriched with two
communication operations which allows it to send a message on a channel or receive
a message on any channel. The operations are send() and receive().

Synchronous Algorithm A distributed synchronous algorithm is an algorithm de-


signed to be executed on a synchronous distributed system. The progress of such a
system is governed by an external global clock, and the processes collectively exe-
cute a sequence of rounds, each round corresponding to a value of the global clock.
1.1 Distributed Algorithms 5

Fig. 1.2 Synchronous execution (left) vs. asynchronous (right) execution

During a round, a process sends at most one message to each of its neighbors. The
fundamental property of a synchronous system is that a message sent by a process
during a round r is received by its destination process during the very same round r.
Hence, when a process proceeds to the round r + 1, it has received (and processed)
all the messages which have been sent to it during round r, and it knows that the
same is true for any process.

Space/time Diagram A distributed execution can be graphically represented by


what is called a space/time diagram. Each sequential progress is represented by
an arrow from left to right, and a message is represented by an arrow from the
sending process to the destination process. These notions will be made more precise
in Chap. 6.
The space/time diagram on the left of Fig. 1.2 represents a synchronous execu-
tion. The vertical lines are used to separate the successive rounds. During the first
round, p1 sends a message to p3 , and p2 sends a message to p1 , etc.

Asynchronous Algorithm A distributed asynchronous algorithm is an algorithm


designed to be executed on an asynchronous distributed system. In such a system,
there is no notion of an external time. That is why asynchronous systems are some-
times called time-free systems.
In an asynchronous algorithm, the progress of a process is ensured by its own
computation and the messages it receives. When a process receives a message, it
processes the message and, according to its local algorithm, possibly sends mes-
sages to its neighbors.
A process processes one message at a time. This means that the processing of a
message cannot be interrupted by the arrival of another message. When a message
arrives, it is added to the input buffer of the receiving process. It will be processed
after all the messages that precede it in this buffer have been processed.
The space/time diagram of a simple asynchronous execution is depicted on the
right of Fig. 1.2. One can see that, in this example, the messages from p1 to p2
are not received in their sending order. Hence, the channel from p1 to p2 is not a
FIFO (first in first out) channel. It is easy to see from the figure that a synchronous
execution is more structured than an asynchronous execution.

Initial Knowledge of a Process When solving a problem in a synchronous/asyn-


chronous system, a process is characterized by its input parameters (which are re-
lated to the problem to solve) and its initial knowledge of its environment.
6 1 Basic Definitions and Network Traversal Algorithms

This knowledge concerns its identity, the total number n of processes, the identity
of its neighbors, the structure of the communication graph, etc. As an example, a
process pi may only know that
• it is on a unidirectional ring,
• it has a left neighbor from which it can receive messages,
• it has a right neighbor to which it can send messages,
• its identity is idi ,
• the fact that no two processes have the same identity, and
• the fact that the set of identities is totally ordered.
As we can see, with such an initial knowledge, no process initially knows the total
number of processes n. Learning this number requires the processes to exchange
information.

1.1.2 An Introductory Example:


Learning the Communication Graph

As a simple example, this section presents an asynchronous algorithm that allows


each process to learn the communication graph in which it evolves. It is assumed
that the channels are bidirectional and that the communication graph is connected
(there is a path from any process to any other process).

Initial Knowledge Each process pi has identity idi , and no process knows n (the
total number of processes). Initially, a process pi knows its identity and the iden-
tity idj of each of its neighbors. Hence, each process pi is initially provided with
a set neighborsi and, for each idj ∈ neighborsi , the pair idi , idj  denotes locally
the channel connecting pi to pj . Let us observe that, as the channels are bidirec-
tional, both idi , idj  and idj , idi  denote the same channel and are consequently
considered as synonyms.

The Forward/Discard Principle The principle on which the algorithm relies is


pretty simple: Each process initially sends its position in the graph to each of its
neighbors. This position is represented by the pair (idi , neighborsi ).
Then, when a process pi receives a pair (idk , neighborsk ) for the first time, it up-
dates its local representation of the communication graph and forwards the message
it has received to all its neighbors (except the one that sends this message). This is
the “when new, forward” principle. On the contrary, if it is not the first time that pi
receives the pair (idk , neighborsk ), it discards it. This is the “when not new, discard”
principle.
When pi has received a pair (idk , neighborsk ), we say that it “knows the po-
sition” of pk in the graph. This means that it knows both the identity idk and the
channels connecting pk to its neighbors.
1.1 Distributed Algorithms 7

operation start() is
(1) for each idj ∈ neighborsi
(2) do send POSITION (idi , neighborsi ) to the neighbor identified idj
(3) end for;
(4) parti ← true
end operation.

when START () is received do


(5) if (¬parti ) then start() end if.

when POSITION (id, neighbors) is received from neighbor identified idx do


(6) if (¬parti ) then start() end if;
(7) if (id ∈/ proc_knowni ) then
(8) proc_knowni ← proc_knowni ∪ {id};
(9) channels_knowni ← channels_knowni ∪ {id, idk  such that idk ∈ neighbors};
(10) for each idy ∈ neighborsi \ {idx }
(11) do send POSITION (id, neighbors) to the neighbor identified idy
(12) end for;
(13) if (∀ idj , idk  ∈ channels_knowni : {idj , idk } ⊆ proc_knowni )
(14) then pi knows the communication graph; return()
(15) end if
(16) end if.

Fig. 1.3 Learning the communication graph (code for pi )

Local Representation of the Communication Graph The graph is locally rep-


resented at each process pi with two local variables.
• The local variable proc_knowni is a set that contains all the processes whose
position is known by pi . Initially, proc_knowni = {idi }.
• The local variable channels_knowni is a set that contains all the channels known
by pi . Initially, channels_knowni = {idi , idj  such that idj ∈ neighborsi }.
Hence, after a process has received a message containing the pair (idj , neighborsj ),
we have idj ∈ proc_knowni and {idj , idk  such that idk ∈ neighborsj } ⊆ channels_
knowni .
In addition to the local representation of the graph, pi has a local Boolean vari-
able parti , initialized to false, which is set to true when pi starts participating in the
algorithm.

Internal Versus External Messages This participation of a process starts when it


receives an external message START () or an internal message POSITION ().
An internal message is a message generated by the algorithm, while an external
message is a message coming from outside. External messages are used to launch
the algorithm. It is assumed that at least one process receives such a message.

Algorithm: Forward/Discard The algorithm is described in Fig. 1.3. As previ-


ously indicated, when a process pi receives a message START () or POSITION (), it
starts participating in the algorithm if not yet done (line 5 or 6). To that end it sends
8 1 Basic Definitions and Network Traversal Algorithms

the message POSITION (idi , neighborsi ) to each of its neighbors (line 2) and sets
parti to true (line 4).
When pi receives a message POSITION (id, neighbors) from one of its neighbors
px for the first time (line 7), it includes the position of the corresponding process pj
in the local data structures proc_knowni and channels_knowni (lines 8–9) and, as it
has learned something new, it forwards this message POSITION () to all its neighbors,
but the one that sent it this message (line 10). If it has already received the message
POSITION (id, neighbors) (we have then j ∈ proc_knowni ), pi discards the message.

Algorithm: Termination As the communication graph is connected, it is easy to


see that, as soon as a process receives a message START (), each process pi will
send a message POSITION (idi , neighborsi ) which, from neighbor to neighbor, will
be received by each process. Consequently, for any pair of processes (pi , pj ), pi
will receive a message POSITION (idj , neighborsj ), from which it follows that any
process pi eventually learns the communication graph.
Moreover, as (a) there is a bounded number of processes n, (b) each process pi is
the only process to initiate the sending of the message POSITION (idi , neighborsi ),
and (c) any process pj forwards this message only once, it follows that there is a
finite time after which no more message is sent. Consequently, the algorithm termi-
nates at each process. While the algorithm always terminates, the important question
is the following: When does a process know that it can stop participating in the al-
gorithm? Trivially, a process can stop when it knows that it has learned the whole
communication graph (due to the “forward” strategy, when a process knows the
whole graph, it also knows that its neighbors eventually know it). This knowledge
can be easily captured by a process pi with the help of its local data structures
proc_knowni and channels_knowni . More precisely, remembering that the pairs
idi , idj  and idj , idi  are synonyms and using a classical graph closure prop-
erty, a process pi knows the whole graph when ∀ idj , idk  ∈ channels_knowni :
{idj , idk } ⊆ proc_knowni . This local termination predicate appears at line 13. When
it becomes locally satisfied, a process pi learns that it knows the whole graph and
learns also that its neighbors eventually know it. That process can consequently stop
its execution by invoking the statement return() line 14.
It is important to notice that the simplicity of the termination predicate
comes from an appropriate choice of the local data structures (proc_knowni and
proc_knowni ) used to represent the communication graph.

Cost Let e be the total number of channels and D be the diameter of the com-
munication graph. The diameter of a graph is the longest among all the shortest
distances connecting any pair of processes, where the shortest distance between pi
and pj is the smallest number of channels to go from pi to pj . The diameter notion
is a global notion that measures the “breadth” of the communication graph.
For any i and any channel, a message POSITION (idi , −) is sent at least once and
at most twice (once in each direction) on that channel. It follows that the message
complexity is upper bounded by 2ne.
As far as the time complexity is concerned, let us consider that each message
takes one time unit and local processing has zero duration. In the worst case, a
1.2 Parallel Traversal: Broadcast and Convergecast 9

single process pk receives a message START() and there is a process p at distance


D from pk . In this case, it takes D time units for a message POSITION (idk , −)
to arrive at p . This message wakes up p , and it takes then D time units for a
message POSITION (id , −) to arrive at pk . It follows that the time complexity is
upper bounded by 2D.
Finally, let d be the maximal degree of the communication graph (i.e., d =
max({|neighborsi |1≤i≤n }), and b the number of bits required to encode any iden-
tity idi . The maximal number of bits needed for a message POSITION () is b(d + 1).

When Initially the Channels Have Only Local Names Let us consider a pro-
cess pi that has ci neighbors to which it is point-to-point connected by ci chan-
nels locally denoted channeli [1..ci ]. When each process pi is initially given only
channeli [1..ci ], the processes can easily compute their sets neighborsi . To that end,
each process executes a preliminary communication phase during which it first
sends a message ID (i) on each channeli [x], 1 ≤ x ≤ ci , and then waits until it has
received the identities of the processes at the other end of its ci channels. When
pi has received ID (idk ) on channel channeli [x], it can associate its local address
channeli [x] with the identity idk whose scope is the whole system.

Port Name When each channel channeli [x] is defined by a local name, the index
x is sometimes called a port. Hence, a process pi has ci communication ports.

1.2 Parallel Traversal: Broadcast and Convergecast

It is assumed that, while the identity of a process pi is its index i, no process knows
explicitly the value of n (i.e., pn knows that its identity is n, but does not know that
its identity is also the number of processes).

1.2.1 Broadcast and Convergecast

Two frequent problems encountered in distributed computing are broadcast and


convergecast. These two problems are defined with respect to a distinguished pro-
cess pa .
• The broadcast problem is a one-to-many communication problem. It consists in
designing an algorithm that allows the distinguished process pa to disseminate
information to the whole set of processes.
A variant of the broadcast problem is the multicast problem. In that case, the
distinguished process pa has to disseminate information to a subset of the pro-
cesses. This subset can be statically defined or dynamically defined at the time of
the multicast invocation.
10 1 Basic Definitions and Network Traversal Algorithms

when GO (data) is received from pk do


(1) if (first reception of go(data)) then
(2) for each j ∈ neighborsi \ {k} do send GO (data) to pj end for
(3) end if.

Fig. 1.4 A simple flooding algorithm (code for pi )

• The convergecast problem is a many-to-one communication problem. It consists


in designing an algorithm that allows each process pj to send information vj to
a distinguished process pa for it to compute some function f (), which is on a
vector [v1 , . . . , vn ] containing one value per process.
Broadcast and convergecast can be seen as dual communication operations. They
are usually used as a pair: pa broadcasts a query to obtain values, one from each
process, from which it computes the resulting value f (). As a simple example, pa
is a process that queries sensors for temperature values, from which it computes
output values (e.g., maximal, minimal and average temperature values).

1.2.2 A Flooding Algorithm

A simple way to implement a broadcast consists of what is called a flooding al-


gorithm. Such an algorithm is described in Fig. 1.4. To simplify the description,
the distinguished process pa initially sends to itself a message denoted GO (data),
which carries the information it wants to disseminate. Then, when a process pi re-
ceives for the first time a copy of this message, it forwards the message to all its
neighbors (except to the sender of the message).
Each message GO (data) can be identified by a sequence number sna . Moreover,
the flooding algorithm can be easily adapted to work with any number of distin-
guished processes by identifying each message broadcast by a distinguished process
pa with an identity pair (a, sna ).
As the set of processes is assumed to be connected, it is easy to see that the algo-
rithm described in Fig. 1.4 guarantees that the information sent by a distinguished
process is eventually received exactly once by each process.

1.2.3 Broadcast/Convergecast Based on a Rooted Spanning Tree

The previous flooding algorithm may use up to 2e − |neighborsa | messages (where


e is the number of channels), and is consequently not very efficient. A simple way to
improve it consists of using an underlying spanning tree rooted at the distinguished
process pa .
1.2 Parallel Traversal: Broadcast and Convergecast 11

Fig. 1.5
A rooted spanning tree

Rooted Spanning Tree A spanning tree rooted at pa is a tree which contains n


processes and whose channels (edges) are channels of the communication graph.
Each process pi has a single parent, locally denoted parenti , and a (possibly empty)
set of children, locally denoted childreni . To simplify the notation, the parent of the
root is the root itself, i.e., the distinguished process pa is the only process pi such
that parenti = i. Moreover, if j = a, we have j ∈ childreni ⇔ parentj = i, and the
channel i, j  belongs to the communication graph.
An example of a rooted spanning tree is described in Fig. 1.5. The arrows (ori-
ented toward the root) describe the channels of the communication graph that belong
to the spanning tree. The dotted edges are the channels of the communication graph
that do not belong to the spanning tree. This spanning tree rooted at pa is such
that, when considering the position of process pi where neighborsi = {a, k, j, f },
we have parenti = a, childreni = {j, k} and consequently parentj = parentk = i.
Moreover, childreni ∪ {parenti } ⊆ neighborsi = {a, k, j, f }.

Algorithms Given such a rooted spanning tree, the algorithms implementing a


broadcast by pa and the associated convergecast to pa are described in Fig. 1.6.
As far as the broadcast is concerned, pa first sends the message GO (data) to itself,
and then this message is forwarded along the channels of the spanning tree, and this
restricted flooding stops at the leaves of the tree.
As far as the convergecast is concerned, each leaf pi sends a message BACK (val_
seti ) to its parent (line 4) where val_seti = {(i, vi )} (line 2), i.e., val_seti contains a

============= Broadcast =============================


when GO (data) is received from pk do
(1) for each j ∈ childreni \ {k} do send GO (data) to pj end for.

============= Convergecast ===========================


when BACK (val_set j ) received from each pj such that j ∈ childreni do
(2) val_seti ← ( j ∈childreni val_setj ) ∪ {(i, vi )};
(3) let k = parenti ;
(4) if (k = i) then send BACK (val_seti ) to pk
(5) else the root pi (= pa ) can compute f (val_seti )
(6) end if.

Fig. 1.6 Tree-based broadcast/convergecast (code for pi )


12 1 Basic Definitions and Network Traversal Algorithms

single pair carrying the value vi sent by pi to the root. A non-leaf process pi waits
for the pairs (k, vk ) from all its children, adds its own pair (i, vi ), and finally sends
the resulting set val_seti to its parent (line 4). When the root has received a set of
pairs from each of its children, it has a pair from each process and can compute the
function f () (line 5).

1.2.4 Building a Spanning Tree

This section presents a simple algorithm that (a) implements broadcast and con-
vergecast, and (b) builds a spanning tree. This algorithm is sometimes called prop-
agation of information with feedback. Once a spanning tree has been constructed,
it can be used for future broadcasts and convergecasts involving the same distin-
guished process pa .

Local Variables As before, each process pi is provided with a set neighborsi


which defines its position in the communication graph and, at the end of the execu-
tion, its local variables parenti and childreni will define its position in the spanning
tree rooted at pa .
To compute its position in the spanning tree rooted at pa , each process pi uses
an auxiliary integer local variable denoted expected_msgi . This variable contains
the number of messages that pi is waiting for from its children before sending a
message BACK () to its parent.

Algorithm The broadcast/convergecast algorithm building a spanning tree is de-


scribed in Fig. 1.7. To simplify the presentation, it is first assumed that the channels
are FIFO (first in, first out). The distinguished process pa is the only process which
receives the external message START () (line 1). Upon its reception, pa initializes
parenta , childrena and expected_msga and sends a message GO (data) to each of its
neighbors (line 2).
When a process pi receives a message GO (data) for the first time, it defines
the sender pj as its parent in the spanning tree, and initializes childreni to ∅ and
expected_msgi the number of its neighbors minus pj (line 4). If its parent is its
only neighbor, it sends back the pair (i, vi ) thereby indicating to pj that it is one
of its children (lines 5–6). Otherwise, pi forwards the message GO (data) to all its
neighbors but its parent pj (line 7).
If parenti = ⊥, when pi receives GO (data), it has already determined its parent
in the spanning tree and forwarded the message GO (data). It consequently sends by
return to pj the message BACK (∅), where ∅ is used to indicate to pj that pi is not
one of its children (line 9).
When a process pi receives a message BACK (res, val_set) from a neighbor pj , it
decreases expected_msgi (line 11) and adds pj to childreni if val_set = ∅ (line 12).
Then, if pi has received a message BACK () from all its neighbors (but its parent,
line 13), it sends to its parent (lines 15–16) the set val_set containing its own pair
1.2 Parallel Traversal: Broadcast and Convergecast 13

when START () is received do % only pa receives this message %


(1) parenti ← i; childreni ← ∅; expected_msgi ← |neighborsi |;
(2) for each j ∈ neighborsi do send GO (data) to pj end for.

when GO (data) is received from pj do


(3) if (parenti = ⊥)
(4) then parenti ← j ; childreni ← ∅; expected_msgi ← |neighborsi | − 1;
(5) if (expected_msgi = 0)
(6) then send BACK ({(i, vi )}) to pj
(7) else for each k ∈ neighborsi \ {j } do send GO (data) to pk end for
(8) end if
(9) else send BACK (∅) to pj
(10) end if.

when BACK (val_set) is received from pj do


(11) expected_msgi ← expected_msgi − 1;
(12) if (val_set = ∅) then childreni ← childreni ∪ {j } end if;
(13) if (expected_msgi = 0) then % a set val_setx has been received from each child px %
(14) let val_set = ( x∈childreni val_setx ) ∪ {(i, vi )}; let pr = parenti ;
(15) if (pr = i)
(16) then send BACK (val_set) to ppr % local termination for pi %
(17) else pi (= pa ) can compute f (val_set) % global termination %
(18) end if
(19) end if.

Fig. 1.7 Construction of a rooted spanning tree (code for pi )

(i, vi ) plus all the pairs (k, vk ) it has received from its children line 14). Then, pi
has terminated its participation in the algorithm (its local variable expected_msgi
then becomes useless). If pi is the distinguished process pa , the set val_set contains
a pair (x, vx ) per process px , and pa can accordingly compute f (val_set) (where
f () is the function whose result is the output of the computation).
Let us notice that, when the distinguished process pa discovers that the algo-
rithm has terminated, all the messages sent by the algorithm have been received and
processed.

Cost Let us observe that a message BACK() is eventually sent as a response to each
message GO(). Moreover, except on the channels of the spanning tree that is built,
two messages GO() can be sent (one in each direction).
Let e be the number of channels of the underlying communication graph. It fol-
lows that the algorithm gives rise to 2(n − 1) messages which travel on the chan-
nels of tree and 4(e − (n − 1)) messages which travel on the other channels, i.e.,
2(2e − n + 1) messages. Then, once the tree is built, a broadcast/convergecast costs
only 2(n − 1) messages.
Assuming all messages take one time unit and local computations have zero du-
ration, it is easy to see that the time complexity is 2D where D is the diameter of
the communication graph. Once the tree is built, the time complexity of a broad-
14 1 Basic Definitions and Network Traversal Algorithms

Fig. 1.8 Left: Underlying communication graph; Right: Spanning tree

Fig. 1.9 An execution of the algorithm constructing a spanning tree

cast/convergecast is 2Da , where Da is the longest distance from pa to any other


process.

An Example An execution of the algorithm described in Fig. 1.7 for the commu-
nication graph depicted in the left part of Fig. 1.8 is described in Fig. 1.9.
Figure 1.9 is a space-time diagram. The execution of a process pi , 1 ≤ i ≤ 4, is
represented by an axis oriented from left to right. An arrow from one axis to another
represents a message transfer. In this picture, an arrow labeled GO x,y () represents a
message GO () sent by px to py . Similarly, an arrow labeled BACK x,y () represents a
message BACK () sent by px to py .
The process p1 is the distinguished process that receives the external message
START () and consequently will be the root of the tree. It sends a message GO () to its
neighbors p2 and p3 . When p3 receives this message, it defines its parent as being
p1 and forwards message GO() to its two other neighbors p2 and p4 .
Since the first message GO() received by p2 is the one sent by p3 , p2 defines its
parent as being p3 and forwards the message GO() to its other neighbor, namely p1 .
When p1 receives a message GO() from p2 , it sends back a message BACK (∅) to
p2 . In contrast, when p4 receives the message GO() from p3 , it sends by return to
p3 a message BACK () carrying the pair (4, v4 ). Moreover, when p2 has received a
message BACK () from p1 , it sends to its parent p3 a message BACK () carrying the
pair (2, v2 ).
Finally, when p3 receives the messages BACK () from p2 and p4 , it discovers
that these processes are its children and sends a message BACK () carrying the set
{(2, v2 ), (3, v3 ), (4, v4 )} to its parent p1 . When p1 receives this message, it discov-
1.2 Parallel Traversal: Broadcast and Convergecast 15

ers that p2 is its only child. It can then compute f () on the vector [v1 , v2 , v3 , v4 ].
The tree that has been built is represented at the right of Fig. 1.8.

On the Parenthesized Structure of the Execution It is important to notice that


the spanning tree that has been built depends on the speed of the messages GO().
Another execution of the same algorithm on the same network with the same distin-
guished process could produce a different tree rooted at p1 .
It is also interesting to observe that each message GO() can be seen as an opening
bracket that can be associated with a message BACK(), which is the corresponding
closing bracket. This appears on the figure as follows: GO x,y () is an opening bracket
whose associated closing bracket is BACK y,x ().

The Case of Non-FIFO Channels Assuming non-FIFO channels and taking into
account Fig. 1.9, let us consider that the message GO 1,2 () arrives at p2 after the mes-
sage BACK 1,2 (). It is easy to see that the algorithm remains correct (i.e., a spanning
tree is built).
The only thing that changes is the meaning associated with line 16. When a
process sends a message BACK() to its parent, it can no longer claim that its local
computation is terminated. A process needs now to have received a message on each
of its incident channels before claiming local termination.

A Spanning Tree per Process The algorithm of Fig. 1.7 can be easily general-
ized to build n trees, each one associated with a distinct process which is its dis-
tinguished process. Then, when any process pi wants to execute an efficient broad-
cast/convergecast, it has to use its associated spanning tree.
To build a spanning tree per process, the local variables parenti , childreni , and
expected_msgi of each process pi have to be replaced by the arrays parenti [1..n],
childreni [1..n] and expected_msgi [1..n] and all messages have to carry the identity
of the corresponding distinguished process. More precisely, when a process pk re-
ceives a message START (), it uses its local variables parentk [k], childrenk [k], and
expected_msgk [k]. The corresponding messages will carry the identity k, GO (k, −)
and BACK (k, −), and, when a process pi receives such messages, it will uses its
local variables parenti [k], childreni [k] and expected_msgi [k].

Concurrent Initiators for a Single Spanning Tree The algorithm of Fig. 1.7 can
be easily modified to build a single spanning tree while allowing several processes to
independently start the execution of the algorithm, each receiving initially a message
START (). To that end, each process manages an additional local variable max_id i
initialized to 0, which contains the highest identity of a process competing to be the
root of the spanning tree.
• If a process pi receives a message START () while max_idi = 0, pi discards this
message (in that case, it already participates in the algorithm but does not com-
pete to be the root). Otherwise, pi starts executing the algorithm and all the cor-
responding messages GO () or BACK () carry its identity.
16 1 Basic Definitions and Network Traversal Algorithms

Fig. 1.10 Two different spanning trees built from the same communication graph

• Then, when a process pi receives a message GO (j, −), pi discards the message if
j < max_idi . Otherwise, pi considers pj as the process with the highest identity
which is competing to be the root. It sets consequently max_idi to j and con-
tinues executing the algorithm by using messages GO () and BACK () carrying the
identity j .
It is easy to see that this simple application of the forward/discard strategy ensures
that a single spanning tree will be constructed, namely the one rooted at pj where j
is such that, at the end of the execution, we have max_id1 = · · · = max_idn = j .

1.3 Breadth-First Spanning Tree


Let us remember that the distance between a process pi and a process pj is the
length of the shortest path connecting these two processes, where the length of a
path is measured by the number of channels it is made up of (this distance is also
called the hop distance).
The spanning tree built by the algorithm of Fig. 1.7 does not necessarily build
a breadth-first tree, i.e., a tree where the children of the root are its neighbors, and
more generally the processes at distance d of the root in the tree are the processes at
distance d of the root process in the communication graph. As we have seen in the
example described in Fig. 1.8, the tree that is built depends on the speed of messages
during the execution of the algorithm, and consequently distinct trees can be built
by different executions.
Breadth-first traversal does not imply that the tree that is built is independent
of the execution. It only means that two processes at distance d of the root in the
communication graph are at distance d in the tree. According to the structure of
the graph, two processes at distance d of the root do not necessarily have the same
parent in different executions. A simple example is given in Fig. 1.10 where process
p5 , which is at distance 2 of the root p1 , has different parents in the breadth-first
trees described on the right part of the figure.
This section presents two algorithms that build breadth-first trees. Both are itera-
tive algorithms, but the first one has no centralized control, while the second one is
based on a centralized control governed by the distinguished root process.
1.3 Breadth-First Spanning Tree 17

1.3.1 Breadth-First Spanning Tree


Built Without Centralized Control

Principle of the Algorithm This algorithm, which is due to T.-Y. Cheung (1983),
is based on parallel traversals of the communication graph. These traversals are con-
current and some of them can stop others. In addition to the local variables parenti ,
childreni , and expexted_msgi , each process pi manages a local variable, denoted
leveli , which represents its current approximation of its distance to the root. More-
over, each message GO () carries now the current level of the sending process.
Then, when a process pi receives a message GO (d), there are two cases according
to the current state of pi and the value of d.
• The message GO (d) is the first message GO () received by pi . In that case, pi
initializes leveli to d + 1 and forwards the message GO (d + 1) to its neighbors
(except the sender of the message GO (d)).
• The message GO (d) is not the first message GO () received by pi and leveli >
d + 1. In that case, pi (a) updates its variable leveli to d + 1, (b) defines the
sender of the message GO (d) just received as its new parent, and (c) forwards a
message GO (d + 1) to each of its other neighbors pk in order that they recompute
their distances to the root.
As we can see, these simple principles consist of a chaotic distributed iterative
computation. They are used to extend the basic parallel network traversal algorithm
of Fig. 1.7 with a forward/discard strategy that allows processes to converge to their
final position in the breadth-first spanning tree.

Description of the Algorithm The algorithm is described in Fig. 1.11. As just


indicated, it uses the parallel network traversal described in Fig. 1.7 as a skeleton
on which are grafted appropriate statements to obtain a breadth-first rooted span-
ning tree. These new statements implement the convergence of the local variables
parenti , childreni , and leveli to their final values. More precisely, we have the fol-
lowing.
Initially, a single process pi receives an external message START (). This process,
which will be the root of the tree, sends a message GO (−1) to itself (line 1). When
it receives the message, pi sets parenti = i (hence it is the root) and its distance to
itself is set to leveli = 0.
As previously indicated, there are two cases when a process pi receives a mes-
sage GO (d). Let us remember that d represents the current approximation of the
distance of the sender of the message GO () to the root. If parenti = ⊥, this message
is the first message GO () received by pi (line 2). In that case, pi enters the tree at
level d + 1 (line 3) and propagates the network traversal to its other neighbors by
sending them the message GO (d + 1) in order that they enter the tree or improve
their position in the tree under construction (line 5). If the sender of the message
GO (d) is its only neighbor, pi sends by return the message BACK (yes, d + 1) to
inform it that it is one of its children at level d + 1 (line 6).
18 1 Basic Definitions and Network Traversal Algorithms

when START () is received do % only the distinguished process receives this message %
(1) send GO (−1) to itself.

when GO (d) is received from pj do


(2) if (parenti = ⊥)
(3) then parenti ← j ; childreni ← ∅; leveli ← d + 1;
(4) expected_msgi ← |neighborsi \ {j }|;
(5) if (expected_msgi = 0)
(6) then send BACK (yes, d + 1) to pparenti
(7) else for each k ∈ neighborsi \ {j } do send GO (d + 1) to pk end for
(8) end if
(9) else if (leveli > d + 1)
(10) then parenti ← j ; childreni ← ∅; leveli ← d + 1;
(11) expected_msgi ← |neighborsi \ {j }|;
(12) if (expected_msgi = 0)
(13) then send BACK (yes, leveli ) to pparenti
(14) else for each k ∈ neighborsi \ {j } do send GO (d + 1) to pk end for
(15) end if
(16) else send BACK (no, d + 1) to pj
(17) end if
(18) end if.

when BACK (resp, d) is received from pj do


(19) if (d = leveli + 1)
(20) then if (resp = yes) then childreni ← childreni ∪ {j } end if;
(21) expected_msgi ← expected_msgi − 1;
(22) if (expected_msgi = 0)
(23) then if (parenti = i) then send BACK (yes, leveli ) to pparenti
(24) else pi learns that the breadth-first tree is built
(25) end if
(26) end if
(27) end if.

Fig. 1.11 Construction of a breadth-first spanning tree without centralized control (code for pi )

If the message GO (d) is not the first message GO () received by pi , there are two
cases. Let pj be the sender of the message GO (d).
• If leveli ≤ d + 1, pi cannot improve its position in the tree. It then sends by return
the message BACK (no, d + 1) to inform the sender of the message GO () that it
cannot be its child at distance d + 1 of the tree (line 16). Hence, pi stops the
network traversal associated with the message GO (d) it has received.
• If leveli > d + 1, pi has to improve its position in the tree under construction. To
that end, it propagates the network traversal associated with the message GO (d) it
has received in order to allow its other neighbors to improve their positions in the
tree. Hence, it executes the same statements as those executed when it received
its first message GO (d) (lines 10–15 are exactly the same as lines 3–8).
When a process pi receives a message BACK (resp, d), it considers it only if
leveli = d − 1 (line 19). This is because this message is meaningful only if its sender
pj sent it when its level was levelj = d = leveli + 1. In the other cases, the message
1.3 Breadth-First Spanning Tree 19

Fig. 1.12 An execution of the algorithm of Fig. 1.11

BACK () is discarded. If the message is meaningful and resp = yes, pi adds pj to


its children (and those are at level leveli + 1, line 20).
Finally, as in a simple parallel network traversal, if it has received a mes-
sage BACK (−, leveli + 1) from each of its other neighbors, pi sends a message
BACK (yes, leveli ) to its current parent. If pi is the root, it learns that the breadth-
first tree is built. Let us observe that, if pi is not the root, it is possible that it later
receives other messages GO () that will improve the current value of leveli .

Termination It follows from line 23 that the root learns the termination of the
algorithm.
On the other hand, the local variable leveli of a process pi , which is not the root,
can be updated each time it receives a message GO (). Unfortunately, the number
of such messages received by pi is not limited to the number of its neighbors but
depends on (a) the number of neighbors of its neighbors, etc. (i.e., the structure of
the communication graph), and (b) the speed of messages (i.e., asynchrony). As its
knowledge of the communication graph is local (it is restricted to its neighbors), a
process cannot define a local predicate indicating that its local variables have con-
verged to their final values. But, as the root can discover when the construction of
the tree has terminated, it can send a message (which will be propagated along the
tree) to inform the other processes that their local variables parenti , childreni , and
leveli have their final values.

A Simple Example Let us consider the communication graph described on the


left of Fig. 1.12. As the graph is a clique, a single breadth-first tree can be obtained,
namely the star graph centered at the distinguished process (here p1 ). This star is
depicted on the right of the figure.
A worst case execution of the algorithm of Fig. 1.11 is depicted in the middle of
Fig. 1.12 (worst case with respect to the number of messages). Only the message
GO () from a process pi to a process pj with i < j is represented in the space-time
diagram. The worst case occurs when the message GO () sent by p1 to p3 arrives
after the message GO () sent by p2 , and both the message GO () sent by p1 to p4 and
the message GO () sent by p2 to p4 arrive after the message GO () sent by p3 to p4 .
It follows that, with this pattern of message arrivals, p2 discovers it is at distance
1 when it receives its first message GO (), p3 discovers that it is at distance 1 when
it receives the message GO () sent by p1 , and similarly and p4 discovers it when it
receives the message GO () sent by p1 .
20 1 Basic Definitions and Network Traversal Algorithms

Cost There are two type of messages, and each message carries an integer whose
value is bounded by the diameter D of the communication graph. Moreover, a mes-
sage BACK() carries an additional Boolean value. It follows that the size of each
message is upper bounded by 2 + log2 D bits. It is easy to see that the time com-
plexity is O(D), i.e., O(n) in the worst case.
As far as the message complexity is concerned, the worst case is a fully connected
communication graph (i.e., any pair of processes is connected by a channel) and a
process at distance d of the root updates leveli d times (as in Fig. 1.12). This means
that among the (n − 1) processes which are not the root, one updates its level once,
another one updates it twice, etc., and one updates it (n − 1) times. Moreover, each
time a process updates its level, a process forwards the messages GO () to (n − 2)
of its neighbors (all processes but itself and the sender of the GO () that entailed the
update of its own level). The root sends (n − 1) messages GO (). Hence the total
number of messages GO () is


n−1
(n − 1)(n2 − 2n + 2)
(n − 1) + (n − 2) i= .
2
i=1

As at most one message BACK () is associated with each message GO (), it follows
that, in a fully connected network, the message complexity is upper bounded by
O(n3 ).

1.3.2 Breadth-First Spanning Tree Built with Centralized Control

This section presents a second distributed algorithm that builds a breadth-first span-
ning tree. Differently from the previous one, this algorithm—which is due to Y. Zhu
and T.-Y. Cheung (1987)—is based on a centralized control that allows each process
to locally learn when its participation to the algorithm has terminated. Moreover,
both its time and message complexities are O(n2 ). This shows an interesting trade-
off with the previous algorithm whose time complexity is O(n) while its message
complexity is O(n3 ).

Underlying Principle This principle consists in a distributed iteration whose


progress is handled by the distinguished process pa . Let us call wave each new
iteration launched by pa . The first wave (first iteration) attains only the neighbors
of the root pa , which discover they are at distance 1 and enter the tree. The second
wave directs the processes at distance 2 of the root to enter the tree. More generally,
the wave number d directs the processes at distance d of the root to enter the tree.
After the wave number d has attained the processes at distance d, it returns to the
root in order to launch the next wave (see Fig. 1.13).
It follows that, when pa launches the wave number d + 1, the processes at dis-
tance less than d + 1 know their position in the tree, and the processes at distance
d + 1 will discover they are at distance d + 1 by receiving a message from a process
1.3 Breadth-First Spanning Tree 21

Fig. 1.13 Successive waves launched by the root process pa

at distance d. Hence, the messages implementing the wave number d + 1 can use
only the channels of the breadth-first spanning tree of depth d that has been built by
the previous waves. This reduces consequently the number of messages needed to
implement a wave.
From an implementation point of view, for any d, the wave number d going from
the root up to processes at distance d is implemented with messages GO (), while its
return to the root is implemented with messages BACK (), as depicted in Fig. 1.13.

Algorithm: Local Variables In addition to the constant set neighborsi and its
local variables parenti (initialized to ⊥) and childreni , each process pi manages the
following local variables in order to implement the underlying principle previously
described.
• distancei is a write-once local variable that keeps the distance of pi to the root.
• to_sendi is a set that, once pi has been inserted into the spanning tree, contains
its neighbors to which it has to propagate the waves it receives from the root. If
pi is at distance d, these wave propagations will concern waves whose number is
greater than d.
• waiting_fromi is a set used by pi to manage the return of the current wave to its
parent in the tree. (Its role is similar to that of the local variable expected_msgi
used in the previous algorithms.)

Algorithm: Launching by pa The algorithm assumes that the system is made up


of at least two processes.
The initial code executed by the distinguished process is described in Fig. 1.14.
This process defines itself as the root (as in previous algorithms, it will be the only
process pi such that parenti = i), initializes its other local variables and sends a
message GO (0) to each of its neighbors. More generally, the value d in a message
GO (d) sent by a process pi means that d is the distance of pi to the root of the tree.
The behavior of a process pi when it receives a message GO () or BACK () is
described in Fig. 1.15 (the root process receives only messages BACK ()).
22 1 Basic Definitions and Network Traversal Algorithms

when START () is received do % only the distinguished process receives this message %
parenti ← i; childreni ← ∅; distancei ← 0; to_sendi ← neighborsi ;
for each k ∈ to_sendi do send GO (0) to pk end for;
waiting_fromi ← neighborsi .

Fig. 1.14 Construction of a breadth-first spanning tree with centralized control (starting code)

when GO (d) is received from pj do


(1) if (parenti = ⊥)
(2) then parenti ← j ; childreni ← ∅; distancei ← d + 1; to_sendi ← neighborsi \ {j };
(3) if (to_sendi = ∅) then send BACK (stop) to pj
(4) else send BACK (continue) to pj end if
(5) else if (parenti = j )
(6) then for each k ∈ to_sendi do send GO (distancei ) to pk end for;
(7) waiting_fromi ← to_sendi
(8) else send BACK (no) to pj
(9) end if
(10) end if.

when BACK (resp) is received from pj do


(11) waiting_fromi ← waiting_fromi \ {j };
(12) if resp ∈ {continue, stop} then childreni ← childreni ∪ {j } end if;
(13) if resp ∈ {stop, no} then to_sendi ← to_sendi \ {j } end if;
(14) if (to_sendi = ∅) % we have then waiting_fromi = ∅ %
(15) then if (parenti = i) then the root learns that the tree is built
(16) else send BACK (stop) to pparenti
(17) end if
(18) else if (waiting_fromi = ∅)
(19) then if (parenti = i)
(20) then for each k ∈ to_sendi do send GO (distancei ) to pk end for;
(21) waiting_fromi ← to_sendi
(22) else send BACK (continue) to pparenti
(23) end if
(24) end if
(25) end if.

Fig. 1.15 Construction of a breadth-first spanning tree with centralized control (code for a pro-
cess pi )

Reception of a Message GO () When a process pi receives for the first time a


message GO (d), it discovers that it is at distance d + 1 of the root (line 1). Let pj be
the sender of this message GO (d). The receiving process pi consequently initializes
its local variables parenti to j and distancei to d + 1 (line 2). It also initializes its
set childreni to ∅ and to_sendi to its set of neighbors except pj .
Then, pi returns the current wave to its parent pj by returning to it a message
BACK () (line 3). If to_send i is empty, pj is the only neighbor of pi , and conse-
quently pi returns the message BACK (stop) to indicate to pj that (a) it is one of
its children and (b) no more processes can be added to the tree as far it is concerned.
Hence, if there are new waves, its parent pj will not have to send it messages GO (d)
1.3 Breadth-First Spanning Tree 23

in order to expand the tree. If to_sendi is not empty, pi is one of the children of pj
and the tree can possibly be expanded from pi . In this case, pi sends the message
BACK (continue) to its parent pj to inform it that it is one of its children and (b)
possibly, during the next wave, new processes can be added to the tree with pi as
parent. These two cases are expressed at line 3.
If pi already has a parent (parenti = ⊥), i.e., it is already in the tree when it
receives a message GO (d), its behavior depends then on the sender pj of the mes-
sage GO (d) line 5. If pj is its parent, pi forwards the wave by sending the message
GO (d + 1) to its other neighbors in the set to_send i (line 6) and resets accordingly
waiting_fromi to to_sendi (line 7). If pj is not its parent (line 8), pi sends back the
message BACK (no) to the process pj to inform it that (a) it is not one of its children
and consequently (b) pj no longer has to forward waves to it.
Reception of a Message BACK () When a process pi receives a message BACK () it
has already determined its position in the breadth-first spanning tree. This message
BACK () sent by a neighbor pj is associated with a message GO () sent by pi to pj .
It carries a value resp ∈ {stop, continue, no}.
Hence, when pi receives BACK (resp) from pj , it first suppresses pj from the pro-
cess from which it waits for messages (line 11). Then, if resp ∈ {stop, continue},
pj is one of its children (line 12) and if resp ∈ {stop, no}, pi discovers that it has
no longer to send messages to pj (line 13). The behavior of pi depends then on the
set to_sendi .
• If to_sendi is empty (line 14), pi knows that its participation to the algorithm
is terminated. (Let us notice that, due to lines 7, 11 and 13, we have then
waiting_fromi = ∅.) If pi is the root, it also knows that the algorithm has ter-
minated (line 15). If it is not the root, it sends the message BACK (stop) to its
parent to inform it that (a) it has locally terminated, and (b) the tree can no longer
be extended from it (line 16).
• If to_sendi is not empty, it is possible that the tree can be expanded from pi . In
this case, if waiting_fromi = ∅, pi returns the wave to its parent by sending it the
message BACK (continue) (line 22). If it is the root, pi starts a new wave by
sending the message GO (0) to its neighbors from which the tree can possibly be
expanded (line 20) and resets appropriately its local set to_sendi (line 21).

On Distributed Synchronization As we can see, this algorithm exhibits two


types of synchronization. The first, which is global, is realized by the root process
that controls the sequence of waves. This appears at lines 20–21. The second is local
at each process. It occurs at line 3, line 16, or line 22 when a process pi sends back
a message BACK () to its parent.
Local Versus Global Termination Differently from the algorithm described in
Fig. 1.11 in which no process—except the root—knows when its participation to
the algorithm has terminated, the previous algorithm allows each process to know
that it has terminated.
For a process which is not the root, this occurs when it sends a message
BACK (stop) to its parent at line 3 (if the parent of pi is its only neighbor) or
24 1 Basic Definitions and Network Traversal Algorithms

at line 16 (if pi has several neighbors). Of course, the fact that a process has lo-
cally terminated does not mean that the algorithm has terminated. Only the root can
learn it. This occurs at line 15 when the root pi receives a message BACK (stop)
entailing the last update of to_sendi which becomes empty.

Cost Let us first consider the number of messages. As in previous algorithms,


each message BACK () is associated with a message GO (). Hence, we have only to
determine the number of messages GO ().
Let e be the number of channels of the communication graph. At most two mes-
sages GO () are exchanged on each channel that will not belong to the tree. It follows
that at most 2e − (n − 1) messages GO () are exchanged on these channels. Since the
tree will involve (n − 1) channels and, in the worst case, there are at most n waves,
it follows that at most O(n2 ) messages GO () travel on the channels of the tree. As e
is upper bounded by O(n2 ), it follows that the total number of messages GO () and
BACK () is upper bounded by O(n2 ).
As far as the time complexity is concerned we have the following. The first wave
takes two time units (messages GO () from the root to its neighbors followed by
messages BACK () from these processes to the root). The second wave takes 4 time
units (two sequential sendings of messages GO () from the root to the processes
at distance 2 from it and two sequential sendings of messages BACK () from these
processes to the root), etc. As there are at most D waves (where D is the diameter
of the communication graph), the time complexity is upper bounded by 2(1 + 2 +
· · · + D), i.e., O(D 2 ).
It follows that both the message and the time complexities are bounded by O(n2 ).
As already noticed, this shows an interesting tradeoff when comparing this algo-
rithm with the algorithm without centralized synchronization described in Fig. 1.11
whose message and time complexities are O(n3 ) and O(n), respectively. The added
synchronization allows for a reduction of the number of messages at the price of an
increase in the time complexity.

1.4 Depth-First Traversal


Starting from a distinguished process pa , a distributed depth-first network traversal
visits one process at a time. This section presents first a simple distributed depth-first
traversal algorithm which benefits from the traversal to construct a rooted tree. Two
improvements of this algorithm are then presented. Finally, one of them is enriched
to obtain a distributed algorithm that builds a logical ring on top of an arbitrary
connected communication graph.

1.4.1 A Simple Algorithm

Description of the Algorithm The basic depth-first distributed algorithm is de-


scribed in Fig. 1.16. The algorithm starts when a distinguished process received the
1.4 Depth-First Traversal 25

when START () is received do % only pa receives this message %


(1) parenti ← i; childreni ← ∅; visitedi ← ∅;
(2) let k ∈ neighborsi ; send GO () to pk .

when GO () is received from pj do


(3) if (parenti = ⊥)
(4) then parenti ← j ; childreni ← ∅; visitedi ← {j };
(5) if (visitedi = neighborsi )
(6) then send BACK (yes) to pj
(7) else let k ∈ neighborsi \ visitedi ; send GO () to pk
(8) end if
(9) else send BACK (no) to pj
(10) end if.

when BACK (resp) is received from pj do


(11) if (resp = yes) then childreni ← childreni ∪ {j } end if;
(12) visitedi ← visitedi ∪ {j };
(13) if (visitedi = neighborsi )
(14) then if (parenti = i)
(15) then the traversal is terminated % global termination %
(16) else send BACK (yes) to pparenti % local termination %
(17) end if
(18) else let k ∈ neighborsi \ visitedi ; send GO () to pk
(19) end if.

Fig. 1.16 Depth-first traversal of a communication graph (code for pi )

external message START (). The distinguished process sends first a message GO () to
one of its neighbors (line 2).
Then, when a process pi receives a message GO (), it defines the message sender
as its parent in the depth-first tree (lines 3–4). The local variable visitedi is a set
containing the identities of its neighbors which have been already visited by the
depth-first traversal (implemented by the progress of the message GO ()). If pj is its
only neighbor, pi sends back to pj the message BACK (yes) to inform it that (a) it
is one of its children and (b) it has to continue the depth-first traversal (lines 5–6).
Otherwise (neighborsi = visitedi , pi propagates the depth-first traversal to one of
its neighbors that, from its point of view, has not yet been visited (line 7). Finally, if
pi has already been visited by the depth-first traversal (parenti = ⊥), it sends back
to pj a message BACK (no) to inform it that it is not one of its children (line 9).
When a process pi receives a message BACK (resp), it first adds its sender pj to
its set of children if resp = yes (line 11). Moreover, it also adds pj to the set of
its neighbors which have been visited by the depth-first traversal (line 12). Then,
its behavior is similar to that of lines 5–8. If, from its point of view, not all of its
neighbors have been visited, it sends a message GO () to one of them (line 18). If
all of its neighbors have been visited (line 13), it claims the termination if it is the
root (line 14). If it is not the root, it sends to its parent the message BACK (yes) to
inform the parent that it is one of its children and it has to forward the depth-first
traversal.
26 1 Basic Definitions and Network Traversal Algorithms

On the Tree That Is Built It is easy to see that, given a predefined root process,
the depth-first spanning tree that is built does not depend on the speed of messages
but depends on the way each process pi selects its neighbor pk to which it propa-
gates the depth-first traversal (line 7 and line 18).

Cost As in previous algorithms, a message BACK () is associated with each mes-


sage GO (). Moreover, at most one message GO () is sent in each direction on every
channel. It follows that the message complexity is O(e), where e is the number of
channels of the communication graph. Hence, it is upper bounded by O(n2 ).
There are two types of messages, and message BACK () carries a binary value.
Hence, the size of a message is one or two bits.
Finally, at most one message GO () or BACK () is traveling on a channel at any
given time. It follows that the time complexity is the same as the message complex-
ity, i.e., O(e).

An Easy Improvement of the Basic Algorithm A simple way to improve the


time complexity of the previous distributed depth-first traversal algorithm consists
in adding a local information exchange phase when the depth-first traversal visits a
process for the first time. This additional communication phase, which is local (it
involves only the process receiving a message GO () and its neighbors), allows for
an O(n) time complexity (let us remember that n is the number of processes).
This additional exchange phase is realized as follows. When a process pi receives
a message GO () for the first (and, as we will see, the only) time, the following state-
ments are executed before it forwards the depth-first traversal to one of its neighbors
(line 7):
• pi sends an additional control message VISITED () to each of its neighbors pj ,
and waits until it has received an answer message KNOWN () from each of them.
• When a process pj receives a message VISITED () from one of its neighbors pi ,
it adds i to its local set visitedj and sends by return the message KNOWN () to pi .
It is easy to see that when a message GO () has been received by a process pi , all
its neighbors are informed that it has been visited by the depth-first traversal, and
consequently none of them will forward the depth-first traversal to it. Moreover,
thanks to this modification, no message BACK () carries the value no (line 9 disap-
pears). Consequently, as all the messages BACK () carry the value yes, this value
can be left implicit and line 11 can be shortened to “childreni ← childreni ∪ {j }”.
It follows that the modified algorithm sends (n − 1) messages GO (), the same
number of messages BACK (), 2e − (n − 1) messages VISITED (), and the same num-
ber of messages KNOWN (). Hence, the message complexity is O(e).
As far as the time complexity is concerned, we have the following. There are
(n − 1) messages GO (), the same number of messages BACK (), and no two of these
messages are traveling concurrently. When it receives a message GO (), each process
sends a message VISITED () to its neighbors (except its parent), and these messages
travel concurrently. The same occurs for the answer messages KNOWN (). Let n1 be
the number of processes that have a single neighbor in the communication graph.
The time complexity is consequently 2(n − 1) + 2n − 2n1 , i.e., O(n).
1.4 Depth-First Traversal 27

when START () is received do % only pa receives this message %


(1) parenti ← i;
(2) let k ∈ neighborsi ;
(3) send GO ({i}) to pk ; childreni ← {k}.

when GO (visited) is received from pj do


(4) parenti ← j ;
(5) if (neighborsi ⊆ visited)
(6) then send BACK (visited ∪ {i}) to pj ; childreni ← ∅;
(7) else let k ∈ neighborsi \ visited;
(8) send GO (visited ∪ {i}) to pk ; childreni ← {k}
(9) end if.

when BACK (visited) is received from pj do


(10) if (neighborsi ⊆ visited)
(11) then if (parenti = i)
(12) then the traversal is terminated % global termination %
(13) else send BACK (visited) to pparenti % local termination %
(14) end if
(15) else let k ∈ neighborsi \ visited;
(16) send GO () to pk ; childreni ← childreni ∪ {k}
(17) end if.

Fig. 1.17 Time and message optimal depth-first traversal (code for pi )

1.4.2 Application: Construction of a Logical Ring

The Notion of a Local Algorithm Both the previous algorithm (Fig. 1.16) and
its improvement are local in the sense that (a) each process has initially to know
only its identity, that of its neighbors, and the fact no two processes have the same
identity, and (b) the size of the information exchanged between any two neighbors
is bounded.

Another Improvement of the Basic Depth-First Traversal Algorithm This part


presents a depth-first traversal algorithm that is not local but whose message com-
plexity and time complexity are O(n). The idea is to replace the local synchroniza-
tion used by the improved version of the previous algorithm (implemented with the
messages VISITED () and KNOWN ()) by a global control information, namely, the
set (denoted visited) of the processes which have been already visited by the net-
work traversal (i.e., by messages GO ()). To that end, each message GO () and each
message BACK () carry the current value of the set visited.
The corresponding depth-first traversal algorithm is described in Fig. 1.17. When
the distinguished process pa receives the external message START (), it defines itself
as the root (parenti = i, line 1), and launches the depth-traversal by sending a mes-
sage GO ({i}) to one of its neighbors that it includes in its set of children (lines 2–3).
Then, when a process pi receives a message GO (visited) from a neighbor process
pj , it defines pj as its parent (line 4). If all of its neighbors have been visited,
pi sends back the message GO (visited ∪ {i}) to its parent (line 6). Otherwise, it
28 1 Basic Definitions and Network Traversal Algorithms

propagates the depth-first traversal to one of its neighbors pk that has not yet been
visited and initializes childreni to {k} (lines 7–8).
Finally, when a process pi receives a message BACK (visited) such that all its
neighbors have been visited (line 10), it claims termination if it is the root (line 12).
If it is not the root, it forwards the message BACK (visited) to its parent (line 13). If
some of its neighbors have not yet been visited, pi selects one of them, propagates
the network traversal by sending to it the message GO (visited) and adds it to its set
of children (lines 15–16).
It is easy to see that this algorithm builds a depth-first spanning tree, and requires
(n − 1) messages GO () and (n − 1) messages BACK (). As no two messages are
concurrent, the time complexity is 2(n − 1). As already indicated, this algorithm
is not local: the set visited carried by each message grows until it contains all the
process identities. Hence, the size of a message includes one bit for the message
type and up to n log2 n bits for its content.

A Token Traveling Along a Logical Ring A token is a message that navigates a


network. A simple way to exploit a token consists in using a logical unidirectional
ring. The token progresses then from a process to the next process on the ring.
Assuming that no process keeps the token forever, it follows that no process that
wants to use the token will miss it.
The problem that then has to be solved consists in building a logical unidirec-
tional ring on top of a connected arbitrary network. Such a construction is presented
below.
In order that the ring can be exploited by the token, the algorithm building the
ring computes the value of two local variables at each process pi :
• The local variable succi denotes the identity of the successor of pi on the unidi-
rectional ring.
• The local variable routingi [j ], where pj is a neighbor of pi , contains the identity
of its neighbor to which pi will have to forward the token when it receives it
from pj .
The set of local variables routingi [j ], 1 ≤ i ≤ n, defines the appropriate rout-
ing (at the level of the underlying communication graph) which ensures the cor-
rect move of the token on the logical ring from any process to its logical successor.
Once these two local variables have been computed at each process pi , the navi-
gation of the token is easily ensured by the statements described in Fig. 1.18. When
it leaves a process, the token carries the identity dest of its destination, i.e., the next
process on the unidirectional ring. When it receives the token and dest = i, a pro-
cess pi uses it. Before releasing it, the process sets the token field dest to succi , the
next process on the ring (i.e., the next process allowed to use the token). Finally,
(whether it was its previous destination or not) pi sends the token to its neighbor pk
such that k = routingi [j ] (where pj is the process from which it received the token
in the communication graph). Hence, when considering the underlying communi-
cation graph, the token progresses from pi to pk where k = routingi [j ], then to p
where = routingk [i], then to pm where m = routing [k], etc., where the sequence
1.4 Depth-First Traversal 29

Fig. 1.18 Management


when TOKEN (dest) is received from pj do
of the token at process pi
if (dest = i) then use the token; dest ← succi end if;
let k = routingi [j ]; send TOKEN (dest) to pk .

of successive neighbor processes pj , pi , pk , p , . . . , pj constitutes the circuit im-


plementing the logical unidirectional ring.

A Distributed Algorithm Building a Logical Ring The distributed depth-first


traversal algorithm described in Fig. 1.17 constitutes the skeleton on which are
grafted statements that build the logical ring, i.e., the statements that give their
values to the local variables succi and routingi [j ] of each process pi . The result-
ing algorithm, which is due to J.-M. Hélary and M. Raynal (1988), is described in
Fig. 1.19.
As we can see, this algorithm is nearly the same as that of Fig. 1.17. Its under-
lying principle follows: As a process receives a single message GO (), a process is
added to the logical ring when it receives such a message. Moreover, in addition to
the set of processes visited, each message is required to carry the identity (denoted
last) of the last process that received a message GO ().
In order to establish the sense of direction of the ring and compute the distributed
routing tables, the algorithm uses the total order in which processes are added to the
ring, i.e., the order in which the processes receive a message GO (). More precisely,
the sense of direction of the ring is then defined as opposite to the order in which the
processes are added to the ring during its construction. This is depicted in Fig. 1.20.

when START () is received do % only pa receives this message %


(1) parenti ← i;
(2) let k ∈ neighborsi ;
(3) send GO ({i}, i) to pk ; firsti ← k.

when GO (visited, last) is received from pj do


(4) parenti ← j ; succi ← last;
(5) if (neighborsi ⊆ visited)
(6) then send BACK (visited ∪ {i}, i) to pj ; routingi [j ] ← j
(7) else let k ∈ neighborsi \ visited;
(8) send GO (visited ∪ {i}, i) to pk ; routingi [k] ← j
(9) end if.

when BACK (visited, last) is received from pj do


(10) if (neighborsi ⊆ visited)
(11) then if (parenti = i)
(12) then succi ← last; routingi [firsti ] ← j % the ring is built %
(13) else send BACK (visited, last) to pparenti ; routingi [parenti ] ← j
(14) end if
(15) else let k ∈ neighborsi \ visited;
(16) send GO (visited, last) to pk ; routingi [k] ← j
(17) end if.

Fig. 1.19 From a depth-first traversal to a ring (code for pi )


30 1 Basic Definitions and Network Traversal Algorithms

Fig. 1.20 Sense of direction


of the ring and computation
of routing tables

From an operational point of view, we have the following. When the distin-
guished process receives the message START () it defines itself as the starting pro-
cess (parenti = i), selects one of its neighbors pk , and sends to pk the message
GO (visited, i) where visited = {i} (lines 1–3). Moreover, the starting process records
the identity k in order to be able to close the ring when it discovers that the depth-
first traversal has terminated (line 12).
When a process pi receives a message GO (visited, last) (let us remember that
it receives exactly one message GO ()), it defines (a) its parent with respect to the
depth-first traversal as the sender pj of the message GO ()), and (b) its successor
on the ring as the last process (before it) that received a message GO (), i.e., the
process plast (line 4). Then, if all its neighbors have been visited by the depth-first
traversal, it sends back to its parent pj the message GO (visited ∪ {i}, i) and defines
the appropriate routing for the token, namely it sets routingi [j ] = j (lines 5–6). If
there are neighbors of pi that have not yet been visited, pi selects one of them (say
pk ) and propagates the depth-first traversal by sending the message GO (visited ∪
{i}, i) to pk . As before, it also defines the appropriate routing for the token, namely
it sets routingi [k] = j (lines 7–8).
When a process pi receives a message BACK (visited, last), it does the same as
previously if some of its neighbors have not yet been visited (lines 15–16 are similar
to lines 7–8). If all its neighbors have been visited and pi is the starting process,
it closes the ring by assigning to routingi [firsti ] the identity j of the process that
sent the message BACK (−, −) (lines 11–12). If pi is not the starting process, it
forwards the message BACK (visited, last) to its parent which will make the depth-
first traversal progress. It also assigns the identity j to routingi [parenti ] for the token
to be correctly routed along the appropriate channel of the communication graph in
order to attain its destination process on the logical ring.

Cost This ring construction requires always 2(n − 1) messages: (n − 1) mes-


sages GO () and (n − 1) messages BACK (). Moreover the ring has n virtual channels
vc1 , . . . , vcn and the length x of vcx , 1 ≤ x ≤ n, is such that 1 ≤ x ≤ (n − 1) and
Σ1≤x≤n x = 2(n − 1). Hence, both the cost of building the ring and ensuring a full
turn of the token on it are 2(n − 1) messages sent one after the other.

Remarks The two logical neighbors on the ring of each process pi depend on the
way a process selects a non-visited neighbor when it has to propagate the depth-first
traversal.
Allowing the messages GO () and BACK () to carry more information (on the struc-
ture of the network that has been visited—or has not been visited—by the depth-first
1.4 Depth-First Traversal 31

Fig. 1.21 An example


of a logical ring construction

traversal) allows the length of the ring at the communication graph level to be re-
duced to x, where x ∈ [n . . . 2(n − 1)]. This number x depends on the structure of
the communication graph and the way neighbors are selected when a process prop-
agates the network traversal.

An Example Let us consider the communication graph depicted Fig. 1.21 (with-
out the dotted arrows). The dotted arrows represent the logical ring constructed by
the execution described below.
In this example, when a process has to send a message GO () to one of its neigh-
bors, it selects the neighbor with the smallest identity in the set neighborsi \ visited.
1. The distinguished process sends the message GO ({1}, 1) to its neighbor p2 and
saves the identity 2 into firsti to be able to close the ring at the end of the network
traversal.
2. Then, when it receives this message, p2 defines succ2 = 1, forwards the
depth-first traversal by sending the message GO ({1, 2}, 2) to p3 and defines
routing2 [3] = 1.
3. When p3 receives this message, it defines succ3 = 2, forwards the depth-
first traversal by sending the message GO ({1, 2, 3}, 3) to p4 and defines
routing3 [4] = 2.
4. When p4 receives this message, it defines succ4 = 3, and propagates the depth-
first traversal by sending the message BACK ({1, 2, 3, 4}, 4) to its parent p3 .
Moreover, it defines routing4 [3] = 3.
5. When p3 receives this message, as neighbors3 ⊂ visited and it is not the
starting process, it forwards BACK ({1, 2, 3, 4}, 4) to its parent p2 and defines
routing3 [2] = 4.
6. When p2 receives this message, as neighbors3 is not included in visited, it selects
its not yet visited neighbor p5 and sends it the message GO ({1, 2, 3, 4}, 4). It also
defines routing2 [5] = 3.
7. When p5 receives this message, it defines p4 as its successor on the logical ring
(succ5 = 4), sends back to its parent p2 the message BACK ({1, 2, 3, 4, 5}, 5) and
defines routing5 [2] = 2.
8. When p2 receives this message, p2 forwards it to its parent p1 and defines
routing2 [1] = 5.
32 1 Basic Definitions and Network Traversal Algorithms

Table 1.1 The paths


implementing Virtual channel of the ring Implemented by the path
the virtual channels
of the logical ring 2 → 1 2 → 1
1 → 5 1 → 2 → 5
5 → 4 5 → 2 → 3 →4
4 → 3 4 → 3
3 → 2 3 → 2

9. Finally, when p1 receives the message BACK ({1, 2, 3, 4, 5}, 5), all its neigh-
bors have been visited. Hence, the depth-first traversal is terminated. Conse-
quently, p1 closes the ring by assigning its value to routing1 [first1 ], i.e., it defines
routing1 [2] = 2.
The routing tables at each process constitute a distributed implementation of the
paths followed by the token to circulate on the ring from each process to its suc-
cessor. These paths are summarized in Table 1.1 which describes the physical paths
implementing the n virtual channels of the logical ring.

1.5 Summary
After defining the notion of a distributed algorithm, this chapter has presented sev-
eral traversal network algorithms, namely, parallel, breadth-first, and depth-first
traversal algorithms. It has also presented algorithms that construct spanning trees
or rings on top of a communication graph. In addition to being interesting in their
own right, these algorithms show that distributed traversal techniques are different
from their sequential counterparts.

1.6 Bibliographic Notes


• Network traversal algorithms are presented in several books devoted to distributed
computing, e.g., [24, 150, 219, 242, 335]. Several distributed traversal algorithms
are described by A. Segall in [341].
• A distributed version of Ford and Fulkerson’s algorithm for the maximum flow
problem is described in [91]. This adaptation is based on a distributed depth-first
search algorithm.
• The notion of a wave-based distributed iteration has been extensively studied in
several books, e.g., [308, 365]. The book by Raynal and Hélary [319] is entirely
devoted to wave-based and phase-based distributed algorithms.
• The improved depth-first traversal algorithm with O(n) time complexity is due to
B. Awerbuch [25]. The technique of controlling the progress of a network traver-
sal with a set visited carried by messages is due to J.-M. Hélary, A. Maddi and
1.7 Exercises and Problems 33

M. Raynal [170]. Other depth-first traversal algorithms have been proposed in the
literature, e.g., [95, 224].
• The breadth-first traversal algorithm without centralized control is due to C.-T.
Cheung [91]. The one with centralized control is due to Y. Zhu and C.-T. Cheung
[395].
• The algorithm building a logical ring on top of an arbitrary network is due to J.-
M. Hélary and M. Raynal [177]. It is shown in this paper how the basic algorithm
described in Fig. 1.19 can be improved to obtain an implementation of the ring
requiring a number of messages ≤ 2n to implement a full turn of the token on the
ring.
• The distributed construction of minimum weight spanning trees has been ad-
dressed in many papers, e.g., [143, 209, 231].
• Graph algorithms and algorithmic graph theory can be found in many textbooks
(e.g., [122, 158]). The book by M. van Steen [359] constitutes an introduction
to graph and complex networks for engineers and computer scientists. Recent
advances on graph theory are presented in the collective book [164].

1.7 Exercises and Problems

1. Let us assume that a message GO () is allowed to carry the position of its sender
in the communication graph. How can we improve a distributed graph traversal
algorithm so that it can benefit from this information?
Solution in [319].
2. Write the full text of the depth-first traversal algorithm corresponding to the im-
provement presented in the part titled “An easy improvement of the basic al-
gorithm” of Sect. 1.4.1. Depict then a run of it on the communication graph
described in the left part of Fig. 1.8.
3. Let us consider the case of a directed communication graph where the mean-
ing of “directed” is as follows. A channel from pi to pj allows (a) pi to send
only messages GO () to pj and (b) pj to send only messages BACK () to pi . Two
processes pi and pj are then such that either there is no communication chan-
nel connecting them, or there is one directed communication channel connecting
one to the other, or there are two directed communication channels (one in each
direction).
Design a distributed algorithm building a breadth-first spanning tree with a
distinguished root pa and compare it with the algorithm described in Fig. 1.11.
It is assumed that there is a directed path from the distinguished root pa to any
other process.
Solution in [91].
4. Consider a communication graph in which the processes have no identity and
each process pi knows its position in the communication network with the help
of a local array channeli [1..ci ] (where ci is the number of neighbors of pi ). An
example is given in Fig. 1.22. As we can see in this figure, the channel connecting
34 1 Basic Definitions and Network Traversal Algorithms

Fig. 1.22
An anonymous network

pi and pj is locally known by pi as channeli [2] and locally known by pj as


channelj [1].
Using a distributed depth-first traversal algorithm where a distinguished pro-
cess receives a message START (), design an algorithm that associates with each
process pi an identity idi and an identity table neighbor_namei [1..ci ] such that:
• ∀ i : idi ∈ {1, 2, . . . , n}.
• ∀ i, j : idi = idj .
• If pi and pj are neighbors and the channel connecting them is denoted
channeli [k] at pi and channelj [ ] at pj , we have neighbor_namei [k] = idj
and neighbor_namej [ ] = idi .
5. Enrich the content of the messages GO () and BACK () of the algorithm described
in Fig. 1.19 in order to obtain a distributed algorithm that builds a logical ring
and uses as few messages as possible. This means that the number of messages
x will be such that n ≤ x ≤ 2(n − 1). Moreover, the length of the logical ring
(counted as the number of channels of the underlying communication graph) has
to be equal to x.
Solution in [177].
6. Design a distributed algorithm that builds a logical ring such that the length of
the path in the communication graph of any two neighbors in the logical ring is
at most 3.
Let us recall that, if G is a connected graph, G3 is Hamiltonian. Hence, the
construction is possible. (Given a graph G, Gk is a graph with the same vertices
as G and any two vertices of Gk are neighbors if their distance in G is at most k.)
Chapter 2
Distributed Graph Algorithms

This chapter addresses three basic graph problems encountered in the context of
distributed systems. These problems are (a) the computation of the shortest paths
between a pair of processes where a positive length (or weight) is attached to each
communication channel, (b) the coloring of the vertices (processes) of a graph in
 + 1 colors (where  is the maximal number of neighbors of a process, i.e., the
maximal degree of a vertex when using the graph terminology), and (c) the detection
of knots and cycles in a graph. As for the previous chapter devoted to graph traversal
algorithms, an aim of this chapter is not only to present specific distributed graph
algorithms, but also to show that their design is not always obtained from a simple
extension of their sequential counterparts.

Keywords Distributed graph algorithm · Cycle detection · Graph coloring ·


Knot detection · Maximal independent set · Problem reduction ·
Shortest path computation

2.1 Distributed Shortest Path Algorithms


This section presents distributed algorithms which allow each process to compute
its shortest paths to every other process in the system. These algorithms can be seen
as “adaptations” of centralized algorithmic principles to the distributed context.
The notations are the same as in the previous chapter. Each process pi has a
set of neighbors denoted neighborsi ; if it exists, the channel connecting pi and pj
is denoted i, j . The communication channels are bidirectional (hence i, j  and
j, i denote the same channel). Moreover, the communication graph is connected
and each channel i, j  has a positive length (or weight) denoted gi [j ] (as i, j 
and j, i are the same channel, we have gi [j ] = gj [i]).

2.1.1 A Distributed Adaptation


of Bellman–Ford’s Shortest Path Algorithm

Bellman–Ford’s sequential algorithm computes the shortest paths from one prede-
termined vertex of a graph to every other vertex. It is an iterative algorithm based

M. Raynal, Distributed Algorithms for Message-Passing Systems, 35


DOI 10.1007/978-3-642-38123-2_2, © Springer-Verlag Berlin Heidelberg 2013
36 2 Distributed Graph Algorithms

Fig. 2.1 Bellman–Ford’s


dynamic programming principle

on the dynamic programming principle. This principle and its adaptation to a dis-
tributed context are presented below.

Initial Knowledge and Local Variables Initially each process knows that there
are n processes and the set of process identities is {1, . . . , n}. It also knows its posi-
tion in the communication graph (which is captured by the set neighborsi ). Interest-
ingly, it will never learn more on the structure of this graph. From a local state point
of view, each process pi manages the following variables.
• As just indicated, gi [j ], for j ∈ neighhborsi , denotes the length associated with
the channel i, j .
• lengthi [1..n] is an array such that lengthi [k] will contain the length of the shortest
path from pi to pk . Initially, lengthi [i] = 0 (and keeps that value forever) while
lengthi [j ] = +∞ for j = i.
• routing_toi [1..n] is an array that is not used to compute the shortest paths from
pi to each other process. It constitutes the local result of the computation. More
precisely, when the algorithm terminates, for any k, 1 ≤ k ≤ n, routing_toi [k] = j
means that pj is a neighbor of pi on a shortest path to pk , i.e., pj is an optimal
neighbor when pi has to send information to pk (where optimality is with respect
to the length of the path from pi to pk ).

Bellman–Ford Principle The dynamic programming principle on which the al-


gorithm relies is the following. The local inputs at each process pi are the values
of the set neighborsi and the array gi [neighborsi ]. The output at each process pi
is the array lengthi [1..n]. The algorithm has to solve the following set of equations
(where the unknown variables are the arrays lengthi [1..n]):
 
∀i, k ∈ {1, . . . , n} : lengthi [k] = min gi [j ] + lengthj [k] .
j ∈neighborsi

The meaning of this formula is depicted in Fig. 2.1 for a process pi such that
neighborsi = {j1 , j2 , j3 }. Each dotted line from pjx to pk , 1 ≤ x ≤ 3, represents the
shortest path joining pjx to pk and its length is lengthjx [k]. The solution of this set
of equations is computed asynchronously and iteratively by the n processes, each
process pi computing successive approximate values of its local array lengthi [1..n]
until it stabilizes at its final value.

The Algorithm The algorithm is described in Fig. 2.2. At least one process pi
has to receive the external message START () in order to launch the algorithm. It
2.1 Distributed Shortest Path Algorithms 37

when START () is received do


(1) for each j ∈ neighborsi do send UPDATE (lengthi ) to pj end for.

when UPDATE (length) is received from pj do


(2) updatedi ← false;
(3) for each k ∈ {1, . . . , n} \ {i} do
(4) if (lengthi [k] > gi [j ] + length[k])
(5) then lengthi [k] ← gi [j ] + length[k];
(6) routing_toi [k] ← j ;
(7) updatedi ← true
(8) end if
(9) end for;
(10) if (updatedi )
(11) then for each j ∈ neighborsi do send UPDATE (lengthi ) to pj end for
(12) end if.

Fig. 2.2 A distributed adaptation of Bellman–Ford’s shortest path algorithm (code for pi )

sends then to each of its neighbors the message UPDATE (lengthi ) which describes
its current local state as far as the computation of the length of its shortest paths to
each other process is concerned.
When a process pi receive a message UPDATE (length) from one of its neighbors
pj , it applies the forward/discard strategy introduced in Chap. 1. To that end, pi first
strives to improve its current approximation of its shortest paths to any destination
process (lines 3–9). Then, if pi has discovered shorter paths than the ones it knew
before, pi sends its new current local state to each of its neighbors (lines 10–12). If
its local state (captured by the array lengthi [1..n]) has not been modified, pi does
not send a message to its neighbors.

Termination While there is a finite time after which the arrays lengthi [1..n] and
routing_toi [1..n], 1 ≤ i ≤ n, have obtained their final values, no process ever learns
when this time has occurred.

Adding Synchronization in Order that Each Process Learns Termination The


algorithm described in Fig. 2.3 allows each process pi not only to compute the
shortest paths but also to learn that it knows them (i.e., learn that its local arrays
lengthi [1..n] and routing_toi [1..n] have converged to their final values).
This algorithm is synchronous: the processes execute a sequence of synchronous
rounds, and rounds are given for free: they belong to the computation model. During
each round r, in addition to local computation, each process sends a message to and
receives a message from each of its neighbors. The important synchrony property
lies in the fact that a message sent by a process pi to a neighbor pj at round r
is received and processed by pj during the very same round r. The progress of
a round r to the round r + 1 is governed by the underlying system. (A general
technique to simulate a synchronous algorithm on top of an asynchronous system
will be described in Chap. 9.)
38 2 Distributed Graph Algorithms

when r = 1, 2, . . . , D do
begin synchronous round
(1) for each j ∈ neighborsi do send UPDATE (lengthi ) to pj end for;
(2) for each j ∈ neighborsi do receive UPDATE (lengthj ) from pj end for;
(3) for each k ∈ {1, . . . , n} \ {i} do
(4) let length_ik1 = minj ∈neighborsi ( gi [j ] + lengthj [k]);
(5) if (length_ik < lengthi [k]) then
(6) lengthi [k] ← length_ik;
(7) routing_toi [k] ← a neighbor j that realizes the previous minimum
(8) end if
(9) end for
end synchronous round.

Fig. 2.3 A distributed synchronous shortest path algorithm (code for pi )

The algorithm considers that the diameter D of the communication graph is


known by the processes (let us remember that the diameter is the number of chan-
nels separating the two most distant processes). If D is not explicitly known, it can
be replaced by an upper bound, namely the value (n − 1).
The text of the algorithm is self-explanatory. There is a strong connection be-
tween the current round number and the number of channels composing the paths
from which pi learns information. Let us consider a process pi at the end of a
round r.
• When r < D, pi knows the shortest path from itself to any other process pk ,
which is composed of at most r channels. Hence, this length to pk is not neces-
sarily the shortest one.
• Differently, when it terminates round r = D, pi has computed both (a) the short-
est lengths from it to all the other processes and (b) the corresponding appropriate
routing neighbors.

2.1.2 A Distributed Adaptation


of Floyd–Warshall’s Shortest Paths Algorithm

Floyd–Warshall’s algorithm is a sequential algorithm that computes the shortest


paths between any two vertices of a non-directed graph. This section presents an
adaptation of this algorithm to the distributed context. This adaptation is due to
S. Toueg (1980). As previously, to make the presentation easier, we consider that
the graph (communication network) is connected and the length of each edge of
the graph (communication channel) is a positive number. (Actually, the algorithm
works when edges have negative lengths as long as no cycle has a negative length.)

Floyd–Warshall’s Sequential Algorithm Let LENGTH[1..n, 1..n] be a matrix


such that, when the algorithm terminates, LENGTH[i, j ] represents the length of the
shortest path from pi to pj . Initially, for any i, LENGTH[i, i] = 0, for any pair (i, j )
2.1 Distributed Shortest Path Algorithms 39

(1) for pv from 1 to n do


(2) for i from 1 to n do
(3) for j from 1 to n do
(4) if LENGTH[i, pv] + LENGTH[pv, j ] < LENGTH[i, j ]
(5) then LENGTH[i, j ] ← LENGTH[i, pv] + LENGTH[pv, j ];
(6) routing_toi [j ] ← routing_toi [pv]
(7) end if
(8) end for
(9) end for
(10) end for.

Fig. 2.4 Floyd–Warshall’s sequential shortest path algorithm

Fig. 2.5 The principle that underlies Floyd–Warshall’s shortest paths algorithm

such that j ∈ neighborsi , LENGTH[i, j ] is initialized to the length of the channel


from pi to pj , and LENGTH[i, j ] = +∞ in the other cases. Moreover, for any i, the
array routing_toi [1..n] is such that routing_toi [i] = i, routing_toi [j ] = j for each
j ∈ neighborsi and routing_toi [j ] is initially undefined for the other values of j .
The principle of Floyd–Warshall’s algorithm is an iterative algorithm based on
the following principle. For any process pi , the algorithm computes first the shortest
path from any process pi to any process pj that (if any) passes through process p1 .
Then, it computes the shortest path from any process pi to any process pj among
all the paths from pi to pj which pass only through processes in the set {p1 , p2 }.
More generally, at the step pv of the iteration, the algorithm computes the shortest
path from any process pi to any process pj among all the paths from pi to pj which
are allowed to pass through the set of processes {p1 , . . . , ppv }. The text of the algo-
rithm is given in Fig. 2.4. As we can see, the algorithm is made up of three nested
for loops. The external one defines the processes (namely p1 , . . . , ppv ) allowed to
appear in current computation of the shortest from any process pi to any process
pj . The process index pv is usually called the pivot.
The important feature of this sequential algorithm lies in the fact that, when com-
puting the shortest path from pi to pj involving the communication channels con-
necting the processes in the set {p1 , . . . , ppv }, the variable LENGTH[i, pv] contains
the length of the shortest path from pi to ppv involving only the communication
channels connecting the processes in the set {p1 , . . . , ppv−1 } (and similarly for the
variable LENGTH[pv, j ]). This is described in Fig. 2.5 when considering the com-
putation of the shortest from pi to pj involving the processes {p1 , . . . , ppv } (this
constitutes the pvth iteration step of the external loop).
40 2 Distributed Graph Algorithms

From a Sequential to a Distributed Algorithm As a distributed system has no


central memory and communication is by message passing between neighbor pro-
cesses, two issues have to be resolved to obtain a distributed algorithm. The first
concerns the distribution of the data structures; the second the synchronization of
processes so that they can correctly compute the shortest paths and the associated
routing tables.
The array LENGTH[1..n, 1..n] is split into n vectors such that each pro-
cess pi computes and maintains the value of LENGTH[i, 1..n] in its local array
lengthi [1..n]. Moreover, as seen before, each process pi computes the value of its
routing local array routing_toi [1..n]. On the synchronization side, there are two is-
sues:
• When pi computes the value of lengthi [j ] during the iteration step pv, process pi
locally knows the current values of lengthi [j ] and lengthi [pv], but it has to obtain
the current value of lengthpv [j ] (see line 4 of Fig. 2.4).
• To obtain from ppv a correct value for lengthpv [j ], the processes must execute
simultaneously the same iteration step pv. If a process pi is executing an iteration
step with the pivot value pv while another process pk is simultaneously executing
an iteration step with the pivot value pv = pv, the values they obtain, respectively,
from ppv for lengthpv [j ] and from ppv lengthpv [j ] can be mutually inconsistent
if these computations are done without an appropriate synchronization.

The Distributed Algorithm The algorithm is described in Fig. 2.6. The processes
execute concurrently a loop where the index pv takes the successive values from 1
to n (line 1). If a process receives a message while it has not yet started executing its
local algorithm, it locally starts the local algorithm before processing the message.
As the communication graph is connected, it follows that, as soon as at least one
process pi starts its local algorithm, all the processes start theirs.
As indicated just previously, when the processes execute the iteration step pv, the
process ppv has to broadcast its local array lengthpv [1..n] so that each process pi to
try to improve its shortest distance to any process pj as indicated in Fig. 2.5.
To this end, let us observe that if, at the pvth iteration of the loop, there is path
from pi to ppv involving only processes in the set {p1 , . . . , ppv−1 }, there is then a
favorite neighbor to attain ppv , namely the process whose index has been computed
and saved in routing_toi [pv]. This means that, at the pvth iteration, the set of local
variables routing_tox [pv] of the processes px such that lengthx [pv] = +∞ define a
tree rooted at ppv .
The algorithm executed by the processes, which ensures a correct process coor-
dination, follows from this observation. More precisely, a local algorithm is made
up of three parts:
• Part 1: lines 1–6. A process pi first sends a message to each of its neighbors pk
indicating if pi is or not one of pk ’s children in the tree rooted at ppv . It then
waits until it has received such a message from each of its neighbors.
Then, pi executes the rest of the code for the pvth iteration only if it has a
chance to improve its shortest paths with the help of ppv , i.e., if lengthi [pv] =
+∞.
2.1 Distributed Shortest Path Algorithms 41

(1) for pv from 1 to n do


(2) for each k ∈ neighborsi do
(3) if (routing_toi [pv] = k) then child ← yes else child ← no end if;
(4) send CHILD (pv, child) to pk
(5) end for;
(6) wait (a message CHILD (pv, −) received from each neighbor);
(7) if (lengthi [pv] = +∞) then
(8) if (pv = i) then
(9) wait (message PV _ LENGTH (pv, pv_length[1..n]) from prouting_toi [pv] )
(10) end if;
(11) for each k ∈ neighborsi do
(12) if (CHILD (pv, yes) received from pk ) then
(13) if (pv = i) then send PV _ LENGTH (pv, lengthi [1..n]) to pk
(14) else send PV _ LENGTH (pv, pv_length[1..n]) to pk
(15) end if
(16) end if
(17) end for;
(18) for j from 1 to n do
(19) if lengthi [pv] + pv_length[j ] < lengthi [j ]
(20) then lengthi [j ] ← lengthi [pv] + pv_length[j ];
(21) routing_toi [j ] ← routing_toi [pv]
(22) end if
(23) end for
(24) end if
(25) end for.

Fig. 2.6 Distributed Floyd–Warshall’s shortest path algorithm

• Part 2: lines 8–17. This part of the algorithm ensures that each process pi such
that lengthi [pv] = +∞ receives a copy of the array lengthpv [1..n] so that it can
recompute the values of its shortest paths and the associated local routing table
(which is done in Part 3).
The broadcast of lengthpv [1..n] by ppv is launched at line 13, where this pro-
cess sends the message PV _ LENGTH (pv, lengthpv ) to all its children in the tree
whose it is the root. When it receives such a message carrying the value pv and
the array pv_length[1..n] (line 9), a process pi forwards it to its children in the
tree rooted at ppv (lines 12 and 14).
• Part 3: lines 18–23. Finally, a process pi uses the array pv_length[1..n] it has
received in order to improve its shortest paths that pass through the processes
p1 , . . . , ppv .

Cost Let e be the number of communication channels. It is easy to see that, dur-
ing each iteration, (a) at most two messages CHILD () are sent on each channel (one
in each direction) and (b) at most (n − 1) messages PV _ LENGTH () are sent. It fol-
lows that the number of messages is upper-bounded by n(2e + n); i.e., the message
complexity is O(n3 ). As far the size of messages is concerned, a message CHILD ()
carries a bit, while PV _ LENGTH () carries n values whose size depends on the indi-
vidual lengths associated with the communication channels.
42 2 Distributed Graph Algorithms

(1) for i from 1 to n do


(2) c ← 1;
(3) while(COLOR[i] = ⊥) do
(4) if ( j ∈neighborsi COLOR[j ] = c) then COLOR[i] ← c else c ← c + 1 end if
(5) end while
(6) end for.

Fig. 2.7 Sequential ( + 1)-coloring of the vertices of a graph

Finally, there are n iteration steps, and each has O(n) time complexity. Moreover,
in the worst case, the processes starts the algorithm one after the other (a single
process starts, which entails the start of another process, etc.). When summing up,
it follows that the time complexity is upper-bounded by O(n2 ).

2.2 Vertex Coloring and Maximal Independent Set

2.2.1 On Sequential Vertex Coloring

Vertex Coloring An important graph problem, which is encountered when one


has to model application-level problems, concerns vertex coloring. It consists in
assigning a value (color) to each vertex such that (a) no two vertices which are
neighbors have the same color, and (b) the number of colors is “reasonably small”.
When the number of colors has to be the smallest possible one, the problem is NP-
complete.
Let  be the maximal degree of a graph (let us remember that, assuming a graph
where any two vertices are connected by at most one edge, the degree of a vertex is
the number of its neighbors). It is always possible to color the vertices of a graph in
 + 1 colors. This follows from the following simple reasoning by induction. The
assertion is trivially true for any graph with at most  vertices. Then, assuming it is
true for any graph made up of n ≥  vertices and whose maximal degree is at most
, let us add a new vertex to the graph. As (by assumption) the maximal degree of
the graph is , it follows that this new vertex has at most  neighbors. Hence, this
vertex can be colored with the remaining color.

A Simple Sequential Algorithm A simple sequential algorithm that colors ver-


tices in at most ( + 1) colors is described in Fig. 2.7. The array variable
COLOR[1..n], which is initialized to [⊥, . . . , ⊥], is such that, when the algorithm
terminates, for any i, COLOR[i] will contain the color assigned to process pi .
The colors are represented by the integers 1 to ( + 1). The algorithm considers
sequentially each vertex i (process pi ) and assigns to it the first color not assigned
to its neighbors. (This algorithm is sensitive to the order in which the vertices and
the colors are considered.)
2.2 Vertex Coloring and Maximal Independent Set 43

(1) for each j ∈ neighborsi do send INIT (colori [i]) to pj end for;
(2) for each j ∈ neighborsi
(3) do wait (INIT (col_j ) received from pj ); colori [j ] ← col_j
(4) end for;
(5) for ri from ( + 2) to m do
begin asynchronous round
(6) if (colori [i] = ri )
(7) then c ← smallest color in {1, . . . ,  + 1} such that ∀j ∈ neighborsi : colori [j ] = c;
(8) colori [i] ← c
(9) end if;
(10) for each j ∈ neighborsi do send COLOR (ri , colori [i]) to pj end for;
(11) for each j ∈ neighborsi do
(12) wait (COLOR (r, col_j ) with r = ri received from pj );
(13) colori [j ] ← col_j
(14) end for
end asynchronous round
(15) end for.

Fig. 2.8 Distributed ( + 1)-coloring from an initial m-coloring where n ≥ m ≥  + 2

2.2.2 Distributed ( + 1)-Coloring of Processes

This section presents a distributed algorithm which colors the processes in at most
( + 1) colors in such a way that no two neighbors have the same color. Distributed
coloring is encountered in practical problems such as resource allocation or pro-
cessor scheduling. More generally, distributed coloring algorithms are symmetry
breaking algorithms in the sense that they partition the set of processes into subsets
(a subset per color) such that no two processes in the same subset are neighbors.

Initial Context of the Distributed Algorithm Such a distributed algorithm is


described in Fig. 2.8. This algorithm assumes that the processes are already colored
in m ≥  + 1 colors in such a way that no two neighbors have the same color. Let
us observe that, from a computability point of view, this is a “no-cost” assumption
(because taking m = n and defining the color of a process pi as its index i trivially
satisfies this initial coloring assumption). Differently, taking m =  + 1 assumes
that the problem is already solved. Hence, the assumption on the value of m is a
complexity-related assumption.

Local Variables Each process pi manages a local variable colori [i] which ini-
tially contains its initial color, and will contain its final color at the end of the algo-
rithm. A process pi also manages a local variable colori [j ] for each of its neigh-
bors pj . As the algorithm is asynchronous and round-based, the local variable ri
managed by pi denotes its current local round number.

Behavior of a Process pi The processes proceed in consecutive asynchronous


rounds and, at each round, each process synchronizes its progress with its neigh-
bors. As the rounds are asynchronous, the round numbers are not given for free by
44 2 Distributed Graph Algorithms

the computation model. They have to be explicitly managed by the processes them-
selves. Hence, each process pi manages a local variable ri that it increases when it
starts a new asynchronous round (line 5).
The first round (lines 1–2) is an initial round during which the processes ex-
change their initial color in order to fill in their local array colori [neighborsi ]. If
the processes know the initial colors of their neighbors, this communication round
can be suppressed. The processes then execute m − ( + 1) asynchronous rounds
(line 5).
The processes whose initial color belongs to the set of colors {1, . . . ,  + 1} keep
their color forever. The other processes update their colors in order to obtain a color
in {1, . . . ,  + 1}. To that end, all the processes execute sequentially the rounds
 + 2, . . . , until m, considering that each round number corresponds to a given
distinct color. During round r,  + 2 ≤ r ≤ m, each process whose initial color is r
looks for a new color in {1, . . . ,  + 1} which is not the color of its neighbors and
adopts it as its new color (lines 6–8). Then, each process exchanges its color with
its neighbors (lines 10–14) before proceeding to the next round. Hence, the round
invariant is the following one: When a round r terminates, the processes whose
initial colors were in {1, . . . , r} (a) have a color in the set {1, . . . ,  + 1}, and (b)
have different colors if they are neighbors.

Cost The time complexity (counted in number of rounds) is m −  rounds (an


initial round plus m − ( + 1) rounds). Each message carries a tag, a color, and
possibly a round number which is also a color. As the initial colors are in {1, . . . , m},
the message bit complexity is O(log2 m).
Finally, during each round, two messages are sent on each channel. The message
complexity is consequently 2e(m − ), where e denotes the number of channels.
It is easy to see that, the better the initial process coloring (i.e., the smaller the
value of m), the more efficient the algorithm.

Theorem 1 Let m ≥  + 2. The algorithm described in Fig. 2.8 is a legal ( + 1)-


coloring of the processes (where legal means that no two neighbors have the same
color).

Proof Let us first observe that the processes whose initial color belongs to
{1, . . . ,  + 1} never modify their color. Let us assume that, up to round r, the
processes whose initial colors were in the set {1, . . . , r} have new colors in the
set {1, . . . ,  + 1} and any two of them which are neighbors have different colors.
Thanks to the initial m-coloring, this is initially true (i.e., for the fictitious round
r =  + 1).
Let us assume that the previous assertion is true up to some round r ≥  + 1.
It follows from the algorithm that, during round r + 1, only the processes whose
current color is r + 1 update it. Moreover, each of them updates it (line 7) with a
color that (a) belongs to the set {1, . . . ,  + 1} and (b) is not a color of its neighbors
(we have seen in Sect. 2.2.1 that such a color does exist). Consequently, at the end
of round r + 1, the processes whose initial colors were in the set {1, . . . , r + 1}
2.2 Vertex Coloring and Maximal Independent Set 45

Fig. 2.9 One bit of control information when the channels are not FIFO

have new colors in the set {1, . . . ,  + 1} and no two of them have the same new
color if they are neighbors. It follows that, as claimed, this property constitutes a
round invariant from which we conclude that each process has a final color in the
set {1, . . . ,  + 1} and no two neighbor processes have the same color. 

Remark on the Behavior of the Communication Channels Let us remember


that the only assumption on channels is that they are reliable. No other behavioral
assumption is made, hence the channels are implicitly non-FIFO channels.
Let us consider two neighbor processes that execute a round r as depicted in
Fig. 2.9. Each of them sends its message COLOR (r, −) to its neighbors (line 10),
and waits for a message COLOR () from each of them, carrying the very same round
number (line 12).
In the figure, pj has received the round r message from pi , proceeded to the
next round, and sent the message COLOR (r + 1, −) to pi while pi is still waiting
for round r message from pj . Moreover, as the channel is not FIFO, the figure
depicts the case where the message COLOR (r + 1, −) sent by pj to pi arrives before
the message COLOR (r, −) it sent previously. As indicated in line 12, the algorithm
forces pi to wait for the message COLOR (r, −) in order to terminate its round r.
As, in each round, each process sends a message to each of its neighbors, a closer
analysis of the message exchange pattern shows that the following relation on round
numbers is invariant. At any time we have:
 
∀(i, j ) : (pi and pj are neighbors ) ⇒ 0 ≤ |ri − rj | ≤ 1 .

It follows that the message COLOR () does not need to carry the value of r but only
a bit, namely the parity of r. The algorithm can then be simplified as follows:
• At line 10, each process pi sends the message COLOR (ri mod 2, colori [i]) to each
of its neighbors.
• At line 12, each process pi waits for a message COLOR (b, colori [i]) from each
of its neighbors where b = (ri mod 2).
Finally, it follows from previous discussion that, if the channels are FIFO, the mes-
sages COLOR () do not need to carry a control value (neither r, nor its parity bit).
46 2 Distributed Graph Algorithms

Fig. 2.10 Examples of maximal independent sets

2.2.3 Computing a Maximal Independent Set

Maximal Independent Set: Definition An independent set is a subset of the ver-


tices of a graph such that no two of them are neighbors. An independent set M is
maximal if none of its strict supersets M  (i.e., M ⊂ M  and M = M  ) is an inde-
pendent set. A graph can have several maximal independent sets.
The subset of vertices {1, 4, 5, 8} of the graph of depicted in the left part of
Fig. 2.10 is a maximal independent set. The subsets {1, 5, 7} and {2, 3, 6, 7} are
other examples of maximal independent sets of the same graph. The graph depicted
on the right part has two maximal independent sets, the set {1} and the set {2, 3, 4, 5}.
There is a trivial greedy algorithm to compute a maximal independent set in a
sequential context. Select a vertex, add it to the independent set, suppress it and its
neighbors from the graph, and iterate until there are no more vertices. It follows that
the problem of computing a maximal independent set belongs to the time complex-
ity class P (the class of problems that can be solved by an algorithm whose time
complexity is polynomial).
A maximum independent set is an independent set with maximal cardinality.
When considering the graph at the left of Fig. 2.10, the maximal independent sets
{1, 4, 5, 8} and {2, 3, 6, 7} are maximum independent sets. The graph on the right of
the figure has a single maximum independent set, namely the set {2, 3, 4, 5}.
While, from a time complexity point of view, the computation of a maximal
independent set is an easy problem, the computation of a maximum independent set
is a hard problem: it belongs to the class of NP-complete problems.

From m-Coloring to a Maximal Independent Set An asynchronous distributed


algorithm that computes a maximal independent set is presented in Fig. 2.11. Each
process pi manages a local array selectedi [j ], j ∈ neighborsi ∪ {i}, initialized to
[false, . . . , false]. At the end of the algorithm pi belongs to the maximal independent
set if and only if selectedi [i] is equal to true.
This algorithm assumes that there is an initial m-coloring of the processes (as
we have just seen, this can be obtained from the algorithm of Fig. 2.8). Hence, the
algorithm of Fig. 2.11 is a distributed reduction of the maximal independent set
problem to the m-coloring problem. Its underlying principle is based on a simple
observation and a greedy strategy. More precisely,
• Simple observation: the processes that have the same color define an independent
set, but this set is not necessarily maximal.
2.2 Vertex Coloring and Maximal Independent Set 47

(1) for ri from 1 to m do


begin asynchronous round
(2)  i = ri ) then
if (color
(3) if ( j ∈neighborsi (¬selectedi [j ])) then selectedi [i] ← true end if;
(4) end if;
(5) for each j ∈ neighborsi do send SELECTED (ri , selectedi [i]) to pj end for;
(6) for each j ∈ neighborsi do
(7) wait (SELECTED (r, selected_j ) with r = ri received from pj );
(8) selectedi [j ] ← selected_j
(9) end for
end asynchronous round
(10) end for.

Fig. 2.11 From m-coloring to a maximal independent set (code for pi )

• Greedy strategy: as the previous set is not necessarily maximal, the algorithm
starts with an initial independent set (defined by some color) and executes a se-
quence of rounds, each round r corresponding to a color, in which it strives to
add to the independent set under construction as much possible processes whose
color is r. The corresponding “addition” predicate for a process pi with color r
is that none of its neighbors is already in the set.
As previous algorithms, the algorithm described in Fig. 2.11 simulates a syn-
chronous algorithm. The color of a process pi is kept in its local variable denoted
colori . The messages carry a round number (color) which can be replaced by its
parity. The processes execute m asynchronous rounds (a round per color). When it
executes round r, if its color is r and none of its neighbors belongs to the set un-
der construction, a process pi adds itself to the set (line 3). Then, before starting
the next round, the processes exchange their membership of the maximal indepen-
dent set in order to update their local variables selectedi [j ]. (As we can see, what
is important is not the fact that the rounds are executed in the order 1, . . . , m, but
the fact that the processes execute the rounds in the same predefined order, e.g.,
1, m, 2, (m − 1), . . . .)
The size of the maximal independent set that is computed is very sensitive to the
order in which the colors are visited by the algorithm. As an example, let us consider
the graph at the right of Fig. 2.10 where the process p1 is colored a while the other
processes are colored b. If a = 1 and b = 2, the maximal independent set that is
built is the set {1}. If a = 2 and b = 1, the maximal independent set that is built is
the set {2, 3, 4, 5}.

A Simple Algorithm for Maximal Independent Set This section presents an


algorithm, due to M. Luby (1987), that builds a maximal independent set.
This algorithm uses a random function denoted random() which outputs a ran-
dom value each time it is called (the benefit of using random values is motivated
below). For ease of exposition, this algorithm, which is described in Fig. 2.12, is
expressed in the synchronous model. Let us remember that the main property of the
synchronous model lies in the fact that a message sent in a round is received by its
48 2 Distributed Graph Algorithms

(1) repeat forever


begin three synchronous rounds r, r + 1 and r + 2
beginning of round r
(2) randomi [i] ← random();
(3) for each j ∈ com_withi do send RANDOM (randomi [i]) to pj end for;
(4) for each j ∈ com_withi do
(5) wait (RANDOM (random_j ) received from pj ); randomi [j ] ← random_j
(6) end for;
end of round r and beginning of round r + 1
(7) if (∀ j ∈ com_withi : randomi [j ] > randomi [i])
(8) then for each j ∈ com_withi do send SELECTED (yes) to pj end for;
(9) statei ← in; return(in)
(10) else for each j ∈ com_withi do send SELECTED (no) to pj end for;
(11) for each j ∈ com_withi do wait (SELECTED (−) received from pj ) end for;
end of round r + 1 and beginning of round r + 2
(12) if (∃ k ∈ com_withi : SELECTED (yes) received from pk )
(13) then for each j ∈ com_withi : SELECTED (no) received from pj
(14) do send ELIMINATED (yes) to pj
(15) end for;
(16) statei ← out ; return(out)
(17) else for each j ∈ com_withi do send ELIMINATED (no) to pj end for;
(18) for each j ∈ com_withi
(19) do wait (ELIMINATED (−) received from pj )
(20) end for;
(21) for each j ∈ com_withi : ELIMINATED (yes) received from pj
(22) do com_withi ← com_withi \ {j }
(23) end for;
(24) if (com_withi = ∅) then statei ← in; return(in) end if
(25) end if
(26) end if:
end three synchronous rounds
(27) end repeat.

Fig. 2.12 Luby’s synchronous random algorithm for a maximal independent set (code for pi )

destination process in the very same round. (It is easy to extend this algorithm so
that it works in the asynchronous model.)
Each process pi manages the following local variables.
• The local variable statei , whose initial value is arbitrary, is updated only once.
It final value (in or out) indicates whether pi belongs or not to the maximal in-
dependent set that is computed. When, it has updated statei to its final value, a
process pi executes the statement return() which stops its participation to the al-
gorithm. Let us notice that the processes do not necessarily terminate during the
same round.
• The local variable com_withi , which is initialized to neighborsi , is a set contain-
ing the processes with which pi will continue to communicate during the next
round.
• Each local variable randomi [j ], where j ∈ neighborsi ∪ {i}, represents the local
knowledge of pi about the last random number used by pj .
2.2 Vertex Coloring and Maximal Independent Set 49

Fig. 2.13 Messages exchanged during three consecutive rounds

As indicated, the processes execute a sequence of synchronous rounds. The code


of the algorithm consists in the description of three consecutive rounds, namely the
rounds r, r + 1, and r + 2, where r = 1, 4, 7, 10, . . . . The messages exchanged
during these three consecutive rounds are depicted in Fig. 2.13.
The behavior of the synchronous algorithm during these three consecutive rounds
is as follows:
• Round r: lines 2–6.
Each process pi invokes first the function random() to obtain a random number
(line 2) that it sends to all its neighbors it is still communicating with (line 3).
Then, it stores all the random numbers it has received, each coming from a process
in com_withi .
• Round r + 1: lines 7–11.
Then, pi sends the message SELECTED (yes) to its neighbors in com_withi if
its random number is smaller than theirs (line 8). In this case, it progresses to the
local state in and stops (line 9).
Otherwise, its random number is not the smallest. In this case, pi first sends
the message SELECTED (no) to its neighbors in com_withi (line 10), and then
waits for a message from each of these neighbors (line 11).
• Round r + 2: lines 12–26.
Finally, if pi has not entered the maximal independent set under construction,
it checks if one of its neighbors in com_withi has been added to this set (line 12).
– If one of its neighbors has been added to the independent set, pi cannot be
added to this set in the future. It consequently sends the message ELIMI -
NATED (yes) to its neighbors in com_withi to inform them that it no longer
competes to enter the independent set (line 13). In that case, it also enters the
local state out and returns it (line 16).
– If none of its neighbors in com_withi has been added to the independent set,
pi sends them the message ELIMINATED (no) to inform them that it is still
competing to enter the independent set (line 17). Then, it waits for a mes-
sage ELIMINATED (−) from each of them (line 18) and suppresses from the set
com_withi its neighbors that are no longer competing (those are the processes
which sent it the message ELIMINATED (yes), lines 21–23).
Finally, pi checks if com_withi = ∅. If it is the case, it enters the indepen-
dent set and returns (line 24). Otherwise, it proceeds to the next round.
50 2 Distributed Graph Algorithms

The algorithm computes an independent set because when a process is added to


the set, all its neighbors stop competing to be in the set (lines 12–15). This set is
maximal because when a process enters the independent set, only its neighbors are
eliminated from being candidates.

Why to Use Random Numbers Instead of Initial Names or Precomputed Colors


As we have seen, the previous algorithm associates a new random number with each
process when this process starts a new round triple. The reader can check that the
algorithm works if each process uses its identity or a legal color instead of a new
random number at each round. Hence, the question: Why use random numbers?
The instance of the algorithm using n distinct identities (or a legal process m-
coloring) requires a number of round triples upper bounded by n/2 (or m/2).
This is because, in each round triple, at least one process enters the maximal inde-
pendent set and at least one process is eliminated. Taking random numbers does not
reduce this upper bound (because always taking initial identities corresponds to par-
ticular random choices) but reduces it drastically in the average case (the expected
number of round triples is then O(log2 n)).

2.3 Knot and Cycle Detection

Knots and cycles are graph patterns encountered when one has to solve distributed
computing problems such as deadlock detection. This section presents an asyn-
chronous distributed algorithm that detects such graph patterns.

2.3.1 Directed Graph, Knot, and Cycle

A directed graph is a graph where every edge is oriented from one vertex to another
vertex. A directed path in a directed graph is a sequence of vertices i1 , i2 , . . . , ix
such that for any y, 1 ≤ y < x, there is an edge from the vertex iy to the vertex iy+1 .
A cycle is a directed path such that ix = i1 .
A knot in a directed graph G is a subgraph G such that (a) any pair of vertices
in G belongs to a cycle and (b) there is no directed path from a vertex in G to a
vertex which is not in G . Hence, a vertex of a directed graph belongs to a knot if
and only if it is reachable from all the vertices that are reachable from it. Intuitively,
a knot is a “black hole”: once in a knot, there is no way to go outside of it.
An example is given in Fig. 2.14. The directed graph has 11 vertices. The set
of vertices {7, 10, 11} defines a cycle which is not in a knot (this is because, when
traveling on this cycle, it is possible to exit from it). The subgraph restricted to the
vertices {3, 5, 6, 8, 9} is a knot (after entering this set of vertices, it is impossible to
exit from it).
2.3 Knot and Cycle Detection 51

Fig. 2.14 A directed graph with a knot

2.3.2 Communication Graph, Logical Directed Graph,


and Reachability

As previously, the underlying communication graph is not directed. Each channel is


bidirectional which means that, if two processes are neighbors, either of them can
send messages to the other.
It is assumed that a directed graph is defined on the communication graph. Its
vertices are the processes, and if pi and pj are connected by a communication
channel, there is (a) either a logical directed edge from pi to pj , or (b) a logical
directed edge from pj to pi , or (c) two logical directed edges (one in each direction).
If there is a directed edge from pi to pj , we say “pj is an immediate successor of
pi ” and “pi is an immediate predecessor of pj ”. A vertex pj is said to be reachable
from a vertex pi if there is a directed path from pi to pj .
From an application point of view, a directed edge corresponds to a dependence
relation linking a process pi to its neighbor pj (e.g., pi is waiting for “something”
from pj ).

2.3.3 Specification of the Knot Detection Problem

The problem consists in detecting if a given process belongs to a knot of a directed


graph. For simplicity, it is assumed that only one process initiates the knot detection.
Multiple instantiations can be distinguished by associating with each of them an
identification pair made up of a process identity and a sequence number.
The knot detection problem is defined by the following properties, where pa is
the process that initiates the detection:
• Liveness (termination). If pa starts the knot detection algorithm, it eventually
obtains an answer.
52 2 Distributed Graph Algorithms

• Safety (consistency).
– If pa obtains the answer “knot”, it belongs to a knot. Moreover, it knows the
identity of all the processes involved in the knot.
– If pa obtains the answer “no knot”, it does not belong to a knot. Moreover, if
it belongs to at least one cycle, pa knows the identity of all the processes that
are involved in a cycle with pa .
As we can see, the safety property of the knot detection problem states what is a
correct result while its liveness property states that eventually a result has to be
computed.

2.3.4 Principle of the Knot/Cycle Detection Algorithm

The algorithm that is presented below relies on the construction of a spanning tree
enriched with appropriate statements. It is due to D. Manivannan and M. Sing-
hal (2003).

Build a Directed Spanning Tree To determine if it belongs to a knot, the initiator


pa needs to check that every process that is reachable from it is on a cycle which
includes it (pa ).
To that end, pa sends a message GO _ DETECT () to its immediate successors in
the directed graph and these messages are propagated from immediate successors
to immediate successors along directed edges to all the processes that are reachable
from pa . The first time it receives such a message from a process pj , the receiver
process pi defines pj as its parent in the directed spanning tree.

Remark The previous message GO _ DETECT () and the messages CYCLE _ BACK (),
SEEN _ BACK (), and PARENT _ BACK () introduced below are nothing more than par-
ticular instances of the messages GO () and BACK () used in the graph traversal algo-
rithms described in Chap. 1.

How to Determine Efficiently that pa Is on a Cycle If the initiator pa receives


a message GO _ DETECT () from a process pj , it knows that it is on a cycle. The issue
is then for pa to know which are the processes involved in the cycles to which it
belongs.
To that end, pa sends a message CYCLE _ BACK () to pj and, more generally
when a process pi knows that it is on a cycle including pa , it will send a message
CYCLE _ BACK () to each process from which it receives a message GO _ DETECT ()
thereafter. Hence, these processes will learn that they are on a cycle including the
initiator pa .
But it is possible that, after it has received a first message GO _ DETECT (), a pro-
cess pi receives more GO _ DETECT () messages from other immediate predecessors
(let pk be one of them, see Fig. 2.15). If this message exchange pattern occurs, pi
2.3 Knot and Cycle Detection 53

Fig. 2.15 Possible message pattern during a knot detection

sends back to pk the message SEEN _ BACK (), and when pk receives this message it
includes the ordered pair k, i in a local set denoted seenk . (Basically, the message
SEEN _ BACK () informs its receiver that its sender has already received a message
GO _ DETECT ().) In that way, if later pi is found to be on a cycle including pa , it
can be concluded from the pair k, i ∈ seenk that pk is also on a cycle including pa
(this is because, due to the messages GO _ DETECT (), there is a directed path from
pa to pk and pi , and due to the cycle involving pa and pi , there is a directed path
from pi to pa ).
Finally, as in graph traversal algorithms, when it has received an acknowledg-
ment from each of its immediate successors, a process pi sends a message PAR -
ENT _ BACK () to its parent in the spanning tree. Such a message contains (a) the
processes that, due to the messages CYCLE _ BACK () received by pi from immediate
successors, are known by pi to be on a cycle including pa , and (b) the ordered pairs
i,  stored in seeni as a result of the acknowledgment messages SEEN _ BACK ()
and PARENT _ BACK () it has received from its immediate successors in the logical
directed graph. This information, which will be propagated in the tree to pa , will
allow pa to determine if it is in a knot or a cycle.

2.3.5 Local Variables

Local Variable at the Initiator pa Only The local variable candidatesa , which
appears only at the initiator, is a set (initially empty) of process identities. If pa is in
a knot, candidatesa will contain the identities of all the processes that are in the knot
including pa , when the algorithm terminates. If pa is not in a knot, candidatesa will
contain all the processes that are in a cycle including pa (if any). If candidatesa = ∅
when the algorithm terminates, pa belongs to neither a knot, nor a cycle.

Local Variables at Each Process pi Each (initiator or not) process pi manages


the following four local variables.
• The local variable parenti is initialized to ⊥. If pi is the initiator we will have
parenti = i when it starts the detection algorithm. If pi is not the initiator,
parenti will contain the identity of the process from which the first message
GO _ DETECT () was received by pi . When all the processes reachable from pa
54 2 Distributed Graph Algorithms

have received a message GO _ DETECT (), these local variables define a directed
spanning tree rooted at pa which will be used to transmit information back to this
process.
• The local variable waiting_fromi is a set of process identities. It is initialized to
set of the immediate successors of pi in the logical directed graph.
• The local variable in_cyclei is a set (initially empty) of process identities. It will
contain processes that are on a cycle including pi .
• The local variable seeni is a set (initially empty) of ordered pairs of process iden-
tities. As we have seen, k, j  ∈ seeni means that there is a directed path from pa
to pk and a directed edge from pk to pj in the directed graph. It also means that
both pk and pj have received a message GO _ DETECT () and, when pj received
the message GO _ DETECT () from pk , it did not know whether it belongs to a cycle
including pa (see Fig. 2.15).

2.3.6 Behavior of a Process

The knot detection algorithm is described in Fig. 2.16.

Launching the Algorithm The only process pi that receives the external mes-
sage START () discovers that it is the initiator, i.e., pi is pa . If it has no outgoing
edges – predicate (waiting_fromi = ∅) at line 1 –, pi returns the pair (no knot,∅),
which indicates that pi belongs neither to a cycle, nor to a knot (line 4). Otherwise,
it sends the message GO _ DETECT () to all its immediate successors in the directed
graph (line 3).
Reception of a Message GO _ DETECT () When a process pi receives the message
GO _ DETECT () from pj , it sends back to pj the message CYCLE _ BACK () if it is
the initiator, i.e., if pi = pa (line 7). If it is not the initiator and this message is the
first it receives, it first defines pj as its parent in the spanning tree (line 9). Then, if
waiting_fromi = ∅ (line 10), pi propagates the detection to its immediate successors
in the directed graph (line 11). If waiting_fromi = ∅, pi has no successor in the
directed graph. It then returns the message PARENT _ BACK (seeni , in_cyclei ) to its
parent (both seeni and in_cyclei are then equal to their initial value, i.e., ∅; seeni = ∅
means that pi has not seen another detection message, while in_cyclei = ∅ means
that pi is not involved in a cycle including the initiator).
If pi is already in the detection tree, it sends back to pj the message
SEEN _ BACK () or CYCLE _ BACK () according to whether the local set in_cyclei is
empty or not (line 14–15). Hence, if in_cyclei = ∅, pi is on a cycle including pa
and pj will consequently learn that it is also on a cycle including pa .
Reception of a Message XXX _ BACK () When a process pi receives a message
XXX _ BACK () (where XXX stands for SEEN , CYCLE , or PARENT ), it first suppresses
its sender pj from waiting_fromi .
2.3 Knot and Cycle Detection 55

when START () is received do


(1) if (waiting_fromi = ∅)
(2) then parenti ← i;
(3) for each j ∈ waiting_fromi do send GO _ DETECT () to pj end for
(4) else return(no knot, ∅)
(5) end if.
when GO _ DETECT () is received from pj do
(6) if (parenti = i)
(7) then send CYCLE _ BACK () to pj
(8) else if (parenti = ⊥)
(9) then parenti ← j ;
(10) if (waiting_fromi = ∅)
(11) then for each k ∈ waiting_fromi do send GO _ DETECT () to pk end for
(12) else send PARENT _ BACK (seeni , in_cyclei ) to pparenti
(13) end if
(14) else if (in_cyclei = ∅) then send CYCLE _ BACK () to pj
(15) else send SEEN _ BACK () to pj
(16) end if
(17) end if
(18) end if.
when SEEN _ BACK () is received from pj do
(19) waiting_fromi ← waiting_fromi \ {j }; seeni ← seeni ∪ {i, j }; check_waiting_from().
when CYCLE _ BACK () is received from pj do
(20) waiting_fromi ← waiting_fromi \ {j }; in_cyclei ← in_cyclei ∪ {j };
(21) check_waiting_from().
when PARENT _ BACK (seen, in_cycle) is received from pj do
(22) waiting_fromi ← waiting_fromi \ {j }; seeni ← seeni ∪ seen;
(23) if (in_cycle = ∅)
(24) then seeni ← seeni ∪ {i, j }
(25) else in_cyclei ← in_cyclei ∪ in_cycle
(26) end if;
(27) check_waiting_from().
internal operation check_waiting_from() is
(28) if (waiting_fromi = ∅) then
(29) if (parenti = i)
(30) then for each k ∈ in_cyclei do
(31) in_cyclei ← in_cyclei \ {k}; candidatesi ← candidatesi ∪ {k};
(32) for each x ∈ {1, . . . , n} do
(33) if (x, k ∈ seeni )
(34) then in_cyclei ← in_cyclei ∪ {x};
(35) seeni ← seeni \ {x, k}
(36) end if
(37) end for
(38) end for;
(39) if (seeni = ∅) then res ← knot else res ← no knot end if;
(40) return(res, candidatesi )
(41) else if (in_cyclei = ∅) then in_cyclei ← in_cyclei ∪ {i} end if;
(42) send PARENT _ BACK (seeni , in_cyclei ) to pparenti ; return()
(43) end if
(44) end if.

Fig. 2.16 Asynchronous knot detection (code of pi )


56 2 Distributed Graph Algorithms

As we have seen, a message SEEN _ BACK () informs its receiver pi that its sender
pj has already been visited by the detection algorithm (see Fig. 2.15). Hence, pi
adds the ordered pair i, j  to seeni (line 19). Therefore, if later pj is found to be
on a cycle involving the initiator, the initiator will be able to conclude from seeni
that pi is also on a cycle involving pa . The receiver pi then invokes the internal
operation check_waiting_from().
If the message received by pi from pj is CYCLE _ BACK (), pi adds j to in_cyclei
(line 20) before invoking check_waiting_from(). This is because there is a path from
the initiator to pi and a path from pj to the initiator, hence pi and pj belong to a
same cycle including pa .
If the message received by pi from pj is PARENT _ BACK (seen, in_cycle), pi
adds the ordered pairs contained in seen sent by its child pj to its set seeni (line 22).
Moreover, if in_cycle is not empty, pi merges it with in_cyclei (line 25). Otherwise
pi adds the ordered pair i, j  to seeni (line 24). In this way, the information allow-
ing pa to know (a) if it is in a knot or (b) if it is only in a cycle involving pi will be
propagated from pi first to its parent, and then propagated from its parent until pa .
Finally, pi invokes check_waiting_from().
The Internal Operation check_waiting_from() As just seen, this operation is in-
voked each time pi receives a message XXX _ BACK (). Its body is executed only
if pi has received a message XXX _ BACK () from each of its immediate successors
(line 28). There are two cases.
If pi is not the initiator, it first adds itself to in_cyclei if this set is not empty
(line 41). This is because, if in_cyclei = ∅, pi knows that it is on a cycle involving
the initiator (lines 20 and 25). Then, pi sends to its parent (whose identity has been
saved in parenti at line 9) the information it knows on cycles involving the initia-
tor. This information has been incrementally stored in its local variables seeni and
in_cyclei at lines 19–27. Finally, pi invokes return(), which terminates its partici-
pation (line 42).
If pi is the initiator pa , it executes the statements of lines 30–39. First pi cleans
its local variables seeni and in_cyclei (lines 30–38). For each k ∈ in_cyclei , pi first
moves k from in_cyclei to candidatesi . This is because, if pi is in a knot, so are all
the processes which are on a cycle including pi . Then, for each x, if the ordered
pair x, k ∈ seeni , pi suppresses it from seeni and adds px to in_cyclei . This is
because, after pa has received a message XXX _ BACK () from each of its immediate
successors, we have for each process pk reachable from pa either k ∈ in_cyclea or
x, k ∈ seena for some px reachable from pa . Hence, if k ∈ in_cyclea and x, k ∈
seena , then px is also in a cycle with pa .
Therefore, after the execution of lines 30–38, candidatesa contains the identities
of all the processes reachable from pa which are on a cycle with pa . It follows that,
if seena becomes empty, all the processes reachable from pa are on a cycle with pa .
The statement of line 39 is a direct consequence of this observation. If seena = ∅,
pa belongs to a knot made up of the processes which belong to the set candidatesi .
If seena = ∅, candidatesa contains all the processes that are involved in a cycle
including pa (hence if candidatesa = ∅, pi is involved neither in a knot, nor in a
cycle).
2.4 Summary 57

Fig. 2.17
Knot/cycle detection:
example

An Example Let us consider the directed graph depicted in Fig. 2.17. This
graph has a knot composed of the processes p2 , p3 , and p4 , a cycle involving
the processes p1 , p6 , p5 and p7 , plus another process p8 . If the initiator process
pa belongs to the knot, pa will discover that it is in a knot, and we will have
candidatesa = {2, 3, 5} and seena = ∅ when the algorithm terminates. If the ini-
tiator process pa belongs to the cycle on the right of the figure (e.g., pa is p1 ), we
will have candidatesa = {1, 6, 5, 7} and seena = {4, 2, 3, 4, 2, 3, 1, 2, 5, 4}
when the algorithm terminates (assuming that the messages GO _ DETECT() propa-
gate first along the process chain (p1 , p2 , p3 , p4 ), and only then from p5 to p4 ).

Cost of the Algorithm As in a graph traversal algorithm, each edge of the di-
rected graph is traversed at most once by a message GO _ DETECT () and a message
SEEN _ BACK (), CYCLE _ BACK () or PARENT _ BACK () is sent in the opposite direc-
tion. It follows that the number of message used by the algorithm is upper bounded
by 2e, where e is the number of edges of the logical directed graph.
Let DT be the depth of the spanning tree rooted at pa that is built. It is easy
to see that the time complexity is 2(DT + 1) (DT time units for the messages
GO _ DETECT () to go from the root pa to the leaves, DT time units for the mes-
sages XXX _ BACK () to go back in the other direction and 2 more time units for the
leaves to propagate the message GO _ DETECT () to their immediate successors and
obtain their acknowledgment messages XXX _ BACK ()).

2.4 Summary
Considering a distributed system as a graph whose vertices are the processes and
edges are the communication channels, this chapter has presented several distributed
graph algorithms. “Distributed” means here each process cooperates with its neigh-
bors to solve a problem but never learns the whole graph structure it is part of.
The problems that have been addressed concern the computation of shortest
paths, the coloring of the vertices of a graph in  + 1 colors (where  is the maxi-
mal degree of the vertices), the computation of a maximal independent set, and the
detection of knots and cycles.
As the reader has seen, the algorithmic techniques used to solve graph problems
in a distributed context are different from their sequential counterparts.
58 2 Distributed Graph Algorithms

2.5 Bibliographic Notes

• Graph notions and sequential graph algorithms are described in many textbooks,
e.g., [122, 158]. Advanced results on graph theory can be found in [164]. Time
complexity results of numerous graph problems are presented in [148].
• Distributed graph algorithms and associated time and message complexity analy-
ses can be found in [219, 292].
• As indicated by its name, the sequential shortest path algorithm presented in
Sect. 2.1.1 is due to R.L. Ford. It is based on Bellman’s dynamic program-
ming principle [44]. Similarly, the sequential shortest path algorithm presented in
Sect. 2.1.2 is due to R.W. Floyd and R. Warshall who introduced independently
similar algorithms in [128] and [384], respectively. The adaptation of Floyd–
Warshall’s shortest path algorithm is due to S. Toueg [373].
Other distributed shortest path algorithms can be found in [77, 203].
• The random algorithm presented in Sect. 2.2.3, which computes a maximal inde-
pendent set, is due to M. Luby [240]. The reader will find in this paper a proof
that the expected number of rounds is O(log2 n). Another complexity analysis of
( + 1)-coloring is presented in [201].
• The knot detection algorithm described in Fig. 2.3.4 is due to D. Manivannan and
M. Singhal [248] (this paper contains a correctness proof of the algorithm). Other
asynchronous distributed knot detection algorithms can be found in [59, 96, 264].
• Distributed algorithms for finding centers and medians in networks can be found
in [210].
• Deterministic distributed vertex coloring in polylogarithmic time, suited to syn-
chronous systems, is addressed in [43].

2.6 Exercises and Problems


1. Adapt the algorithms described in Sect. 2.1.1 to the case where the communica-
tion channels are unidirectional in the sense that a channel transmits messages in
one direction only.
2. Execute Luby’s maximal independent set described in Fig. 2.12 on both graphs
described in Fig. 2.10 with various values output by the function random().
3. Let us consider the instances of Luby’s algorithm where, for each process, the
random numbers are statically replaced by its initial identity or its color (where
no two neighbor processes have the same color).
Compare these two instances. Do they always have the same time complexity?
4. Adapt Luby’s synchronous maximal independent set algorithm to an asyn-
chronous message-passing system.
5. Considering the directed graph depicted in Fig. 2.14, execute the knot detection
algorithm described in Sect. 2.3.4 (a) when p5 launches the algorithm, (b) when
p10 launches the algorithm, and (c) when p4 launches the algorithm.
Chapter 3
An Algorithmic Framework
to Compute Global Functions
on a Process Graph

This chapter is devoted to distributed graph algorithms that compute a function or a


predicate whose inputs are disseminated at the processes of a network. The function
(or the predicate) is global because the output at each process depends on the inputs
at all the processes. It follows that the processes have to communicate in order to
compute their results.
A general algorithmic framework is presented which allows global functions to
be computed. This distributed framework is (a) symmetric in the sense that all pro-
cesses obey the same rules of behavior, and (b) does not require the processes to
exchange more information than needed. The computation of shortest distances and
the determination of a cut vertex in a graph are used to illustrate the framework.
The framework is then improved to allow for a reduction of the size and the num-
ber of messages that are exchanged. Finally, the chapter analyzes the particular case
of regular networks (networks in which all the processes have the same number of
neighbors).

Keywords Cut vertex · De Bruijn’s graph · Determination of cut vertices ·


Global function · Message filtering · Regular communication graph ·
Round-based framework

3.1 Distributed Computation of Global Functions


3.1.1 Type of Global Functions

The problems addressed in this chapter consist for each process pi in computing a
result outi which involves the whole set of inputs in1 , in2 , . . . , inn , where ini is the
input provided by process pi . More precisely, let IN[1..n] be the vector that, from an
external observer point of view, represents the inputs, i.e., ∀i: IN[i] = ini . Similarly,
let OUT[1..n] be the vector such that ∀i: OUT[i] = outi . Hence, the processes have
to cooperate and coordinate so that they collectively compute
OUT = F (IN).
According to the function F () which is computed, all the processes may obtain
the same result out (and we have then OUT[1] = · · · = OUT[n] = out) or different

M. Raynal, Distributed Algorithms for Message-Passing Systems, 59


DOI 10.1007/978-3-642-38123-2_3, © Springer-Verlag Berlin Heidelberg 2013
60 3 An Algorithmic Framework to Compute Global Functions

results (i.e., it is possible that there are processes pi and pj such that outi = outj ).
Examples of such problems are the following.

Routing Tables In this case, the input of a process is its position in the network
and its output is a routing table that, for each process pj , defines the local channel
that pi has to use so that the messages to pj travel along the path with the shortest
distance (let us recall that the distance is the minimal number of channels separating
pi from pj ).

Eccentricity, Diameter, Radius, Center, and Peripheral Vertex The eccentric-


ity of a process pi (vertex) in a communication graph is the longest distance from
pi to any other process. The eccentricity of pi is denoted ecci .
The diameter D of a graph is its largest eccentricity: D = max1≤i≤n (ecci ). The
radius of a graph is its minimal eccentricity, a center of a graph is a vertex (process)
whose eccentricity is equal to the radius, while a peripheral vertex (process) is a
vertex (process) whose eccentricity is equal to the diameter.
The computation of the diameter (or the radius) of a communication graph cor-
responds to a function that provides the same parameter to all the processes. Differ-
ently, the computation of its eccentricity by each process corresponds to a function
that does not necessarily provide the processes with the same result. The same oc-
curs when each process has to know if it is a center (or a peripheral process) of the
communication graph.

Maximal or Minimal Input Simple functions returning the same result to all the
processes are the computation of the maximum (or minimum) of their inputs. This
is typically the case in the election problem where, assuming that processes have
distinct and comparable identities, the input of each process is its identity and their
common output is the greatest (or smallest) of their identities, which defines the
process that is elected.

Cut Vertex A cut vertex in a graph is a vertex (process) whose suppression dis-
connects the graph. Knowledge of such processes is important to analyze message
bottleneck and fault-tolerance. The global function computed by each process is
here a predicate indicating, at each process, if it is a cut vertex.

3.1.2 Constraints on the Computation

On the Symmetry Side: No Centralized Control A simple solution to compute


a functionF () would be to use a broadcast/convergecast as presented in Sect. 1.2.1.
The process pa , which is the root of the spanning tree, broadcasts a request message
GO () and waits until it has received response messages BACK () from each of its
children (each of these messages carries input values from a subset of processes, and
they all collectively carry all the input values except the one of pa ). The root process
3.2 An Algorithmic Framework 61

pa then knows the input vector IN[1..n]. It can consequently compute OUT[1..n] =
F (IN) with the help of a sequential algorithm, and returns its result to each process
along the channels of the spanning tree.
While this solution is viable and worthwhile for some problems, we are interested
in this chapter in a solution in which the control is distributed, that is to say in a
solution in which no process plays a special role. This can be expressed in the form
of the following constraint: the processes must obey the same rules of behavior.

On the Efficiency Side: Do Not Learn More than What Is Necessary A solu-
tion satisfying the previous constraint would be for each process to flood the system
with its input so that all the processes learn all the inputs, i.e., the input vector
IN[1..n]. Then, each process pi can compute the output vector OUT[1..n] and ex-
tracts from it its local result outi = OUT[i].
As an example, if the processes have to compute the shortest distances, they
could first use the algorithm described in Fig. 1.3 (Sect. 1.1.2) in order to learn the
communication graph. They could then use any sequential shortest path algorithm
to compute their shortest distances. This is not satisfactory when a process pi learns
more information than what is needed to compute its table of shortest distances. As
we have seen in Sect. 2.1, it is not necessary for it to know the whole structure of
the communication graph in which it is working.
The solutions in which we are interested are thus characterized by a second con-
straint: a process has not to learn information which is useless from the point of
view of the local output it has to compute.

3.2 An Algorithmic Framework

This section presents an algorithmic framework which offers a solution for the class
of distributed computations which have been described previously, while respecting
the symmetry and efficiency constraints that have been stated. This framework is
due to J.-Cl. Bermond, J.-Cl. König, and M. Raynal (1987).

3.2.1 A Round-Based Framework

The algorithmic framework is based on rounds. Moreover, to simplify the presen-


tation, we consider that the communication channels are FIFO (if they are not—as
we have seen in Fig. 2.9, Sect. 2.2.2—each message has to carry a parity bit defined
from the current round of its sender).
Each process executes first an initialization part, followed by an asynchronous
sequence of rounds. If a process receives a message before it started its participation
in the algorithm, it executes its initialization part and starts its first round before
processing the message.
62 3 An Algorithmic Framework to Compute Global Functions

Asynchronous Round-Based Computation Differently from synchronous sys-


tems, the rounds are not given for free in an asynchronous system. Each process pi
has to handle a local variable ri which denotes its current round number.
We first consider that, in each round r it executes, a process sends a message to
each of its neighbors, and receives a message from each of them. (This assumption
on the communication pattern will be relaxed in the next section.) It then executes a
local processing which depends on the particular problem that is solved.

Local Variables at Each Process pi In addition to its local variable ri and its
identity idi , a process manages the following local variables:
• As already indicated, ini is the input parameter provided by pi to solve the prob-
lem, while outi is a local variable that will contain its local output.
• A process pi has ci (1 ≤ ci ≤ n − 1) bidirectional channels, which connect
it to ci distinct neighbors processes. The set channelsi = {1, . . . , ci } is the set
of local indexes that allow these neighbors to be addressed, and the local ar-
ray channeli [1..ci ] defines the local names of these channels (see Exercise 4,
Chap. 1). This means that the local names of the channel (if any) connecting
pi and pj are channeli [x] (where x ∈ channelsi ) at pi and channelj [y] (where
y ∈ channelsj ) at pj .
• newi is a local set that, at the end of each round r, contains all the information
that pi learned during this round (i.e., it receives this information at round r for
the first time).
• infi is a local set that contains all the information that pi has learned since the
beginning of the execution.

Principle The underlying principle is nothing more than the forward/discard prin-
ciple. During a round r, a process sends to its neighbors all the new information it
has received during the previous round (r − 1). It is assumed that it has received its
input during the fictitious round r = 0.
It follows that, during the first round a process learns the inputs of the processes
which are distance 1 from it, during the second round it learns the inputs of the
processes at distance 2, and more generally it learns the inputs of the processes at
distance d during the round r = d.

Example: Computation of the Routing Tables As a first example, let us consider


the computation of the routing tables at each process pi . Here ini is the identity of
pi (idi ), and it is assumed that no two processes have the same identity. As far as the
output is concerned, outi is here a routing table outi [1..ci ], where each outi [x] is a
set initially empty. At the end of the algorithm, outi [x] will contain the identities of
the processes pj such that channeli [x] is the channel that connects pi to its neighbor
on the shortest distance to pj .
To simplify the presentation, the description of the algorithm assumes that the
processes know the diameter D of the communication graph. It follows from the
design principle of the algorithm that, when ri = D, each process has learned all the
3.2 An Algorithmic Framework 63

init infi ← {idi }; newi ← {idi }; ri ← 0.

(1) for ri from 1 to D do


begin asynchronous round
(2) ri ← r1 + 1;
(3) for each x ∈ channelsi do send MSG (newi ) on channeli [x] end for;
(4) newi ← ∅;
(5) for each x ∈ channelsi do
(6) wait (MSG (new) received on channeli [x]);
(7) let aux = new \ (infi ∪ newi );
(8) outi [x] ← outi [x] ∪ aux;
(9) newi ← newi ∪ aux
(10) end for;
(11) infi ← infi ∪ newi
end asynchronous round
(12) end for;
(13) return(outi [1..ci ]).

Fig. 3.1 Computation of routing tables defined from distances (code for pi )

information it can ever know. Consequently, if D is known, D rounds are necessary


and sufficient to compute the routing tables.
The corresponding algorithm is described in Fig. 3.1. Each process pi executes
D rounds (line 1). During a round, it first sends to its neighbors what it has learned
during the previous round (line 3), which has been saved in its local variable newi
(line 9; initially, it knows only its identity idi ).
Then, pi waits for a round r message from each of its neighbors. When it receives
such a message on channel channeli [x], it first computes what it learns from this
message (line 7). As already noticed, what it learns are identities of processes at
distance r from it. It consequently adds these identities to the routing table outi [x].
The update of outi [x] (line 8) is such that channeli [x] is the favorite channel to
send messages to the processes whose identity belongs to outi [x]. (If we want to
use an array routing_toi [id], as done in the previous chapter, id ∈ outi [x] means
that routing_toi [id] = channeli [x].) Then pi updates appropriately its local variable
newi (line 9).
Finally, before proceeding to the next round, pi updates infi which stores all the
identities learned since the beginning (line 11). If it has executed the D rounds, it
returns its local routing table outi [1..n] (line 13).

Cost As previously, let one time unit be the maximal transit time for a message
from a process to one of its neighbors. The time complexity is 2D. The worst case
is when a single process starts the algorithm. It then takes D messages sent sequen-
tially from a process to another process before all processes have started their local
algorithm. Then, all the processes execute D rounds.
Each process executes D rounds, and two messages are exchanged on each chan-
nel at each round. Hence, the total number of messages is 2eD, where e is the num-
ber of channels of the communication graph.
64 3 An Algorithmic Framework to Compute Global Functions

A General Algorithm A general algorithm is easily obtained as follows. Let


ini the input of pi . The initialization is then replaced by the statements “infi ←
{(i, ini )}; new ← {(i, ini )}”, and line 8 is modified according to the function or the
predicate that one wants to compute.

3.2.2 When the Diameter Is Not Known

This section shows how to eliminate the a priori knowledge of the diameter.

A Simple Predicate Let us observe that, if a process does not learn any new in-
formation at a given round, it will learn no more during the next rounds.
This follows from the fact that, if pi learns nothing new in round r, there is no
process situated at distance r from pi , and consequently no process at a distance
greater than r. Hence, if it learns nothing new in round r, pi will learn nothing new
during rounds r  > r.
Actually, the last round r during which pi learns something new is the round
r = ecci (its eccentricity), but, not knowing ecci , it does not know that it has learned
everything. In contrast, at the end of round r = ecci + 1, pi will have newi = ∅ and
it will then learn that it knows everything. (Let us observe that this predicate also
allows pi to easily compute ecci .)

Not All Processes Terminate During the Same Round Let us observe that, as
not all the processes have necessarily the same eccentricity, they do not terminate
at the same round. Let us consider two neighbor processes pi and pj such that pi
learns no new information during round r while pj does. We have the following:
• As pi and pj are neighbors, we have 0 ≤ |ecci − eccj | ≤ 1. Thus, pj will have
newj = ∅ at the end of round r + 1.
• In order that pj does not block during round r + 1 waiting for a message from pi ,
this process has to execute round r + 1, sending it a message carrying the value
newi = ∅.
It follows that a process pi now executes ecci + 2 rounds. At round r = ecci it
knows everything, at round r = ecci + 1 it knows that it knows everything, and at
round r = ecci + 2 it informs its neighbors of this and terminates.
The corresponding generic algorithm is described in Fig. 3.2. Each process man-
ages an additional local variable com_withi , which contains the channels on which
it has to send and receive messages. The management of this variable is the main
novelty with respect to the algorithm of Fig. 3.1.
If the empty set is received on a channel channeli [x], this channel is withdrawn
from com_withi (line 7). Moreover, to prevent processes from being blocked waiting
forever for a message, a process pi (before it terminates) sends the message MSG (∅)
on each channel which is still open (line 15). It also empties the open channels by
waiting for a message on each of them (line 16).
3.2 An Algorithmic Framework 65

init infi ← {(idi , ini )}; newi ← {(idi , ini }; com_withi ← {1, . . . , ci }; ri ← 0.

(1) while (newi = ∅) do


begin asynchronous round
(2) ri ← ri + 1;
(3) for each x ∈ com_withi do send MSG (newi ) on channeli [x] end for;
(4) newi ← ∅;
(5) for each x ∈ com_withi do
(6) wait (MSG (new) received on channeli [x]);
(7) if (new = ∅) then com_withi ← com_withi \ {x}
(8) else let aux = new \ (infi ∪ newi );
(9) specific statements related to the function that is computed;
(10) newi ← newi ∪ aux
(11) end if
(12) end for;
(13) infi ← infi ∪ newi
end asynchronous round
(14) end while;
(15) for each x ∈ com_withi do send MSG (newi ) on channeli [x] end for;
(16) for each x ∈ com_withi do wait (MSG (new) on channeli [x]) end for;
(17) specific statements which compute outi ;
(18) return(outi ).

Fig. 3.2 A diameter-independent generic algorithm (code for pi )

Cost It is easy to see that the time complexity is upper-bounded by D + (D + 2) =


2D + 2, and the message complexity is O(2e(D + 2)).

A Round-Based Algorithm as an Iteration on a Set of Equations The algorithm


of Fig. 3.2 can be rephrased as a set of equations relating the local variables at round
r − 1 and round r. This makes the algorithm instantiated with a function F () easier
to verify than an algorithm computing the same function F () which would not be
based on rounds. These equations use the following notations:
• senti [x]r denotes the content of the message sent by pi on channeli [x] during
round r,
• receivedi [x]r denotes the content of the message received by pi on channeli [x]
during round r; moreover, pj is the process that sent this message on its local
channel channelj [y],
• newr−1
i is the content of newi at the end of round r − 1,
• com_withr−1i and infi r−1 are the values of com_withi and infi at the end of round
r − 1, respectively.
Using these notations, the equations that characterize the generic round-based algo-
rithm are the following ones for all i ∈ [1..n]:

∀x ∈ com_withr−1
i : senti [x]r = newr−1
i ,
∀x ∈ com_withr−1
i : receivedi [x]r = sentj [y]r ,
66 3 An Algorithmic Framework to Compute Global Functions

Fig. 3.3 A process graph with three cut vertices

 
newri = receivedi [x]r \ infi r−1 ,
x∈com_withri


com_withri = com_withr−1
i \ x | receivedi [x]r = ∅ .

3.3 Distributed Determination of Cut Vertices

3.3.1 Cut Vertices

Definition A cut vertex (or articulation point) of a graph is a vertex whose sup-
pression disconnects the graph. A graph is biconnected if it remains connected after
the deletion of any of its vertices. Thus, a connected communication graph is bicon-
nected if and only if it has no cut vertex. A connected graph can be decomposed in
a tree whose vertices are its biconnected components.
Given a vertex (process) pi of a connected graph, let Ri be the local relation
defined on the edges (channels) incident to pi as follows. Let e1 and e2 be two
edges incident to pi ; e1 Ri e2 if the edges e1 and e2 belong to the same biconnected
component of the communication graph. It is easy to see that Ri is an equivalence
relation; i.e., it is reflexive, symmetric, and transitive. Thus, if e1 Ri e2 and e2 Ri e3 ,
then, the three edges e1 , e2 , and e3 incident to pi belong to the same biconnected
component.

Example As an example, let us consider the graph depicted in Fig. 3.3. The pro-
cesses p4 , p5 , and p8 are cut vertices (the deletion of any of them disconnects the
graph).
There are four biconnected components (ellipsis on the figure) denoted C1 , C2 ,
C3 , and C4 . As an example, the deletion of any process (vertex) inside a component
does not make it disconnected.
When looking at the four edges a, b, c, and d connecting process p8 to its neigh-
bors, we have a R8 b and c R8 d, but we do not have a R8 d. This is because the
channels a and b belong to the biconnected component C3 , the channels c and d be-
long to the biconnected component C4 , and C3 ∪ C4 is not a biconnected component
of the communication graph.
3.3 Distributed Determination of Cut Vertices 67

Fig. 3.4
Determining cut vertices:
principle

3.3.2 An Algorithm Determining Cut Vertices

Principle of the Algorithm The algorithm, which is due to J.-Cl. Bermond and
J.-Cl. König (1991), is based on the following simple idea.
Given a process pi which is on one or more cycles, let us consider an elementary
cycle to which pi belongs (an elementary cycle is a cycle that does not pass several
times through the same vertex/process). Let a = channeli [x] and b = channeli [y] be
the two distinct channels of pi on this cycle (see an example in Fig. 3.4). Moreover,
let pj be the process on this cycle with the greatest distance to pi , and let be this
distance. Hence, the length of the elementary cycle including pi and pj is 2 or
2 + 1 (in the figure, it is 2 + 1 = 3 + 4).
The principle that underlies the design of the algorithm follows directly from (a)
the message exchange pattern of the generic framework, and (b) the fact that, during
each round r, a process sends only the new information it has learned during the
previous round. More precisely, we have the following.
• It follows from the message exchange pattern that pi receives idj during round
r = (in the figure, idj is learned from channel b). Moreover, pi also receives
idj from channel a during round r = if the length of the cycle is even, or during
round r = + 1 if the length of the cycle is odd. In the figure, as the length of the
elementary cycle is odd, pi receives idj a first time on channel b during round
and a second time on channel a during round + 1. The pair (a, b) ∈ Ri , i.e.,
the channels a and b (which are incident to pi ) belong to the same biconnected
component.
This observation provides a simple local predicate, which allows a process
pi to determine if two of its incident channels belong to the same biconnected
component.
• Let us now consider two channels incident to the same process pi that are not
in the same biconnected component (as an example, this is the case of channels
b and c, both incident to p8 ). As these two channels do not belong to a same
elementary cycle including pi , this process cannot receive two messages carrying
the same process identity during the same round or two consecutive rounds. Thus,
the previous predicate is an “if and only if” predicate.

Description of the Algorithm The algorithm is described in Fig. 3.5. This algo-
rithm is an enrichment of the generic algorithm of Fig. 3.2: the only addition are
the lines N.9_1–N.9_10, which replace line 9, and lines 17_1–17_3, which replace
68 3 An Algorithmic Framework to Compute Global Functions

init infi ← {idi }; newi ← {idi }; com_withi ← {1, . . . , ci }.

(1) while (newi = ∅) do


begin asynchronous round
(2) ri ← ri + 1;
(3) for each x ∈ com_withi do send MSG (newi ) on channeli [x] end for;
(4) newi ← ∅;
(5) for each x ∈ com_withi do
(6) wait (MSG (new) received on channeli [x]);
(7) if (new = ∅) then com_withi ← com_withi \ {x}
(8) else let aux = new \ (infi ∪ newi );
(N.9_1) for each id ∈ new do
(N.9_2) if (id ∈
/ infi ∪ newi )
(N.9_3) then routing_toi [id] ← x; disti [id] ← ri
(N.9_4) else if (ri = disti [id] ∨ ri = disti [id] + 1)
(N.9_5) then let y = routing_toi [id];
(N.9_6) channeli [y], channeli [x]
(N.9_7) ∈ same biconnected component
(N.9_8) end if
(N.9_9) end if
(N.9_10) end for;
(10) newi ← newi ∪ aux
(11) end if
(12) end for;
(13) infi ← infi ∪ newi
end asynchronous round
(14) end while;
(15) for each x ∈ com_withi do send MSG (newi ) on channeli [x] end for;
(16) for each x ∈ com_withi do wait (MSG (new) on channeli [x]) end for;
(N.17_1) Considering channeli [1..ci ] and the pairs computed at line N.9-6, compute the
(N.17_2) transitive closure of the local relation Ri (belong to a same biconnected component);
(N.17_3) if (all channels of pi belong to the same biconnected component)
(N.17_4) then outi ← no else outi ← yes
(N.17_5) end if;
(18) return(outi ).

Fig. 3.5 An algorithm determining the cut vertices (code for pi )

line 17. Said differently, by suppressing all the lines whose number is prefixed by
N, we obtain the algorithm of Fig. 3.2.
Each process pi manages two array-like data structures, denoted routing_toi and
disti . Given a process identity id, disti [id] contains the distance from pi to the pro-
cess whose identity is id, and routing_toi [id] contains the local channel on which
messages for this process have to be sent. We use an array-like structure to make
the presentation clearer. Since initially a process pi knows neither the number of
processes, nor their identities, a dynamic list has to used to implement routing_toi
and disti .
Thus, let us consider a process pi that, during a round r, receives on a channel
channeli [x] a message carrying the value new = ∅. Let id ∈ new (line N.9_1) and
let pj be the corresponding process (i.e., id = idj ). There are two cases:
3.4 Improving the Framework 69

• If id is an identity not yet known by pi (line N.9_2), pi creates the variables


routing_toi [id] and disti [id] and assigns them x (the appropriate local channel
to send message to pj ) and ri (the distance separating pi and pj ), respectively
(line N.9_3).
• If id is an identity known by pi , it checks the biconnectivity local predicate,
namely (ri = disti [id]) ∨ (ri = disti [id] + 1) (line N.9_4). If this predicate is true,
pi has received twice the same identity during the same round ri , or during the
rounds ri and ri − 1. As we have seen, this means that channeli [x] (the channel
on which the message MSG (new) has been received) and channeli [y] (the chan-
nel on which id has been received for the first time, line N.9_5), belong to the
same biconnected component. Hence, we have (channeli [x], channeli [y]) ∈ Ri
(line N.9_6).
The second enrichment of the generic algorithm is the computation of the re-
sult at lines N.17_1–N.17_3. A process pi computes the transitive closure of its
relation Ri , which has been incrementally computed at line N.9_6. Let Ri∗ denote
this transitive closure. If Ri∗ is made up of a single class of equivalence, then pi is
not a cut vertex and the result is consequently the value no. If Ri∗ contains two or
more classes of equivalence, each class defines a distinct biconnected component to
which pi belongs. In this case, pi is a cut vertex of the communication graph, and
it consequently locally outputs yes (N.17_3).

3.4 Improving the Framework

This section shows that the previous generic algorithm can be improved so as to
reduce the size and the number of messages that are exchanged.

3.4.1 Two Types of Filtering

Filtering That Affects the Content of Messages A trivial way to reduce the size
of messages exchanged during a round r, is for each process pi to not send on a
channel channeli [x] the complete information newi it has learned during the previ-
ous round, but to send only the content of newi minus what has been received on
this channel during the previous round.
Let receivedi [x] be the information received by pi on the channel channeli [x]
during a round r. Thus, pi has to send only senti [x] = newi \ receivedi [x] on
channeli [x] during the round r + 1. This is the first filtering: it affects the content of
messages themselves.

Filtering That Affects the Channels That Are Used The idea is here for a pro-
cess pi to manage its channels according to their potential for carrying new infor-
mation.
70 3 An Algorithmic Framework to Compute Global Functions

To that end, let us consider two neighbor processes pi and pj such that we have
newi = newj at the end of round r. According to the message exchange pattern and
the round-based synchronization, this means that pi and pj have exactly the same
set E of processes at distance r, and during that round both learn the input values of
the processes in the set E. Hence, if pi and pj knew that newi = newj , they could
stop exchanging messages. This is because the new information they will acquire
in future rounds will be the inputs of the processes at distance r + 1, r + 2, etc.,
which they will obtain independently from one another (by way of the processes in
the set E).
The problem of exploiting this property lies in the fact that, at the end of a round,
a process does not know the value of the set new that it will receive from each of its
neighbors. However, at the end of a round r, each process pi knows, for each of its
channels channeli [x], the value senti [x] it has sent on this channel, and the value
receivedi [x] it has received on this channel. Four cases may occur (let pj denote the
neighbor to which pi is connected by channeli [x]).
1. senti [x] = receivedi [x]. In this case, pi and pj sent to each other the same infor-
mation during round r. They do not learn any new information from each other.
What is learned by pi is the fact that it has learned in round (r − 1) the informa-
tion that pj sent it in round r, and similarly for pj . Hence, from now on, they
will not learn anything more from each other.
2. senti [x] ⊂ receivedi [x]. In this case, pj does not learn anything new from pi in
the current round. Hence, it will not learn anything new from pi in the future
rounds.
3. receivedi [x] ⊂ senti [x]. This case is the inverse of the previous one: pi learns
that it will never learn new information on the channel channeli [x], in all future
rounds.
4. In the case where senti [x] and receivedi [x] cannot be compared, both pi and
pj learn new information from each other, receivedi [x] \ senti [x] as far as pi is
concerned.
These items allow for the implementation of the second filtering. It is based on
the values carried by the messages for managing the use of the communication chan-
nels.

3.4.2 An Improved Algorithm

Two More Local Variables To implement the previous filtering, each process pi
is provided with two local set variables, denoted c_ini and c_outi . They contain
indexes of local channels, and are initialized to {1, . . . , ci }. These variables are used
as follows in order to implement the management of the communication channels:
• When senti [x] = receivedi [x] (item 1), x is suppressed from both c_ini and
c_outi .
• When senti [x] ⊂ receivedi [x] (item 2), x is suppressed from c_outi .
• When receivedi [x] ⊂ senti [x] (item 3), x is suppressed from c_ini .
3.4 Improving the Framework 71

init infi ← {(idi , ini )}; newi ← {(idi , ini }; ri ← 0;


c_ini ← {1, . . . , ci }; c_outi ← {1, . . . , ci };
for each x ∈ {1, . . . , ci } do receivedi [x] ← ∅ end for.

(1) while (c_ini = ∅) ∨ (c_outi = ∅) do


begin asynchronous round
(2) ri ← ri + 1;
(3) for each x ∈ c_outi do
(4) if (x ∈ c_ini ) then senti [x] ← newi \ receivedi [x] else senti [x] ← newi end if;
(5) send MSG (senti [x]) on channeli [x];
(6) if (senti [x] = ∅) then c_outi ← c_outi \ {x} end if
(7) end for;
(8) newi ← ∅;
(9) for each x ∈ c_ini do
(10) wait (MSG (receivedi [x]) received on channeli [x]);
(11) if (x ∈ c_outi )
(12) then if (senti [x] ⊆ receivedi [x]) then c_outi ← c_outi \ {x} end if;
(13) if (receivedi [x] ⊆ senti [x]) then c_ini ← c_ini \ {x} end if
(14) else if (receivedi [x] = ∅) then c_ini ← c_ini \ {x} end if
(15) end if;
(16) newi ← newi ∪ (receivedi [x] \ infi )
(17) end for;
(18) infi ← infi ∪ newi
end asynchronous round
(19) end while;
(20) return(infi ).

Fig. 3.6 A general algorithm with filtering (code for pi )

The Improved Algorithm The final algorithm is described in Fig. 3.6. It is the
algorithm of Fig. 3.2 modified according to the previous discussion. The local ter-
mination is now satisfied when a process pi can no longer communicate; i.e., it is
captured by the local predicate (c_ini = ∅) ∧ (c_outi = ∅).
When it executes its “send” part of the algorithm (lines 3–7), a process pi has
now to compute the value senti [x] sent on each open output channel channeli [x].
Moreover, if senti [x] = ∅, this channel is suppressed from the set c_outi , which
contains the indexes of the output channels of pi that are still open. In this case, the
receiving process pj will also suppress the corresponding channel from its set of
open input channels c_inj (lines 14).
When it executes its “receive” part of the algorithm (lines 9–17), a process pi
updates its set of input channels c_ini and its set of output channels c_outi according
to the value receivedi [x] it has received on each input channel that is still open (i.e.,
on each channeli [x] such that x ∈ c_ini ).

Complexity The algorithm terminates in D + 1 rounds. This comes from the fact
that when senti [x] = ∅, both the sender pi and the receiver pj withdraw the cor-
responding channel from c_outi and c_inj , respectively. The maximum number of
messages is consequently 2e(D + 1). The time complexity is 2D + 1 in the worst
case (which occurs when a single process starts, its first round message wakes up
72 3 An Algorithmic Framework to Compute Global Functions

other processes, etc., and the eccentricity of the starting process is equal to the di-
ameter D of the communication graph).

3.5 The Case of Regular Communication Graphs

3.5.1 Tradeoff Between Graph Topology and Number of Rounds

A graph is characterized by several parameters, among which its number of vertices


n, its number of edges e, its diameter D, and its maximal degree , are particularly
important.

On the Message Complexity This appears clearly in the message complexity (de-
noted C in the following) of the previous algorithm for which C is upper bounded
by 2e(D + 1). If D is known by the processes, one round is saved, and we have
C = 2eD. This means that
• If the graph is fully connected we have D = 1, e = n(n − 1)/2, and C = O(n2 ).
• If the graph is a tree we have e = (n − 1), and C = O(nD).
This shows that it can be interesting to first build a spanning tree of the com-
munication graph and then use it repeatedly to compute global functions. However,
for some problems, a tree is not satisfactory because the tree that is obtained can
be strongly unbalanced in the sense that processes may have a distinct number of
neighbors.

The Notion of a Regular Graph Hence, for some problems, we are interested in
communication graphs in which the processes have the same number of neighbors
(i.e., the same degree ). When, they exist, such graph are called regular. In such a
graph we have e = (n)/2, and consequently we obtain

C = nD.

This relation exhibits a strong relation linking three of the main parameters associ-
ated with a regular graph.

What Regular Graphs Can Be Built? Given  and D, Moore’s bound (1958) is
an upper bound on the maximal number of vertices (processes) that a regular graph
with diameter D and degree  can have. This number is denoted n(D, ), and we
have n(D, ) ≤ 1 +  + ( − 1) + · · · + ( − 1)D−1 , i.e.,

n(D, 2) ≤ 2D + 1, and
( − 1)D −2
n(D, ) ≤ for  > 2.
−2
3.5 The Case of Regular Communication Graphs 73

This is an upper bound. Therefore (a) it does not mean that regular graphs for
which n(D, ) is equal to the bound exist for any pair (D, ), and (b) when such
a graph exists, it does not state how to built it. However,
√ this bound states that, in
the regular graphs that can be built, we have  ≥ D n. It follows that, in the regular
networks that can be built, we have

D
C = D nD+1 .

This formula relates clearly the number


√ of rounds D and the number of messages
exchanged at each round, namely n D n.

3.5.2 De Bruijn Graphs

The graphs known as De Bruijn’s graphs are directed regular graphs, which can
be easily built. This section presents them, and shows their interest in computing
global functions. (These graphs can also be used as overlay structures in distributed
applications.)
Let x be a vertex of a directed graph. + (x) denotes its input degree (number of
incoming edges), while − (x) denotes its output degree (number of output edges).
In a regular network, we have ∀ x : + (x) = − (x) = , and the value  defines
the degree of the graph.

De Bruijn’s Graph Let us consider a vocabulary V of  letters (e.g., {0, 1, 2} for


 = 3).
• The vertices are all the words of length D that can be built on a vocabulary V of
 letters. Hence, there are n = D vertices.
• Each vertex x = [x1 , . . . , xD−1 , xD ] has  output edges that connect it to the
vertices y = [x2 , . . . , xD , α], where α ∈ V (this is called the shifting property).
It follows from this definition that the input channels of a vertex x = [x1 , . . . ,
xD−1 , xD ] are the  vertices labeled [β, x1 , . . . , xD−1 ], where β ∈ V .
Let us also observe that the definition of the directed edges implies that each
vertex labeled [a, a, . . . , a], a ∈ V , has a channel to itself (this channel counts then
as both an input channel and an output channel).
A De Bruijn’s graph defined from a specific pair (, D) is denoted dB(, D).
Such a graph has n = D vertices and e = n directed edges.

Examples of De Bruijn’s Graphs Examples of directed De Bruijn’s graphs are


described Fig. 3.7.
• The graph at the top of the figure is dB(2,1). We have  = 2, D = 1, and n =
21 = 2 vertices.
• The graph in the middle of the figure is dB(2,2). We have  = 2, D = 2, and
n = 22 = 4 vertices.
• The graph at the bottom of the figure is dB(2,3). We have  = 2, D = 3, and
n = 23 = 8 vertices.
74 3 An Algorithmic Framework to Compute Global Functions

Fig. 3.7 The De Bruijn’s directed networks dB(2,1), dB(2,2), and dB(2,3)

A Fundamental Property of a De Bruijn’s Graph In addition to being easily


built, De Bruijn’s graphs possess a noteworthy property which makes them attractive
for the computation of global functions, namely there is exactly one directed path
of length D between any pair of vertices (including each pair of the form (x, x)).

Computing a Function on a De Bruijn’s Graph The algorithm described in


Fig. 3.8 is tailored to compute a global function F () in a round-based synchronous
(or asynchronous) distributed system whose communication graph is a De Bruijn’s
graph (as seen in Fig. 2.9, messages have to carry the parity bit of the current round
number in the asynchronous case).
This algorithm is designed to benefit from the fundamental property of De
Bruijn’s graphs. Let c_ini denote the set of  input channels and c_outi denote the
set of  output channels of pi . At the end of a round, the local variable receivedi
contains all pairs (j, inj ) received by pi in the current round (lines 3–7). Initially,
receivedi contains the input pair of pi , i.e. (i, ini ). When, it starts a round, a process
pi sends the value of receivedi on each of its output channels (line 2). Hence, during
a round r, a process sends on each of its output channels what it has learned from
all its input channels during the previous round (r − 1). There is neither filtering nor
memorization of what has been received during all previous rounds r  < r.
It follows from the fundamental property of De Bruijn’s graphs that, at the
end of the last round r = D, the set receivedi of each process pi contains all
the input values, each exactly once; i.e., we have then receivedi = {(1, in1 ),
3.6 Summary 75

init receivedi ← {(i, ini )}.

(1) when r = 1, 2, . . . , D do
begin synchronous round
(2) for each x ∈ c_outi do send MSG (receivedi ) on channeli [x] end for;
(3) receivedi ← ∅;
(4) for each x ∈ c_ini do
(5) wait (MSG (rec) received on channeli [x]);
(6) receivedi ← receivedi ∪ rec
(7) end for
end synchronous round
(8) end when;
(9) outi ← F (receivedi );
(10) return(outi ).

Fig. 3.8 A generic algorithm for a De Bruijn’s communication graph (code for pi )

(2, in2 ), . . . , (n, ini )}. Each process can consequently compute F (receivedi ) and re-
turns the corresponding result.

A Simple Example Let us consider the communication graph DB(2,2) (the one
described in the middle of Fig. 3.7). We have the following:
• At the end of the first round:
– The process labeled 00 is such that received00 = {(00, in00 ), (10, in10 )}.
– The process labeled 10 is such that received10 = {(01, in01 ), (11, in11 )}.
• At the end of the second round, the process labeled 01 is such that received01
contains the values of received00 and received10 computed at the previous round.
We consequently have received01 = {(00, in00 ), (10, in10 ), (01, in01 ), (11, in11 )}.
If the function F () is associative and commutative, a process can compute
to_send = F (receivedi ) at the end of each round, and send this value instead of
receivedi during the next round (line 2). Merging of files, max(), min(), and + are
examples of such functions.

3.6 Summary

This chapter has presented a general framework to compute global functions on a set
of processes which are the nodes of a network. The main features of this framework
are that all processes execute the same local algorithm, and no process learns more
than what it needs to compute F . Moreover, the knowledge of the diameter D is not
necessary and the time complexity is 2(D + 1), while the total number of messages
is upper bounded by 2e(D + 2), where e is the number of communication channels.
76 3 An Algorithmic Framework to Compute Global Functions

3.7 Bibliographic Notes

• The general round-based framework presented in Sect. 3.2 and its improvement
presented in Sect. 3.4 are due to J.-Cl. Bermond, J.-Cl. König, and M. Ray-
nal [48].
• The algorithm that computes the cut vertices of a communication graph is due to
J.-Cl. Bermond and J.-Cl. König [47]. Other distributed algorithms determining
cut vertices have been proposed (e.g., [187]).
• The tradeoff between the number of rounds and the number of messages is ad-
dressed in [223, 243] and in the books [292, 319].
• The use of the general framework in regular directed networks has been inves-
tigated in [46]. Properties of regular networks (such as hypercubes, De Bruijn’s
graphs, and Kautz’s graphs) are presented in [45, 49].
• A general technique for network synchronization is presented [27].
• A graph problem is local if it can be solved by a distributed algorithm with time
complexity smaller than D (the diameter of the corresponding graph). The inter-
ested reader will find in [236] a study on the locality of the graph coloring problem
in rings, regular trees of radius r, and n-vertex graphs with largest degree .

3.8 Problem
1. Adapt the general framework presented in Sect. 3.2 to communication graphs in
which the channels are unidirectional. It is, however, assumed that the graphs are
strongly connected (there is a directed path from any process to any process).
Chapter 4
Leader Election Algorithms

This chapter is on the leader election problem. Electing a leader consists for the
processes of a distributed system in selecting one of them. Usually, once elected, the
leader process is required to play a special role for coordination or control purposes.
Leader election is a form of symmetry breaking in a distributed system. After
showing that no leader can be elected in anonymous regular networks (such as
rings), this chapter presents several leader election algorithms with a special focus
on non-anonymous ring networks.

Keywords Anonymous network · Election · Message complexity ·


Process identity · Ring network · Time complexity ·
Unidirectional versus bidirectional ring

4.1 The Leader Election Problem


4.1.1 Problem Definition

Let each process pi be endowed with two local Boolean variables electedi and
donei , both initialized to false. (Let us recall that i is the index of pi , i.e., a nota-
tional convenience that allows us to distinguish processes. Indexes are not known by
the processes.) The Boolean variables electedi are such that eventually exactly one
of them becomes true, while each Boolean variable donei becomes true when the
corresponding process pi learns that a process has been elected. More formally, the
election problem is defined by the following safety and liveness properties, where
varτi denotes the value of the local variable vari at time τ .
• Safety property:
 
– (∀ i : electedτi ⇒ (∀τ  ≥ τ : electedτi )) ∧ (∀ i : doneτi ⇒ (∀τ  ≥ τ : doneτi )).

– ∀ i, j, τ, τ  : (i = j ) ⇒ ¬(electedτi ∧ electedτj ).

– ∀ i : doneτi ⇒ (∃j, τ  ≤ τ : electedτj ).
The first property states that the local Boolean variables electedi and donei are
stable (once true, they remain true forever). The second property states that at
most one process is elected, while the third property states that a process cannot
learn that the election has terminated while no process has yet been elected.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 77


DOI 10.1007/978-3-642-38123-2_4, © Springer-Verlag Berlin Heidelberg 2013
78 4 Leader Election Algorithms

• Termination property:
– ∃i, τ : electedτi .
– ∀ i : ∃τ : doneτi .
This liveness property states that a process is eventually elected, and this fact is
eventually known by all processes.

4.1.2 Anonymous Systems: An Impossibility Result

This section assumes that the processes have no identities. Said differently, there is
no way to distinguish a process pi from another process pj . It follows that all the
processes have the same number of neighbors, the same code, and the same initial
state (otherwise these features could be considered as their identities). The next
theorem shows that, in such an anonymity context, there is no solution to the leader
election problem. For simplicity reasons, the theorem considers the case where the
network is a ring.

Theorem 2 There is no deterministic election algorithm for leader election in a


(bi/unidirectional) ring of n > 1 processes.

Proof The proof is by contradiction. Let us assume that there is a deterministic


distributed algorithm A that solves the leader election problem in an anonymous
ring. A is composed of n local deterministic algorithms A1 , . . . , Ai , . . . , An , where
Ai is the local algorithm executed by pi . Moreover, due to anonymity, we have
A1 = · · · = An . We show that A cannot satisfy both the safety and the termination
properties that define the leader election problem.
For any i, let σi0 denote the initial state of pi , and let Σ 0 = (σ10 , . . . , σn0 ) denote
the initial global state. As all the processes execute the same local deterministic
algorithm Ai , there is a synchronous execution during which all the processes ex-
ecute the same step of their local deterministic algorithm Ai , and each process pi
proceeds consequently from σi0 to σi1 (this step can be the execution of an inter-
nal statement or a communication operation). Moreover, as σ10 = · · · = σn0 and Ai
is deterministic and the same for all processes, it follows that the set of processes
progress from Σ 0 = (σ10 , . . . , σn0 ) to Σ 1 = (σ11 , . . . , σn1 ), where σ11 = · · · = σn1 . The
important point here is that the execution progresses from a symmetric global state
Σ 0 to another symmetric global state Σ 1 (“symmetric” meaning here that all pro-
cesses are in the same local state).
As all Ai are deterministic and identical, the previous synchronous execution can
be continued and the set of processes progresses then from the symmetric global
state Σ 1 to a new symmetric global state Σ 2 , etc. It follows that the synchronous
execution can be extended forever from a symmetric global state Σ to another sym-
metric global state Σ  .
4.2 A Simple O(n2 ) Leader Election Algorithm for Unidirectional Rings 79

Consequently, the previous synchronous execution never terminates. This is be-


cause for it to terminate, it has to enter an asymmetric global state (due to its very
definition, the global state in which a process is elected is asymmetric). Hence, the
algorithm A fails to satisfy the termination property of the leader election problem,
which concludes the proof of the theorem. 

4.1.3 Basic Assumptions and Principles of the Election Algorithms

Basic Assumptions Due to the previous theorem, the rest of this chapter assumes
that each process pi has an identity idi , and that distinct processes have different
identities (i.e., i = j ⇒ idi = idj ). Moreover, it is assumed that identities can be
compared with the help of the operators <, =, >.

Basic Principles of the Election Algorithms The basic idea of election algo-
rithms consists in electing the process whose identity is an extremum (the greatest
identity or the smallest one) among the set of all processes or a set of candidate
processes. As no two processes have the same identity and all identities can be com-
pared, there is a single extremum identity.

4.2 A Simple O(n2 ) Leader Election Algorithm


for Unidirectional Rings

4.2.1 Context and Principle

Context This section considers a network structured as a unidirectional ring, i.e.,


each process has two neighbors, one from which it can receive messages and another
one to which it can send messages. Moreover, a channel connecting two neighbor
processes is not required to be FIFO (first in first out).
A process knows initially its identity and the fact that no two processes have the
same identity. It does not know the total number of processes n that define the ring.

Principle This algorithm, which is due to E. Chang and R. Roberts (1979), is


one of the very first election algorithms. The idea is the following. Each process pi
sends its identity on the ring, and stops the progress of any identity idj it receives
if idj < idi . As all the processes have distinct identities, it follows that the only
identity whose progress on the ring cannot be stopped is the highest identity. This is
the identity of the process that is eventually elected.
80 4 Leader Election Algorithms

when START () is received do


(1) if (¬parti ) then parti ← true; send ELECTION (idi ) end if.

when ELECTION (id) is received do


(2) case (id > idi ) then parti ← true; send ELECTION (id)
(3) (id < idi ) then if (¬parti ) then parti ← true; send ELECTION (idi ) end if
(4) (id = idi ) then send ELECTED (id); electedi ← true
(5) end case.

when ELECTED (id) is received do


(6) leaderi ← id; donei ← true;
(7) if (id = idi ) then electedi ← false; send ELECTED (id) end if.

Fig. 4.1 Chang and Robert’s election algorithm (code for pi )

4.2.2 The Algorithm

The code of the algorithm is described in Fig. 4.1. In addition to its identity idi , and
the Boolean variables electedi and donei , each process pi manages another Boolean
denoted parti (and initialized to false), and a variable leaderi that will contain the
identity of the elected process.
The processes that launch the election are the processes pi that receive the exter-
nal message START () while parti = false. When this occurs, pi becomes a partici-
pant and sends the message ELECTION (idi ) on its single outgoing channel (line 1).
When a process pi receives a message ELECTION (id) on its single input channel,
there are three cases. If id > idi , pi cannot be the winner, and it forwards the mes-
sage it has received (line 2). If id < idi , if not yet done, pi sends the message ELEC -
TION (idi ) (line 3). Finally, if id = id i , its message ELECTION (idi ) has visited all the
processes, and consequently idi is the highest identity. So, pi is the elected process.
Hence, it informs the other processes by sending the message ELECTED (idi ) on its
outgoing channel (line 4).
When a process pi receives a message ELECTED (id), it learns both the identity of
the leader and the fact that the election is terminated (line 5). Then, it forwards the
message ELECTED (id) so that all processes learn the identity of the elected leader.

4.2.3 Time Cost of the Algorithm

In all cases, the algorithm sends n messages ELECTED (), which are sent one after
the other. Hence, in the following we focus only on the cost due to the messages
ELECTION (). To compute the time complexity, we assume that each message takes
one time unit (i.e., all messages take the worst transfer delay which is defined as
being equal to one time unit).
The best case occurs when only the process pi with the highest identity receives
a message START (). It is easy to see that the time complexity is n (the message
4.2 A Simple O(n2 ) Leader Election Algorithm for Unidirectional Rings 81

Fig. 4.2
Worst identity distribution
for message complexity

ELECTION (idi ) sent by pi is passed from each process to its neighbor on the ring
before returning to pi ).
The worst case occurs when the process pj with the second highest identity (a)
is the only process that receives a message START (), and (b) follows on the ring the
process pi that has the highest identity. Hence, (n − 1) processes separate pj and pi
The message ELECTION (idj ) takes (n − 1) time units before attaining pi , and then
the message ELECTION (idi ) takes n times units to travel the ring. Hence, (2n −
1) time units are required. It follows that, whatever the case, an election requires
between n and (2n − 1) times units.

4.2.4 Message Cost of the Algorithm

Best Case and Worst Case The best case for the number of messages is the same
as for the time complexity, which happens when only the process with the high-
est identity receives a message START (). The algorithm then gives rise to exactly
n messages ELECTION (). The worst case is when (a) each process pi receives a
message START () while parti = false, (b) each message takes one time unit, and (c)
the processes are ordered in the ring as follows: first, the process with the highest
identity, then the one with the second highest identity, etc., until the process with
the smallest identity (Fig. 4.2 where idn is the highest identity, etc., until id1 , which
is the smallest one).
It follows that the message START () received by the process with the smallest
identity gives rise to one message ELECTION (), the one with the second smallest
identity gives rise to two messages ELECTION () (one from itself to the process with
the smallest identity, plus another one from this process to the process with the
highest identity), and so on until the process with the highest identity whose message
START () gives rise to n messages ELECTION (). It follows that the total number of

messages is ni=1 i = n(n+1) 2
2 , i.e., O(n ).

Average Case To compute the message complexity in the average case, let P (i, k)
be the probability that the message ELECTION () sent by the process px with the
ith smallest identity is forwarded k times. Assuming that the direction of the ring
82 4 Leader Election Algorithms

is clockwise (as in Fig. 4.2), P (i, k) is the probability that the k − 1 clockwise
neighbors of px (the processes that follow px on the ring) have an identity smaller
than idx and the kth clockwise neighbor of px has an identity greater than idx . Let
us recall that there are (i − 1) processes with an identity smaller than idx and (n − i)
processes
  with an identity greater than idx .
Let ab denote the number of ways of choosing b elements in a set of a elements.
We have
 i−1 
k−1 n−i
P (i, k) = n−1 × .
n−k
k−1
Since there is a single message that makes a full turn on the ring (the one carrying the
highest identity), let us consider each of the (n − 1) other messages. The expected
number of passes of the ith message (where i denotes the rank of the identity of the
corresponding ELECTION () message) is then for i = n:


n−1
Ei (k) = kP (i, k).
k=1

Hence, the expected number of transfers for all messages is


n−1 
n−1
E =n+ kP (i, k),
i=1 k=1

which can be simplified to


n−1 
n 1 1 1
E =n+ = n 1 + + + ··· + .
k+1 2 3 n
k=1

As the harmonic series 1 + 1/2 + · · · + 1/n + · · · is upper bounded by c + loge n


(where c is a constant), the average number of ELECTION () messages E is upper
bounded by O(n log n).

4.2.5 A Simple Variant

The previous algorithm always elects the process with the maximal identity. It is
possible to modify this algorithm to obtain an algorithm that elects the process with
the highest identity among the processes whose participation in the algorithm is
due to the reception of a START () message (this means that, whatever its identity, a
process that does not receive a START () message, or that receives such a message
only after having received an ELECTION () message, cannot be elected).
The corresponding algorithm is depicted in Fig. 4.3. Each process pi manages
a local variable idmaxi which contains the greatest identity of a competing process
seen by pi . Initially, idmaxi is equal to 0.
4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings 83

when START () is received do


(1) if (idmaxi = 0) then idmaxi ← idi ; send ELECTION (idi ) end if.

when ELECTION (id) is received do


(2) case (id > idmaxi ) then idmaxi ← id; send ELECTION (id)
(3) (id < idmaxi ) then skip
(4) (id = idi ) then send ELECTED (id); electedi ← true
(5) end case.

when ELECTED (id) is received do


(6) leaderi ← id; donei ← true;
(7) if (id = idi ) then electedi ← false; send ELECTED (id) end if.

Fig. 4.3 A variant of Chang and Robert’s election algorithm (code for pi )

When it receives a START () message (if it ever receives one), a process pi con-
siders and processes it only if idmaxi = 0. Moreover, when it receives a message
ELECTION (id), a process pi discards it if id < idmaxi (this is because pi has seen a
competing process with a higher identity). The rest of the algorithm is similar to the
algorithm of Fig. 4.1.

4.3 An O(n log n) Leader Election Algorithm


for Bidirectional Rings

4.3.1 Context and Principle

Context This section considers a bidirectional ring. As before, each process has
a left neighbor and a right neighbor, but it can now send a message to, and receive
a message from, any of these neighbors. Given a process pi , the notations lefti and
righti are used to denote the channels connecting pi to its left neighbor and to its
right neighbor, respectively.
The notions of left and right are global: they are the same for all processes, i.e.,
going only to left allows a message to visit all processes (and similarly when going
only to right).

Principle The algorithm presented below is due to D.S. Hirschberg and J.B. Sin-
clair (1980). It is based on the following idea. The processes execute asynchronous
rounds. During each round, processes compete, and only the processes that win in
a round r are allowed to continue competing during round r + 1. During the first
round (denoted round 0), all processes compete.
A process pi , which is competing during round r, is a winner at the end of that
round if it has the largest identifier on the part of the ring that spans the 2r pro-
cesses on its left side and the 2r processes on its right side, i.e., in a continuous
neighborhood of 2r+1 + 1 processes. This is illustrated in Fig. 4.4.
84 4 Leader Election Algorithms

Fig. 4.4 Neighborhood of a process pi competing during round r

Fig. 4.5 Competitors


at the end of round r are
at distance greater than 2r

If it has the highest identity, pi proceeds to the next round as a competitor. Oth-
erwise, it no longer competes to become leader. It follows that any two processes
that remain competitors after round r are at a distance d > 2r (see Fig. 4.5, where
pi and pj are competitors at th end of round r). Said differently, after each round,
the number of processes that compete to become leader is divided at least by two.

4.3.2 The Algorithm

To simplify the presentation, it is assumed that each process receives a message


START () before any message sent by the algorithm.
The algorithm is described in Fig. 4.6. When it starts participating in the algo-
rithm, a process pi sends a message ELECTION (idi , r, d) to its left and right neigh-
bors (line 1), where idi is its identity, r = 0 is the round number associated with this
message (let us recall that the first round is numbered 0), and d = 1 is the number
of processes that have already been visited by this message.
When a process pi receives a message ELECTION (id, r, d), its behavior depends
on the respective values of idi and id. If id > idi , there are two cases.
• If d < 2r , the message has not yet visited the full left (or right) neighborhood of
size 2r it has to visit. In this case, pi forwards the message ELECTION (idi , r, d +
1) to the appropriate right or left neighbor (line 2).
• If d ≥ 2r , the message has visited all the process neighborhood it had to visit.
As the progress of this message has not been stopped, pi sends back (line 3)
the message REPLY (id, r) to inform the process px whose identity is id that the
message ELECTION (id, r, −) it sent has visited all the processes of its right (or
left) neighborhood of size 2r without being stopped (this neighborhood starts at
px and ends at pi ).
4.3 An O(n log n) Leader Election Algorithm for Bidirectional Rings 85

when START () is received do


(1) send ELECTION (idi , 0, 1) on both lefti and righti .

when ELECTION (id, r, d) is received on lefti (resp., righti ) do


(2) case (id > idi ) ∧ (d < 2r ) then send ELECTION (id, r, d + 1) to righti (resp., lefti )
(3) (id > idi ) ∧ (d ≥ 2r ) then send REPLY (id, r) to lefti (resp., righti )
(4) (id < idi ) then skip
(5) (id = idi ) then send ELECTED (id) on lefti ; electedi ← true
(6) end case.

when REPLY (id, r) is received on lefti (resp., righti ) do


(7) if (id = idi )
(8) then send REPLY (id, r) on righti (resp., lefti )
(9) else if (already received REPLY (id, r) from righti (resp., lefti ))
(10) then send ELECTION (idi , r + 1, 1) on both lefti and righti
(11) end if
(12) end if.

when ELECTED (id) is received on righti do


(13) leaderi ← id; donei ← true;
(14) if (id = idi ) then electedi ← false; send ELECTED (id) on lefti end if.

Fig. 4.6 Hirschberg and Sinclair’s election algorithm (code for pi )

If idi > id, the process whose identity is id cannot be elected (this is because
its identity is not the greatest one). Hence, pi stops the progress of the message
ELECTION (id, −, −) (line 4). Finally, if id i = id, the message ELECTION (id, r, −)
sent by pi has visited all the processes without being stopped. It follows that idi is
the greatest identity, and pi consequently sends a message ELECTED (idi ) (line 5),
which will inform all processes that the election is over (lines 13–14).
When a process pi receives a message REPLY (id, r), pi forwards it to the appro-
priate (right or left) neighbor if it is not the final destination process of this message
(line 8). If it is the final destination process (idi = id), and this is the second mes-
sage REPLY (id, r) (i.e., the message coming from the other side of the ring), pi
learns that it has the highest identity in both its left and right neighborhoods of size
2r (line 9). It consequently proceeds to the next round by sending to its left and
right neighbors the message ELECTION (idi , r + 1, 1) (line 10), which starts its next
round.

4.3.3 Time and Message Complexities

Message Complexity Let us first notice that a process pi remains a competitor on


its left and right neighborhoods, each of size 2r , only if it has not been defeated by
a process within distance 2r−1 on its left or its right neighborhood. Moreover, in a
set of 2r−1 + 1 consecutive processes, at most one process can remain competitor
for round r. It follows that we have the following:
86 4 Leader Election Algorithms

• n processes are competitors on paths of length 20 = 1 (round 0),


• At most  n2  processes are competitors on paths of length 21 = 2 (round 1),
• At most  n3  processes are competitors on paths of length 22 (round 2), etc.,
• At most  2r−1n +1  processes are competitors on paths of length 2r (round r), etc.
Let us also observe that a process, which is competitor at round r, entails the
transfer of at most 4×2r messages (a message ELECTION () and a message REPLY (),
one in each direction, both on the right path and the left path of size 2r ). It follows
that the total number of ELECTION () and REPLY () messages is upper bounded by
        
n n n n
4 1×n+2 0 + 22 1 + 23 2 + · · · + 2r r−1 +···
2 +1 2 +1 2 +1 2 +1
Each internal term is less than 2n, and there are at most 1 + log2 n terms. It fol-
lows that the total number of messages is upper bounded by 8n(1 + log2 n), i.e.,
O(n log2 n).

Time Complexity To simplify, let us assume that n = 2k , and let us consider the
best case, namely, all the processes start simultaneously. Moreover, as usual for time
complexity, it is assumed that each message takes one time unit. The process with
the highest identity will execute from round 0 until round k, and a round r will
take 2r time units. By summing the time taken by all rounds, the process with the
highest identity will be elected after at most 2(1 + 21 + 22 + · · · + 2r + · · · + 2k ) time
units (the factor 2 is due to a message ELECTION () in one direction plus a message
REPLY () in the other direction). It follows that, in the best case, the time complexity
is upper bounded by 2( 2 2−1−1 ) = 4n − 2, i.e., O(n).
k+1

The worst case time complexity, which is also O(n), is addressed in Exercise 3.
This means that the time complexity is linear with respect to the size of the ring.

4.4 An O(n log n) Election Algorithm for Unidirectional Rings

4.4.1 Context and Principles

Context This section considers a unidirectional ring network in which the chan-
nels are FIFO (i.e., on each channel, messages are received in their sending order).
As in the previous section, each process pi knows only its identity idi and the fact
that no two processes have the same identity. No process knows the value n.

Principle The algorithm presented below is due to D. Dolev, M. Klawe, and M.


Rodeh (1982). It is based on a very simple and elegant idea.
As in the previous election algorithms, initially all processes compete to be
elected as a leader, and execute consecutive rounds to that end. During each round,
at most half of the processes that are competitors remain competitors during the next
round. Hence, there will be at most O(log2 n) rounds.
4.4 An O(n log n) Election Algorithm for Unidirectional Rings 87

Fig. 4.7 Neighbor processes on the unidirectional ring

Fig. 4.8 From the first to the second round

The novel idea is the way the processes simulate a bidirectional ring. Let us
consider three processes pi , pj , and pk such that pi and pk are the left and the
right neighbor of pj on the ring (Fig. 4.7). Moreover, let us assume that a process
receives messages from its right neighbor and sends messages to its left neighbor.
During the first round, each process sends its identity to its left neighbor, and
after it has received the identity idx of its right neighbor px , it forwards idx to its
left neighbor. Hence, when considering Fig. 4.7, pi receives first idj and then idk ,
which means that it knows three identities: idi , idj , and idk . It follows that it can
play the role of pj . More precisely
• If idj > max(idi , idk ), pi considers idj as the greatest identity it has seen and
progresses to the next round as a competitor on behalf of idj .
• If idj < max(idi , idk ), pi stops being a competitor and its role during the next
rounds will be to forward to the left the messages it receives from the right.
It is easy to see that, if pi remains a competitor (on behalf of idj ) during the
second round, its left neighbor ph and its right neighbor pj can no longer remain
competitors (on behalf of idi , and on behalf of idk , respectively). This is because
   
idj > max(idi , idk ) ⇒ ¬ idi > max(idh , idj ) ∧ ¬ idk > max(idj , id ) .

It follows that, during the second round, at most half of the processes remain com-
petitors. During that round, the processes that are no longer competitors only for-
ward the messages they receive, while the processes that remain competitors do the
same as during the first round, except that they consider the greatest identity they
have seen so far instead of their initial identity. This is illustrated in Fig. 4.8, where it
is assumed that idj > max(idi , idk ), id < max(idk , idm ), and idm > max(id , idp ),
and consequently pi competes on behalf of idj , and p competes on behalf of idm .
The processes that remain competitors during the second round define a logical ring
with at most n/2 processes. This ring is denoted with dashed arrows in the figure.
Finally, the competitor processes that are winners during the second round pro-
ceed to the third round, etc., until a round with a single competitor is attained, which
occurs after at most 1 + log2 n rounds.
88 4 Leader Election Algorithms

when START () is received do


(1) competitori ← true; maxidi ← idi ; send ELECTION (1, idi ).

when ELECTION (1, id) is received do


(2) if (¬competitori )
(3) then send ELECTION (1, id)
(4) else if (id = maxidi )
(5) then send ELECTION (2, id); proxy_fori ← id
(6) else send ELECTED (id, idi )
(7) end if
(8) end if.

when ELECTION (2, id) is received do


(9) if (¬competitori )
(10) then send ELECTION (2, id)
(11) else if (proxy_fori > max(id, maxidi ))
(12) then maxidi ← proxy_fori ; send ELECTION (1, proxy_fori )
(13) else competitori ← false
(14) end if
(15) end if.

when ELECTED (id1, id2) is received do


(16) leaderi ← id1; donei ← true; electedi ← (id1 = idi );
(17) if (id2 = idi ) then send ELECTED (id1, id2) end if.

Fig. 4.9 Dolev, Klawe, and Rodeh’s election algorithm (code for pi )

4.4.2 The Algorithm

As in Sect. 4.3, it is assumed that each process receives an external message START ()
before any message sent by the algorithm. The algorithm is described in Fig. 4.9.

Local Variables In addition to donei , leaderi , and electedi , each process pi man-
ages three local variables.
• competitori is a Boolean which indicates if pi is currently competing on behalf of
some process identity or is only relaying messages. The two other local variables
are meaningful only when competitori is equal to true.
• maxidi is the greatest identity know by pi .
• proxy_fori is the identity of the process for which pi is competing.

Process Behavior When a process receives a message START (), it initializes


competitori and maxidi before sending a message ELECTION (1, idi ) on its single
outgoing channel (line 1). Let us observe that messages do not carry round numbers.
Actually, round numbers are used only to explain the behavior of the algorithm and
compute the total number of messages.
When a process pi receives a message ELECTION (1, id), pi forwards it on its
outgoing channel if it is no longer a competitor (lines 2–3). If pi is a competitor,
there are two cases.
4.5 Two Particular Cases 89

• If the message ELECTION (1, idi ) is such that id = maxidi , then it has made a full
turn on the ring, and consequently maxidi is the greatest identity. In this case,
pi sends the message ELECTED (maxidi , idi ) (line 6), which is propagated on the
ring to inform all the processes (lines 16–17).
• If message ELECTION (1, id) is such that id = maxidi , pi copies id in proxy_fori ,
and forwards the message ELECTION (2, id) on its outgoing channel.
When a process pi receives a message ELECTION (2, id), it forwards it (as pre-
viously) on its outgoing channel if it is no longer a competitor (lines 9–10). If it
is a competitor, pi checks if proxy_fori > max(id, maxidi ), i.e., if the identity of
the process it has to compete for (namely, proxy_fori ) is greater than both maxidi
(the identity of the process on behalf of which pi was previously competing) and
the identity id it has just received (line 11). If it is the case, pi updates maxidi
and starts a new round (line 12). Otherwise, proxy_fori is not the highest identity.
Consequently, as pi should compete for an identity that cannot be elected, it stops
competing (line 13).

4.4.3 Discussion: Message Complexity and FIFO Channels

Message Complexity During each round, except the last one, each process sends
two messages: a message ELECTION (1, −) and a message ELECTION (2, −). More-
over, there are only ELECTION (1, −) messages during the last round. As we have
seen, there are at most log2 n + 1 rounds. It follows that the number of messages
ELECTION (1, −) and ELECTION (2, −) sent by the algorithm is at most 2n log n + n.

Type of Channels Each process receives alternately a message ELECTION (1, −)


followed by a message ELECTION (2, −). As the channels are FIFO, it follows that
the numbers 1 and 2 used to tag the messages are useless: A message ELECTION ()
needs to carry only a process identifier, and consequently, there are only n different
messages ELECTION ().

4.5 Two Particular Cases


Leader Election in an Arbitrary Network To elect a leader in a connected arbi-
trary network, one can use network traversal algorithms such as those described in
Chap. 1.
Each process launches a network traversal with feedback, and all the messages
related to the network traversal launched by a process pi are tagged with its identity
idi . Moreover, each process continues participating only in the network traversal it
has seen that has been launched with the greatest identity. It follows that a single
network traversal will terminate, namely the one launched by the process with the
greatest identity.
90 4 Leader Election Algorithms

(1) rdi ← random(1, n); broadcast RANDOM (rdi );


(2) wait(a message RANDOM (rdx ) from each process px );
(3) electedi ← (Σx=1
n rd ) mod n + 1.
x

Fig. 4.10 Index-based randomized election (code for pi )

When the Indexes Are the Identities When the identity of a process is its index,
and both this fact and n are known by all processes, the leader election problem
is trivial. It is sufficient to statically select an index and define the corresponding
process as the leader. While this works, it has a drawback, namely, the same process
is always elected.
There is a simple way to solve this issue, as soon as the processes can use random
numbers. Let random(1, n) be a function that returns a random number in {1, . . . , n}
each time it is called.
The algorithm described in Fig. 4.10 is a very simple randomized election al-
gorithm. Each process first obtains a random number and sends it to all. Then, it
waits until it has received all random numbers. When this occurs, the processes can
compute the same process identity, and consistently elect the corresponding process
(line 3). The costs of the algorithm are O(1) time units and O(n2 ) messages.
The probability that a given process px is elected can be computed from the
specific probability law associated with the function random().

4.6 Summary
This chapter was devoted to election algorithms on a ring. After a simple proof that
no such algorithm exists in anonymous ring networks, the chapter presented three
algorithms for non-anonymous rings. Non-anonymous means here that (a) each pro-
cess pi has an identity idi , (b) no two processes have the same identity, (c) identities
can be compared, (d) initially, a process knows only its identity, and (e) no process
knows n (total number of processes).
Interestingly, this chapter has presented two O(n log n) elections algorithms that
are optimal. The first, due to Hirschberg and Sinclair, is suited to bidirectional rings,
while the second, due to Dolev, Klawe, and Rodeh, is suited to both unidirectional
rings and bidirectional rings (this is because a bidirectional ring can always be con-
sidered as if it was unidirectional). This algorithm shows that, contrary to what one
could think, the fact that a ring is unidirectional or bidirectional has no impact on
its optimal message complexity when considering a O() point of view.

4.7 Bibliographic Notes


• The impossibility of solving the leader election problem in anonymous rings is
due to D. Angluin [19].
4.8 Exercises and Problems 91

• The general problem of what can be computed in anonymous networks is ad-


dressed in [22, 392, 393].
• The distributed election problem and its first solution are due to G. Le Lann [232].
• The O(n2 ) election algorithm presented in Sect. 4.2 is due to E.J.H. Chang and
R. Roberts [83].
• The O(n log n) election algorithm for bidirectional rings presented in Sect. 4.3 is
due to D.S. Hirschberg and J.B. Sinclair [185].
• The O(n log n) election algorithm for unidirectional rings presented in Sect. 4.4
is due to D. Dolev, M. Klawe, and M. Rodeh [117]. Simultaneously and indepen-
dently, a similar algorithm was presented by G.L. Peterson [295]. Another leader
election algorithm for rings is presented in [134].
• Leader election algorithms in complete networks are studied in [208]. Numerous
leader election algorithms suited to various network topologies (e.g., trees) are
presented by N. Santoro in [335].
• Several authors have shown that Ω(n log n) is a lower bound on the number of
messages in both ring networks and complete networks (e.g., [24, 56, 288]).
• The best election algorithm on unidirectional ring networks known so far is due
to L. Higham and T. Przytycka [184]. Its message complexity is 1.271 n log n +
O(n). However, it is not yet known what is the smallest constant c such that
an election can be solved on unidirectional rings with message complexity c ×
n log n + O(n) (it is only known that c ≥ 0.69 [56]).
Higham and Przytycka’s algorithm is based on rounds and assumes that all
processes start with the same initial round number. An extension of this algorithm
is presented in [20], which allows the processes to start with arbitrary round num-
bers. This extension, which is motivated by fault-tolerance with respect to initial
values, guarantees that the message complexity is O(n log n) when the processes
start with the same round number, and may increase up to O(n2 ) when each pro-
cess starts with an arbitrary round number. This fault-tolerance property is called
graceful degradation with respect to initial values.
• Leader election in dynamic networks and mobile ad hoc networks is addressed
in [197, 244].

4.8 Exercises and Problems

1. Extend the proof of Theorem 2 so that it works for any anonymous regular net-
work.
2. Considering the variant of Chang and Robert’s election algorithm described in
Fig. 4.3, and assuming that k processes send an ELECTION () message at line 1,
what is the maximal number of ELECTION () messages sent (at lines 1 and 2)
during an execution of the algorithm?
Answer: nk − k(k−1)
2 .
92 4 Leader Election Algorithms

3. Show that the worst case for the time complexity of Hirschberg and Sinclair’s
election algorithm is when n is 1 more than a power of 2. Show that, in this case,
the time complexity is 6n − 6.
Solution in [185].
Chapter 5
Mobile Objects Navigating a Network

A mobile object is an object that, according to requests issued by user processes,


travels from process to process. This chapter is on algorithms that allow a mobile
object to navigate a network. It presents three distributed navigation algorithms with
different properties. All these algorithms ensure both that the object remains always
consistent (i.e., it is never present simultaneously at several processes), and that any
process that requires the object eventually obtains it.

Keywords Adaptive algorithm · Distributed queuing · Edge/link reversal ·


Mobile object · Mutual exclusion · Network navigation · Object consistency ·
Routing · Scalability · Spanning tree · Starvation-freedom · Token

5.1 Mobile Object in a Process Graph

5.1.1 Problem Definition

A mobile object is an object (such as a file or a data structure) that can be accessed
sequentially by different processes. Hence, a mobile object is a concurrent object
that moves from process to process in a network of processes.
When a process momentarily owns a mobile object, the process can use the ob-
ject as if it was its only user. It is assumed that, after using a mobile object, a pro-
cess eventually releases it, in order that the object can move to another process that
requires it. So what has to be defined is a navigation service that provides the pro-
cesses with two operations denoted acquire_object() and release_object() such that
any use of the object by a process pi is bracketed by an invocation to each of these
operations, namely

acquire_object(); use of the object by pi ; release_object().

As already noticed, in order that the state of the object remains always consistent,
it is required that the object be accessed by a single process at a time, and the object
has to be live in the sense that any process must be able to obtain the mobile object.
This is captured by the two classical safety and liveness properties which instantiate
as follows for this problem (where the sentences “the object belongs to a process”

M. Raynal, Distributed Algorithms for Message-Passing Systems, 93


DOI 10.1007/978-3-642-38123-2_5, © Springer-Verlag Berlin Heidelberg 2013
94 5 Mobile Objects Navigating a Network

or “a process owns the object” means that the object is currently located at this
process):
• Safety: At any time, the object belongs to at most one process.
• Liveness: Any process that invokes acquire_object() eventually becomes the
owner of the object.
Let us notice that a process pi invokes release_object() after having obtained
and used the object. Hence, it needs to invoke again acquire_object() if it wants to
use the mobile object again.

5.1.2 Mobile Object Versus Mutual Exclusion

The particular case where the ownership of the mobile object gives its owner a
particular right (e.g., access to a resource) is nothing more than an instance of the
mutual exclusion problem. In that case, the mobile object is a stateless object, which
is usually called a token. The process that has the token can access the resource,
while the other processes cannot. Moreover, any process can require the token in
order to be eventually granted the resource.
Token-based mutual exclusion algorithms define a family of distributed mutual
exclusion algorithms. We will see in Chap. 10 another family of distributed mutual
exclusion algorithms, which are called permission-based algorithms.

5.1.3 A Centralized (Home-Based) Algorithm

A simple solution to the navigation of a mobile object consists in using a home-


based structure by statically associating a fixed process p with the mobile object.
Such a home-based scheme is easy to implement. When it is not used, the object
resides at its home process p, and a three-way handshake algorithm is used to ensure
that any process that invokes acquire_object() eventually obtains the object.

Three-Way Handshake Algorithm The three-way handshake algorithm works


as follows. Its name comes from the three messages used to satisfy an object request.
• When a process pi invokes acquire_object(), pi sends a message REQUEST (i) to
the home process p, and waits for the object.
• When the home process receives a message REQUEST (i) from a process pi , it
adds this message in a local queue and sends back the object to pi if this request
is at the head of the queue.
• When pi receives the object, it uses the object, and eventually invokes release_
object(). This invocation entails sending a message RELEASE _ OBJECT (i) to
the home process p. Moreover, if the object has been updated by pi , the RE -
LEASE _ OBJECT (i) message carries the last value of the object as modified by pi .
5.1 Mobile Object in a Process Graph 95

Fig. 5.1 Home-based three-way handshake mechanism

• When the home process receives a message RELEASE _ OBJECT (i), it stores the
new value of the object (if any), and suppresses the request of pi from its local
queue. Then, if the queue is not empty, p sends the object to the first process of
the queue.
This three-way handshake algorithm is illustrated in Fig. 5.1, with two processes
pi and pj . When the home process p receives the message REQUEST (j ), it adds
it to its local queue, which already contains REQUEST (i). Hence, the home process
will answer this request when it receives the message RELEASE _ OBJECT (i), which
carries the last value of the object as modified by pi .

Discussion Let us first observe that the home process p can manage its internal
queue on a FIFO basis or use another priority discipline. This actually depends on
the application.
While they may work well in small systems, the main issue of home-based al-
gorithms lies in their poor ability to cope with scalability and locality. If the object
is heavily used, the home process can become a bottleneck. Moreover, always re-
turning the object to its home process can be inefficient (this is because when a
process releases the object, it could be sent to its next user without passing through
its home).

5.1.4 The Algorithms Presented in This Chapter

The algorithms that are described in this chapter are not home-based and do not
suffer the previous drawback. They all have the following noteworthy feature: If,
when a process pi releases the object, no other process wants to acquire it, the
object remains at its last user pi . It follows that, if the next user is pi again, it does
not need to send messages to obtain the object. Consequently, no message is needed
in this particular case.
The three algorithms that are presented implicitly consider that the home of the
object is dynamically defined: the home of the mobile object is its last user. These
96 5 Mobile Objects Navigating a Network

algorithms differ in the structure of the underlying network they assume. As we will
see, their cost and their properties depend on this structure. They mainly differ in
the way the mobile object and the request messages are routed in the network.

5.2 A Navigation Algorithm for a Complete Network


This section presents a navigation algorithm which assumes a complete network
(any pair of processes is connected by an asynchronous channel). The channels are
not required to be FIFO. This algorithm was proposed independently by G. Ricart
and A.K. Agrawala (1983), and I. Suzuki and T. Kasami (1985). The system is made
up of n processes, p1 , . . . , pn and the index i is used as the name of pi .

5.2.1 Underlying Principles

As there is no statically defined home notion, the main issue that has to be solved
is the following: When a process pi releases the object, to which process has it to
send the mobile object?

A Control Data Inside the Mobile Object First, the mobile object is enriched
with a control data, denoted obtained[1..n], such that obtained[i] counts the number
of times that pi has received the object. Let us notice that it is easy to ensure that,
for any i, obtained[i] is modified only when pi has the object. It follows that the
array obtained[1..n] always contains exact values (and no approximate values). This
array is initialized to [0, . . . , 0].

Local Data Structure In order to know which processes are requesting the object,
a process pi , which does not have the object and wants to acquire it, sends a message
REQUEST (i) to every other process to inform them that it is interested in the object.
Moreover, each process pi manages a local array request_byi [1..n] such that
request_byi [j ] contains the number of REQUEST () messages sent by pj , as know
by pi .

Determining Requesting Processes Let pi be the process that has the mobile
object. The set of processes that, from its point of view, are requesting the token
can be locally computed from the arrays request_byi [1..n] and obtained[1..n]. It is
the set Si including the processes pk such that (to pi ’s knowledge) the number of
REQUEST () messages sent by pk is higher than the number of times pk has received
the object (which is saved in obtained[k]), i.e., it is the set


Si = k | request_byi [k] > obtained[k] .

This provides pi with a simple predicate (Si = ∅) that allows it to know if pro-
cesses are requesting the mobile object. This predicate can consequently be used to
5.2 A Navigation Algorithm for a Complete Network 97

ensure that, if processes want to acquire the object, eventually there are processes
that obtain it (deadlock-freedom property).

Ensuring Starvation-Freedom Let us consider the case where all processes have
requested the object and p1 has the object. When, p1 releases the object, it sends
it p2 , and just after sends a message REQUEST (1) to again obtain the object. It is
possible that, when it releases the object, p2 sends it to p1 , and just after sends a
message REQUEST (2) to again obtain the object. This scenario can repeat forever,
and, while processes p1 and p2 repeatedly forever obtain the mobile object, the
other processes never obtain it.
A simple way to solve this problem (and consequently obtain the starvation-
freedom property) consists in establishing an order on the processes of Si that pi
has to follow when it releases the object. The order, which depends on i, is the
following one for pi :

i + 1, i + 2, . . . , n, 1, . . . , i − 1.

The current owner of the object pi sends the object to the first process of this list
that belongs to Si . This means that, if (i + 1) ∈ Si , pi sends the object to pi+1 .
Otherwise, if (i + 2) ∈ Si , pi sends the object to pi+2 , etc. It is easy to see that,
as no process can be favored, no process can be missed, and, consequently, any
requesting process will eventually receive the mobile object.

5.2.2 The Algorithm

Additional Local Variables and Initialization In addition to the array request_


byi [1..n] (which is initialized to [0, . . . , 0]), each process pi manages the following
local variables:
• interestedi is a Boolean initialized to false. It is set to true when pi is interested
in the object (it is waiting for it or is using it).
• object_presenti is a local Boolean variable, which is true if and only if the object
is present at process pi . Initially, all these Boolean variables are equal to false,
except at the process where the object has been initially deposited.
It is important to notice that, in addition to the fact that processes have distinct
names (known by all of them), the initialization is asymmetric: the mobile object is
initially present at a single predetermined process pj .

Structural View The structural view of the navigation algorithm at each process
is described in Fig. 5.2. The module associated with each process pi contains the
previous local variables and four pieces of code, namely:
• The two pieces of code for the algorithms implementing the operations acquire_
object() and release_object(), which constitute the interface with the application
layer.
98 5 Mobile Objects Navigating a Network

Fig. 5.2 Structural view of the navigation algorithm (module at process pi )

• Two additional pieces of code, each associated with the processing of a mes-
sage, namely the message REQUEST () and the message OBJECT (). As we have
seen, the latter contains the mobile object itself plus the associated control data
obtained[1..n].

The Algorithm The navigation algorithm is described in Fig. 5.3. Each piece of
code is executed atomically, except the wait() statement. This means that, at each
process pi , the execution of lines 1–4, lines 7–14, line 15, and lines 16–19, are
mutually exclusive. Let us notice that these mutual exclusion rules do not prevent
a process pi , which is currently using the mobile object, from being interrupted to
execute lines 16–19 when it receives a message REQUEST ().
When a process pi invokes acquire_object(), it first sets interestedi to true
(line 1). If it is the last user of the object, (object_presenti is then true, line 2),
pi returns from the invocation and uses the object. If it does not have the object, pi
increases request_byi (line 3), sends a REQUEST (i) message to each other process
(line 4), and waits until it has received the object (lines 5 and 15).
When a process which was using the object invokes release_object(), it first in-
dicates that it is no longer interested in the mobile object (line 7), and updates the
global control variable obtained[i] to the value request_byi [i] (line 8). Then, start-
ing from pi+1 (line 9), pi looks for the first process pk that has requested the mobile
object more often than the number of times it acquired the object (line 10) (let us
notice that pi is then such that ¬ interestedi ∧ object_presenti ).
If there is such a process, pi sends it the object (lines 11–12) and returns from
the invocation of release_object(). If, to its local knowledge, no process wants the
object, pi keeps it and returns from the invocation of release_object() (let us notice
that pi is then such that ¬ interestedi ∧ object_presenti ).
As already seen, pi sets object_presenti to the value true when it receives the ob-
ject. A process can receive the mobile object only if it has previously sent a request
message to obtain it.
Finally, when a process pi receives a message REQUEST (k), it first increases
accordingly request_byi [k] (line 16). Then, if it has the object and is not using it, pi
sends it to pk by return (lines 17–19).
5.2 A Navigation Algorithm for a Complete Network 99

operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) request_byi [i] ← request_byi [i] + 1;
(4) for k ∈ {1, . . . , n} \ {i} do send REQUEST (i) to pk end for;
(5) wait (object_presenti )
(6) end if.

operation release_object() is
(7) interestedi ← false;
(8) obtained[i] ← request_byi [i];
(9) for k from i + 1 to n and then from 1 to i − 1 do
(10) if (request_byi [k] > obtained[k]) then
(11) object_presenti ← false;
(12) send OBJECT () to pk ; exit loop
(13) end if
(14) end for.

when OBJECT () is received do


(15) object_presenti ← true.

when REQUEST (k) is received do


(16) request_byi [k] ← request_byi [k] + 1;
(17) if (object_presenti ∧ ¬ interestedi ) then
(18) object_presenti ← false; send OBJECT () to pk
(19) end if.

Fig. 5.3 A navigation algorithm for a complete network (code for pi )

Cost of the Algorithm The number of messages needed for one use of the mobile
object is 0 when the object is already present at the process that wants to use it, or n
(n − 1 request messages plus the message carrying the object).
A message REQUEST () carries a process identity. Hence, its size is O(log2 n).
The time complexity is 0 when the object is already at the requesting process.
Otherwise, let us first observe that the transit time used by request messages that
travel while the object is used has not to be counted. This is because, whatever their
speed, as the object is used, these transit times cannot delay the transfer of the object.
Assume that all messages take one time unit; it follows that the time complexity lies
between 1 and 2. One time unit happens in heavy load: When a process pi releases
the object, the test of line 10 is satisfied, and the object takes one time unit before
being received by its destination process. Two times units are needed in light load:
One time unit is needed for the request message to attain the process that has the
object, and one more time unit is needed for the object to travel to the requesting
process (in that case, the object is sent at line 18).

Are Early Updates Good? When a process pi receives the object (which car-
ries the control data obtained[1..n]), it is possible that some entries k are such
that obtained[k] > request_byi [k]. This is due to asynchrony, and is independent
of whether or not the channels are FIFO. This occurs when some request messages
100 5 Mobile Objects Navigating a Network

Fig. 5.4 Asynchrony involving a mobile object and request messages

are very slow, as depicted in Fig. 5.4, where the path followed by the mobile object
is indicated with dashed arrows (the dashed arrow on a process is the period during
which this process uses the mobile object). Moreover, as shown in the figure, the
REQUEST () message from pj to pi is particularly slow with respect to the object.
Hence the question: when a process pi receives the object, is it interesting for pi
to benefit from the array obtained[1..n] to update its local array request_byi [1..n],
i.e., to execute at line 15 the additional statement

for k ∈ {1, . . . , n} do request_byi [k] ← max(request_byi [k], obtained[k]) end for.

Due to the fact that each REQUEST () message has to be counted exactly once, this
early update demands other modifications so that the algorithm remains correct. To
that end, line 4 has to be replaced by

(4 ) for k ∈ {1, . . . , n} \ {i} do send REQUEST(i, request_byi [i]) to pk end for,

and, when a message REQUEST (k, rnb) is received, line 16 has to be replaced by

(16 ) request_byi [k] ← max(request_byi [k], rnb).

It follows that trying to benefit from the array obtained[1..n] carried by the mo-
bile object to early update the local array request_byi [1..n] requires us to add (a)
sequence numbers to REQUEST () messages, and (b) associated update statements.
It follows that these modifications make the algorithm less efficient and a little bit
more complicated. Hence, contrarily to what one could a priori hope, early updates
are not good for this navigation algorithm.

5.3 A Navigation Algorithm Based on a Spanning Tree


The previous navigation algorithm is based on the broadcast of requests and uses
sequence numbers, which can increase forever. In contrast, the algorithm presented
5.3 A Navigation Algorithm Based on a Spanning Tree 101

Fig. 5.5 Navigation tree:


initial state

in this section has only bounded variables. It is based on a statically defined span-
ning tree of the network, and each process communicates only with its neighbors
in the tree, hence the algorithm is purely local. This algorithm is due to K. Ray-
mond (1989).

5.3.1 Principles of the Algorithm:


Tree Invariant and Proxy Behavior

As just indicated, the algorithm considers a statically defined spanning tree of the
network. Only the channels of this spanning tree are used by the algorithm.

Tree Invariant Initially, the process that has the mobile object is the root of the
tree, and each process has a pointer to its neighbor on its path to the root. This is
depicted in Fig. 5.5, where the mobile object is located at process pa .
This tree structure is the invariant maintained by the algorithm: A process always
points to its neighbor in the subtree containing the object. In that way, a process
always knows in which part of the tree the object is located.
Let us consider the process pc that wants to acquire the object. Due to the tree
orientation, it can send a request to its current parent in the tree, namely process
pb , which in turn can forward it to its parent, etc., until it attains the root of the
tree. Hence, the tree orientation allows requests to be propagated to the appropriate
part of the tree. When pa receives the request, it can send the object to pb , which
forwards it to pc . The object consequently follows the same path of the tree (in the
reverse order) as the one the request. Moreover, in order to maintain the invariant
associated with the tree orientation, the mobile object reverses the direction of the
edges during its travel from the process where it was previously located (pa in the
figure) to its destination process (pc in the figure). This technique, which is depicted
in Fig. 5.6, is called edge reversal.

The Notion of a Proxy Let us consider process pb that receives a request for the
mobile object from its neighbor pd after it has received a request from pc (Fig. 5.6).
Hence, it has already sent a request to pa to obtain the object on behalf of pc .
Moreover, a process only knows its neighbors in the tree. Additionally, after having
102 5 Mobile Objects Navigating a Network

Fig. 5.6 Navigation tree:


after the object has moved to pc

received requests from its tree neighbors pc and pd , pb may require the object for
itself. How do solve these conflicts between the requests by pc , pd , and pb ?
To that end, pb manages a local FIFO queue, which is initially empty. When
it receives the request from pc , it adds the request to the queue, and as the queue
contains a single request, it plays a proxy role, namely, it sends a request for itself
to its parent pa (but this request is actually for the process pc , which is at the head
of its local queue). This is a consequence of the fact that a process knows only its
neighbors in the tree (pa does not know pc ).
Then, when it receives the request from pd , pb adds to its queue, and, as this
request is not the only one in the queue, pb does not send another request to pa .
When later pb receives the mobile object, it forwards it to the process whose request
is at the head of its queue (i.e., pc ), and suppresses its request from the queue.
Moreover, as the queue is not empty, pb continues to play its proxy role: it sends
to pc a request (for the process at the head of its queue, which is now pd ). This
is depicted in Fig. 5.7, where the successive states of the local queue of pb are
explicitly described.

5.3.2 The Algorithm

The structure of the algorithm is the same as the one described in Fig. 5.2.

Local Variables and Initialization In addition to the Boolean variable interestedi ,


whose meaning and behavior are the same as in the algorithm described in Fig. 5.3,
each process pi manages the following local variables.

Fig. 5.7 Navigation tree: proxy role of a process


5.3 A Navigation Algorithm Based on a Spanning Tree 103

• queuei is the local queue in which pi saves the requests it receives from its neigh-
bors in the tree, or from itself. Initially, queuei is empty. If pi has d neighbors in
the tree, queuei will contain at most d process identities (one per incoming edge
plus one for itself). It follows that the bit size of queuei is bounded by d log2 n.
The following notations are used:
– queuei ← queuei + j  means “add j at the end of queuei ”.
– queuei ← queuei − j  means “withdraw j from queuei (which is at its head)”.
– head(queuei ) denotes the first element of queuei .
– |queuei | denotes the size of queuei , while ∅ is used to denote the empty queue.
• parenti contains the identity of the parent of pi in the tree. The set of local vari-
ables {parenti }1≤i≤n is initialized in such a way that they form a tree rooted at the
process where the object is initially located. The root of the tree is the process pk
such that parentk = k.
The Boolean object_presenti used in the previous algorithm is no longer
needed. This is because it is implicitly encoded in the local variable parenti . More
precisely, we have (parenti = i) ≡ object_presenti .

Behavior of a Process The algorithm is described in Fig. 5.8. When a process


pi invokes acquire_object(), it first sets interestedi to the value true (line 1), and
checks if it has the object (predicate parenti = i, line 2). If it has not the object, it
adds its request to its local queue (line 3), and sends a request to its current parent
in the spanning tree if queuei contains only its own request (line 4). Then, pi waits
until it is at the head of its queue and receives the object (lines 5, 20, and 21).
When, after it has used the mobile object, a process pi invokes release_object(),
it first resets interestedi to the value false (line 7), and then checks the state of its
local queue (line 8). If queuei is not empty, pi sends the object to its neighbor in
the tree that is at the head of queuei , and defines it as the new parent in the tree.
Let us notice that, in that way, parenti points to the neighbor of pi in the subtree
containing the object. Finally, if queuei is not empty, pi sends a request to its new
parent (line 11) so that the object eventually returns to it to satisfy this request. (Let
us remark that, if two messages are sent at lines 10 and 11, they can be sent as a
single message carrying both the object and a request.)
When a process pi receives a message REQUEST () from a neighbor pk in the
spanning tree, (pi is then the parent of pk in the tree), its behavior depends on
whether or not it has the object. If pi has the object and is using it, it enqueues the
request (lines 13–14). If pi has the object and it is not using it (pi was then the last
user of the object), it sends the object to pk , and defines pk as its new parent in the
spanning tree (line 15). If pi does not have the object when it receives a message
REQUEST (k), it adds k to its request queue (line 17), and, as a proxy, sends a request
message to its parent if the request it has received is the only request in its queue
(line 18, which is the same as line 4).
Finally, let us observe that a process can receive the object only if it has previ-
ously sent a request. This means that, when a process pi receives the mobile object,
we necessarily have queuei = ∅. Hence, when it receives the object, a process pi
104 5 Mobile Objects Navigating a Network

operation acquire_object() is
(1) interestedi ← true;
(2) if (parenti = i) then
(3) queuei ← queuei + i;
(4) if (|queuei | = 1) then send REQUEST (i) to pparenti end if;
(5) wait (parenti = i)
(6) end if.

operation release_object() is
(7) interestedi ← false;
(8) if (queuei = ∅) then
(9) let k = head(queuei ); queuei ← queuei − k;
(10) send OBJECT () to pk ; parenti ← k;
(11) if (queuei = ∅) then send REQUEST (i) to pparenti end if
(12) end if.

when REQUEST (k) is received do


(13) if (parenti = i)
(14) then if (interestedi ) then queuei ← queuei + k
(15) else send OBJECT () to pk ; parenti ← k
(16) end if
(17) else queuei ← queuei + k;
(18) if (|queuei | = 1) then send REQUEST (i) to pparenti end if
(19) end if.

when OBJECT () is received do


(20) let k = head(queuei ); queuei ← queuei − k;
(21) if (i = k) then parenti ← i
(22) else send OBJECT () to pk ; parenti ← k;
(23) if (queuei = ∅) then send REQUEST (i) to pparenti end if
(24) end if.

Fig. 5.8 A spanning tree-based navigation algorithm (code for pi )

considers the request at the head of queuei (line 20). If it is its own request, the
object is for it, and consequently pi becomes the new root of the spanning tree and
is the current user of the object (line 21). Otherwise, the object has to be forwarded
to the neighbor pk whose request was at the head of queuei (line 22), and pi has
to send a request to its new parent pk if queuei = ∅ (line 23). (Let us observe that
lines 22–23 are the same as lines 10–11.)

5.3.3 Discussion and Properties

On Messages As shown in Figs. 5.5 and 5.6, it follows that, during the object’s
travel in the tree to its destination process, the object reverses the direction of the
tree edges that have been traversed (in the other direction) by the corresponding
request messages.
5.3 A Navigation Algorithm Based on a Spanning Tree 105

Fig. 5.9 The case of non-FIFO channels

Each message REQUEST () carries the identity of its sender (lines 4, 11, 18, 23).
Thus, the control information carried by these messages is bounded. Moreover, all
local variables are bounded.

Non-FIFO Channels As in the previous one, this algorithm does not require the
channels to be FIFO. When channels are not FIFO, could a problem happen if a
process pi , which has requested the object for it or another process pk , first receives
a request and then the object, both from one of its neighbor pj in the tree?
This scenario is depicted in Fig. 5.9, where we initially have parenti = j , and
pi has sent the message REQUEST (i) to its parent on behalf of the process py at
the head of queuei (hence |queuei | ≥ 1). Hence, when pi receives REQUEST (j ),
it is such that parenti = i and |queuei | ≥ 1. It then follows from lines 13–19 that
the only statement executed by pi is line 17, namely, pi adds j to its local queue
queuei , which is then such that |queuei | ≥ 2).

On the Tree Invariant The channels used by the algorithm constitute an undi-
rected tree. When considering the orientation of these channels as defined by the
local variable parenti , we have the following.
Initially, these variables define a directed tree rooted at the process where the
object is initially located. Then, according to requests and the move of the object, it
is possible that two processes pi and pj are such that parenti = j and parentj = i.
Such is the case depicted in Fig. 5.9, in which the message REQUEST (j ) can be
suppressed (let us observe that this is independent of whether or not the channel
connecting pi and pj is FIFO). When this occurs we necessarily have:
• parentj = i and pj has sent the object to pi , and
• parenti = j and pi has sent the message REQUEST (i) to pj .
When the object arrives at pi , parenti is modified (line 21 or line 22, according to
the fact that pi is or is not the destination of the object), and, from then on, the edge
of the spanning tree is directed from pj to pi .
Hence, let us define an abstract spanning tree as follows. The orientation of all
its edges, except the one connecting pi and pj when the previous scenario occurs,
are defined by the local variables parentk (k = i, j ). Moreover, when the previous
scenario occurs, the edge of the abstract spanning tree is from pj to pi , i.e., the
106 5 Mobile Objects Navigating a Network

direction in which the object is moving. (This means that the abstract spanning tree
logically considers that the object has arrived at pi .) The invariant preserved by the
algorithm is then the following: At any time there is exactly one abstract directed
spanning tree, whose edges are directed to the process that has the object, or to
which the object has been sent.

Cost of the Algorithm As in the previous algorithm, the message cost is 0 mes-
sages in the most favorable case. The worst case depends on the diameter D of the
tree, and occurs when the two most distant processes are such that one has the object
and the other requests it. In this case, D messages and D time units are necessary
for the requests to arrive at the owner of the object, and again D messages and D
time units are necessary for the object to arrive at the requesting process. Hence,
both message complexity and time complexity are O(D).

On Process Identities As any process pi communicates only with its neighbors


in the spanning tree, the only identities that can appear in queuei are the identities
of these neighbors plus its own identity i. So that no two identities in queuei can be
confused, the algorithm requires them to be different.
As the identities of any two processes pi and pj that are at a distance greater than
2 in the tree can never appear in the same queue, it follows that any two processes
at a distance greater than 2 in the tree can have the same name without making
the algorithm incorrect. This noteworthy property is particularly interesting from a
scalability and locality point of view.

On Priority As each local queue queuei is managed according to the FIFO disci-
pline, it follows that each request is eventually granted (the corresponding process
obtains the mobile object). It is possible to apply other management rules to these
queues, or particular rules to a subset of processes, in order to favor some processes,
or obtain specific priority schemes. This versatility dimension of the algorithm can
be exploited by some applications.

5.3.4 Proof of the Algorithm

Theorem 3 The algorithm described in Fig. 5.8 guarantees that the object is never
located simultaneously at several sites (safety), and any process that wants the ob-
ject eventually acquires it (liveness).

Proof Proof of the safety property. Let us first observe that the object is initially
located at a single process. The proof is by induction. Let us assume that the property
is true up to the xth move of the object from a process to another process. Let pi be
the process that has the object after its xth move. If pi keeps the object forever, the
property is satisfied. Hence, let us assume that eventually pi sends the object. There
are three lines at which pi can send the object. Let us consider each of them.
5.3 A Navigation Algorithm Based on a Spanning Tree 107

• pi invokes release_object(). In this case, we have i ∈ / queuei . This is be-


cause when, after its last acquire_object(), pi received the object, it was at the
head of queuei but it suppressed its name from queuei before using the object
(lines 20–21).
If queuei = ∅, pi keeps the object and the property is satisfied. If queuei = ∅,
pi sends the object to the process pk whose request is at the head of queuei (and
k = i, due to the previous observation). It also sets parenti to k (lines 7–10). We
have then parenti = k = i, which means that the object is no longer at pi , and the
property is satisfied.
• pi executes line 15. In this case, we have k = i at line 15 (this is because only pi
can send messages REQUEST (i), and a process does not send request messages to
itself). It follows that, after it has executed line 15, we have parenti = i, i.e., pi
no longer has the object, and the safety property is satisfied.
• pi executes line 22. In this case, we have i = k (line 21). Hence, parenti = k = i
after pi has forwarded the object, and the safety property is satisfied.
Proof of the liveness property. This proof is made up of two parts. It is first
proved that, if processes want to acquire the object, one process obtains it (deadlock-
freedom). It is then proved that any process that wants to acquire the object, eventu-
ally obtains it (starvation-freedom).
Proof of the deadlock-freedom property. Let us consider that the object is located
at pi , and at least one process pj has sent a request. Hence, pj has sent a message
REQUEST (j ) to parentj , which in turn has sent a message REQUEST (parentj ) to its
parent, etc. (these request messages are not necessarily due to the request of pi if
other processes have issued requests). The important point is that, due to orientation
of the edges and the loop-freedom property of the abstract dynamic spanning tree,
pi receives a message REQUEST (k) (this message is not necessarily due to the pj ’s
request). When this occurs, if interestedi , pi adds k to queuei (the important point
is here that queuei = ∅, line 14), and will send the object to the head of queuei when
it will execute line 10 of the operation release_object(). If ¬ interestedi when pi
receives REQUEST (k), it sends by return the object to pk (line 15). Hence, whatever
the value of interestedi , pi eventually sends the object to pk . If k is at the head of
queuek , it has requested the object and obtains it. Otherwise, pk forwards the object
to the process pk  such that k  is at the head of queuek . Iterating this reasoning, and
considering the orientation property of the abstract spanning tree, it follows that the
object attains a process that has a pending request. It follows that the algorithm is
deadlock-free.
Proof of the starvation-freedom property. Let D be the diameter of the spanning
tree, 1 ≤ D ≤ n − 1. Let pi be process that has sent a message REQUEST (i) and pj
the process that has the mobile object. Let us associate the following vector R with
the request of pi (see Fig. 5.10):
 
R = d(i1 ), d(i2 ), . . . , d(ix−1 ), d(ix ), 0, . . . , 0 ,

where i1 = i, ix = j , N (iy ) is the degree of piy in the spanning tree, and


• d(i1 ) is the rank of i1 in queuei1 ≡ queuei (1 ≤ d(i1 ) ≤ N (i1 )),
108 5 Mobile Objects Navigating a Network

Fig. 5.10 The meaning of vector R = [d(i1 ), d(i2 ), . . . , d(ix−1 ), d(ix ), 0, . . . , 0]

• d(i2 ) is the rank of i1 in queuei2 (1 ≤ d(i2 ) ≤ N (i2 )),


• d(i3 ) is the rank of i2 in queuei3 (1 ≤ d(i3 ) ≤ N (i3 )), etc.,
• d(ix ) is the rank of ix−1 in queueix ≡ queuej (1 ≤ d(ix ) ≤ N (ix )).
Let us observe that, as R has D entries and each entry has a bounded domain, it
follows that the set of all possible vectors R is bounded and totally ordered (when
considering lexicographical ordering).
Moreover, as each local queue is a FIFO queue and a process cannot issue a new
request while its previous request is still pending, it follows that, when px sends the
object (this necessarily occurs from the deadlock-freedom property), we proceed
from the vector R to the vector
• R  = [d(i1 ), d(i2 ), . . . , , d(ix−1 ) − 1, 0, . . . , 0] if d(ix ) = 1, or
• R  = [d(i1 ), d(i2 ), . . . , , d(ix−1 ), d(ix ) − 1, ∗, . . . , ∗] if d(ix ) > 1.
The first case (d(ix ) = 1) is when pix sends the object to pix−1 (which is at the
head of its queue), while the second case (d(ix ) > 1) is when a process different
from pix−1 is at the head of queueix (where the “*” stands for appropriate values
according to the current states the remaining (D − x) local queues).
The important point is that we proceed from vector R to a vector R  or R  , which
is smaller than R according to the total order on the set of possible vectors. Hence,
starting from R  or R  , we proceed to another vector R  (smaller than R  or R  ),
etc., until we attain the vector [1, 0, . . . , 0]. When this occurs pi has the object,
which concludes the proof of the starvation-freedom property. 

5.4 An Adaptive Navigation Algorithm

This section presents a distributed algorithm that implements a distributed FIFO


queue. A process invokes the operation enter() to append itself at the end of the
queue. Once it has become the head of the queue, it invokes the operation exit() to
leave the queue.
This algorithm, which was proposed by M. Naimi and M. Trehel (1987), as-
sumes both a complete asynchronous network, and an underlying spanning tree
whose structure evolves according to the requests to enter the queue issued by the
processes.
From a mobile object point of view, this means that a process invokes enter()
(i.e., acquire_object()) to obtain the object, and this operation terminates when the
invoking process is at the head of the queue. After using the object, a process invokes
5.4 An Adaptive Navigation Algorithm 109

exit() (i.e., release_object()) to exit the queue and allows its successor in the queue
to become the new head of the queue.

5.4.1 The Adaptivity Property

A distributed algorithm implementing a distributed object (or a distributed service)


is adaptive, if, for each process that is not interested in using the object (or the ser-
vice), there is a finite time after which these processes no longer have to participate
in the algorithm (hence, they no longer receive messages sent by the algorithm).
As an example, let us consider the two algorithms described in Figs. 5.3 and 5.8,
which implement a navigation service for a mobile object. None of them is adap-
tive. Let px be a process that, after some time, never invokes the operation
acquire_object(). In the algorithm of Fig. 5.3, px receives all the messages RE -
QUEST () sent by the other processes. In the algorithm of Fig. 5.8, according to its
position in the spanning tree, px has to play a proxy role for the requests issued by
its neighbors in the tree, and, in the other direction, the object has to pass through
px to attain its destination. Hence, in both cases, px has to forever participate in the
navigation algorithm.
As we are about to see, the algorithm described in this section is adaptive: If after
some time τ , a process px never invokes enter(), there is a finite time τ  ≥ τ , after
which px it is not required to participate in the distributed algorithm.

5.4.2 Principle of the Implementation

A Distributed Queue To implement the queue, each process pi manages two lo-
cal variables. The Boolean interestedi is set to the value true when pi starts entering
the queue, and is reset to the value false when it exits the queue. The second local
variable is denoted nexti , and is initialized to ⊥. It contains the successor of pi in
the queue. Moreover, nexti = ⊥ if pi is the last element of the queue.
Hence, starting from the process px that is the head of the queue, the queue is
defined by the sequence of pointers nextx , nextnextx , etc., until the process py such
that nexty = ⊥.
Due to its very definition, there is a single process at the head of the queue.
Hence, the algorithm considers that this is the process at which the mobile object
is currently located. When this process pi exits the queue, it has to send the mobile
object to pnexti (if nexti = ⊥).

How to Enter the Queue: A Spanning Tree to Route Messages Let p be the
process that is currently the last process of the queue (hence, interested ∧ next =
⊥). The main issue that has to be solved consists in the definition of an addressing
mechanism that allows a process pi , that wants to enter the queue, to inform p that
it is no longer the last process in the queue and it has a successor.
110 5 Mobile Objects Navigating a Network

Fig. 5.11 A dynamically evolving spanning tree

To that end, we need a distributed routing structure that permits any process to
send a message to the last process of the queue. Moreover, this distributed routing
structure has to be able to evolve, as the last process of the queue changes according
to the request issued by processes. The answer is simple, namely, the distributed
routing structure we are looking for is a dynamically evolving spanning tree whose
current root is the last process of the queue.
More specifically, let us consider Fig. 5.11. There are five processes, p1 , . . . , p5 .
The local variable parenti of each process pi points to the process that is the current
parent of pi in the tree. As usual, parentx = x means that px is the root of the
tree. The process p4 is the current root (initial configuration). The process p2 sends
a message REQUEST (2) to its parent p1 and defines itself as new root by setting
parent2 to 2. When p1 receives this message, as it is not the root, it forwards this
message to its parent p4 and redefines its new parent as being p2 (intermediary
configuration). Finally, when p4 receives the message REQUEST (2) forwarded by
p1 , it discovers that it has a successor in the queue (hence, it executes next4 ← 2),
and it considers p2 as the new root of the spanning tree (update of parent4 to 2 in
the final configuration).
As shown by the previous example, it is possible that, at some times, several trees
coexist (each spanning a distinct partition of the network). The important point is
that there is no creation of cycles, and there is a single new spanning tree when all
control messages have arrived and been processed.
Differently from the algorithm described in Fig. 5.8, which is based on “edge re-
versal” on a statically defined tree, this algorithm benefits from the fact any channel
can be part of the dynamically evolving spanning tree, The directed path p2 , p1 , p4
of the initial spanning tree, is replaced by two new directed edges (from p1 to p2 ,
and from p4 to p2 ). Hence, a path (in the initial configuration) of length d from the
new root to the old root is replaced by d edges, each directly pointing to the new
root (in the final configuration).

Heuristic Used by the Algorithm The previous discussion has shown that the
important local variable for a process pi to enter the queue is parenti . It follows
from the modification of the edges of the spanning tree, which are entailed by the
5.4 An Adaptive Navigation Algorithm 111

messages REQUEST (), that the variable parenti points to the process pk that has
issued the last request seen by pi .
Hence, when pi wants later to enter the queue, it sends its request to the process
pparenti , because this is the last process in the queue from pi ’s point of view.

The Case of the Empty Queue As in the diffusion-based algorithm of Fig. 5.3,
each process pi manages a local Boolean variable object_presenti , whose value is
true if and only if the mobile object is at pi (if we are not interested in the navigation
of a mobile object, this Boolean could be called first_in_queuei ).
As far as the management of the queue is concerned, its implementation has
to render an account of the case where the queue is empty. To that end, let us
consider the case where the queue contains a single process pi . This process
is consequently the first and the last process of the queue, and we have then
object_presenti ∧ (parenti = i).
If pi exits the queue, the queue becomes empty. The previous predicate remains
satisfied, but we have then ¬ interestedi . It follows that, if a process pi is such that
(¬ interestedi ∧ object_presenti ) ∧ (parenti = i), it knows that it was the last user
of the queue, which is now empty.

5.4.3 An Adaptive Algorithm Based on a Distributed Queue

Structure of the Algorithm and Initialization The structure of the algorithm is


the same as in both previous algorithms (see Fig. 5.2). In order to be as generic
as possible, and allow for an easy comparison with the previous algorithms, we
continue using the operation names acquire_object() and release_object(), instead
of enter() and exit(), respectively.
Initially, the object is located at some predetermined process pk . This is a process
that is fictitiously placed at the head of the distributed empty queue (see the predicate
associated with an empty queue). The local variables are then initialized as follows:
• object_presentk is initialized to true,
• ∀ i = k: object_presenti is initialized to false,
• ∀ i: parenti is initialized to k, and interestedi is initialized to false.

Behavior of a Process The algorithm is described in Fig. 5.12. As in the previous


algorithms, the four parts of code (except the wait statement at line 4) are mutually
exclusive.
When a process invokes acquire_object() (i.e., when it wants to enter the queue),
it first sets interestedi to the value true (line 1). If object_presenti is equal to false,
pi sends the message REQUEST (i) to its parent in order to be added at the end of the
queue, and defines itself as the new end of the queue (line 3). Then, it waits until it
has obtained the object, i.e., it is the head of the queue (lines 4 and 10).
If object_presenti is equal to true, pi is at the head of the queue (it has the object).
This is due to the following observation. Just before invoking acquire_object() to en-
ter the queue (obtain the object), pi was such that (¬ interestedi ) ∧ object_presenti ,
112 5 Mobile Objects Navigating a Network

operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) send REQUEST (i) to pparenti ; parenti ← i;
(4) wait(object_presenti )
(5) end if.

operation release_object() is
(6) interestedi ← false;
(7) if (nexti = ⊥) then
(8) send OBJECT () to pnexti ; object_presenti ← false; nexti ← ⊥
(9) end if.

when OBJECT () is received do


(10) object_presenti ← true.

when REQUEST (k) is received do


(11) if (parenti = i)
(12) then send REQUEST (k) to pparenti
(13) else if (interestedi ) then nexti ← k
(14) else send OBJECT () to pk ; object_presenti ← false
(15) end if
(16) end if;
(17) parenti ← k.

Fig. 5.12 A navigation algorithm based on a distributed queue (code for pi )

from which we conclude that, the last time pi has invoked release_object(), it was
such that nexti = ⊥ (line 7), and no other process has required the object (otherwise,
pi would have set object_presenti to false at line 14). It follows that the object re-
mained at pi since the last time it used it (i.e., the queue was empty and pi was its
fictitious single element).
When pi invokes release_object(), it first resets interestedi to the value false,
i.e., it exits the queue (line 6). Then, if nexti = ⊥, pi keeps the object. Otherwise, it
first sends the object to pnexti (i.e., it sends it the property “you are the head of the
queue”), and then resets object_presenti to false, and nexti to ⊥ (line 8).
When pi receives a message REQUEST (k), its behavior depends on the fact that
it is, or it is not, the last element of the queue. If it is not, it has only to forward the
request message to its current parent in the spanning tree, and redefine parenti as
being pk (lines 11–12 and 17). This is to ensure that parenti always points to the last
process that, to pi ’s knowledge, has issued a request. Let us remark that the message
REQUEST (k) that is forwarded is exactly the message received. This ensures that the
variables parentx of all the processes px visited by this message will point to pk .
If pi is such that parenti = ⊥, there are two cases. If interestedi is true, pi is in
the queue. It consequently adds pk to the queue (line 13). If interestedi is false, we
are in the case where the queue is empty (pi is its fictitious single element). In this
case, pi has the object and sends it to pk (line 14). Moreover, whatever the case, to
preserve its correct meaning (pparenti is the last process that, to pi knowledge, has
issued a request), pi updates parenti to k (line 17).
5.4 An Adaptive Navigation Algorithm 113

Fig. 5.13 From the worst to the best case

5.4.4 Properties

Variable and Message Size, Message Complexity A single bit is needed for each
local variable object_presenti and interestedi ; log2 n bits are needed for each vari-
able parenti , while log2 (n + 1) bits are needed for each variable nexti . As each
message REQUEST () carries a process identity, it needs to log2 n bits. If the al-
gorithm is used only to implement a queue, the message OBJECT () does not have
to carry application data, and a single bit is needed to distinguish the two message
types.
In the best case, an invocation of acquire_object() does not generate REQUEST ()
messages, while it gives rise to (n − 1) REQUEST () messages plus one OBJECT ()
message in the worst case. This case occurs when the spanning tree is a chain.
The left side of Fig. 5.13 considers the worst case: The process denoted pi issues
a request, and the object is at the process denoted pj . The right side of the figure
shows the spanning tree after the message REQUEST (i) has visited all the processes.
The new spanning tree is optimal in the sense that the next request by a process
will generate only one request message. More generally, it has been shown that
the average number of messages generated by an invocation of acquire_object() is
O(log n).

Adaptivity The adaptivity property concerns each process pi taken individually.


It has been defined as follows: If after some time pi never invokes acquire_object(),
there is a finite time after which it does not receive REQUEST () messages, and it
consequently stops participating in the algorithm.
Considering such a process pi , let τ be a time after which all the messages
REQUEST (i) that have been sent have been received and processed. Moreover, let
{k1 , k2 , . . .} be the set of identities of the processes whose parent is pi at time τ (this
set of processes is bounded).
Let us first observe that, as pi never sends a message REQUEST (i) at any time
after τ , it follows from line 17 that no process py , y ∈ {1, . . . , n} \ {k1 , k2 , . . .} can
be such that parenty = i after time τ .
Moreover, let us assume that after time τ , pi forwards a message REQUEST (k) it
has received from some process px , x ∈ {k1 , k2 , . . .}. It follows from line 17 that we
have parentx = k after px has forwarded the message REQUEST (k) to pi . Hence, in
the future, pi will no longer receive messages REQUEST () from px . As this is true
for any process px , x ∈ {k1 , k2 , . . .}, and this set is bounded, it follows that there is
a time τ  ≥ τ after which pi will no longer receive REQUEST () messages.
This adaptivity property is particularly interesting from a scalability point of
view. Intuitively, it means that if, during a long period, some processes do not invoke
114 5 Mobile Objects Navigating a Network

Fig. 5.14 Example of an execution

ACQUIRE _ OBJECT (), they are “ignored” by the algorithm whose cost is accordingly
reduced.

5.4.5 Example of an Execution

An example of an execution of the algorithm in a system of five processes,


p1 , . . . , p5 is described in Fig. 5.14. The initial configuration is depicted on the left
of the top line of the figure. The plain arrows represent the spanning tree (pointer
parenti ), while the dashed arrows represent the queue (pointer nexti ). The processes
that are in the queue, or are entering it, are represented with dashed circles. The star
represents the object; it is placed close to the process at which it is currently located.
Both the processes p2 and p3 invoke acquire_object() and send to their parent
p1 the message REQUEST (2) and REQUEST (3), respectively. Let us assume that the
message REQUEST (2) is the one that arrives first at p1 . This process consequently
sends the object to p2 . We obtain the structure depicted in the middle of the top line
of the figure (in which the message REQUEST (3) has not yet been processed by p1 ).
When p1 receives the message REQUEST (3), it forwards it to p2 , and defines p3
as its new parent in the tree. Then, when p2 receives the message REQUEST (3), it
defines p3 as its new parent (for its next requests), and sets next2 to point to p3 . The
queue including p2 and p3 is now formed (right of the top line of the figure).
Then, p4 invokes acquire_object(). It sends the message REQUEST (4) to its par-
ent p1 , and considers itself the new root. When p1 receives this message, it forwards
REQUEST (4) to its previous parent p3 , and then updates its pointer parent1 to 4.
Finally, when p3 receives REQUEST (4), it updates accordingly parent3 and next3 ,
which now point to p4 (left of the bottom line of the figure).
Let us assume that p2 invokes release_object(). It consequently leaves the queue
by sending the object to the process pointed to by next2 , (i.e., p3 ). This is depicted
5.5 Summary 115

in the middle of the bottom line of the figure. Finally, the last subfigure (right of the
bottom line) depicts the value of the pointers parenti and nexti after a new invocation
of acquire_object() by process p2 .
As we can see, when all messages generated by the invocations of acquire_
object() and release_object() have been received and processed, the sets of point-
ers parenti and nexti define a single spanning tree and a single queue, respectively.
Moreover, this example illustrates also the adaptivity property. When considering
the last configuration, if p5 and p1 do not invoke the operation acquire_object(),
they will never receive messages generated by the algorithm.

5.5 Summary
This chapter was on algorithms that allow a mobile object to navigate a network
made up of distributed processes. The main issue that these algorithms have to solve
lies in the routing of both the requests and the mobile object, so that any process
that requires the object eventually obtains it. Three navigation algorithms have been
presented. They differ in the way the requests are disseminated, and in the way the
mobile object moves to its destination.
The first algorithm, which assumes a fully connected network, requires O(n)
messages per use of the mobile object. The second one, which uses only the edges
of a fixed spanning tree built on the process network, requires O(D) messages,
where D is the diameter of the spanning tree. The last algorithm, which assumes a
fully connected network, manages a spanning tree whose shape evolves according
to the requests issued by the processes. Its average message cost is O(log n). This
algorithm has the noteworthy feature of being adaptive, namely, if after some time a
process is no longer interested in the mobile object, there is a finite time after which
it is no longer required to participate in the algorithm.
When the mobile object is a (stateless) token, a mobile object algorithm is noth-
ing more than a token-based mutual exclusion algorithm. Actually, the algorithms
presented in this chapter were first introduced as token-based mutual exclusion al-
gorithms.
Finally, as far as algorithmic principles are concerned, an important algorithmic
notion presented in this chapter is the “edge reversal” notion.

5.6 Bibliographic Notes


• The diffusion-based algorithm presented in Sect. 5.2 was proposed independently
by G. Ricart and A.K. Agrawala [328], and I. Suzuki and T. Kasami [361]. This
algorithm is generalized to arbitrary (connected) networks in [176].
• The algorithm presented in Sect. 5.3 is due to K. Raymond [304]. This algorithm,
which assumes a statically defined tree spanning the network of processes, is
based on the “edge reversal” technique.
116 5 Mobile Objects Navigating a Network

• The edge reversal technique was first proposed (as far as we know) by E. Gafni
and D. Bertsekas [141]. This technique is used in [78] to solve resource alloca-
tion problems. A monograph entirely devoted to this technique has recently been
published [388].
• The algorithm presented in Sect. 5.4 is due to M. Naimi and M. Trehel [275, 374].
A formal proof of it, and the determination of its average message complexity
(namely, O(log n)), can be found in [276].
• A generic framework for mobile object navigation along trees is presented
in [172]. Both the algorithms described in Sects. 5.3 and 5.4, and many other
algorithms, are particular instances of this very general framework.
• Variants of the algorithms presented in Sects. 5.3 and 5.4 have been proposed.
Among them are the algorithm presented by J.M. Bernabéu-Aubán and M.
Ahamad [50], the algorithm proposed by M.L. Nielsen and M. Mizuno [279]
(see Exercise 4), a protocol proposed by J.L.A. van de Snepscheut [355], and the
arrow protocol proposed by M.J. Demmer and M. Herlihy [105]. (The message
complexity of this last algorithm is investigated in [182].)
• The algorithm proposed by J.L.A. van de Snepscheut [355] extends a tree-based
algorithm to work on any connected graph.
• Considering a mobile object which is a token, a dynamic heuristic-based navi-
gation algorithm is described in [345]. Techniques to regenerate lost tokens are
described in [261, 285, 321]. These techniques can be extended to more sophisti-
cated objects.

5.7 Exercises and Problems

1. Modify the navigation algorithm described in Fig. 5.3 (Sect. 5.2), so that all the
local variables request_byi [k] have a bounded domain [1..M].
(Hint: consider the process that is the current user of the object.)
2. The navigation algorithm described in Fig. 5.3 assumes that the underlying com-
munication network is a complete point-to-point network. Generalize this algo-
rithm so that it works on any connected network (i.e., a non-necessarily complete
network).
Solution in [176].
3. A greedy version of the spanning tree-based algorithm described in Fig. 5.8
(Sect. 5.3) can be defined as follows. When a process pi invokes acquire_object()
while parenti = i (i.e., the mobile object is not currently located at the invoking
process), pi adds its identity i at the head of queuei (and not at its tail as done in
Fig. 5.8). The rest of the algorithm is left unchanged.
What is the impact of this modification on the order in which the processes
obtain the mobile object? Does the liveness property remain satisfied? (Justify
your answers.)
Solution in [304].
5.7 Exercises and Problems 117

operation acquire_object() is
(1) interestedi ← true;
(2) if (¬ object_presenti ) then
(3) send REQUEST (i, i) to pparenti ; parenti ← i;
(4) wait(object_presenti )
(5) end if.

when release_object() is
(6) interestedi ← false;
(7) if (nexti = ⊥) then
(8) send OBJECT () to pnexti ; object_presenti ← false; nexti ← ⊥
(9) end if.

when OBJECT () is received do


(10) object_presenti ← true.

when REQUEST (j, k) is received do


(11) if (parenti = i)
(12) then send REQUEST (i, k) to pparenti
(13) else if (interestedi ) then nexti ← k
(14) else send OBJECT () to pk ; object_presenti ← false
(15) end if
(16) end if;
(17) parenti ← j .

Fig. 5.15 A hybrid navigation algorithm (code for pi )

4. In the algorithm described in Fig. 5.8 (Sect. 5.3), the spanning tree is fixed, and
both the requests and the object navigate its edges (in opposite direction). Dif-
ferently, in the algorithm described in Fig. 5.12, the requests navigate a spanning
tree that they dynamically modify, and the object is sent directly from its current
owner to its next owner.
Hence the idea to design a variant of the algorithm of Fig. 5.12 in which the
requests are sent along a fixed spanning (only the direction of its edges is mod-
ified according to requests), and the object is sent directly from its current user
to its next user. Such an algorithm is described in Fig. 5.15. A main difference
with both previous algorithms lies in the message REQUEST (). Such a message
carries now two process identities j and k, where pj is the identity of the pro-
cess that sent the message, while pk is the identity of the process from which
originates this request. (In the algorithm of Fig. 5.8, which is based on the notion
of a proxy process, a request message carries only the identity of its sender. Dif-
ferently, in the algorithm of Fig. 5.12, a request message carries only the identity
of the process from which the request originates.) The local variables have the
same names (interestedi , object_presenti , parenti , nexti ), and the same meaning
as in the previous algorithms.
Is this algorithm correct? If it is not, find a counterexample. If it is, prove its
correction and compute its message complexity.
Solution in [279].
Part II
Logical Time and Global States
in Distributed Systems

This part of the book, which consists of four chapters (Chap. 6 to Chap. 9), is de-
voted to the concepts of event, local state, and global state of a distributed compu-
tation and associated notions of logical time. These are fundamental notions that
provide application designers with sane foundations on the nature of asynchronous
distributed computing in reliable distributed systems.

Chapter 6 shows how a distributed computation can be represented as a partial


order on the set of events produced by the processes. It also introduces the notion of
a consistent global state and presents two algorithms that compute such global states
on the fly. The notion of a lattice of global states and its interest are also discussed.
Chapter 7 introduces distinct notions of logical time encountered in distributed
systems, namely, linear time (also called scalar time, or Lamport’s time), vector
time, and matrix time. Each type of time is defined, its main properties are stated,
and examples of its uses are given.
Chapter 8 addresses distributed checkpointing in asynchronous message-passing
systems. It introduces the notion of communication and checkpoint pattern, and
presents two consistency conditions which can be associated with such an abstrac-
tion of a distributed computation. The first one (called z-cycle-freedom) captures the
absence of cycles among local checkpoints, while the second one (called rollback-
dependency trackability) allows us to associate, without additional computation, a
global checkpoint with each local checkpoint. Checkpointing algorithms that ensure
these properties are presented.
Finally, Chap. 9, which is the last chapter of this part, presents general techniques
(called synchronizers) to simulate a synchronous system on top of an asynchronous
distributed system.
Chapter 6
Nature of Distributed Computations
and the Concept of a Global State

A sequential execution can be represented by the sequence (trace) of consecutive lo-


cal states it has produced, or, given its initial state, by the sequence of statements that
have been executed. Hence, a question that comes naturally to mind is the following
one: How do we model a distributed execution?
This chapter answers first this question. To that end, it gives basic definitions,
and presents three ways to model a distributed execution, namely, a partial order on
a set of events, a partial order on a set of local states, and a lattice of global states.
While these three types of models are equivalent, it appears that each one is more
appropriate than the others to analyze and understand specific features of distributed
executions.
The chapter then focuses on a fundamental notion of distributed computing,
namely, the notion of a global state. The chapter analyzes global states and presents
several distributed algorithms, which compute on the fly global states of a distributed
application. These algorithms are observation algorithms (they have to observe an
execution without modifying its behavior). It is shown that the best that can be done
is the computation of a global state in which a distributed execution has passed or
could have passed. This means that no process can know if the distributed execution
has passed or has not passed through the global state which is returned as a re-
sult. This noteworthy feature illustrates the relativistic nature of the observation of
distributed computations. Despite this relativistic feature, the computation of such
global states allows distributed computing problems to be solved (such as the detec-
tion of stable properties).
Both the terms “global state” and “snapshot” are used with the same meaning in
the literature. They have to be considered as synonyms. This chapter uses the term
global state.

Keywords Event · Causal dependence relation · Causal future · Causal path ·


Causal past · Concurrent (independent) events · Causal precedence relation ·
Consistent global state · Cut · Global state · Happened before relation ·
Lattice of global states · Observation · Marker message · Nondeterminism ·
Partial order on events · Partial order on local states · Process history ·
Process local state · Sequential observation

M. Raynal, Distributed Algorithms for Message-Passing Systems, 121


DOI 10.1007/978-3-642-38123-2_6, © Springer-Verlag Berlin Heidelberg 2013
122 6 Nature of Distributed Computations and the Concept of a Global State

6.1 A Distributed Execution Is a Partial Order on Local Events


This chapter considers asynchronous systems made up of n processes p1 , . . . , pn ,
where the identity of process pi is its index i. The power of a process is that of a
Turing machine enriched with point-to-point send and receive operations.

6.1.1 Basic Definitions

Events The execution of a statement by a process is called an event. Hence, being


sequential, a process consists of a sequence of events. In a distributed system, three
types of events can be distinguished.
• A communication event involves a process and a channel connecting this process
to another process. There are two types of communication events:
– A send event occurs when a process sends a message to another process.
– A receive event occurs when a process receives a message from another pro-
cess.
• The other events are called internal events. Such an event involves a single pro-
cess and no communication channel. Its granularity depends on the observation
level. It can be the execution of a subprogram, the execution of a statement ex-
pressed in some programming language, the execution of a single machine in-
struction, etc. The important point is that an internal event does not involve com-
munication. Hence, any sequence of statements executed by a process between
two communication events can be abstracted by a single internal event.

Process History As each process is sequential, the events executed by a process


pi are totally ordered. The corresponding sequence is sometimes called the history
of pi .
Let eix and hi denote the xth event produced by process pi , and the history of pi ,
respectively. We consequently have

hi = ei1 , ei2 , . . . , eix , eix+1 , . . .

From a notational point of view, we sometimes denote hi as a pair (hi , →i ),


where hi is the set of events produced by pi , and →i is the (local) total order on the
events of hi .

6.1.2 A Distributed Execution Is a Partial Order on Local Events



Let Hi = 1≤i≤n hi (i.e., the set of all the events produced by a distributed execu-
tion). Moreover (to simplify the presentation), let us assume that no process sends
the same message twice.
6.1 A Distributed Execution Is a Partial Order on Local Events 123

Message Relation Let M be the set of all the messages exchanged during an exe-
cution. Considering the associated send and receive events, let us define a “message
order” relation, denoted “−→msg ”, as follows. Given any message m ∈ M, let s(m)
denote its send event, and r(m) denote its receive event. We have

s(m) −→msg r(m).

This relation expresses the fact that any message m is sent before being received.

Distributed Computation The flow of control of a distributed execution (com-


i = (Hi , −→),
putation) is captured by the smallest partial order relation, denoted H
ev
ev y
where eix −→ ej if:
• Process order: i = j ∧ x < y, or
y
• Message order: ∃m ∈ M: eix = s(m) and ej = r(m), or
ev ev y
• Transitive closure: ∃e: eix −→ e ∧ e −→ ej .
“Process order” comes from the fact that each process is sequential; a process
produces one event at a time. “Message order” comes from the fact that a message
has first to be sent in order to be later received. Finally, “transitive closure” binds
process order and message order.
ev
The relation −→ is usually called happened before relation. It is also sometimes
called the causal precedence relation. There is then a slight abuse of language as,
ev y
while we can conclude from ¬(eix −→ ej ) that the event eix is not a cause of the
y ev y y
event ej , it is not possible to conclude from eix −→ ej that eix is a cause of ej (as a
simple example, two consecutive events on a process are not necessarily “causally
related”; they appear one after the other only because the process that issued them
is sequential). Despite this approximation, we will use “causally precede” as a syn-
ev y
onym of “happen before”. Hence, eix −→ ej has to be understood as “it is possible
y
that event eix causally affects event ej ”.
This approach to model a distributed computation as a partial order on a set of
events is due to L. Lamport (1978). It is fundamental as, being free from physical
time, it captures the essence of asynchronous distributed computations, providing
thereby a simple way to abstract them so that we can analyze and reason on them.
An example of a distributed execution is depicted in the classical space-time
diagram in Fig. 6.1. There are three processes, and each bullet represents an event;
e33 is an internal event, e13 is a receive event, while e23 is a send event.

6.1.3 Causal Past, Causal Future, Concurrency, Cut

Causal Path A causal path is a sequence of events a(1), a(2), . . . , a(z) such that
ev
∀x : 1 ≤ x < z : a(x) −→ a(x + 1).
124 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.1 A distributed execution as a partial order

ev
Hence, a causal path is a sequence of consecutive events related by −→. Let us
notice that each process history is trivially a causal path.
When considering Fig. 6.1, the sequence of events e22 , e23 , e32 , e33 , e34 , e13 is a causal
path connecting the event e22 (sending by p2 of a message to p3 ) to the event e13
(reception by p1 of a message sent by p3 ). Let us also observe that this causal
relates an event on p2 (e23 ) to an event on p1 (e13 ), despite the fact that p2 never
sends a message to p1 .

Concurrent Events, Causal Past, Causal Future, Concurrency Set Two events
a and b are concurrent (or independent) (notation a||b), if they are not causally
related, none of them belongs to the causes of the other, i.e.,

def ev ev
a||b = ¬(a −→ b) ∧ ¬(b −→ a).

Three more notions follow naturally from the causal precedence relation. Let e be
an event.
ev
• Causal past of an event: past(e) = {f | f −→ e}.
This set includes all the events that causally precede the event e.
ev
• Causal future of an event: future(e) = {f | e −→ f }. This set includes all the
events that have e in their causal past.
• Concurrency set of an event: concur(e) = {f | f ∈ / (past(e) ∪ future(e))}.
This set includes all the events that are not causally related with the event e.
Examples of such sets are depicted in Fig. 6.2, where the event e23 is considered.
The events of past(e23 ) are the three events to the left of the left bold line, while the
events of future(e23 ) are the six events to the right of the bold line on the right side
of the figure.
It is important to notice that while, with respect to physical time, e31 occurs “be-
fore” e23 , and e12 occurs “after” e23 , both are independent from e23 (i.e., they are log-
ically concurrent with e23 ). Process p1 cannot learn that the event e23 has been pro-
duced before receiving the message from p3 (event e13 , which terminates the causal
6.1 A Distributed Execution Is a Partial Order on Local Events 125

Fig. 6.2 Past, future, and concurrency sets associated with an event

path from starting at e23 on p2 ). A process can learn it only thanks to the flow of
control created by the causal path e23 , e32 , e33 , e34 , e13 .

Cut and Consistent Cut A cut C is a set of events which define initial prefixes
of process histories. Hence, a cut can be represented by a vector [prefix(h1 ), . . . ,
prefix(hn )], where prefix(h i ) is the corresponding prefix for process pi . A consistent
ev
cut C is a cut such that ∀ e ∈ C: f −→ e ⇒ f ∈ C.
As an example, C = {e1 , e2 , e2 , e31 } is not a cut because the only event from p1
2 1 3

is e12 , which is not a prefix of the history of p1 (an initial prefix also has to include
e11 , because it has been produced before e12 ).
C = {e11 , e12 , e21 , e22 , e31 , e32 } is a cut because its events can be partitioned into ini-
tial prefix histories: e11 , e12 is an initial prefix of h1 , e21 , e22 is an initial prefix of h2 ,
and e31 , e32 is an initial prefix of h3 . It is easy to see that the cut C is not consis-
ev
/ C, e32 ∈ C, and e23 −→ e32 . Differently, the cut C  = C ∪ {e23 } is
tent, because e23 ∈
consistent.
The term “cut” comes from the fact that a cut can be represented by a line sep-
arating events in a space-time diagram. The events at the left of the cut line are the
events belonging to the cut. The cut C and the consistent cut C  are represented by
dashed lines in Fig. 6.3.

6.1.4 Asynchronous Distributed Execution


with Respect to Physical Time

As we have seen, the definition of a distributed execution does not refer to physical
time. This is similar to the definition of a sequential execution, and is meaningful
as long as we do not consider real-time distributed programs. Physical time is a
resource needed to execute a distributed program, but is not a programming object
126 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.3 Cut and consistent cut

Fig. 6.4 Two instances of the same execution

accessible to the processes. (Physical time can be known only by an omniscient


global observer, i.e., an observer that is outside the computation.)
This means that the sets of distributed executions with the same set of events and
the same partial order on these events are the very same execution, whatever the
physical time at which the events have been produced. This is depicted in Fig. 6.4.
This means that equivalent executions are obtained when abstracting space/time
diagrams (as introduced in Sect. 1.1.1) from the duration of both real-time intervals
between events and message transfer delays.
The events, which are denoted a, . . . , j , are the same in both executions. As an
example, a is the event abstracting the execution of the statement x1 ← 0 (where x1
is a local variable of the process p1 ), g is the sending of a given message by p2 to
p1 , and d is the reception of this message by p1 .
Physical time is represented by a graduated arrow at the bottom of each execu-
tion, where τ0 denotes the starting time of both executions. Each small segment of
a graduated arrow represents one physical time unit.
This shows that the partial order on events produced by a distributed execution
captures causal dependences between events, and only them. There is no notion of
“duration” known by the processes. Moreover, this shows that the only way for a
process to learn information from its environment is by receiving messages.
6.2 A Distributed Execution Is a Partial Order on Local States 127

Fig. 6.5 Consecutive local states of a process pi

In an asynchronous system, the passage of time alone does not provide infor-
mation to the processes. This is different in synchronous systems, where, when a
process proceeds to the next synchronous round, it knows that all the processes do
the same. (This point will be addressed in Chap. 9, which is devoted to synchroniz-
ers.)

6.2 A Distributed Execution Is a Partial Order on Local States

From Events to Local States Each process pi starts from an initial local state
denoted σi0 . Then, its first event ei1 entails its move from σi0 to its next local state
σi1 , and more generally, its xth event eix entails its progress from σix−1 to σix . This is
depicted in Fig. 6.5, where the small rectangles denote the consecutive local states
of process pi .
We sometimes use the transition-like notation σix = δ(σix−1 , eix ) to state that the
statement generating the event eix makes pi progress from the local state σix−1 to
the local state σix .

ev
A Slight Modification of the Relation −→ In order to obtain a simple definition
ev 
for a relation on local states, let us consider the relation on events denoted −→,
ev
which is −→ enriched with reflexivity. This means that the “process order” part of
ev
the definition of −→, namely i = j ∧ x < y, is extended to i = j ∧ x ≤ y (i.e., by
definition, each event precedes itself).

A Partial Order on Local States Let S be the set of all local states produced by a
ev 
distributed execution. Thanks to −→, we can define, in a simple way, a partial order
σ
on the elements of S. This relation, denoted −→, is defined as follows
σ y def ev  y
σix −→ σj = eix+1 −→ ej .

This definition is illustrated on Fig. 6.6. There are two cases.


y
• If eix+1 is the event associated with the sending of a message, and ej the event
associated with its reception, the local state σix preceding eix+1 “happens before”
y y
(causally precedes) the local state σj generated by ej (top of the figure).
128 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.6 From a relation on events to a relation on local states

y
• If the events eix+1 and ej have been produced by the same process (i = j ), and are
y
consecutive (y = x + 1), then the local state σix preceding eix+1 = ej “happens
y y
before” (causally precedes) the local state σj = σix+1 generated by ej (bottom
y
of the figure). In this case, the reflexivity (eix+1
is ej ) can be interpreted as an
“internal communication” event, where pi sends to itself a fictitious message.
y
And, the corresponding send event eix+1 and receive event ej are then merged to
define a single same internal event.
σ
It follows from the definition of −→ that a distributed execution can be abstracted
as a partial order 
S on the set of the process local states, namely, 
σ
S = (S, −→).

Concurrent Local States Two local states σ 1 and σ 2 are concurrent (or indepen-
dent, denoted σ 1||σ 2) if none of them causally precedes the other one, i.e.,

def σ σ
σ 1||σ 2 = ¬(σ 1 −→ σ 2) ∧ ¬(σ 2 −→ σ 1).

It is important to notice that two concurrent local states may coexist at the same
y−1
physical time. This is, for example, the case of the local states σix+1 and σj
in the top of Fig. 6.6 (this coexistence lasts during the—unknown and arbitrary—
transit time of the corresponding message). On the contrary, this can never occur for
causally dependent local states (a “cause” local state no longer exists when any of
its “effect” local states starts existing).
6.3 Global State and Lattice of Global States 129

6.3 Global State and Lattice of Global States


6.3.1 The Concept of a Global State

Global State and Consistent Global State A global state Σ of a distributed exe-
cution is a vector of n local states, one per process:
Σ = [σ1 , . . . , σi , . . . , σn ],
where, for each i, σi is a local state of process pi .
Intuitively, a consistent global state is a global state that could have been ob-
served by an omniscient external observer. More formally, it is a global state
Σ = [σ1 , . . . , σi , . . . , σn ] such that
∀i, j : i = j ⇒ σi ||σj .
This means no two local states of a consistent global state can be causally dependent.

Global State Reachability Let Σ1 = [σ1 , . . . , σi , . . . , σn ] be a consistent global


state. The global state Σ2 = [σ1 , . . . , σi , . . . , σn ] is directly reachable from Σ1 if
there is a process pi and an event ei (representing a statement that pi can execute in
its local state σi ) such that
• ∀j = i : σj = σj , and
• σi = δ(σi , ei ).
Hence, direct reachability states that, the distributed execution being in the global
state Σ1 , there is a process pi that produces its next event, and this event directs the
distributed execution to enter the global state Σ2 , which is the same as Σ1 except
for its ith entry.
It follows that, given a consistent global state Σ1 = [σ1 , . . . , σi , . . . , σn ], for
any i, 1 ≤ i ≤ n, and assuming ei exists, the consistent global state Σ2 =
[σ1 , . . . , δ(σi , ei ), . . . , σn ] is directly reachable from Σ1 . This is denoted Σ2 =
δ(Σ1 , ei ).
More generally, Σ1 and Σz being two consistent global states, Σz is reachable
Σ
from Σ1 (denoted Σ1 −→ Σz ) if there is a sequence of consistent global states Σ2 ,
Σ3 , etc., such that, for any k, 1 < k ≤ z, the global state Σk is directly reachable
Σ
from the global state Σk−1 . By definition, for any Σ1 , Σ1 −→ Σ1 .

6.3.2 Lattice of Global States

Let us consider the very simple two-process distributed execution described in


Fig. 6.7. While it is in its initial local state, p1 waits for a message from p2 . The
reception of this message entails its progress from σ10 to σ11 . Then, p1 sends back
an answer to p2 and moves to σ12 . The behavior of p2 can be easily deduced from
the figure.
130 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.7 A two-process


distributed execution

Reachability Graph Interestingly, it is possible to represent all the consistent


global states in which a distributed execution can pass, as depicted in Fig. 6.8. To
y
simplify the notation, the global state [σ1x , σ2 ] is denoted [x, y]. Hence, the ini-
tial global state Σinit is denoted [0, 0], while the final global state Σfinal is denoted
[2, 3].
Starting from the initial global state, only p2 can produce an event, namely the
sending of a message to p1 . It follows that a single consistent global state can be
directly attained from the initial global state [0, 0], namely the global state [0, 1].
Then, the execution being in the global state [0, 1], both p1 and p2 can produce
an event. If p1 produces e11 , the execution progresses from the global state [0, 1] to
[1, 1]. Differently, if p2 produces its next event e22 , the execution progresses from
the global state [0, 1] to the global state [0, 2]. Progressing this way, we can built
the reachability graph including all the consistent global states that can be visited
by the distributed execution depicted in Fig. 6.6.

A Distributed Execution as a Lattice of Global States The fact that the previous
execution has two processes, generates a graph with “two dimensions”: one asso-

Fig. 6.8 Lattice of consistent global states


6.3 Global State and Lattice of Global States 131

Fig. 6.9 Sequential observations of a distributed computation

ciated with the events issued by p1 (edges going from right to left in the figure),
the other one associated with the events issued by p2 (edges going left to right in
the figure). More generally, an n-process execution gives rise to a graph with “n
dimensions”.
The reachability graph actually has a lattice structure. A lattice is a directed graph
such that any two vertices have a unique greatest common predecessor and a unique
lowest common successor. As an example, the consistent global states denoted [2, 1]
and [0, 2] have several common predecessors, but have a unique greatest common
predecessor, namely the consistent global state denoted [0, 1]. Similarly they have a
unique lowest common successor, namely the consistent global state denoted [2, 2].
As we will see later, these lattice properties become particularly relevant when one
is interested in determining if some global state property is satisfied by a distributed
execution.

6.3.3 Sequential Observations

Sequential Observation of a Distributed Execution A sequential observation of


a distributed execution is a sequence including all its events, respecting their partial
ordering. The distributed execution of Fig. 6.6 has three sequential observations,
which are depicted in Fig. 6.9. They are the following:
• O1 = e21 , e11 , e12 , e22 , e23 . This sequential observation is depicted by the dashed line
at the left of Fig. 6.9.
132 6 Nature of Distributed Computations and the Concept of a Global State

• O2 = e21 , e11 , e22 , e12 , e23 . This sequential observation is depicted by the dotted line
in the middle of Fig. 6.9.
• O3 = e21 , e22 , e11 , e12 , e23 . This sequential observation is depicted by the dashed/
dotted line at the right of Fig. 6.9.
As each observation is a total order on all the events that respects their partial or-
dering, it follows that we can go from one observation to another one by permuting
any pair of consecutive events that are concurrent. As an example, O2 can be ob-
tained from O1 by permuting e12 and e22 (which are independent events). Similarly,
as e11 and e22 are independent events, permuting them in O2 provides us with O3 .
Let us finally observe that the intersection of all the sequential observations (i.e.,
their common part) is nothing more than the partial order on the events associated
with the computation.

Remark 1 As each event makes the execution proceed from a consistent global
state to another consistent global state, an observation can also be defined at a se-
quence of consistent global states, each global state being directly reachable from
its immediate predecessor in the sequence. As an example, O1 corresponds to the
sequence of consistent global states [0, 0], [0, 1], [1, 1], [2, 1], [2, 2], [2, 3].

Remark 2 Let us insist on the fact that the notion of global state reachability
considers a single event at a time. During an execution, independent events can be
executed in any order or even simultaneously. If, for example, e11 and e22 are executed
“simultaneously”, the execution proceeds “directly” from the global state [0, 1] to
the global state [1, 2]. Actually, whatever does really occur, it is not possible to know
it. The advantage of the lattice approach, which considers one event at a time, lies
in the fact that no global state in which the execution could have passed is missed.

6.4 Global States Including Process States and Channel States

6.4.1 Global State Including Channel States

In some cases, we are not interested in global states consisting of only one local state
per process, but in global states consisting both of process local states and channel
states.
To that end we consider that processes are connected by unidirectional channels.
This is without loss of generality as a bidirectional channel connecting pi and pj
can be realized with two unidirectional channels, one from pi to j and one from pj
to pi . The state of the channel from pi to pj consists in the messages that have been
sent by pi , and have not yet been received by pj . A global state is consequently
made up of two parts:
• a vector Σ with a local state per process, plus
• a set ΣC whose each element represents the state of a given channel.
6.4 Global States Including Process States and Channel States 133

Fig. 6.10 Illustrating the notations “e ∈ σi ” and “f ∈ σi ”

Fig. 6.11 In-transit


and orphan messages

6.4.2 Consistent Global State Including Channel States

Notation Let σi be a local state of a process pi , and e and f be two events pro-
duced by pi . The notation e ∈ σi means that pi has issued e before attaining the
local state σi , while f ∈
/ σi means that pi has issued f after its local state σi . We
then say “e belongs to the past of σi ”, and “f belongs to the future of σi ”. These
notations are illustrated in Fig. 6.10.

In-transit Messages and Orphan Messages The notions of in-transit and orphan
messages are with respect to an ordered pair of local states. Let m be a message
sent by pi to pj , and σi and σj two local states of pi and pj , respectively. Let us
recall that s(m) and r(m) are the send event (by pi ) and the reception event (by pj )
associated with m, respectively.
• A message m is in-transit with respect to the ordered pair of local states σi , σj 
if
s(m) ∈ σi ∧ r(m) ∈
/ σj .
• A message m is orphan with respect to the ordered pair of local states σi , σj  if
/ σi ∧ r(m) ∈ σj .
s(m) ∈
These definitions are illustrated in Fig. 6.11, where there are three processes,
and two channels, one from p1 to p2 , and one from p3 to p2 . As, on the one hand,
s(m1 ) ∈ σ1 and s(m3 ) ∈ σ1 , and, on the other hand, r(m1 ) ∈ σ2 and r(m3 ) ∈ σ2 , both
m1 and m3 are “in the past” of the directed pair σ1 , σ2 . Differently, as s(m5 ) ∈
σ1 and r(m5 ) ∈ / σ2 , the message m5 is in-transit with respect to the ordered pair
σ1 , σ2 .
Similarly, the message m2 belongs to the “past” of the pair of local states σ3 , σ2 .
Differently, with respect to this ordered pair, the message m4 has been received by
p2 (r(m4 ) ∈ σ2 ), but has not been sent by p3 (s(m4 ) ∈ / σ3 ); hence it is an orphan
message.
Considering Fig. 6.11, let us observe that message m5 , which is in-transit with
respect to pair of local states σ1 , σ2 , does not make this directed pair inconsistent
134 6 Nature of Distributed Computations and the Concept of a Global State

(it does not create a causal path invalidating σ1 || σ2 ). The case of an orphan message
σ
is different. As we can see, the message m4 creates the dependence σ3 −→ σ2 , and,
consequently, we do not have σ3 || σ2 . Hence, orphan messages prevent local states
from belonging to the same consistent global state.

Consistent Global State Let us define the state of a FIFO channel as a sequence
of messages, and the state of a non-FIFO channel as a set of messages. The state of a
channel from pi to pj with respect to a directed pair σi , σj  is denoted c_state(i, j ).
As already indicated, it is the sequence (or the set) of messages sent by pi to pj
whose send events are in the past of σi , while their receive events are not in the past
of σj (they are in its “future”).
Let C = {(i, j ) | ∃ a directed channel from pi to pj }. A global state (Σ, CΣ),
where Σ = [σ1 , . . . , σn ] and CΣ = {c_state(i, j )}(i,j )∈C , is consistent if, for any
message m, we have (where ⊕ stands for exclusive or)
• C1: (s(m) ∈ σi ) ⇒ (r(m) ∈ σj ⊕ m ∈ c_state(i, j )), and
• C2: (s(m) ∈
/ σi ) ⇒ (r(m) ∈
/ σj ∧ m ∈
/ c_state(i, j )).
This definition states that, to be consistent, a global state (Σ, CΣ) has to be
such that, with respect to the process local states defined by Σ , (C1) each in-transit
message belongs to the state of the corresponding channel, and (C2) there is no
message received and not sent (i.e., no orphan message).

Σ Versus (Σ, CΣ) Let us observe that the knowledge of Σ contains implicitly
the knowledge of CΣ . This is because, for any channel, the past of the local state
σi contains implicitly the messages sent by pi to pj up to σi , while the past of the
local state σj contains implicitly the messages sent by pi and received by pj up
to σj . Hence, in the following we use without distinction Σ or (Σ, CΣ).

6.4.3 Consistent Global State Versus Consistent Cut

The notion of a cut was introduced in Sect. 6.1.3. A cut C is a set of events defined
from prefixes of process histories. Let σi be the local state of pi obtained after
executing the events in its prefix history prefix(hi ) as defined by the cut, i.e., σi =
δ(si0 , prefix(hi )) using the transition function-based notation (σi0 being the initial
state of pi ). It follows that [σ1 , . . . , σi , . . . , σn ] is a global state Σ . Moreover, the
cut C is consistent if and only if Σ is consistent.
When considering (Σ, CΣ), we have the following. If the cut giving rise to Σ
is consistent, the state of the channels in CΣ correspond to the messages that cross
the cut line (their send events belong to the cut, while their receive events do not).
If the cut is not consistent, there is at least one message that crosses the cut line in
the “bad direction” (its send event does not belong to the cut, while its receive event
does, this message is an orphan message).
6.5 On-the-Fly Computation of Global States 135

Fig. 6.12 Cut versus global state

Examples of a cut C  , which is consistent, and a cut C, which is not consis-


tent, were depicted in Fig. 6.3. This execution is reproduced in Fig. 6.12, which is
enriched with local states. When considering the consistent cut C  , we obtain the
consistent global state Σ  = [σ12 , σ23 , σ32 ]. Moreover, CΣ  is such that the state of
each channel is empty, except for the channel from p1 to p3 for which we have
c_state(1, 3) = {m }. When considering the inconsistent cut C, we obtain the (in-
consistent) global state Σ = [σ12 , σ22 , σ32 ], whose inconsistency is due to the orphan
message m.

6.5 On-the-Fly Computation of Global States

6.5.1 Global State Computation Is an Observation Problem

The aim is to design algorithms that compute a consistent global state of a dis-
tributed application made up of n processes p1 , . . . , pn . To that end, a controller (or
observer) process, denoted cpi , is associated with each application process pi . The
role of a controller is to observe the process it is associated with, in such a way that
the set of controllers computes a consistent global state of the application.
Hence, the computation of a consistent global state of a distributed computation
is an observation problem: The controllers have to observe the application processes
without modifying their behavior. This is different from problems such as the nav-
igation of a mobile object (addressed in the previous chapter), where the aim was
to provide application processes with appropriate operations (such as acquire() and
release()) that they can invoke.
The corresponding structural view is described in Fig. 6.13. Ideally, the addi-
tion/suppression of controllers/observers must not modify the execution of the dis-
tributed application.
136 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.13 Global state computation: structural view

6.5.2 Problem Definition

The computation of a global state is launched independently by any number x of


controllers, where 1 ≤ x ≤ n. The problem is defined by the following properties.
• Liveness. If at least one controller launches a global state computation, a global
state is computed.
• Safety. Let (Σ, CΣ) be the global state that is computed.
– Validity. Let Σstart be the global state of the application when the global state
computation starts and Σend its global state when it terminates. Σ is such that
Σ Σ
Σstart −→ Σ and Σ −→ Σend .
– Channel consistency. CΣ records all the messages (and only those) that are
in-transit with respect to Σ .
The validity property states that the global state Σ that is computed is consis-
tent and up to date (always returning the initial state would not solve the prob-
lem). Moreover, when considering the lattice of global states associated with the
distributed execution, the computed global state Σ is reachable from Σstart , and
Σend is reachable from it. This defines the “freshness” of Σ . The channel consis-
tency states that CΣ (recorded channel states) is in agreement with Σ (recorded
process local states).

6.5.3 On the Meaning of the Computed Global State

On the Nondeterminism of the Result Considering the lattice of global states


described in Fig. 6.8, let us assume that the computation starts when the application
is in state Σstart = [0, 1] and terminates when the application is in state Σend =
[1, 2]. The validity property states that the computed global state Σ is one of the
global states [0, 1], [1, 1], [0, 2], or [1, 2].
6.5 On-the-Fly Computation of Global States 137

The global state that is computed depends actually on the execution and the inter-
leaving of the events generated by the application processes and by the controllers in
charge of the computation of the global state. The validity property states only that
Σ is a consistent global state that the execution might have passed through. Maybe
the execution passed through Σ , maybe it did not. Actually, there is no means to
know if the distributed execution passed through Σ or not. This is due to fact that
independent events can be perceived, by distinct sequential observers, as having
been executed in different order (see Fig. 6.9). Hence, the validity property charac-
terizes the best that can be done, i.e., (a) Σ is consistent, and (b) it can have been
passed through by the execution. Nothing stronger can be claimed.

The Case of Stable Properties A stable property is a property that, once true,
remains true forever. “The application has terminated”, “there is a deadlock”, or “ a
distributed cell has become inaccessible”, are typical examples of stable properties
that can be defined on the global states of a distributed application. More precisely,
let P be a property on the global states of a distributed execution. If P is a stable
property, we have
  
P(Σ) ⇒ ∀Σ  : Σ −→ Σ  : P Σ  .
Σ

Let Σ be a consistent global state that has been computed, and Σstart and Σend
as defined previously. We have the following:
Σ
• As P is a stable property and Σ −→ Σend , we have P(Σ) ⇒ P(Σend ). Hence,
if P(Σ) is true, whether the distributed execution passed through Σ or not, we
conclude that P(Σend ) is satisfied, i.e., the property is satisfied in the global state
Σend , which is attained by the distributed execution (and can never be explicitly
known). Moreover, from then on, due to its stability, we know that P is satisfied
on all future global states of the distributed computation.
• If ¬P(Σ), taking the contrapositive of the definition of a stable property, we can
conclude ¬P(Σstart ), but nothing can be concluded on P(Σend ).
In this case, a new global state can be computed, until either a computed global
state satisfies P, or the computation has terminated. (Let us observe that, if P
is “the computation has terminated”, it is eventually satisfied. Special chapters
of the book are devoted to the detection of stable properties such as distributed
termination and deadlock detection.)

6.5.4 Principles of Algorithms Computing a Global State

An algorithm that computes a global state of a distributed execution has to ensure


that each controller cpi records a local state σi of the process pi it is associated with,
and each pair of controllers (cpj , cpi ) has to compute the value c_state(j, i) (i.e.,
the state of the directed channel from pj to pi ).
138 6 Nature of Distributed Computations and the Concept of a Global State

In order that (Σ, CΣ) be consistent, the controllers have to cooperate to ensure
the conditions C1 and C2 stated in Sect. 6.4.2. This cooperation involves both syn-
chronization and message recording.
• Synchronization. In order that there is no orphan message with respect to a pair
σj , σi , when a message is received from pj , the controller cpi might be forced
to record the local state of pi before giving it the message.
• Message recording. In order that the in-transit messages appear in the state of the
corresponding channels, controllers have to record them in one way or another.
While nearly all global state computation algorithms use the same synchroniza-
tion technique to ensure Σ is consistent, they differ in (a) the technique they use to
record in-transit messages, and (b) the FIFO or non-FIFO assumption they consider
for the directed communication channels.

6.6 A Global State Algorithm Suited to FIFO Channels

The algorithm presented in this section is due to K.M. Chandy and L. Lamport
(1985). It was the first global state computation algorithm proposed. It assumes that
(a) the channels are FIFO, and (b) the communication graph is strongly connected
(there is a directed communication path from any process to any other process).

6.6.1 Principle of the Algorithm

A Local Variable per Process Plus a Control Message The local variable
gs_statei contains the state of pi with respect to the global state computation. Its
value is red (its local state has been recorded) or green (its local state has not yet
been recorded). Initially, gs_statei = green.
A controller cpi , which has not yet recorded the local state of its associated pro-
cess pi , can do it at any time. When cpi records the local state of pi , it atomically
(a) updates gs_statei to the value red, and (b) sends a control message (denoted
MARKER (), and depicted with dashed arrows) on all its outgoing channels. As chan-
nels are FIFO, a message MARKER () is a synchronization message separating the
application messages sent by pi before it from the application messages sent after
it. This is depicted in Fig. 6.14.
When a message is received by the pair (pi , cpi ), the behavior of cpi depends on
the value of gs_statei . There are two cases.
• gs_statei = green. This case is depicted in Fig. 6.15. The controller cpi discovers
that a global state computation has been launched. It consequently participates in
it by atomically recording σi , and sending MARKER () messages on its outgoing
channels to inform their destination processes that a global state computation has
been started.
6.6 A Global State Algorithm Suited to FIFO Channels 139

Fig. 6.14 Recording of a local state

Fig. 6.15 Reception of a MARKER () message: case 1

Fig. 6.16 Reception of a MARKER () message: case 2

Moreover, for the ordered pair of local states σj , σi  to be consistent, cpi
defines the state c_state(j, i) of the incoming channel from pj as being empty.
• gs_statei = red. In this case, depicted in Fig. 6.16, cpi has already recorded σi .
Hence, it has only to ensure that the recorded channel state c_state(j, i) is consis-
tent with respect to the ordered pair σj , σj . To that end, cpi records the sequence
of messages that are received on this channel between the recording of σi and the
reception of the marker sent by cpj .

Properties Assuming at least one controller process cpi launches a global state
computation (Fig. 6.14), it follows from the strong connectivity of the communi-
140 6 Nature of Distributed Computations and the Concept of a Global State

internal operation record_ls() is


(1) σi ← current local state of pi ;
(2) gs_statei ← red;
(3) for each k ∈ c_ini do c_state(k, i) ← ∅ end for;
(4) for each j ∈ c_outi do send MARKER () on out_channeli [j ] end for.

when START () is received do


(5) if (gs_statei = green) then record_ls() end if.

when MSG (v) is received on in_channeli [j ] do


(6) case MSG = MARKER then if (gs_statei = green) then record_ls() end if;
(7) closedi [j ] ← true
(8) MSG = APPL then if (gs_statei = red) ∧ (¬closedi [j ])
(9) then add the message APPL () at the tail of c_state(j, i)
(10) end if;
(11) pass the message APPL (v) to pi
(12) end case.

Fig. 6.17 Global state computation (FIFO channels, code for cpi )

cation graph and the rules associated with the control messages MARKER () that
exactly one local state per process is recorded, and exactly one marker is sent on
each directed channel. It follows that a global state is eventually computed.
Let us color a process with the color currently in cs_statei . Similarly, let the
color of an application message be the color of its sender when the message is
sent. According to these colorings, an orphan message is a red message received
by a green process. But, as the channel from pj to pi is FIFO, no red message can
arrive before a marker, from which it follows that there is no orphan message. The
fact that all in-transit messages are recorded follow the recording rules expressed in
Figs. 6.15 and 6.16. The recorded global state is consequently consistent.

6.6.2 The Algorithm

Local Variables Let c_ini denote the sets of process identities j such that there
is a channel from pj to pi ; this channel is denoted in_channeli [j ]. Similarly, let
c_outi denote the sets of process identities j such that there is a channel from pi to
pj ; this channel is denoted out_channeli [j ]. These sets are known both by pi and
its controller cpi . (As the network is strongly connected, no set c_ini or c_outi is
empty.)
The local array closedi [c_ini ] is a Boolean array. Each local variable closedi [j ]
is initialized to false, and is set to value true when cpi receives a marker from cpj .

Algorithm Executed by cpi The algorithm executed by a controller process cpi


is described in Fig. 6.17. The execution of the internal operation record_ls(), and
the processing associated with the reception of a message START () or MSG () are
6.6 A Global State Algorithm Suited to FIFO Channels 141

Fig. 6.18
A simple automaton
for process pi (i = 1, 2)

atomic. This means they exclude each other, and exclude also concurrent execution
of the process pi . (Among other issues, the current state of pi has not to be modified
when cpi records σi ).
One or several controller processes launch the algorithm when they receive an ex-
ternal message START (), while their local variable gs_statei = green (line 4). Such
a process then invokes the internal operation record_ls(). This operation records the
current local state of pi (line 1), switches gs_statei to red (line 2), initializes the
channel states c_state(j, i) to the empty sequence, denoted ∅ (line 3), and sends
markers on pi ’s outgoing channels (line 4).
The behavior of a process cpi that receives a message depends on the message. If
it is marker (control message), cpi learns that the sender cpj has recorded σj . Hence,
if not yet done, cpi executes the internal operation record_ls() (line 6). Moreover,
in all cases, it sets closedi [j ] to true (line 7), to indicate that the state of the input
channel from pj has been computed, this channel state being consistent with respect
to the pair of local states σj , σi  (see Figs. 6.15 and 6.16).
If the message is an application message, and the computation of c_state(j, i)
has started and is not yet finished (predicate of line 8), cpi adds it at the end of
c_state(j, i). In all cases, the message is passed to the application process pi .

6.6.3 Example of an Execution

This section illustrates the previous global state computation algorithm with a sim-
ple example. Its aim is to give the reader a deeper insight on the subtleties of global
state computations.

Application Program and a Simple Execution The application is made up of


two processes p1 and p2 and two channels, one from p1 to p2 , and one from p2 to
p1 . Moreover, both processes have the same behavior, described by the automaton
described in Fig. 6.18. Process pi is in state σi0 and moves to state σi1 by sending
the message mi to the other process pj . It then returns to its initial state when it
receives the message mj from process pj , and this is repeated forever.
A prefix of an execution of p1 and p2 is described in Fig. 6.19.

Superimposing a Global State Computation Figure 6.20 describes a global state


computation. This global state computation is superimposed on the distributed ex-
ecution described in Fig. 6.19. The control processes cp1 and cp2 access the local
142 6 Nature of Distributed Computations and the Concept of a Global State

Fig. 6.19 Prefix of a simple execution

Fig. 6.20 Superimposing a global state computation on a distributed execution

control variables gs_state1 and gs_state2 , respectively, and send MARKER () mes-
sages. This global state computation involves four time instants (defined from an
external omniscient observer’s point of view).
• At time τ0 , the controller cp1 receives a message START () (line 5). Hence, when
the global state computation starts, the distributed execution is in the global state
Σstart = [σ10 , σ20 ]. The process cp1 consequently invokes record_ls(). It records
the current state of p1 , which is then σ10 , and sends a marker on its only outgoing
channel (lines 1–5).
• Then, at time τ1 , the application message m2 arrives at p1 (lines 8–11). This
message is first handled by cp1 , which adds a copy of it to the channel state
c_state(2, 1). It is then received by p1 which moves from σ11 to σ10 .
• Then, at time τ2 , cp2 receives the marker sent by cp1 (lines 6–7). As gs_state2 =
green, cp2 invokes record_ls(): It records the current state of p2 , which is σ21 , and
sends a marker to cp1 .
• Finally, at time τ3 , the marker sent by cp2 arrives at cp1 . As gs_state1 = red,
cp1 has only to stop recording the messages sent by p2 to p1 . Hence, when the
global state computation terminates, the distributed execution is in the global state
Σend = [σ10 , σ21 ].
6.7 A Global State Algorithm Suited to Non-FIFO Channels 143

Fig. 6.21 Consistent cut associated with the computed global state

Fig. 6.22 A rubber band transformation

It follows that the global state that been cooperatively computed by cp1 and cp2
is the pair (Σ, CΣ), where Σ = [σ10 , σ21 ], and CΣ = {c_state(1, 2), c_state(2, 1)}
with c_state(1, 2), = ∅ and c_state(2, 1) = m2 . It is important to see that this com-
putation is not at all atomic: as illustrated in the space-time diagram, it is distributed
both with respect to space (processes) and time.

Consistent Cut and Rubber Band Transformation Figure 6.21 is a copy of


Fig. 6.19, which explicitly represent the consistent cut associated with the global
state that has been computed. This figure shows that the distributed execution did
not pass through the computed global state.
Actually, as suggested in Sect. 6.1.4, we can play with asynchrony and physi-
cal time to exhibit a distributed execution which has passed through the computed
global state. This distributed is obtained by the “rubber band transformation”: in the
space-time diagram, we consider process axes are rubber bands that can be stretched
or shrunk as long as no message is going backwards. Such a rubber band transfor-
mation is described in Fig. 6.22. As the execution described in Fig. 6.21 and the
execution described in Fig. 6.22 define the same partial order, they are actually the
same execution.

6.7 A Global State Algorithm Suited to Non-FIFO Channels


This section presents a global state computation algorithm suited to systems with
directed non-FIFO channels. This algorithm, which is due to T.H. Lai and T.H.
144 6 Nature of Distributed Computations and the Concept of a Global State

Yang (1987), differs from the previous one mainly in the way it records the state of
the channels.

6.7.1 The Algorithm and Its Principles

The local variables c_ini , c_outi , in_channeli [k], and out_channeli [k] have the
same meaning as before. In addition to gs_statei , which is initialized to green and
is eventually set to red, each controller cpi manages two arrays of sets defined as
follows:
• rec_msgi [c_ini ] is such that, for each k ∈ c_ini , we have
rec_msgi [k] =
{all the messages received on in_channeli [k] since the beginning}.
• sent_msgi [c_outi ] is such that, for each k ∈ c_outi , we have
sent_msgi [k] = {all the messages sent on out_channeli [k] since the beginning}.
Each array is a log in which cpi records all the messages it receives on each input
channel, and all the messages it sends on each output channel. This means that (dif-
ferently from the previous algorithm), cpi has to continuously observe pi . But now
there is no control message: All the messages are application messages. Moreover,
each message inherits the current color of its sender (hence, each message carries
one control bit).
The basic rule is that, while a green message can always be consumed, a red
message can be consumed only when its receiver is red. It follows that, when a
green process pi receives a red message, cpi has first to record pi ’s local state so
that pi becomes red before being allowed to receive and consume this message.
The algorithm is described in Fig. 6.23. The text is self-explanatory. m.color
denotes the color of the application message m. Line 7 ensures that a red message
cannot be received by a green process, which means that there is no orphan message
with respect to an ordered pair of process local states that have been recorded.

6.7.2 How to Compute the State of the Channels

Let cp be any controller process. After it has recorded σi , rec_msgi [c_ini ], and
sent_msgi [c_outi ] (line 3), a controller cpi sends this triple to cp (line 4).
When it has received such a triple from each controller cpi , the controller cp
pieces together all the local states to obtain Σ = [σ1 , . . . , σn ]. As far as CΣ is
concerned, it defines the state of the channel from pj to pi as follows:

c_state(j, i) = sent_msgj [i] \ rec_msgi [j ].


6.7 A Global State Algorithm Suited to Non-FIFO Channels 145

internal operation record_ls() is


(1) σi ← current local state of pi ;
(2) gs_statei ← red;
(3) record σi , rec_msgi [c_ini ], and sent_msgi [c_outi ];
(4) send the previous triple to cp.

when START () is received do


(5) if (gs_statei = green) then record_ls() end if.

when MSG (m) is received on in_channeli [j ] do


(6) rec_msgi [j ] ← rec_msgi [j ] ∪ {m};
(7) if (m.color = red) ∧ (gs_statei = green) then record_ls() end if;
(8) pass MSG (m) to pi .

when MSG (m) is sent on out_channeli [j ] do


(9) m.color ← gs_statei ;
(10) sent_msgi [k] ← sent_msgi [j ] ∪ {m}.

Fig. 6.23 Global state computation (non-FIFO channels, code for cpi )

Fig. 6.24 Example of a global state computation (non-FIFO channels)

It follows from (a) this definition, (b) the fact that sent_msgj [i] is the set of
messages sent by pj to pi before σj , and (c) the fact that rec_msgi [j ] is the set of
messages received by pi from pj before σi , that c_state(j, i) records the messages
that are in-transit with respect to the ordered pair σj , σi . As there is no orphan
message with respect to the recorded local states (see above), the computed global
state is consistent.

An Example An example of execution of the previous algorithm is described in


Fig. 6.24. Green messages are depicted with plain arrows, while red messages are
depicted with dashed arrows.
In this example, independently from the other controller processes, each of cp1
and cp2 receives an external START () message, and consequently records the lo-
cal states of p1 and p2 , respectively. After σ2 has been recorded, p2 sends a (red)
message to p4 . When this message is received, σ4 is recorded by cp4 before being
146 6 Nature of Distributed Computations and the Concept of a Global State

passed to p4 . Then, p4 sends a (red) message to p3 . As this message is red, its re-
ception entails the recording of σ3 by cp3 , after which the message is received and
processed by p3 .
The cut corresponding to this global state is indicated by a bold dashed line. It is
easy to see that it is consistent. The in-transit messages are the two green messages
(plain arrows) that cross the cut line.

Remark The logs sent_msgi [c_outi ] and rec_msgi [c_ini ] may require a large
memory, which can be implemented with a local secondary storage. In some ap-
plications, based on the computation of global states, what is relevant is not the
exact value of these sets, but their cardinality. In this case, each set sent_msgi [k],
and each set rec_msgi [j ], can be replaced by a simple counter. As a simple example,
this is sufficient when one wants to know which channels are empty in a computed
global state.

6.8 Summary
This chapter was on the nature of a distributed execution and the associated notion
of a global state (snapshot). It has defined basic notions related to the execution
of a distributed program (event, process history, process local state, event concur-
rency/independence, cut, and global state). It has also introduced three approaches
to model a distributed execution, namely, a distributed execution can be represented
as a partial order on events, a partial order on process local states, or a lattice of
global states.
The chapter then presented algorithms that compute on the fly consistent global
states of a distributed execution. It has shown that the best that can be done is the
computation of a global state that might have been passed through by the distributed
computation. The global state that has been computed is consistent, but no process
can know if it really occurred during the execution. This nondeterminism is inherent
to the nature of asynchronous distributed computing. An aim of this chapter was to
give the reader an intuition of this relativistic dimension, when the processes have
to observe on the fly the distributed execution they generate.

6.9 Bibliographic Notes


• The capture of a distributed execution as a partial order on the events produced
by processes is due to L. Lamport [226]. This is the first paper that formalized the
notion of a distributed execution.
• The first paper that presented a precise definition of a global state of a distributed
computation and an algorithm to compute a consistent global state is due to K.M.
Chandy and L. Lamport [75]. Their algorithm is the one presented in Sect. 6.6.
A formal proof can be found in [75].
6.10 Exercises and Problems 147

• The algorithm by Chandy and Lamport computes a single global state. It has been
extended in [57, 357] to allow for repeated global state computations.
• The algorithm for non-FIFO channels presented in Sect. 6.7 is due to Y.T. Lai and
Y.T. Yang [222].
• Numerous algorithms that compute consistent global states in particular contexts
have been proposed. As an example, algorithms suited to causally ordered com-
munication are presented in [1, 14]. Other algorithms that compute (on the fly)
consistent global states are presented in [169, 235, 253, 377]. A reasoned con-
struction of an algorithm computing a consistent global state is presented in [80].
A global state computation algorithm suited to large-scale distributed systems
is presented in [213]. An introductory survey on global state algorithms can be
found in [214].
• The algorithms that have been presented are non-inhibitory, in the sense that they
never freeze the distributed execution they observe. The role of inhibition for
global state computation is investigated in [167, 363].
• The view of a distributed computation as a lattice of consistent global states is
presented and investigated in [29, 98, 250]. Algorithms which determine which
global states of a distributed execution satisfy some predefined properties are pre-
sented in [28, 98]. Those algorithms are on-the-fly algorithms.
• A global state Σ of a distributed execution is inevitable if, when considering the
lattice associated with this execution, it belongs to all the sequential observations
defined by this lattice. An algorithm that computes on the fly all the inevitable
global states of a distributed execution is described in [138].
• The causal precedence (happened before) relation has been generalized in several
papers (e.g., [174, 175, 297]).
• Snapshot computation in anonymous distributed systems is addressed in [220].

6.10 Exercises and Problems

1. Adapt the algorithm described in Sect. 6.6 so that the controller processes are
able to compute several consistent global states, one after the other.
Answer in [357].
2. Use the rubber band transformation to give a simple characterization of an orphan
message.
3. Let us consider the algorithm described in Fig. 6.25 in which each controller
process cpi manages the local variable gs_statei (as in the algorithms described
in this chapter), plus an integer counti . Moreover, each message m carries the
color of its sender (in its field m.color). The local control variable counti counts
the number of green messages sent by pi minus the number of green messages
received by pi . The controller cp is one of the controllers, known by all, that is
in charge of the construction of the computed global state.
When
cp has received a pair (σi , counti ) from each controller cpi , it computes
ct = 1≤i≤n counti .
148 6 Nature of Distributed Computations and the Concept of a Global State

internal operation record_ls() is


(1) σi ← current local state of pi ;
(2) gs_statei ← red;
(3) record σi , counti ; send the pair (σi , counti ) to cp.

when START () is received do


(4) if (gs_statei = green) then record_ls() end if.

when MSG (m) is received on in_channeli [j ] do


(5) if (m.color = red) ∧ (gs_statei = green) then record_ls() end if;
(6) if (m.color = green) then
(7) if (gs_statei = green) then counti ← counti − 1 else send a copy of m to cp end if
(8) end if;
(9) pass MSG (m) to pi .

when MSG (m) is sent on out_channeli [j ] do


(10) m.color ← gs_statei ;
(11) if (m.color = green) then counti ← counti + 1 end if.

Fig. 6.25 Another global state computation (non-FIFO channels, code for cpi )

• What does the counter ct represent?


• What can be concluded on a message m sent to cp by a controller cpi at line 7?
• Show that the global state Σ = [σ1 , . . . , σn ] obtained by cp is consistent.
• How can cp compute the state of the channels?
• Does this algorithm compute a consistent pair (Σ, CΣ)? To answer “yes”,
you need to show that each channel state c_state(j, i) contains all the green
messages sent by pj to pi that are received by pi while it is red. To answer
“no”, you need to show that there is no means for cp to always compute correct
values for all channel states.
Answer in [250].
Chapter 7
Logical Time
in Asynchronous Distributed Systems

This chapter is on the association of consistent dates with events, local states, or
global states of a distributed computation. Consistency means that the dates gen-
erated by a dating system have to be in agreement with the “causality” generated
by the considered distributed execution. According to the view of a distributed ex-
ecution we are interested in, this causality is the causal precedence order on events
ev σ
(relation −→), the causal precedence order on local states (relation −→), or the
Σ
reachability relation in the lattice of global states (relation −→), all introduced in
the previous chapter. In all cases, this means that the date of a “cause” has to be
earlier than the date of any of its “effects”. As we consider time-free asynchronous
distributed systems, these dates cannot be physical dates. (Moreover, even if pro-
cesses were given access to a global physical clock, the clock granularity should be
small enough to always allow for a consistent dating.)
Three types of logical time are presented, namely, scalar (or linear) time, vector
time, and matrix time. Each type of time is defined, its properties are stated, and
illustrations showing how to use it are presented.

Keywords Adaptive communication layer · Approximate causality relation ·


Causal precedence · Causality tracking · Conjunction of stable local predicates ·
Detection of a global state property · Discarding old data · Hasse diagram ·
Immediate predecessor · Linear (scalar) time (clock) · Logical time ·
Matrix time (clock) · Message stability · Partial (total) order · Relevant event ·
k-Restricted vector clock · Sequential observation · Size of a vector clock ·
Timestamp · Time propagation · Total order broadcast · Vector time (clock)

7.1 Linear Time

ev
Linear time considers events and the associated partial order relation −→ produced
by a distributed execution. (The same could be done by considering local states, and
σ
the associated partial order relation −→.) This notion of logical time is due to L.
ev
Lamport (1978), who introduced it together with the relation −→.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 149


DOI 10.1007/978-3-642-38123-2_7, © Springer-Verlag Berlin Heidelberg 2013
150 7 Logical Time in Asynchronous Distributed Systems

when producing an internal event e do


(1) clocki ← clocki + 1. % date of the internal event
(2) Produce event e.

when sending MSG (m) to pj do


(3) clocki ← clocki + 1; % date of the send event
(4) send MSG (m, clocki ) to pj .

when MSG (m, h) is received from pj do


(5) clocki ← max(clocki , h);
(6) clocki ← clocki + 1. % date of the receive event.

Fig. 7.1 Implementation of a linear clock (code for process pi )

7.1.1 Scalar (or Linear) Time

Linear (Scalar) Clock As just indicated, the aim is to associate a logical date with
events of a distributed execution. Let date(e) be the date associated with event e. To
be consistent the dating system has to be such that
ev
∀ e1 , e2 : (e1 −→ e2 ) ⇒ date(e1 ) < date(e2 ).

The simplest time domain which can respect event causality is the sequence of in-
creasing integers (hence the name linear or scalar time): A date is an integer.
Hence, each process pi manages an integer local variable clocki (initialized to
ev
0), which increases according to the relation −→, as described by the local clock
management algorithm of Fig. 7.1. Just before producing its next internal event,
or sending a message, a process pi increases its local clock clocki , and this new
clock value is the date of the corresponding internal or send event. Moreover, each
message carries its sending date. When a process receives a message, it first updates
its local clock and then increases it, so that the receive event has a date greater than
both the date of the corresponding send event and the date of the last local event.
It follows trivially from these rules that the linear time increases along all causal
ev
paths, and, consequently, this linear time is consistent with the relation −→ (or
σ
−→). Any increment value d > 0 could be used instead of 1. Considering d = 1
allows for the smallest clock increase while keeping consistency.

An Example An illustration of the previous linear clock system is depicted in


Fig. 7.2. The global linear clock is implemented by three integer local clocks, clock1 ,
clock2 , and clock3 , and all the integers associated with events or messages represent
clock values. Let us consider process p2 . Its first event is the sending of a message
to p3 : Its date is 1. Its next event is an internal event, and its date is 2. Its third event
is the reception of a message whose sending date is 3, hence this receive event is
dated 4. Similarly, its next receive event is dated 5, etc.
This example shows that, due to resetting entailed by message receptions, a local
clock clocki may skip integer values. As an example, clock1 jumps from value 4
7.1 Linear Time 151

Fig. 7.2 A simple example of a linear clock system

to value 6. The fact that (a) logical time increases along causal paths, and (b) the
increment value has been chosen equal to 1, provides us with the following property

 
date(e) = x ⇔ (there are x events on the longest causal path ending at e).

Properties The following properties follow directly from clock consistency. Let
e1 and e2 be two events.
ev
• (date(e1 ) ≤ date(e2 )) ⇒ ¬(e2 −→ e1 ).
• (date(e1 ) = date(e2 )) ⇒ (e1 ||e2 ).
The first property is simply the contrapositive of the clock consistency prop-
erty, while the second one is a restatement of (date(e1 ) ≤ date(e2 )) ∧ (date(e2 ) ≤
date(e1 )).
These properties are important from an operational point of view. The partial
ev
order −→ on events allows us to understand and reason on distributed executions,
while the dates associated with events are operational and can consequently be used
to design and write distributed algorithms. Hence, the aim is to extract information
on events from their dates.
ev
Unfortunately, it is possible to have (date(e1 ) < date(e2 )) ∧ ¬(e1 −→ e2 ). With
linear time, it is not because the date of an event e1 is earlier (smaller) than the date
of an event e2 , that e1 is a cause of e2 .

7.1.2 From Partial Order to Total Order:


The Notion of a Timestamp

The previous observation is the main limitation of any linear time system. But, for-
tunately, this is not a drawback if we have to totally order events while respecting
ev
the partial order −→. To that end, in addition to its temporal coordinate (its date), let
us associate with each event a spatial coordinate (namely the identity of the process
pi that issued this event).
152 7 Logical Time in Asynchronous Distributed Systems

Fig. 7.3 A non-sequential


observation obtained
from linear time

The Notion of a Timestamp Hence let us define the timestamp of an event e as


the pair h, i such that:
• pi is the process that issued e, and
• h is the date at which e has been issued.
Timestamps allow events to be totally ordered without violating causal prece-
dence. Let e1 and e2 be two events whose timestamps are h, i and k, j , respec-
to_ev
tively. The causality-compliant total relation on the events, denoted −→, is defined
as follows:
to_ev def  
e1 −→ e2 = (h < k) ∨ (h = k) ∧ (i < j ) .
This total order is nothing more than the lexicographical order on the pairs made
up of two integers, a date and a process identity. (It is of course assumed that no two
processes have the same identity.)

Notation In the following the notation h, i < k, j  is used to denote (h <
k) ∨ ((h = k) ∧ (i < j )).

7.1.3 Relating Logical Time and Timestamps with Observations

The notion of a sequential observation was introduced in Chap. 6 devoted to global


states, namely, it is a sequence (a) including the events generated by a distributed
execution, and (b) respecting their partial ordering.

Linear Time and Non-sequential Observation Let us consider the distributed


execution depicted in Fig. 7.3, in which the corresponding timestamp is associated
with each event. Let us first consider all the events whose date is H = 1, then all the
events whose date is H = 2, etc. This is indicated in the figure with bold lines, which
partition the events according to their dates. This corresponds to a non-sequential
observation, i.e., execution that could be observed by an observer able to see several
events at the same time, namely all the events with the same logical date.
7.1 Linear Time 153

Fig. 7.4
A sequential observation
obtained from timestamps

Timestamps and Sequential Observation Figure 7.4 represents the lattice of


global states of the distributed execution of Fig. 7.3. Let us now totally order all
to_ev
the events produced by this execution according to the relation −→. We obtain the
sequence of events e21 , e11 , e22 , e12 , e23 , which is denoted with a bold zigzag arrow
on Fig. 7.4. It is easy to see that this sequence is nothing more than a particular
sequential observation of the distributed execution depicted in Fig. 7.3.
More generally, the total order on timestamps defines a sequential observation of
the corresponding distributed execution.

7.1.4 Timestamps in Action: Total Order Broadcast

Linear time and timestamps are particularly useful when one has to order operations
or messages. The most typical case is the establishment of a total order on a set of
requests, which have to be serviced one after the other. This use of timestamps will
be addressed in Chap. 10, which is devoted to permission-based mutual exclusion.
This section considers another illustration of timestamps, namely, it presents a
timestamp-based implementation of a high-level communication abstraction, called
total order broadcast.

The Total Order Broadcast Abstraction Total order broadcast is a communica-


tion abstraction defined by two operations, denoted to_broadcast() and to_deliver().
Intuitively, to_broadcast() allows a process pi to send a message m to all the pro-
cesses (we then say “pi to_broadcasts m”), while to_deliver() allows a process to
receive such a message (we then say “pi to_delivers m”). Moreover, all the mes-
sages that have been to_broadcast must be to_delivered in the same order at each
154 7 Logical Time in Asynchronous Distributed Systems

process, and this order has to respect the causal precedence order. To simplify the
presentation, it is assumed that each message is unique (this can be easily realized
by associating a pair sequence number, sender identity with each message).

A Causality-Compliant Partial Order on Messages M being the set of mes-


 = (M, →M ) be the rela-
sages which are to_broadcast during an execution, let M
tion where →M is defined on M as follows. Given m, m ∈ M, m →M m (and we
the say “m causally precedes m ”) if:
• m and m have been to_broadcast by the same process, and m has been
to_broadcast before m , or
• m has been to_delivered by a process pi before pi to_broadcasts m , or
• there is a message m ∈ M such that m →M m and m →M m .

Total Order Broadcast: Definition The total order broadcast abstraction is for-
mally defined by the following properties. Said differently, this means that, to be
correct, an implementation of the total order broadcast abstraction has to ensure that
these properties are always satisfied.
• Validity. If a process to_delivers a message m, there is a process that has
to_broadcast m.
• Integrity. No message is to_delivered twice.
• Total order. If a process to_delivers m before m , no process to_delivers m be-
fore m.
• Causal precedence order. If m →M m , no process to_delivers m before m.
• Termination. If a process to_broadcasts a message m, any process to_delivers m.
The first four properties are safety properties. Validity relates the outputs to the
inputs. It states that there is neither message corruption, nor message creation. In-
tegrity states that there is no message duplication. Total order states that messages
are to_delivered in the same order at every process, while causal precedence states
that this total order respects the message causality relation →M . Finally, the termi-
nation property is a liveness property stating that no message is lost.

Principle of the Implementation To simplify the description, we consider that


the communication channels are FIFO. Moreover, each pair of processes is con-
nected by a bidirectional channel.
The principle that underlies the implementation is very simple: it consists in as-
sociating a timestamp with each message, and to_delivering the messages according
to their timestamp order. As timestamps are totally ordered and this order respects
causal precedence, we obtain both total order and causal precedence order proper-
ties.
To illustrate the main issue posed by associating appropriate timestamps with
messages, and define accordingly a correct message delivery rule, let us consider
Fig. 7.5. Independently one from the other, p1 and p2 to_broadcast the messages m1
and m2 , respectively. Neither of p1 and p2 can immediately to_deliver its message,
otherwise the total order delivery property would be violated. The processes have to
7.1 Linear Time 155

Fig. 7.5 Total order broadcast: the problem that has to be solved

Fig. 7.6 Structure of the total order broadcast implementation

cooperate so that they to_deliver m1 and m2 in the same order. This remains true if
only p1 to_broadcasts a message. This is because, when it issues to_broadcast(m1 ),
p1 does not know whether p2 has independently issued to_broadcast(m2 ) or not.
It follows that a to_broadcast message generates two distinct communication
events at each process. The first one is associated with the reception of the message
from the underlying communication network, while the second one is associated
with its to_delivery.

Message Stability A means to implement the same to_delivery order at each pro-
cess consists in providing each process pi with information on the clock values of
the whole set of processes. This local information can then be used by each process
pi to know which, among the to_broadcast messages it has received and not yet
to_delivered, are stable, where message stability is defined as follows.
A message timestamped k, j  received by a process pi is stable (at that process)
if pi knows that all the messages it will receive in the future will have a timestamp
greater than k, j . The main job of a timestamp-based implementation consists in
ensuring the message stability at each process.

Global Structure and Local Variables at a Process pi The structure of the im-
plementation is described in Fig. 7.6. Each process pi has a local module imple-
156 7 Logical Time in Asynchronous Distributed Systems

menting the operations to_broadcast() and to_deliver(). Each local module manages
the following local variables.
• clocki [1..n] is an array of integers initialized to [0, . . . , 0]. The local variable
clocki [i] is the local clock of pi , which implements the global linear time. Dif-
ferently, for j = i, clocki [j ] is the best approximation of the value of the local
clock of pj , as known by pi . As the communication channels are FIFO, clocki [j ]
contains the last value of clockj [j ] received by pi . Hence, in addition to the fact
that the set of local clocks {clocki }1≤i≤n implement a global scalar clock, the lo-
cal array clocki [1..n] of each process pi represents its current knowledge on the
progress of the whole set of local logical clocks.
• to_deliverablei is a sequence, initially empty (the empty sequence is denoted ).
This sequence contains the list of messages that (a) have been received by pi , (b)
have then been totally ordered, and (c) have not yet been to_delivered. Hence,
to_deliverablei is the list of messages that can be to_delivered to the local upper
layer application process.
• pendingi is a set of pairs m, d, j , where m is a message whose timestamp is
d, j . Initially, pendingi = ∅.

Description of the Implementation The timestamp-based algorithm implement-


ing the total order broadcast abstraction is described in Fig. 7.7. It is assumed that
pi does not interleave the execution of the statements at lines 1–4, lines 9–14, and
lines 18–22. Let us recall that there is a bidirectional FIFO point-to-point commu-
nication channel connecting any pair of distinct processes. This algorithm works as
described below.
When a process pi invokes to_broadcast(m), it first associates with m a new
timestamp ts(m) = clocki [i], i (line 2). Then, it adds the pair m, ts(m) to
its local set pendingi (line 3), and sends to all the other processes the mes-
sage TOBC (m, ts(m)) to inform them that a new message has been to_broadcast
(line 4). An invocation of to_deliver() returns the first message in the local list
to_deliverablei (line 5–8).
The behavior of a process pi when it receives a message TOBC (m, sd_date, j )
can be decomposed into two parts.
• The process pi first modifies its local context according to content of the TOBC ()
message it has just received. It updates clocki [j ] (line 9), and adds the pair
m, sd_date, j  to its local set pendingi (line 10).
Let us notice that, as (a) clockj [j ] never decreases, (b) pj increased it be-
fore sending the message TOBC (m, −), and (c) the channels are FIFO, it follows
that we have clocki [j ] < sd_date when the message TOBC (m, sd_date, j ) is
received. Hence the systematic update at line 9.
• The second set of statements executed by pi (lines 11–14) is related to the update
of its local clock clocki [i], and the dissemination of its new value to the other
processes. This is to ensure both that (a) each message that has been to_broadcast
is eventually to_delivered, and (b) message to_deliveries satisfy total order.
If sd_date ≥ clocki [i], pi resets its local clock (line 12) to a value greater than
sd_date (as in the basic scalar clock algorithm of Fig. 7.1). Then pi sends to each
7.1 Linear Time 157

operation to_broadcast(m) is
(1) clocki [i] ← clocki [i] + 1;
(2) let ts(m) = clocki [i], i;
(3) pendingi ← pendingi ∪ {m, ts(m)};
(4) for each j ∈ {1, . . . , n} \ {i} do send TOBC (m, ts(m)) to pj end for.

operation to_deliver(m) is
(5) wait (to_deliverablei = );
(6) let m be the first message in the list to_deliverablei ;
(7) withdraw m from to_deliverablei ;
(8) return(m).

when TOBC (m, sd_date, j ) is received do


(9) clocki [j ] ← sd_date;
(10) pendingi ← pendingi ∪ {m, sd_date, j };
(11) if (sd_date ≥ clocki [i]) then
(12) clocki [i] ← sd_date + 1;
(13) for each k ∈ {1, . . . , n} \ {i} do send CATCH _ UP (clocki [i], i) to pj end for
(14) end if.

when CATCH _ UP (last_date, j ) is received do


(15) clocki [j ] ← last_date.

background task T is
(16) repeat forever
(17) wait (pendingi = ∅);
(18) let m, d, k be the pair in pendingi with the smallest timestamp;
(19) if (∀j = k : d, k < clocki [j ], j ) then
(20) add m at the tail of to_deliverablei ;
(21) pendingi ← pendingi \ {m, d, k}
(22) end if
(23) end repeat.

Fig. 7.7 Implementation of total order broadcast (code for process pi )

Fig. 7.8 To_delivery predicate of a message at process pi

process a control message denoted CATCH _ UP (). This message carries the new
clock value of pi (line 13). As the channels are FIFO, it follows that when this
message is received by pk , 1 ≤ k = i ≤ n, this process has necessarily received all
the messages TOBC () sent by pi before this CATCH _ UP () message. This allows
pk to know if, among the to_broadcast messages it has received, the one with the
smallest timestamp is stable (Fig. 7.8).
Finally, a background task checks if some of the to_broadcast messages which
have been received by pi are stable, and can consequently be moved from the set
158 7 Logical Time in Asynchronous Distributed Systems

pendingi to the local sequence to_partitionablei . To that end, when pendingi = ∅


(line 17), pi looks for the message m with the smallest timestamp (line 18). Let
ts(m) = d, k. If, for any j = k, ts(m) is smaller than clocki [j ], j , it follows
(from lines 11–14 executed by each process when it received the pair m, ts(m))
that any to_broadcast message not yet received by pi will have a timestamp greater
than ts(m). Hence, if the previous predicate is true, the message m is stable at pi
(Fig. 7.8). Consequently, pi withdraws m, ts(m) from the set pendingi (line 20)
and adds m at the tail of to_partitionablei (line 21). As we will see in the proof,
message stability at each process ensures that the to_broadcast messages are added
in the same order to all the sequences to_partitionablex , ≤ x ≤ n.

Theorem 4 The algorithm described in Fig. 7.7 implements the total order broad-
cast abstraction.

Proof The validity property (neither corruption, nor creation of messages) follows
directly from the reliability of the channels. The integrity property (no duplica-
tion) follows from the reliability of the underlying channels, the fact that no two
to_broadcast messages have the same timestamp, and the fact that when a message
is added to to_deliverablei , it is suppressed from pendingi .
The proof of the termination property is by contradiction. Assuming that
to_broadcast messages are never to_delivered by a process pi (i.e., added to
to_deliverablei ), let m be the one with the smallest timestamp, and let ts(m) =
d, k.
Let us observe that, each time a process pj updates its local clock (to a greater
value), it sends its new clock value to all processes. This occurs at lines 1 and 4, or
at lines 12 and 13.
As each other process pj receives the message TOBC (m, ts(m)) sent by pk , its
local clock becomes greater than d (if it was not before). It then follows from
the previous observation that a value of clockj [j ] greater than d becomes even-
tually known by each process, and we eventually have clocki [j ] > d. Hence, the
to_delivery predicate for m becomes eventually satisfied at pi . The message m is
consequently moved from pendingi to to_deliverablei , which contradicts the initial
assumption and proves the termination property.
The proof of the total order property is also by contradiction. Let mx and my be
two messages timestamped ts(mx ) = dx , x and ts(my ) = dy , y, respectively. Let
us assume that ts(mx ) < ts(my ), and my is to_delivered by a process pi before mx
(i.e., my is added to to_deliverablei before mx ).
Just before my is added to to_deliverablei , (my , ts(my )) is the pair with the
smallest timestamp in pendingi , and ∀j = y : dy , y < clocki [j ], j  (lines 18–19).
It follows that we have then dx , x < dy , y < clocki [x], x. As (a) px sends
only increasing values of its local clock (lines 1 and 4, and lines 12–13), (b)
dx < clocki [x], and (c) the channels are FIFO, it follows that pi has received the
message TOBC (mx , ts(mx )) before the message carrying the value of clockx [x]
which entailed the update of clocki [x] making true the predicate dy , y <
7.2 Vector Time 159

clocki [x], x. Consequently, as mx is not yet to_delivered (assumption), it fol-


lows that (my , ts(my )) cannot be the pair with the smallest timestamp in pendingi .
This contradicts the initial assumption. It follows that the messages that have been
to_broadcast and are to_delivered, are to_delivered at each process in the same total
order (as defined by their timestamp).
The proof that the total order on timestamps respects the causal precedence order
on to_broadcast messages follows from the two following observations.
• A process pi increases it local clock clocki [i] each time it invokes to_broadcast().
Hence, if pi to_broadcasts m before m , we have ts(m) < ts(m ).
• Due to line 11, it follows that, after it has received a message TOBC (m, d, j ),
we have clocki [i] > d. It follows that, if later pi to_broadcasts a message m , we
necessarily have ts(m) < ts(m ). 

7.2 Vector Time


As we have seen, the domain of linear time is the set of non-negative integers, and
a set of n logical clocks (one per process) allows us to associate dates with events,
ev
in such a way that the dates respect the causal precedence relation −→. However,
as we have seen, except for those events that have the same date, linear dates do not
allow us to determine with certainty if one event belongs to the causes of another
one. The aim of vector time (which is implemented by vector clocks) is to solve
this issue. Vector time, and its capture by vector clocks, was simultaneously and
independently proposed in 1988 by C.J. Fidge, F. Mattern, and F. Schmuck.

7.2.1 Vector Time and Vector Clocks

Vector Time: Definition A vector clock system is a mechanism that is able to


associate dates with events (or local states) in such a way that the comparison of
their dates allows us to determine if the corresponding events are causally related or
not, and if they are which one belongs to the cause of the other. Vector time is the
time notion captured by vector clocks.
More precisely, let date(e) be the date associated with an event e. The aim of
a vector clock system is to provide us with a time domain in which any two dates
either can be compared (<, >), or are incomparable (denoted ||), in such a way that
the following properties are satisfied
ev
• ∀ e1 , e2 : (e1 −→ e2 ) ⇔ date(e1 ) < date(e2 ), and
• ∀ e1 , e2 : (e1 ||e2 ) ⇔ date(e1 ) || date(e2 ) (the dates cannot be compared).
This means that the dating system has to be in perfect agreement with the causal
ev
precedence relation −→. To attain this goal, vector time considers that the time
domain is made up of all the integer vectors of size n (the number of processes).
160 7 Logical Time in Asynchronous Distributed Systems

when producing an internal event e do


(1) vci [i] ← vci [i] + 1;
(2) Produce event e. % The date of e is vci [1..n].

when sending MSG (m) to pj do


(3) vci [i] ← vci [i] + 1; % vci [1..n] is the sending date of the message.
(4) send MSG (m, vci [1..n]) to pj .

when MSG (m, vc) is received from pj do


(5) vci [i] ← vci [i] + 1;
(6) vci [1..n] ← max(vci [1..n], vc[1..n]). % vci [1..n] is the date of the receive event.

Fig. 7.9 Implementation of a vector clock system (code for process pi )

Vector Clock: Definition To implement vector time, each process pi manages a


vector of non-negative integers vci [1..n], initialized to [0, . . . , 0]. This vector is such
that:
• vci [i] counts the number of events produced by pi , and
• vci [j ], j = i, counts the number of events produced by pj , as known by pi .
More formally, let e be an event produced by some process pi . Just after pi has
produced e we have (where 1(k, i) = 1 if pk is pi , and 1(k, i) = 0 otherwise):

 
vci [k] =  f (f has been produced by pk ) ∧ (f −→ e)  + 1(k, i).
ev

Hence, vci [k] is the number of events produced by pk in the causal past of the
event e. The term 1(k, i) is to count the event e, which has been produced by pi .
The value of vci [1..n] is the vector date of the event e.

Vector Clock: Algorithm The algorithm implementing the vector clock system
has exactly the same structure as the one for linear time (Fig. 7.1). The difference
lies in the fact that only the entry i of the local vector clock of pi (i.e., vci [i])
is increased each time it produces a new event, and each message m piggybacks
the current value of the vector time, which defines the sending time of m. This
value allows the receiver to update its local vector clock, so that the date of the
receive event is after both the date of the sending event associated with m and the
immediately preceding local event produced by the receiver process. The operator
max() on integers is extended to vectors as follows (line 5):

max(v1, v2)
      
= max v1[1], v2[1] , . . . , max v1[j ], v2[j ] , . . . , max v1[n], v2[n] .

Let us observe that, for any pair (i, k), it follows directly from the vector clock
algorithm that (a) vci [k] never decreases, and (b) at any time, vci [k] ≤ vck [k].
7.2 Vector Time 161

Fig. 7.10 Time propagation in a vector clock system

Example of Time Propagation An example of an execution of the previous al-


gorithm is depicted in Fig. 7.10. The four local vector clocks are initialized to
[0, 0, 0, 0]. Then, p2 produces an internal event dated [0, 1, 0, 0]. Its next event is the
sending of a message to p1 , dated [0, 2, 0, 0]. The reception of this message by p1
is its first event, and consequently the reception of this message is dated [1, 2, 0, 0],
etc.
The bottom of the figure shows what could be seen by an omniscient external ob-
server. The global time seen by this observer is defined as follows: At any time τx , it
sees how many events have been produced by each process up to τx . Hence, the cur-
rent value of this global time is [vc1 [1], vc2 [2], vc3 [3], vc4 [4]]. Initially, no process
has yet produced events, and the observer sees the global vector time [0, 0, 0, 0] at
external time τ1 . Then, at τ3 , it sees the sending by p2 of a message to p1 at the
global time [0, 2, 0, 0], etc. Some global time instants are indicated on the figure.
The value of the last global time seen by the omniscient observer is [2, 3, 2, 2].
The global time increases when events are produced. As the observer is omni-
scient, this is captured by the fact that the global time values it sees increase in the
sense that, if we consider any two global time vectors GD1 followed by GD2, we
have (∀k : GD1[k] ≤ GD2[k]) ∧ (∃ k : GD1[k] < GD2[k]).

Notation Given two vectors vc1 and vc2, both of size n, we have
def
• vc1 ≤ vc2 = (∀ k ∈ {1, . . . , n} : vc1[k] ≤ vc2[k]).
def
• vc1 < vc2 = (vc1 ≤ vc2) ∧ (vc1 = vc2).
162 7 Logical Time in Asynchronous Distributed Systems

def
• vc1||vc2 = ¬(vc1 ≤ vc2) ∧ ¬(vc1 ≤ vc2).
When considering Fig. 7.10, we have [0, 2, 0, 0] < [0, 3, 2, 2], and [0, 2, 0, 0] || [0, 0,
1, 0].

7.2.2 Vector Clock Properties

The following theorem characterizes the power of vector clocks.

Theorem 5 Let e.vc be the vector date associated with event e, by the algorithm of
Fig. 7.9. These dates are such that, for any two distinct events e1 and e2 we have (a)
ev
(e1 −→ e2 ) ⇔ (e1 .vc < e2 .vc), and (b) (e1 ||e2 ) ⇔ (e1 .vc || e2 .vc).

Proof The proof is made up of three cases.


ev
• (e1 −→ e2 ) ⇒ (e1 .vc < e2 .vc). This case follows directly from the fact that the
vector time increases along all causal paths (lines 1 and 3–6).
ev
• (e1 .vc < e2 .vc) ⇒ (e1 −→ e2 ). Let pi be the process that produced event e1 . We
have e1 .vc < e2 .vc ⇒ e1 .vc[i] ≤ e2 .vc[i]. Let us observe that only process pi can
entail an increase of the entry i of any vector (if pi no longer increases vci [i] = a
at line 1, no process pj can be such that vcj [i] > a). We conclude from this
observation, the code of the algorithm (vector time increases only along causal
paths), and e1 .vc[i] ≤ e2 .vc[i] that there is a causal path from e1 to e2 .
ev
• (e1 ||e2 ) ⇔ (e1 .vc || e2 .vc). By definition we have (e1 || e2 ) = (¬(e1 −→
ev ev
e2 ) ∧ ¬(e2 −→ e1 )). It follows from the previous items that ¬(e1 −→ e2 ) ⇔
ev
¬(e1 .vc ≤ e2 .vc), i.e., ¬(e1 −→ e2 ) ⇔ (e1 .vc > e2 .vc) ∨ (e1 .vc || e2 .vc). Sim-
ev
ilarly, ¬(e2 −→ e1 ) ⇔ (e2 .vc > e1 .vc) ∨ (e2 .vc || e1 .vc). Combining the previ-
ous observations, and observing that we cannot have simultaneously (e1 .vc >
e2 .vc) ∧ (e2 .vc > e1 .vc), we obtain (e1 || e2 ) ⇔ (e1 .vc || e2 .vc). 

The next corollary follows directly from the previous theorem.

Corollary 1 Given the dates of two events, determining if these events are causally
related or not can require up to n comparisons of integers.

Reducing the Cost of Comparing Two Vector Dates As the cost of comparing
two dates is O(n), an important question is the following: Is it possible to add con-
trol information to a date in order to reduce the cost of their comparison? If the
events are produced by the same process pi , a simple comparison of the ith entry of
their vector dates allows us to conclude. More generally, given an event e produced
by a process pi , let us associate with e a timestamp defined as the pair e.vc, i. We
have then the following theorem from which it follows that, thanks to the knowledge
of the process that produced an event, the cost of deciding if two events are or not
causally related is reduced to two comparisons of integers.
7.2 Vector Time 163

Theorem 6 Let e1 and e2 be events timestamped e1 .vc, i and e2 .vc, j , re-
ev
spectively, with i = j . We have (e1 −→ e2 ) ⇔ (e1 .vc[i] ≤ e2 .vc[i]), and
(e1 || e2 ) ⇔ ((e1 .vc[i] > e2 .vc[i]) ∧ (e2 .vc[j ] > e1 .vc[j ])).

Proof Let us first observe that time increases only along causal paths, and only the
process that produced an event entails an increase of the corresponding entry in a
vector clock (Observation O). The proof considers each case separately.
ev
• (e1 −→ e2 ) ⇔ (e1 .vc[i] ≤ e2 .vc[i]).
ev
If e1 −→ e2 , there is a causal path from e1 to e2 , and we have e1 .vc ≤ e2 .vc
(Theorem 6), from which e1 .vc[i] ≤ e2 .vc[i] follows.
If e1 .vc[i] ≤ e2 .vc[i], it follows from observation O that there is a causal path
from e1 to e2 .
• (e1 || e2 ) ⇔ ((e1 .vc[i] > e2 .vc[i]) ∧ (e2 .vc[j ] > e1 .vc[j ])).
As, at any time, we have vcj [i] ≤ vci [i], pi increases vci [i] when it produces
e1 , it follows from the fact that there is no causal path from e1 to e2 and obser-
vation O that e1 .vc[i] > e2 .vc[i]. The same applies to e2 .vc[j ] with respect to
e1 .vc.
In the other direction, we conclude from e1 .vc[i] > e2 .vc[i] that there is no
causal path from e1 to e2 (otherwise we would have e1 .vc[i] ≤ e2 .vc[i]). Simi-
larly, e2 .vc[j ] > e1 .vc[j ] implies that there is no causal path from e2 to e1 . 

To illustrate this theorem, let us consider Fig. 7.10, where the first event of p2 is
denoted e21 , first event of p3 is denoted e31 , and the second event of p3 is denoted e32 .
Event e21 is timestamped [0, 1, 0, 0], 2, e31 is timestamped [0, 0, 1, 0], 2, and e32 is
ev
timestamped [0, 3, 2, 2], 2. As e21 .vc[2] = 1 ≤ e32 .vc[2] = 3, we conclude e21 −→
e32 . As e21 .vc[2] = 1 > e31 .vc[2] = 0 and e31 .vc[3] = 1 > e21 .vc[3] = 0, we conclude
e21 || e31 .

7.2.3 On the Development of Vector Time

When considering vector time, messages put restrictions on the development of


time. More precisely, let m be a message sent by a process pi at local time vci [i] = a
and received by a process pj at local time vcj [j ] = b. This message induces the re-
striction that no event e can be dated e.vc such that e.vc[i] < a and e.vc[j ] ≥ b.
This restriction simply states that there is no event such that there is a message
whose reception belongs to its causal past while its sending does not.
To illustrate this restriction created by messages on the vector time domain, let
us consider the execution depicted in Fig. 7.11. There are two processes, and each
of them produces six events. The vector time date of each event is indicated above
or below the corresponding event. Let us notice that the channel from p1 to p2 is
not FIFO (p1 sends m1 before m2 , but p2 receives m2 before m1 ).
164 7 Logical Time in Asynchronous Distributed Systems

Fig. 7.11 On the


development of time (1)

Let us consider the message m2 . As its sending time is [3, 0] and its receive time
is [3, 4], it follows that it is impossible for an external observer to see a global time
GD such that GD[1] < 3 and GD[2] ≥ 4. The restrictions on the set of possible
vector dates due to the four messages exchanged in the computation are depicted in
Fig. 7.12. Each message prevents some vector clock values from being observed. As
an example, the date [2, 5] cannot exist, while the vector date [3, 3] can be observed
by an external observer.
The borders of the area including the vector dates that could be observed by
external observers are indicated with dotted lines in the figure. They correspond to
the history (sequence of events) of each process.

Fig. 7.12 On the development of time (2)


7.2 Vector Time 165

Fig. 7.13 Associating vector dates with global states

7.2.4 Relating Vector Time and Global States

Let us consider the distributed execution described on the left top of Fig. 7.13. Vec-
tor dates of events and local states are indicated in this figure. Both initial local states
σ10 and σ20 are dated [0, 0]. Then, each σix inherits the vector date of the event eix
that generated it. As an example the date of the local state σ22 is [0, 2].
The corresponding lattice of global states is described at the right of Fig. 7.13. In
this lattice, a vector date is associated with each global state as follows: the ith entry
of the vector is the number of events produced by pi . This means that, when con-
y
sidering the figure, the vector date of the global state [σix , σj ] is [x, y]. (This dating
system for global states, which is evidently based on vector clocks, was implicitly
introduced and used in Sect. 6.3.2, where the notion of a lattice of global states was
introduced.) Trivially, the vector time associated with global dates increases along
each path of the lattice.
Let us recall that the greatest lower bound (GLB) of a set of vertices of a lattice is
their greatest common predecessor, while the least upper bound (LUB) is their least
common successor. Due to the fact that the graph is a lattice, each of the GLB and
the LUB of a set of vertices (global states) is unique.
An important consequence of the fact that the set of consistent global states is a
lattice and the associated dating of global states is the following. Let us consider two
global states Σ  and Σ  whose dates are [d1 , . . . , dn ] and [d1 , . . . , dn ], respectively.
Let Σ − = GLB(Σ  , Σ  ) and Σ + = LUB(Σ  , Σ  ). We have
• date(Σ − ) = date(GLB(Σ  , Σ  )) = [min(d1 , d1 ), . . . , min(dn , dn )], and
166 7 Logical Time in Asynchronous Distributed Systems

• date(Σ + ) = date(LUB(Σ  , Σ  )) = [max(d1 , d1 ), . . . , max(dn , dn )].


As an example, let us consider the global states denoted Σd and Σe in the
lattice depicted at the left of Fig. 7.13. We have Σb = GLB(Σd , Σe ) and Σf =
LUB(Σd , Σe ). It follows that we have date(Σb ) = [min(2, 1), min(1, 2)] = [1, 1].
Similarly, we have date(Σf ) = [max(2, 1), max(1, 2)] = [2, 2]. The consistent cuts
associated with Σd and Σe are depicted in the bottom of the left side of Fig. 7.13.
These properties of the vector dates associated with global states are particularly
important when one has to detect properties on global states. It allows processes to
obtain information on the lattice without having to build it explicitly. At the oper-
ational level, processes have to compute vector dates identifying consistent global
states which are relevant with respect to the property of interest (see Sect. 7.2.5).

7.2.5 Vector Clocks in Action:


On-the-Fly Determination of a Global State Property

To illustrate the use of vector clocks, this section presents an algorithm that deter-
mines the first global state of a distributed computation that satisfies a conjunction
of stable local predicates.

Conjunction of Stable Local Predicates A predicate is local to a process pi if it


is only on local variables of pi . Let LPi be a predicate local to process pi . The fact
that the local state σi of pi satisfies the predicate LPi is denoted σi |= LPi .
Let {LPi }1≤i≤n be a set of n local predicates, one per process. (If there is no
local predicate for a process px , we can consider a fictitious local predicate LPx =
true satisfied by all local states of px .) The predicate i LPi is a global predicate,
called conjunction of local predicates. Let Σ = [σ 1 , . . . , σn ] be a consistent global
state. We say that
 Σ satisfies the global predicate i LP i , if i (σi |= LPi ). This is
denoted Σ |= i LPi .
A predicate is stable if, once true, it remains true forever. Hence, if a local state
σi satisfies a local stable predicate LPi , all the local states that follow σi in pi ’s local
history satisfy LPi . Let
 us observe that, if each local predicate LPi is stable, so is
the global predicate i LPi .
This section presents a distributed algorithm that computes, on the fly and with-
out using additional control messages, the first consistent global state that satisfies a
conjunction of stable local predicates. The algorithm, which only adds control data
to application messages, assumes that the application sends “enough” messages (the
meaning of “enough” will appear clearly in the description of the algorithm).

On the Notion of a “First” Global State The consistent global state Σ defined
by a process pi from the vector date firsti [1..n] is the first global state satisfying
  
i LPi , in the following sense: There is no global state Σ such that (Σ = Σ) ∧
 Σ
(Σ  |= i LPi ) ∧ (Σ  −→ Σ).
7.2 Vector Time 167

Fig. 7.14 First global state satisfying a global predicate (1)

Where Is the Difficulty Let us consider the execution described in Fig. 7.14. The
predicates LP1 , LP2 and LP3 are satisfied from the local states σ1x , σ2b , and σ3c ,
respectively. (The fact that they are stable is indicated by a bold line on the corre-
sponding process axis.) Not all local states are represented; the important point is
y
that σ2 is the state of p2 after it has sent the message m8 to p1 , and σ3z is the state of
p3 after it has sent the message m5 to p2 . The first consistent global state satisfying
y
LP1 ∧ LP2 ∧ LP3 is Σ = [σ1x , σ2 , σ3z ].
Using causality created by message exchanges and appropriate information pig-
gybacked by messages, the processes can “learn” information related to the predi-
cate detection. More explicitly, we have the following when looking at the figure.
• When p1 receives m1 , it learns nothing about the predicate detection (and simi-
larly for p3 when it receives m2 ).
• When p1 receives m4 (sent by p2 ), it can learn that (a) the global state
[σ10 , σ2b , σ30 ] is consistent and (b) it “partially” satisfies the global predicate,
namely, σ2b |= LP2 .
• When p2 receives the message m3 (from p1 ), it can learn that the global state
[σ1a , σ2b , σ30 ] is consistent and such σ2a |= LP2 . When it later receives the message
m5 (from p3 ), it can learn that the global state [σ1a , σ2b , σ3c ] is consistent and such
(σ2b |= LP2 ) ∧ (σ3c |= LP3 ). Process p3 can learn the same when it receives the
message m7 (sent by p2 ).
• When p1 receives the message m8 (sent by p2 ), it can learn that, while LP1 is
not yet locally satisfied, the global state [σ1a , σ2b , σ3c ] is the first consistent global
state that satisfies LP2 and LP3 .
• Finally, when p1 produces the internal event giving rise to σ1x , it can learn that
y
the first consistent global state satisfying the three local predicates is [σ1x , σ2 , σ3z ].
The corresponding consistent cut is indicated by a dotted line on the figure.
y
Let us recall that the vector date of the local state σ2 is the date of the preceding
event, which is the sending of m8 , and this date is piggybacked by m5 . Similarly,
the date of σ3z is the date of the sending of m5 .
Another example is given in Fig. 7.15. In this case, due to the flow of control
created by the exchange of messages, it is only when p3 receives m3 that it can learn
168 7 Logical Time in Asynchronous Distributed Systems

Fig. 7.15 First global state satisfying a global predicate (2)

y 
that [σ1x , σ2 , σ3z ] is the first consistent global state satisfying i LPi . As previously,
the corresponding consistent cut is indicated by a dotted line on the figure.
As no control messages are allowed, it is easy to  see that, in some scenarios,
enough application messages have to be sent after i LPi is satisfied, in order to
compute
 the vector date of the first consistent global state satisfying the global pred-
icate i LPi .

Local Control Variables In order to implement the determination of the first


global state satisfying a conjunction of stable local predicates, each process pi man-
ages the following local variables:
• vci [1..n] is the local vector clock.
• sati is a set, initially empty, of process identities. It has the following meaning:
(j ∈ sati ) ⇔ (pi knows that pj has entered a local state satisfying LPj ).
• donei is Boolean, initialized to false, and then set to the value true by pi when
LPi becomes satisfied for the first time. As LPi is a stable predicate, donei then
keeps that value forever.
• firsti is the vector date of the first consistent global state, known by pi , for
which all the processes in sati satisfy their local predicate. Initially, firsti =
first[1] first[n]
[0, . . . , 0]. Hence, [σ1 , . . . , σn ] is this global state, and we have ∀ j ∈
first[j ]
sati : σj |= LPj .
As an example, considering Fig. 7.14, after p1 has received the message m4 ,
we have sat1 = {2} and first1 = [0, b, 0]. After it has received the message m8 ,
we have sat1 = {2, 3} and first1 = [0, b, c].

The Algorithm The algorithm  is described in Fig. 7.16. It ensures that if consis-
tent global states satisfy LPi , at least one process will compute the vector date of
the first of them. As already indicated, this is under the assumption that the processes
send enough application messages.
Before producing a new event, pi always increases its local clock vci [i] (lines 7,
10, and 13). If the event e is an internal event, and LPi has not yet been satisfied
(i.e., donei is false), pi invokes the operation check_lp() (line 9). If its current local
state σ satisfies LPi (line 1), pi adds its identity to sati and, as it is the first time that
7.2 Vector Time 169

internal operation check_lp(σ ) is


(1) if (σ |= LPi ) then
(2) sati ← sati ∪ {i}; firsti [1..n] ← vci [1..n]; donei ← true;
(3) if (sati = {1, . . . , n}) then 
(4) firsti [1..n] is the vector date of the first global state satisfying j LPj
(5) end if
(6) end if.

when producing an internal event e do


(7) vci [i] ← vci [i] + 1;
(8) Produce event e and move to the next state σ ;
(9) if (¬donei ) then check_lp(σ ) end if.

when sending MSG (m) to pj do


(10) vci [i] ← vci [i] + 1; move to the next local state σ ;
(11) if (¬donei ) then check_lp(σ ) end if;
(12) send MSG (m, vci , sati , firsti ) to pj .

when MSG (m, vc, sat, first) is received from pj do


(13) vci [i] ← vci [i] + 1; vci ← max(vci [1..n], vc[1..n]);
(14) move to the next local state σ ;
(15) if (¬donei ) then check_lp(σ ) end if;
(16) if (sat  sati ) then
(17) sati ← sati ∪ sat; firsti ← max(firsti [1..n], first[1..n]);
(18) if (sati = {1, . . . , n}) then 
(19) firsti [1..n] is the vector date of the first global state satisfying j LPj
(20) end if
(21) end if.

Fig. 7.16 Detection the first global state satisfying i LPi (code for process pi )

LPi is satisfied, it defines accordingly firsti as being vci (which is the vector date
of the global state associated with the causal past of the event e currently produced
by pi , line 2). Process pi then checks if the consistent global  state defined by the
vector date firsti (= vci ) satisfies the whole global predicate j LPj (lines 3–4). If
it is the case, pi has determined the first global state satisfying the global predicate.
If the event e is the sending of a message by pi to pj , before sending the message,
process pi first moves to its next local state σ (line 10), and does the same as if e was
an internal event. The important point is that, in addition to the application message,
pi sends to pj (line 12) its full current state (from the global state determination
point of view), which consists of vci (current vector time at pi ), sati (processes
whose local predicates are satisfied), and firsti (date of the consistent global state
satisfying the local predicates of the processes in sati ).
If the event e is the reception by pi of a message sent by pj , pi updates first its
vector clock (line 13), and moves to its next local state (line 14). As in the previous
cases, if LPi has not yet been satisfied, pi then invokes check_lp() (line 15). Finally,
if pi learns something new with respect to local predicates (test of line 16), it “adds”
what it knew before (sati and firsti [1..n]) with what it learns (sat and first[1..n]).
The new value of firsti [1..n] is the vector date of the first consistent global state
170 7 Logical Time in Asynchronous Distributed Systems

in which all the local states of the processes in sati ∪ sat satisfy their local predi-
cates. Finally, pi checks if the global state defined by firsti [1..n] satisfies all local
predicates (lines 18–20).
When looking at the executions described in Figs. 7.14 and 7.15, we have the
following. In Fig. 7.14, p1 ends the detection at line 4 after it has produced the
event that gave rise to σ1x . In Fig. 7.15, p3 ends the detection at line 19 after it has
received the protocol message MSG(m3 , [−, −, −], {1, 2}, [x, y, 0]). Just before it
receives this message we have sat3 = {2, 3} and first3 = [0, y, z].

7.2.6 Vector Clocks in Action:


On-the-Fly Determination of the Immediate Predecessors

The Notion of a Relevant Event At some abstraction level, only a subset of


events are relevant. As an example, in some applications only the modification of
some local variables, or the processing of specific messages, are relevant. Given a
distributed computation H  = (H, −→),
ev
let R ⊂ H be the subset of its events that
are defined as relevant.
Let us accordingly define the causal precedence relation on these events, denoted
re
−→, as follows
re ev
∀e1 , e2 ∈ R : (e1 −→ e2 ) ⇔ (e2 −→ e1 ).
ev
This relation is nothing else than the projection of −→ on the elements of R. The
pair R = (R, −→)
re
constitutes an abstraction of the distributed computation H =
ev
(H, −→).
Without loss of generality, we consider that the set of relevant events consists
only of internal events. Communication events are low-level events which, without
being themselves relevant, participate in the establishment of causal precedence on
relevant events. If a communication event has to be explicitly observed as relevant,
an associated relevant internal event can be generated just before or after it.
An example of a distributed execution with both relevant events and non-relevant
events is described in Fig. 7.17. The relevant events are represented by black dots,
while the non-relevant events are represented by white dots.
The management of vector clocks restricted to relevant events (Fig. 7.18) is a
simplified version of the one described in Fig. 7.9. As we can see, despite the fact
that they are not relevant, communication events participate in the tracking of causal
precedence. Given a relevant e timestamped e.vc[1..n], i, the integer e.vc[j ] is the
number of relevant events produced by pj and known by pi .

The Notion of an Immediate Predecessor and the Immediate Predecessor


Tracking Problem Given two events e1 , e2 ∈ R, the relevant event e1 is an im-
mediate predecessor of the relevant event e2 if
re  re re 
(e1 −→ e2 ) ∧ e ∈ R : (e1 −→ e) ∧ (e −→ e2 ) .
7.2 Vector Time 171

Fig. 7.17 Relevant events in a distributed computation

when producing a relevant internal event e do


(1) vci [i] ← vci [i] + 1;
(2) Produce the relevant event e. % The date of e is vci [1..n].

when sending MSG (m) to pj do


(3) send MSG (m, vci [1..n]) to pj .

when MSG (m, vc) is received from pj do


(4) vci [1..n] ← max(vci [1..n], vc[1..n]).

Fig. 7.18 Vector clock system for relevant events (code for process pi )

Fig. 7.19 From relevant events to Hasse diagram

The immediate predecessor tracking (IPT) problem consists in associating with


each relevant event the set of relevant events that are its immediate predecessors.
Moreover, this has been done on the fly and without adding control messages. The
determination of the immediate predecessors consists in computing the transitive
reduction (or Hasse diagram) of the partial order R  = (R, −→).
re
This reduction cap-

tures the essential causality of R.
The left of Fig. 7.19 represents the distributed computation of Fig. 7.17, in which
only the relevant events are explicitly indicated, together with their vector dates and
an identification pair (made up of a process identity plus a sequence number). The
right of the figure shows the corresponding Hasse diagram.

An Algorithm Solving the IPT Problem: Local Variables The kth relevant
event on process pk is unambiguously identified by the pair (k, vck), where vck is the
value of vck [k] when pk has produced this event. The aim of the algorithm is conse-
172 7 Logical Time in Asynchronous Distributed Systems

when producing a relevant internal event e do


(1) vci [i] ← vci [i] + 1;
(2) Produce the relevant event e; let e.imp = {(k, vci [k]) | impi [k] = 1};
(3) impi [i] ← 1;
(4) for each j ∈ {1, . . . , n} \ {i} do impi [j ] ← 0 end for.

when sending MSG (m) to pj do


(5) send MSG (m, vci [1..n], impi [1..n]) to pj .

when MSG (m, vc, imp) is received do


(6) for each k ∈ {1, . . . , n} do
(7) case vci [k] < vc[k] then vci [k] ← vc[k]; impi [k] ← imp[k]
(8) vci [k] = vc[k] then impi [k] ← min(impi [k], imp[k])
(9) vci [k] > vc[k] then skip
(10) end case
(11) end for.

Fig. 7.20 Determination of the immediate predecessors (code for process pi )

quently to associate with each relevant event e a set e.imp such that (k, vck) ∈ e.imp
if and only if the corresponding event is an immediate predecessor of e.
To that end, each process pi manages the following local variables:
• vci [1..n] is the local vector clock.
• impi [1..n] is a vector initialized to [0, . . . , 0]. Each variable impi [j ] contains 0
or 1. Its meaning is the following: impi [j ] = 1 means that the last relevant event
produced by pj , as known by pi , is candidate to be an immediate predecessor of
the next relevant event produced by pi .

An Algorithm Solving the IPT Problem: Process Behavior The algorithm ex-
ecuted by each process pi is a combined management of both the vectors vci and
impi . It is described in Fig. 7.20.
When a process pi produces a relevant event e, it increases its own vector clock
(line 1). Then, it considers impi [1..n] to compute the immediate predecessors of
e (line 2). According to the definition of each entry of impi , those are the events
identified by the pairs (k, cvi [k]) such that impi [k] = 1 (which indicates that the
last relevant event produced by pk and known by pi , is still a candidate to be an
immediate predecessor of e).
Then, as it produced a new relevant event e, pi must reset its local array
impi [1..n] (lines 3–4). It resets (a) impi [i] to 1 (because e is candidate to be an
immediate predecessor of relevant events that will appear in its causal future) and
(b) each impi [j ] (j = i) to 0 (because no event that will be produced in the future
of e can have them as immediate predecessors).
When pi sends a message, it attaches to this message all its local control in-
formation, namely, cvi and impi (line 5). When it receives a message, pi updates
it local vector clock as in the basic algorithm so that the new value of vci is the
component-wise maximum of vc and the previous value of vci .
7.3 On the Size of Vector Clocks 173

Fig. 7.21 Four possible cases when updating impi [k], while vci [k] = vc[k]

The update of the array impi depends on the value of each entry of vci .
• If vci [k] < vc[k], pj (the sender of the message) has fresher information on pk
than pi . Consequently, pi adopts what is known by pj , and sets impi [k] to imp[k]
(line 7).
• If vci [k] = vc[k], the last relevant event produced by pk and known by pi is the
same as the one known by pj . If this event is still candidate to be an immedi-
ate predecessor of the next event produced by pi from both pi ’s point of view
((impi [k]) and pj ’s point of view ((imp[k]), then impi [k] remains equal to 1;
otherwise, impi [k] is set to 0 (line 8). The four possible cases are depicted in
Fig. 7.21.
• If vci [k] > vc[k], pi knows more on pk than pj . Hence, it does not modify the
value of impi [k] (line 9).

7.3 On the Size of Vector Clocks

This section first shows that vector time has an inherent price, namely, the size of
vector clocks cannot be less than the number of processes. Then it introduces the
notion of a relevant event and presents a general technique that allows us to reduce
the number of vector entries that have to be transmitted in each message. Finally,
it presents the notions of approximation of the causality relation and approximate
vector clocks.
174 7 Logical Time in Asynchronous Distributed Systems

7.3.1 A Lower Bound on the Size of Vector Clocks

A vector clock has one entry per process, i.e., n entries. It follows from the algorithm
of Fig. 7.9 and Theorem 5 that vectors of size n are sufficient to capture causality
and independence among events. Hence the question: Are vectors of size n neces-
sary to capture causality and concurrency among the events produced by n asyn-
chronous processes communicating by sending and receiving messages through a
reliable asynchronous network?
This section shows that the answer to this question is “yes”. This means that there
are distributed executions in which causality and concurrency cannot be captured
by vector clocks of size smaller than n. The proof of this result, which is due to B.
Charron-Bost (1991), consists in building such a specific execution and showing a
contradiction if vector clocks of size smaller than n are used to capture causality
and independence of events.

The Basic Execution Let us consider an execution of n processes in which each


process pi , 1 ≤ i ≤ n, executes a sending phase followed by a reception phase.
There is a communication channel between any pair of distinct processes, and the
communication pattern is based on a logical ring defined as follows on the process
identities: i, i + 1, i + 2, . . . , n, 1, . . . , i − 1, i. The notations i + x and i − y are
used to denote the xth successor and the yth predecessor of the identity i on the
ring, respectively. More precisely, the behavior of each process pi is as follows.
• A process pi first sends, one after the other, a message to each process of the
following “increasing” list of (n − 2) processes: pi+1 , pi+2 , . . . , pn , p1 , . . . ,
pi−2 . The important points are the “increasing” order in the list of processes and
the fact that pi does not send a message to pi−1 .
• Then, pi receives, one after the other, the (n − 2) messages sent to it, in the
“decreasing” order on the identity of their senders, namely, pi receives first the
message from pi−1 , then the one from pi−2 , . . . , p1 , pn , . . . , pi+2 . As before, the
important points are the “decreasing” order in the list of processes and the fact
that pi does not receive a message from pi+1 .
This communication pattern is described in Fig. 7.22. (A simpler figure with only
three processes is described in Fig. 7.23.)
Considering a process pi , let fsi denote its first send event (which is the sending
of a message to pi+1 ), and lri denote its last receive event (which is the reception of
a message from pi+2 ).

Lemma 1 ∀ i ∈ {1, . . . , n} : lri || fsi+1 .

Proof As any process first sends messages before receiving any message, there is
no causal chain involving more than one message. The lemma follows from this
observation and the fact pi+1 does not send message to pi . 
ev
Lemma 2 ∀ i, j : 1 ≤ i = j ≤ n : fsi+1 −→ lrj .
7.3 On the Size of Vector Clocks 175

Fig. 7.22 A specific communication pattern

Fig. 7.23 Specific communication pattern with n = 3 processes

Proof Let us first consider the case j = i +1. We have then lrj = lri+1 . As both fsi+1
and lri+1 are produced by pi+1 , and the send phase precedes the receive phase, the
lemma follows.
Let us now consider the case j = i + 1. Due to the assumption, we have then
j∈/ {i, i + 1}. It then follows from the communication pattern that pi+1 sends a
message to pj . Let s(i, j ) and r(i, j ) be the corresponding send event and receive
ev ev
event, respectively. We have (a) fsi+1 = s(i, j ) or fsi+1 −→ s(i, j ), (b) s(i, j ) −→
ev
r(i, j ), and (c) r(i, j ) = lrj or r(i, j ) −→ lrj . Combining the previous relations,
ev
we obtain fsi+1 −→ lrj . 

Theorem 7 Let n ≥ 3 be the number of processes. To capture causality and in-


dependence of events produced by asynchronous computations, the dimension of
vector time has to be at least n.
176 7 Logical Time in Asynchronous Distributed Systems

Proof Let e.vc[1..k] (e.vc) be the vector date associated with event e. Let us con-
sider a process spi . It follows from Lemma 1 that, for each i ∈ {1, . . . , n}, we have
lri || fsi+1 , which means that the vector dates lri .vc and fsi+1 .vc have to be incom-
parable. If lri .vc[j ] ≥ fsi+1 .vc[j ] for any j , we would have lri .vc ≥ fsi+1 .vc, which
is impossible since the events are independent. There exist consequently indexes x
such that lri .vc[x] < fsi+1 .vc[x]. Let (i) be one of these indexes. As this is true for
any i ∈ {1, . . . , n}, we have defined a function

: {1, . . . , n} → {1, . . . , k}.

The rest of the proof shows that k ≥ n. This is done by showing that the function
() is one-to-one.
Let us assume by contradiction that () is such that there are two distinct indexes
i and j such that (i) = (j ) = x. Due to the definition of (), we have lri .vc[x] <
fsi+1 .vc[x] (A1), and lrj .vc[x] < fsj +1 .vc[x] (A2). On another side, it follows from
ev
Lemma 2 that fsi+1 −→ lrj . Since vector clocks are assumed to capture causality, we
have fsi+1 .vc ≤ lrj .vc (A3). Combining (A1), (A2), and (A3), we obtain lri .vc[x] <
ev
fsi+1 .vc[x] ≤ lrj .vc[x] < fsj +1 .vc[x], from which we conclude ¬(fsj +1 −→ lri ),
which contradicts Lemma 2 and concludes the proof of the theorem. 

The reader can notice that the previous proof relies only on the fact that vector
entries are comparable. Moreover, from a theoretical point of view, it does not re-
quire the value domain of each entry to be restricted to the set of integers (even if
integers are easier to handle).

7.3.2 An Efficient Implementation of Vector Clocks

Theorem 7 shows that there are distributed executions in which the dimension of
vector time must be n if one wants to capture causality and independence (concur-
rency) among the events generated by n processes.
Considering an abstraction level defined by relevant events (as defined in
Sect. 7.2.6), this section presents an abstract condition, and two of its implemen-
tations, that allows each message to carry only a subset of the vector clock of its
sender. Of course, in the worst case, this subset is the whole vector clock of the
sender. This condition is on a “per message” basis. This means that the part of a
vector clock carried by a message m depends on what is known by its sender about
the values of all vector clocks, when it sends this message. This control information
is consequently determined just before a message is sent.
Let us recall that communication events cannot be relevant events. Only a subset
of the internal events are relevant.

To Transmit or Not to Transmit Control Information: A Necessary and Suf-


ficient Condition Given a message m sent by a process pi to a process pj , let
7.3 On the Size of Vector Clocks 177

s(m, i, j ) denote its send event and r(m, i, j ) denote its receive event. Moreover, let
pre(m, i, j ) denote the last relevant event (if any) produced by pj before r(m, i, j ).
Moreover, e being any (relevant or not) event produced by a process px , let e.vcx [k]
be the value of vcx [k] when px produces e. Let K(m, i, j, k) be the following pred-
icate:
def
K(m, i, j, k) = s(m, i, j ).vci [k] ≤ pre(m, i, j ).vcj [k].
When true, K(m, i, j, k) means that vci [k] is smaller or equal to vcj [k], when pi
sends m to pj ; consequently, it is not necessary for pi to transmit the value of vci [k]
to pj . The next theorem captures the full power of this predicate.

Theorem 8 The predicate ¬K(m, i, j, k) is a necessary and sufficient condition for


pi to transmit the pair (k, vci [k]) when it sends a message m to pj .

Proof Necessity. Let us assume that ¬K(m, i, j, k) is satisfied, i.e., s(m, i, j ).vci [k]
> pre(m, i, j ).vcj [k]. According to the definition of vector clocks we must have
s(m, i, j ).vci [k] ≤ r(m, i, j ).vcj [k]. If the pair (k, vci [k]) is not attached to m, pj
cannot update vcj [k] to its correct value, which proves the necessity part.
Sufficiency. Let us consider a message m sent by pi to pj such that K(m, i, j, k)
is satisfied. Hence, we have s(m, i, j ).vci [k] ≤ pre(m, i, j ).vcj [k]. As r(m, i, j ).
vcj [k] = pre(m, i, j ).vcj [k], we have s(m, i, j ).vci [k] ≤ r(m, i, j ).vcj [k], from
which it follows that, if the pair (k, vci [k]) is attached to m, it is useless as vcj [k]
does not need to be updated. 

From an Abstract Predicate to an Operational Predicate Unfortunately, pi


cannot evaluate the predicates K(m, i, j, k), 1 ≤ k ≤ n, before sending a message
m to pj . The aim is consequently to find an operational predicate K  (m, i, j, k),
such that K  (m, i, j, k) ⇒ K(m, i, j, k) (operational means that the corresponding
predicate can be locally computed by pi ). This means that if K  (m, i, j, k) indicates
that it is useless to transmit the pair (k, vci [k]), then this decision does not entail an
incorrect management of vector clocks.
Of course, always taking K  (m, i, j, k) = false works, but would force each mes-
sage to carry the full vector clock. In order to attach to each message m as few pairs
(k, vci [k]) as possible, the aim is to find an operational predicate K  (m, i, j, k) that
is the best possible approximation of K(m, i, j, k), which can be locally computed.
To implement such a local predicate K  (m, i, j, k), each process pi manages an
additional control data, namely a matrix denoted kprimei [1..n, 1..n]. The value of
each entry kprimei [ , k] is 0 or 1, and its initial value is 1. It is managed in such a
way that, to pi ’s knowledge,
   
kprimei [ , k] = 1 ⇒ vci [k] ≤ vc [k] .
Hence,
def  
K  (m, i, j, k) = s(m, i, j ).kprimei [j, k] = 1 .
The algorithms that follow are due to J.-M. Hélary, M. Raynal, G. Melideo, and
R. Baldoni (2003).
178 7 Logical Time in Asynchronous Distributed Systems

when producing a relevant internal event e do


(1) vci [i] ← vci [i] + 1; % e.vci [1..n] is the vector date of e
(2) for each ∈ {1, . . . , n} \ {i} do kprimei [ , i] ← 0 end for.

when sending MSG (m) to pj do


(3) let vc_set = {(k, vci [k]) such that kprimei [j, k] = 0 };
(4) send MSG (m, vc_set) to pj .

when MSG (m, vc_set) is received do


(5) for each (k, vck) ∈ vc_set do
(6) case vci [k] < vck then vci [k] ← vck;
(7) for each ∈ {1, . . . , n} \ {i, j, k}
(8) do kprimei [ , k] ← 0
(9) end for;
(10) kprimei [j, k] ← 1
(11) vci [k] = vck then kprimei [j, k] ← 1
(12) vci [k] > vck then skip
(13) end case
(14) end for.

Fig. 7.24 Management of vci [1..n] and kprimei [1..n, 1..n] (code for process pi ): Algorithm 1

A First Algorithm The algorithm of Fig. 7.24 describes the way each process pi
has to manage its vector clock vci [1..n] and its matrix kprimei [1..n, 1..n] so that the
previous relation is satisfied. Let us recall that vci [1..n] is initialized to [0, . . . , 0],
while kprimei [1..n, 1..n] is initialized to [[1, . . . , 1], . . . , [1, . . . , 1]].
When it produces a relevant event, pi increases vci [i] (line 1) and resets to 0
(line 2) all entries of the column kprimei [1..n, i] (except its own entry). This is
because, pi knows then that vci [i] > vc [i] for = i.
When it sends a message to a process pj , pi adds to it the set vc_set con-
taining the pairs (k, vck) such that, to its knowledge vcj [k] < vci [k] (line 3). Ac-
cording to the definition of kprimei [1..n, 1..n], those are the pairs (k, −) such that
kprimei [j, k] = 0.
When a process pi receives a message m with an associated set of pairs vc_set, it
considers separately each pair (k, vck) ∈ vc_set. This is in order to preserve the prop-
erty associated with K  (m, i, j, k) for each k, i.e., (kprimei [ , k] = 1) ⇒ (vci [k] ≤
vc [k]). The behavior of pi depends on the values of the pair (vci [k], vck). More
precisely, we have the following.
• If vci [k] < vck, pi is less informed on pk than the sender pj of the mes-
sage. It consequently updates vci [k] to a more recent value (line 6), and sets
(a) kprimei [ , k] to 0 for = i, j, k (this is because pi does not know if
vc [k] ≥ vci [k], lines 7–9), and (b) kprimei [j, k] to 1 (because now it knows that
vcj [k] ≥ vci [k], line 10).
• If vci [k] = vck, pi sets accordingly kprimei [j, k] to 1 (line 11).
• If vci [k] > vck, pi is more informed on pk than the sender pj of the message. It
consequently does not modify the array kprimei (line 12).
7.3 On the Size of Vector Clocks 179

when producing a relevant internal event e do


(1) vci [i] ← vci [i] + 1; % e.vci [1..n] is the vector date of e
(2) for each ∈ {1, . . . , n} \ {i} do kprimei [ , i] ← 0 end for.
when sending MSG (m) to pj do
(3 ) let vc_set = {(k, vci [k], kprimei [1..n, k]) such that kprimei [j, k] = 0};
(4) send MSG (m, vc_set) to pj .
when MSG (m, vc_set) is received do
(5 ) for each (k, vck, kpk[1..n]) ∈ vc_set do
(6) case vci [k] < vck then vci [k] ← vck;
(7-10 ) for each ∈ {1, . . . , n} \ {i}
(7-10 ) do kprimei [ , k] ← kpk[ ]
(7-10 ) end for;
(11 ) vci [k] = vck then for each ∈ {1, . . . , n} \ {i}
(11 ) do kprimei [ , k] ← max(kprimei [ , k], kpk[ ])
(11 ) end for;
(12) vci [k] > vck then skip
(13) end case
(14) end for.

Fig. 7.25 Management of vci [1..n] and kprimei [1..n, 1..n] (code for process pi ): Algorithm 2

Remark When considering a process pi , the values in the column kprimei [1..n, i]
(but kprimei [i, i]) remain equal to 0 after its first update (line 2).

The Case of FIFO Channels If the channels are FIFO, when a process pi sends
a message m to another process pj , the following line can be added after (line 3):

for each (k, −) ∈ vc_set do kprimei [j, k] ← 1 end for.

These updates save future sendings of pairs to pj as long as pi does not produce
a new relevant event (i.e., until vci [i] is modified). In particular, with this enhance-
ment, a process pi sends the pair (i, vci [i]) to a process pj only if, since its last
relevant event, this sending is the first sending to pj .

A Modified Algorithm When pi sends a message to pj , the more entries of


kprimei [j, k] are equal to 1, the fewer pairs (k, cvk) have to be transmitted (line 3).
Hence, the idea is to design an algorithm that increases the number of entries of the
local arrays kprimei [1..n, 1..n] equal to 1, in order to decrease the size of vc_set.
To that end, let us replace each pair (k, cvk) transmitted in vc_set by a triplet
(k, cvk, kprimei [1..n, k]), and modify the statements associated with a message re-
ception to benefit from the values of the binary vectors which have been received.
The corresponding algorithm is described in Fig. 7.25.
When pi receives a message sent by a process pj , it can update kprimei [ , k]
to kprimek [ , k] if pj was more informed on pk than pi (case vci [k] < vcj [k]).
In this case, for every , pi updates it to the received value kprimej [ , k]
(lines 7–10 ), which replaces lines 7 and 10 of the algorithm in Fig. 7.24).
If pi and pj are such that vci [k] = vcj [k], pi updates each kprimej [ , k] to
max(kprimei [ , k], kprimej [ , k]) (line 11 ) which replaces line (line 11). This is
180 7 Logical Time in Asynchronous Distributed Systems

because, if vci [k] = vcj [k] and kprimej [ , k] = 1, pi knows that vc [k] ≥ vci [k], if
it did not know it before. There is of course a tradeoff between the number of pairs
whose sending is saved and the number of binary vectors which have now to be sent
(see below).

An Adaptive Communication Layer Let s be the number of bits required to


encode the sequence numbers of the relevant events of each process. Let P0 be the
basic vector clock algorithm suited to relevant event (Fig. 7.18), P1 the algorithm
in which messages carry pairs (Fig. 7.24), and P2 the algorithm in which messages
carry triplets (Fig. 7.25).
As vectors have a canonical representation, the bit size of the control information
associated with a message m is n × s bits in P0. Given some point of an execution of
algorithm P0, let m be a message sent by pi to pj . If, instead of the full vector clock,
m had to piggyback the set vc_set defined at line 3 of P1, the bit size of the control
information associated with m would be α1 = set_size(s + log2 n) bits. Similarly,
it would be α2 = set_size(n + s + log2 n) bits if m had to piggyback the set vc_set
defined at line 3 of P2.
Let us observe that, while α1 < α2 , the algorithm P2 has the ability to make more
entries of the matrices kprimei [1..n, 1..n] equal to 1, which has a direct impact on
the size of the control information associated with messages that will be sent in the
future. Hence, choosing between P1 and P2 has to depend on some heuristic func-
tion based on the structure of the computation. It is nevertheless possible to define
an adaptive communication layer which selects, dynamically for each message m,
the “best” algorithm among P0, P1, and P2, to send and receive m.
These observations direct us to the adaptive algorithm described in Fig. 7.26, in
which the statements associated with the sending and the reception of each message
m are dynamically selected. If sending the full vector clock is cheaper, the sending
and receiving rules of P0 are chosen (lines 2–3). Otherwise, a heuristic function is
used to select either the rules of P1, or the rules of P2. This is expressed with the
predicate heuristic() (line 4), which may depend on the communication graph or the
application structure. A simple example of such a predicate is the following one:

predicate heuristic() is return((n − Σ1≤x≤n kprimei [x, k]) > c) end predicate.

The value c is a threshold: if the number of triplets to transmit is not greater than c,
then algorithm P2 is used, otherwise algorithm P1 is used.
Of course, plenty of possibilities are offered to the user. As a toy example, the
messages sent to processes with an even identity could be sent and received with
P1, while the other would be sent and received with P2. A more interesting strategy
is the following. Let pi and pj be any pair of processes, where pi is the sender and
pj the receiver. When P0 is not more efficient that P1 or P2 (line 2), pi alternates in
using P1 and P2 for its successive messages to pj . Another strategy would consist to
draw (at line 4) a random number in {1, 2}, which would be used to direct a process
to use P1 or P2.
7.3 On the Size of Vector Clocks 181

when sending MSG (m) to pj do


(1) let set_size = |{k such that kprimei [j, k] = 0 }|;
(2) if (s × n) < set_size(s + log2 n)
(3) then attach tag “0” to the message and use algorithm P0 to send it
(4) else if (heuristic) then attach tag “1” to the message and use algorithm P1 to send it
(5) else attach the “2” to the message and use algorithm P2 to send it
(6) end if
(7) end if.

when MSG (tag, m, vc_set) is received do


(8) According to the tag of the message, use the reception statements of P0, P1, or P2.

Fig. 7.26 An adaptive communication layer (code for process pi )

7.3.3 k-Restricted Vector Clock


app
An Approximation of the Causal Precedence Relation A relation −→ on the
ev
set of events is an approximation of the causal precedence relation −→ if
ev app
∀e1 , e2 : (e1 −→ e2 ) ⇒ (e1 −→ e2 ).
app
This means that the relation −→ orders correctly any pair of causally related events.
app app
Hence, when e1 || e2 , we have either e1 −→ e2 , or e2 −→ e1 , or e1 and e2 are not
app
ordered by the relation −→. The important point is that any approximation has to
respect causal precedence. As a simple example, the order on events defined by
linear time (see Sect. 7.1.1) is an approximation of causal precedence.

k-Restricted Vector Clocks The notion of restricted vector clocks was introduced
by F. Torres-Rojas and M. Ahamad (1999). It imposes a bound k, 1 ≤ k ≤ n, on the
size of vector clocks (i.e., the dimension of vector time). The vector clock of each
process pi has only k entries, namely, vci [1..k]. These k-restricted vector clocks are
managed by the algorithm described in Fig. 7.27 (which is the same as the vector
clock algorithm described in Fig. 7.9, except for the way the vector entries are used).
Let fk () be a deterministic surjective function from {1, . . . , n} (the set of process
identities) to {1, . . . , k} (the set of vector clock entries). As a simple example, fk (i)
can be (i mod k) + 1. The function fk () defines the set of processes that share the
same entry of the restricted vector clocks.
Let e1 and e2 be two events timestamped e1 .vc[1..k], i and e2 .vc[1..k], j ,
respectively. The set of timestamps defines an approximation relation as follows:
ev
• ((i = j ) ∧ (e1 .vc[i] < e2 .vc[i])) ⇒ (e1 −→ e2 ).
ev
• ((i = j ) ∧ (e1 .vc[i] > e2 .vc[i])) ⇒ (e2 −→ e1 ).
• (e1 .vc || e2 .vc) ⇒ (e1 || e2 ).
app ev
• ((i = j ) ∧ (e1 .vc < e2 .vc)) ⇒ (e1 −→ e2 ). (We have then e1 −→ e2 or e1 || e2 .)
app ev
• ((i = j ) ∧ (e1 .vc > e2 .vc)) ⇒ (e2 −→ e1 ). (We have then e2 −→ e1 or e1 || e2 .)
182 7 Logical Time in Asynchronous Distributed Systems

when producing an internal event e do


(1) vci [fk (i)] ← vci [fk (i)] + 1;
(2) Produce event e. % The date of e is vci [1..k].

when sending MSG (m) to pj do


(3) vci [fk (i)] ← vci [fk (i)] + 1; % vci [1..k] is the sending date of the message.
(4) send MSG (m, vci [1..k]) to pj .

when MSG (m, vc) is received from pj do


(5) vci [fk (i)] ← vci [fk (i)] + 1;
(6) vci [1..k] ← max(vci [1..k], vc[1..k]). % vci [1..k] is the date of the receive event.

Fig. 7.27 Implementation of a k-restricted vector clock system (code for process pi )

app app
When e1 −→ e2 , while e1 || e2 , we say that −→ adds false causality.
If k = 1, we have then fk (i) = 1 for any i, and the k-restricted vector clock
system boils down to linear time. If k = n and fk (i) = i, the k-restricted vector clock
app
system implements the vector time of dimension n. The approximate relation −→
ev app
then boils down to the (exact) causal precedence relation −→. When 1 < k < n, −→
adds false causality. Experimental results have shown that for n = 100 and 1 < k ≤
5, the percentage of false causality (with respect to all the pairs of causally related
app
events) added by −→ remains smaller than 10 %. This shows that approximations of
the causality relation giving few false positives can be obtained with a k-restricted
vector clock system working with a very small time dimension k. This makes k-
restricted vector clock systems attractive when one has to simultaneously keep track
of causality and cope with scaling problems.

7.4 Matrix Time


Linear time and vector time can be generalized to matrix time. While vector time
captures “first-order” knowledge (a process knows that another process has issued
some number of events), matrix time captures “second-order” knowledge (a process
knows that another process knows that . . . ). This section introduces matrix time and
illustrates its use with a simple example. Matrix time is due to M.J. Fischer and
A. Michael (1982).

7.4.1 Matrix Clock: Definition and Algorithm

The matrix clock of a process pi is a two-dimensional array, denoted mci [1..n, 1..n],
such that:
• mci [i, i] counts the number of events produced by pi .
• mci [i, k] counts the number of events produced by pk , as known by pi .
It follows than mci [i, 1..n] is nothing else than the vector clock of pi .
7.4 Matrix Time 183

Fig. 7.28 Matrix time: an example

• mci [j, k] counts the number of events produced by pk and known by pj , as


known by pi .
Hence, for any i, j, k, mci [j, k] = x means “pi knows that “pj knows that “pk has
produced x events” ” ”.
Development of matrix time is illustrated in Fig. 7.28. Let e be the last event
produced by pi . The values of mci considered are the values of that matrix clock
just after e has been produced. It follows that, if e is the xth event produced by
pi , we have e.mc[i, i] = mci [i, i] = x (these numbers are indicated in the figure on
top/bottom of the corresponding event).
• It is easy to see that e2 is the last event produced by pk and known by pi (ac-
ev
cording to the relation −→). If it is the yth event produced by pk , we have
e.mc[i, k] = mci [i, k] = mci [k, k] = y.
• Similarly, if e4 (last event produced by pj and known by pi ) is the zth event
produced by pj , we have e.mc[j, j ] = mci [i, j ] = mci [j, j ] = y.
• The figure shows that e1 is the last event of pk which is known by pj , and this is
known by pi . Let this event be the y  th event produced by pk . We consequently
have e.mc[j, k] = mci [j, k] = y  .
• Finally, e3 is the last event produced by pj and known by pk , and this is known
by pi when it produces event e. Let this event be the z th event produced by pj .
We have e.mc[k, j ] = mci [k, j ] = z .

Matrix Time Algorithm The algorithm implementing matrix time is described in


Fig. 7.29. A process pi increases its event counter each time it produces an event
(lines 1, 3, and 5). Each message piggybacks the current value of the matrix clock
of its sender.
When a process pi receives a message it updates its matrix clock as follows.
• The vector clock of pi , namely mci [i, 1..n], is first updated as in the basic vector
clock algorithm (line 6). The value of the vector clock sent by pj is in the vector
mc[j, 1..n].
• Then, pi updates each entry of its matrix clock so that the matrix contains every-
ev
thing that can be known according to the relation −→ (line 7).
184 7 Logical Time in Asynchronous Distributed Systems

when producing an internal event e do


(1) mci [i, i] ← mci [i, i] + 1;
(2) Produce event e. % The matrix date of e is mci [1..n][1..n].

when sending MSG (m) to pj do


(3) mci [i, i] ← mci [i, i] + 1; % mci [1..n][1..n] is the sending date of the message.
(4) send MSG (m, mci [1..n][1..n]) to pj .

when MSG (m, mc) is received from pj do


(5) mci [i, i] ← mci [i, i] + 1;
(6) mci [i, 1..n] ← max(mci [i, 1..n], mc[j, 1..n]);
(7) for each (k, ) ∈ [1..n, 1..n] do mci [k, ] ← max(mci [k, ], mc[k, ]) end for.
% mci [1..n, 1..n] is the matrix date of the receive event.

Fig. 7.29 Implementation of matrix time (code for process pi )

Property of Matrix Clocks Let us consider the two following cases:


• Let min(mci [1, k], . . . , mci [n, k]) = x. This means that, to pi ’s knowledge, all the
processes know that pk has produced x events. This can be used by pi to forget
events produced by pk which are older than the (x + 1)th one.
• Let min(mci [k, i], . . . , mci [n, i]) = x. This means that, to pi ’s knowledge, all the
processes know that it has produced x events. This can be used by pi to forget
parts of its past older than its (x + 1)th event.
This means, in some applications, the two previous observations can be used by a
process to discard old data, as soon as it knows that these data are known by all the
processes.

7.4.2 A Variant of Matrix Time in Action: Discard Old Data

This example is related to the previous property of matrix clocks. It concerns the
management of a message buffer.

A Buffer Management Problem A process can invoke two operations. The oper-
ation broadcast(m) allows it to send a message to all processes, while the operation
deliver() returns to it a message that has been broadcast.
For some reasons (fault-tolerance, archive recording, etc.) each process keeps in
a private space (e.g., local disk) called buffer, all the messages it has broadcast or
delivered. A condition imposed to a process which wants to destroy messages to
free buffer space is that a message has to be known by all other processes before
being destroyed. A structural view of this buffer management problem is described
in Fig. 7.30.

The Buffer Management Algorithm A simple adaptation of matrix clocks solves


this problem. As only broadcast messages are relevant, the local time of a process is
7.4 Matrix Time 185

Fig. 7.30 Discarding


obsolete data: structural view
(at a process pi )

represented here by the number of messages it has broadcast. The definition of the
content of the vector entry mci [j, k] has consequently to be interpreted as follows:
mci [j, k] = x means that, to pi ’s knowledge, the x first messages broadcast by pk
have been delivered by pj .
The corresponding algorithm (which assumes FIFO channels) is described in
Fig. 7.31. The FIFO assumption simplifies the design of the algorithm. It guarantees
that, if a process pi delivers a message m broadcast by a process pk , it has previously
delivered all the messages broadcast by pk before m.
When pi broadcasts a message m it increases mci [i, i], which is the sequence
number of m (line 1). Then, after it has associated the timestamp mci [i, 1..n], i
with m, pi sends m and its timestamp to all the other process, and deposits the pair
(m, mci [i], i) in its local buffer (lines 2–3).
When it receives a message m with its timestamp vc, j , a process pi first de-
posits the pair (m, vc[j ], j ), in its buffer and delivers m to the local application
process (line 4). Then, pi increases mci [i, j ] (it has delivered one more message
from pj ), and updates its local view of the vector clock of pj , namely, mci [j, 1..n],
to vc (line 5). The fact that a direct assignment replaces the usual vector clock update

operation broadcast(m) is
(1) mci [i, i] ← mci [i, i] + 1;
(2) for each j ∈ {1, . . . n} \ {i} do send(m, mci [i, 1..n], i) end for;
(3) deposit(m, mci [i], i) into the buffer and deliver m locally.

when (m, vc, j ) is received do


(4) deposit(m, vc[j ], j ) into the buffer; deliver(m) to the upper layer;
(5) mci [i, j ] ← mci [i, j ] + 1; mci [j, 1..n] ← vc[1..n].

background task T is
(6) repeat forever
(7) if (∃(m, sn, k) ∈ buffer such that sn ≤ min(mci [1, k], . . . , mci [n, k]))
(8) then (m, sn, k) can be discarded from the buffer
(9) end if
(10) end repeat.

Fig. 7.31 A buffer management algorithm (code for process pi )


186 7 Logical Time in Asynchronous Distributed Systems

statement mci [j, 1..n] ← max(mci [j, 1..n], vc[1..n]) is due to the FIFO property of
the channels.
Finally, a local task T runs forever in the background. This task is devoted to the
management of the buffer. If there is message m in the buffer whose tag sn, k is
such that sn ≤ min(mci [1, k], . . . , mci [n, k]), pi can conclude that all the processes
have delivered this message. As a particular case, pi knows that pj has delivered
m (which was broadcast by pk ) because (a) mci [j, k] ≥ sn and (b) the channels
are FIFO (mci [j, k] being a local variable that pi modifies only when it receives
a message from pj , it follows that pi has previously received from pj a message
carrying a vector vcj such that vcj [k] ≥ sn).

Remark The reader can observe that this simple algorithm includes three notions of
logical time: local time with sequence numbers, vector time which allows a process
to know how many messages with a specific sender have been delivered by the other
processes, and matrix time which encapsulates the previous notions of time. Matrix
clocks are local, messages carry only vector clocks, and each buffer registers only
sequence numbers.

7.5 Summary

This chapter has addressed the concept of logical time in distributed computations.
It has introduced three types of logical time: linear (or scalar) time, vector time,
and matrix time. An appropriate notion of virtual clock is associated with each of
them. The meaning of these time notions has been investigated, and examples of
their use have been presented. Basically, linear time is fundamental when one has to
establish a total order on events that respects causal precedence, vector time captures
exactly causal precedence, while matrix time provides processes with a “second
order” knowledge on the progress of the whole set of processes.

7.6 Bibliographic Notes

• The notion of linear (scalar) time was introduced in 1978 by L. Lamport in [226].
This is a fundamental paper in which Lamport also introduced the happened be-
fore relation (causal precedence), which captures the essence of a distributed com-
putation.
• The timestamp-based total order broadcast algorithm presented in Sect. 7.1.4 is a
variant of algorithms described in [23, 226].
• The notions of vector time and vector clocks were introduced in 1988 (ten
years after linear clocks) simultaneously and independently by C.J. Fidge [124],
F. Mattern [250], and F. Schmuck [338]. The underlying theory is described
in [125, 250, 340].
7.7 Exercises and Problems 187

Preliminary intuitions of vector clocks appear in several papers (e.g., [53, 238,
290, 307, 338, 360]). Surveys on vector clock systems appear in [40, 149, 325].
The power and limitations of vector clocks are investigated in [126, 312].
• Tracking of causality in specific contexts is the subject of numerous papers. The
case of synchronous systems is addressed in [6, 152], while the case of mobile
distributed systems is addressed in [300].
• The algorithm which detects the first global state satisfying a conjunction of stable
local predicates is due to M. Raynal [312].
• The proof of the lower bound showing that the size of vector clocks has to be at
least n (the number of processes) if one wants to capture causality and indepen-
dence of events is due to B. Charron-Bost [86].
• The notion of an efficient implementation of vector clocks was first intro-
duced in [350]. The efficient algorithms implementing vector clocks presented
in Sect. 7.3.2 are due J.-M. Hélary, M. Raynal, G. Melideo, and R. Baldoni [181].
An algorithm to reset vector clocks is presented in [394].
• The notion of a dependency vector was introduced in [129]. Such a vector is
a weakened vector clock. This notion is generalized in [37] to the notion of k-
dependency vector clock (k = n provides us with vector clocks).
• The notion of immediate predecessors of relevant events was introduced in [108,
198]. The corresponding tracking algorithm presented in Sect. 7.2.6 is due to E.
Anceaume, J.-M. Hélary, and M. Raynal [17, 18].
• The notion of k-restricted vector clocks is due to F. Torres-Rojas and M.
Ahamad [370], who also introduced a more general notion of plausible clock
systems.
• Matrix time and matrix clocks were informally introduced in [127] and used
in [334, 390] to discard obsolete data (see also [8]).
• Another notion of virtual time, suited to distributed simulation, is presented and
studied in [199, 263].

7.7 Exercises and Problems


1. Let D be the set of all the linear clock systems that are consistent. Given any
D ∈ D, let dateD (e) be the date associated by D with the event e. Let E be
ev
a distributed execution. As any D ∈ D is consistent, we have (e1 −→ e2 ) ⇒
dateD (e1 ) < dateD (e2 ), for any pair (e1 , e2 ) of events of E.
Given a distributed execution E, show that, for any pair (e1 , e2 ) of events of
E, we have
• (e1 ||e2 ) ⇒ (∃D  ∈ D : dateD  (e1 ) = dateD  (e2 )).
• (e1 ||e2 ) ⇒
[∃D  , D  ∈ D : (dateD  (e1 ) ≤ dateD  (e2 )) ∧ (dateD  (e1 ) ≥ dateD  (e2 ))].
2. Considering the algorithm implementing the total order broadcast abstraction
described in Fig. 7.7, let us replace the predicate of line 11, sd_date ≥ clocki [i]
by the following predicate sd_date, j  < clocki [i], i.
188 7 Logical Time in Asynchronous Distributed Systems

when producing an internal event e do


(1) g1i ← g1i + 1;
(2) Produce event e. % The date of e is the pair (g1i , g2i ).

when sending MSG (m) to pj do


(3) g1i ← g1i + 1; % (g1i , g2i ) is the sending date of the message.
(4) send MSG (m, g1i ) to pj .

when MSG (m, g1) is received from pj do


(5) g1i ← max(g1i , g1) + 1;
(6) g2i ← max(g2i , g1). % (g1i , g2i ) is the date of the receive event.

Fig. 7.32 Yet another clock system (code for process pi )

• Is the algorithm still correct? (Either a proof or a counterexample is needed.)


• What is the impact of this new predicate on the increase of local clock values?
3. Show that the determination of the first global state satisfying a conjunction of
stable local predicates cannot be done by repeated global state computations
(which were introduced in Sect. 6.6 for the detection of stable properties).
4. When considering k-restricted vector time, prove that the statements relating the
app
timestamps with the relations −→ are correct.
5. Let us consider the dating system whose clock management is described in
Fig. 7.32.
The clock of each process pi is composed of two integers g1i and g2i , both
initialized to 0. The date of an event is the current value of the pair (g1i , g2i ),
and its timestamp is the pair (g1i , g2i ), i.
• What relation between g1i and g2i is kept invariant?
• What do g1i and g2i mean, respectively?
• Let e1 and e2 be two events timestamped (g1, g2), i and (h1, h2), i, re-
spectively. What can be concluded on the causality relation linking the events
e1 and e2 when (g1 > h2) ∧ (h1 > g2)?
More generally, define from their timestamps (g1, g2), i and (h1, h2), i
an approximation of the causal precedence relation on e1 and e2 .
• Compare this system with a k-restricted vector clock system, 1 < k < n. Is one
more powerful than the other from an approximation of the causal precedence
relation point of view?
Solution in [370].
6. Design an algorithm that associates with each relevant event its immediate pre-
decessors and uses the matrices kprimei [1..n, 1..n] (introduced in Sect. 7.3.2) to
reduce the bit size of the control information piggybacked by messages.
Solution in [18].
Chapter 8
Asynchronous Distributed Checkpointing

This chapter is devoted to checkpointing in asynchronous message-passing systems.


It first presents the notions of local and global checkpoints and a theorem stating a
necessary and sufficient condition for a set of local checkpoints to belong to the
same consistent global checkpoint.
Then, the chapter considers two consistency conditions, which can be associated
with a distributed computation enriched with local checkpoints (the correspond-
ing execution is called a communication and checkpoint pattern). The first consis-
tency condition (called z-cycle-freedom) ensures that any local checkpoint, which
has been taken by a process, belongs to a consistent global checkpoint. The sec-
ond consistency condition (called rollback-dependency trackability) is stronger. It
states that a consistent global checkpoint can be associated on the fly with each
local checkpoint (i.e., without additional communication).
The chapter discusses these consistency conditions and presents algorithms that,
once superimposed on a distributed execution, ensure that the corresponding con-
sistency condition is satisfied. It also presents a message logging algorithm suited
to uncoordinated checkpointing.

Keywords Causal path · Causal precedence ·


Communication-induced checkpointing · Interval (of events) · Local checkpoint ·
Forced local checkpoint · Global checkpoint · Hidden dependency · Recovery ·
Rollback-dependency trackability · Scalar clock · Spontaneous local checkpoint ·
Uncoordinated checkpoint · Useless checkpoint · Vector clock · Z-dependence ·
Zigzag cycle · Zigzag pattern · Zigzag path · Zigzag prevention

8.1 Definitions and Main Theorem

8.1.1 Local and Global Checkpoints

It was shown in Chap. 6 that a distributed computation can be represented by a


partial order 
σ
S = (S, −→), where S is the set of all the local states produced by the
processes and σ is the causal precedence relation on these local states. This chapter
has also defined the notion of a consistent global state, namely Σ = [σ1 , . . . , σn ] is

M. Raynal, Distributed Algorithms for Message-Passing Systems, 189


DOI 10.1007/978-3-642-38123-2_8, © Springer-Verlag Berlin Heidelberg 2013
190 8 Asynchronous Distributed Checkpointing

Fig. 8.1 A checkpoint and


communication pattern
(with intervals)

consistent if, for any pair of distinct local states σi and σj , we have σi || σj (none
of them depends on the other).
In many applications, we are not interested in all the local states, but only in a
subset of them. Each process defines which of its local states are relevant. This is, for
example, the case for the detection of properties on global states (a local checkpoint
being then a local state satisfying some property), or for the definition of local states
for consistent recovery. Such local states are called local checkpoints, and a set of n
local checkpoints, one per process, is a global state called the global checkpoint.
A local checkpoint is denoted cix , where i is the index (identity) of the corre-
sponding process and x is its sequence number (among the local checkpoints of the
same process). In the following, C will denote the set of all the local checkpoints.
An example of a distributed execution with local checkpoints is represented in
Fig. 8.1, which is called a checkpoint and communication pattern. The local check-
points are depicted with grey rectangular boxes. As they are irrelevant from a check-
pointing point of view, the other local checkpoints are not represented. It is usually
assumed that the initial local state and the final local state of every process are local
checkpoints.
It is easy to see that the global checkpoint [ci1 , cj1 , ck1 ] is consistent, while the
global checkpoint [ci2 , cj2 , ck1 ] is not consistent.

8.1.2 Z-Dependency, Zigzag Paths, and Z-Cycles

The sequence of internal and communication events occurring at a process pi be-


tween two consecutive local checkpoints cix−1 and cix , x > 0, is called interval Iix .
Some intervals are represented in Fig. 8.1.

Zigzag Dependency Relation and Zigzag Path These notions, which are due to
σ
R.H.B. Netzer and J. Xu (1995), are an extension of the relation −→ defined on local
states. A relation on local checkpoints, called z-dependency, is defined as follows.
y zz y
A checkpoint cix z-depends on a local checkpoint cj (denoted cix −→ cj ), if:
y y
• cix and cj are in the same process (i = j ) and cix appears before cj (x < y), or
8.1 Definitions and Main Theorem 191

• there a sequence of messages m1 ; . . . ; mq , q ≥ 1, such that (let us recall that


s(m) and r(m) are the sending and receiving events associated with message m):
– s(m1 ) ∈ Iix+1 (i.e., m1 has been sent in the interval that starts just after cix ),
y y
– r(mq ) ∈ Ij (i.e., mq has been received in the interval finishing just before cj ),
– if q > 1, ∀ : 1 ≤ < q, let Ikt be the interval in which r(m ) occurs (i.e., m

is received during Ikt ). Then s(m +1 ) ∈ Ikt where t  ≥ t (i.e., m +1 is sent by
pk in the interval in which m has been received, or in a later interval). Let us
observe that it is possible that m +1 has been sent before m is received.
Such a sequence of messages is called a zigzag path.
zz zz y
• it exists c such that cix −→ c and c −→ cj .
zz
As an example, due to the sequence of message m3 ; m2 , we have ck0 −→ ci2 .
Similarly, due to sequence of messages m5 ; m4 , (or the sequence m5 ; m6 ), we
zz σ
have ci2 −→ ck2 . Let us observe that we have ci2 −→ ck2 , while we do not have
σ
ck0 −→ ci2 .
A local checkpoint cix belongs to a zigzag path m1 ; . . . ; mq  if this path contains
two consecutive messages m and m +1 such that m is received by pi before cix
and m +1 is sent by pi after cix . As an example, in the figure, the local checkpoint
ci2 belongs to the zigzag path m3 ; m2 ; m5 .
zz
A local checkpoint c belongs to a zigzag cycle if c −→ c. As an example, in
zz
Fig. 8.1, we have ck2 −→ ck2 . This is because the sequence m7 ; m5 ; m6  is a zigzag
cycle in which the reception of m7 and the sending of m5 (both by pi ) belong to the
same interval Ii3 .

Zigzag Pattern Let us consider a zigzag path m1 ; . . . ; mq . Two consecutive


messages m and m +1 define a zigzag pattern (also denoted zz-pattern) if the re-
ception of m and the sending of m +1 occurs in the same interval, and the sending
of m +1 occur before the reception of m .
As we have just seen, the sequence of messages m7 ; m5  is a zigzag pattern.
Other zigzag patterns are the sequences of messages m3 ; m2  and m5 ; m4 .
It is easy to notice that zigzag patterns are what make zigzag paths different from
causal paths. Every causal path is a zigzag path, but a zigzag path that includes
a zigzag pattern is not necessarily a causal path. Hence, c1 and c2 being any two
distinct local checkpoints, we always have

σ zz
(c1 −→ c2 ) ⇒ (c1 −→ c2 ),
zz σ
while it is possible that (c1 −→ c2 ) ∧ ¬(c1 −→ c2 ). As an example, in Fig. 8.1,
zz σ zz
we have (ck0 −→ ci2 ) ∧ ¬(ck0 −→ ci2 ). In that sense, the Z-dependency relation −→
is weaker (i.e., includes more pairs) than the causal precedence relations on events
σ
−→.
192 8 Asynchronous Distributed Checkpointing

Fig. 8.2 A zigzag pattern

8.1.3 The Main Theorem

A fundamental question associated with local and global checkpoints is the fol-
lowing one: Given a checkpoint and communication pattern, and a set LC of local
checkpoints (with at most one local checkpoint per process), is it possible to extend
LC (with local checkpoints of the missing processes, if any) in order to obtain a
consistent global checkpoint?

What Is the Difficulty To illustrate this question let us again consider Fig. 8.1.
• Let us first consider LC1 = {cj1 }. Is it possible to extend this set with a local
checkpoint from pj , and another one from pi , such that, once pieced together,
these three local checkpoints define a consistent global checkpoint? The figure
shows that the answer is “yes”: the global checkpoint [ci1 , cj1 , ck0 ] answers the
question (as does also the global checkpoint [ci1 , cj1 , ck1 ]).
• Let us now consider the question with LC2 = {ci2 , ck0 }. It is easy to see that neither
cjt with t ≤ 1, nor cjt with t ≥ 2, can be added to LC2 to obtain a consistent global
checkpoint. Hence, the answer is “no” for LC2.
• Let us finally consider the case LC3 = {ck2 }. The figure shows that neither ci2 nor
ci3 can be consistent with ck2 (this is due to the causality precedence relating these
local checkpoints to ck2 ). Hence, the answer is “no” for LC3.
If there is a causal path relating two local checkpoints, we know (from Chap. 6)
that they cannot belong to the same consistent global checkpoint. Hence, to better
appreciate the difficultly of the problem (and the following theorem), let us consider
γ +1 β β+1
Fig. 8.2. Let LC = {ciα , ck }. It is easy to see that adding either cj or cj to
LC does not work. As depicted, each cut line (dotted line in the figure) defines an
inconsistent global checkpoint. Consequently, there is no mean to extend LC with
a local checkpoint of pj in order to obtain a consistent global checkpoint. This
observation shows that, absence of causal dependences among local checkpoints is
a necessary condition to have a consistent global checkpoint, but is not a sufficient
condition.
8.1 Definitions and Main Theorem 193

This example shows that, in addition to causal precedence, there are hidden de-
pendences among local checkpoints that prevent them to belong to the same con-
sistent global checkpoint. These hidden dependences are the ones created by zigzag
patterns. These patterns, together with causal precedence, are formally captured by
zz
the relation −→. The following theorem, which characterizes the set of local check-
points that can be extended to form a consistent global checkpoint, is due to R.H.B.
Netzer and J. Xu (1995).

Theorem 9 Let LC be a set of local checkpoints. LC can be extended to a consistent


zz
global checkpoint if and only if ∀c1 , c2 ∈ LC, we have ¬(c1 −→ c2 ).
zz
Proof Proof of the “if” part. Let us assume that, ∀ c1 , c2 ∈ LC, we have ¬(c1 −→
c2 ). The proof consists in (a) first constructing a global checkpoint Σ that includes
the local checkpoints of LC, plus one local checkpoint per process not in LC, and
(b) then proving that Σ is consistent.
The global checkpoint Σ is built as follows. Let pj be a process that has no local
checkpoint in LC.
• Case 1: pj has a local checkpoint c(j ) with a zigzag path to a local checkpoint in
zz
LC (i.e., ∃c ∈ LC : c(j ) −→ c).
The local checkpoint c (j ) is added to Σ, where c (j ) is the first checkpoint
of pj that has no zigzag path to a local checkpoint in LC. Such a local checkpoint
c (j ) exists because it is assumed that the last local state of a process is its last
local checkpoint (and such a local checkpoint cannot be the starting point of a
zigzag path to any other local checkpoint).
• Case 2: pj has no local checkpoint c(j ) with a zigzag path to a local checkpoint
zz
in LC (i.e., c ∈ LC : c(j ) −→ c).
The local checkpoint c (j ) is added to Σ, where c (j ) is the first checkpoint
of pj . Such a local checkpoint c (j ) exists because it is assumed that the first
local state of a process is its first local checkpoint.
Let us recall that the definition of the consistency of a global checkpoint (global
σ
state) Σ is based on the relation −→. More precisely, to prove that Σ is consistent,
σ
we have to show that, for any c1 , c2 ∈ Σ , we cannot have c1 −→ c2 . Let LC be the
set of local checkpoints which are in Σ and not in LC. The proof is by contradic-
σ
tion: It assumes that there are c1 , c2 ∈ Σ such that c1 −→ c2 and shows that this
assumption is impossible. According to the fact that c1 and c2 belong or not to LC
or LC there are four possible cases to analyze:
σ
1. c1 , c2 ∈ LC and c1 −→ c2 .
σ zz σ
This case is clearly impossible, as −→⊆−→, c1 −→ c2 contradicts ¬(c1
zz
−→ c2 ).
σ
2. c1 ∈ LC, c2 ∈ LC, and c1 −→ c2 .
zz
This implies that c1 −→ c2 , which is impossible because it contradicts the
way the local checkpoint c1 ∈ LC is defined (according to the definition of the
194 8 Asynchronous Distributed Checkpointing

Fig. 8.3 Proof of Theorem 9:


a zigzag path joining
two local checkpoints of LC

local checkpoints of LC, none of them can be the starting local checkpoint of a
zigzag path to a local checkpoint in LC).
σ
3. c1 ∈ LC, c2 ∈ LC, and c1 −→ c2 .
In this case, c2 cannot be an initial local checkpoint. This is because no lo-
σ
cal checkpoint can causally precede (relation −→) an initial local checkpoint.
Hence, as c2 ∈ LC, c2 is a local checkpoint defined by Case 1, i.e., c2 is the
first local checkpoint (of some process pj ) that has no zigzag path to a local
checkpoint in LC.
Let c2 the local checkpoint of pj immediately preceding c2 . This local check-
point c2 must have a zigzag path to a local checkpoint c3 ∈ LC (otherwise, c2
would not be the first local checkpoint of pj that has no zigzag path to a local
checkpoint in LC).
σ
This zigzag path from c2 to c3 , plus the messages giving rise to c1 −→ c2 ,
establish a zigzag path from c1 to c3 (Fig. 8.3). But this contradicts the fact that
zz
we have (assumption) ¬(c1 −→ c3 ) for any pair c1 , c3 ∈ LC. This contradiction
concludes the proof of the case.
σ
4. c1 , c2 ∈ LC and c1 −→ c2 .
It follows from the argument of the previous item that there is a zigzag path
from c1 to a local checkpoint in LC (see the figure where c1 ∈ LC is replaced
by c1 ∈ LC). But this contradicts the definition of c1 (which is the first local
checkpoint—of some process pj —with no zigzag path to any local checkpoints
of LC).
Proof of the “only if” part. We have to show that, if there are two (not necessarily
zz
distinct) local checkpoints c1 , c2 ∈ LC such that c1 −→ c2 , there is no consistent
zz
global checkpoint Σ including c1 and c2 . Hence, let us assume that c1 −→ c2 . If c1
and c2 are from the same process, the proof follows directly from the definition of
global checkpoint consistency. Hence, let us assume that c1 and c2 are from different
processes. There is consequently a zigzag path m1 ; . . . ; mq  starting after c1 and
finishing before c2 . The proof is by induction on the number of messages in this
path.
zz
• Base case: q = 1. If c1 −→ c2 and the zigzag path contains a single message,
σ
we necessarily have c1 −→ c2 (a zigzag path made up of a single message is
necessarily a causal path), and the proof follows.
8.1 Definitions and Main Theorem 195

Fig. 8.4 Proof of Theorem 9:


a zigzag path joining
two local checkpoints

• Induction case: q > 1. Let us assume that, if a zigzag path made up of at most
q messages joins two local checkpoints, these local checkpoints cannot belong
to the same consistent global checkpoint. We have to prove that if two lo-
cal checkpoints c1 and c2 are connected by a zigzag path of q + 1 messages
m1 ; . . . ; mq ; mq+1 , they cannot belong to the same consistent global check-
point.
The proof is by contradiction. Let us assume that there is a consistent global
checkpoint Σ including c1 and c2 such that these two local checkpoints are con-
nected by a zigzag path of q + 1 messages (if c1 = c2 , the zigzag path is a zigzag
cycle).
Let c3 be the local checkpoint preceding the reception of the message mq ,
and pj be the corresponding receiving process. This means that c1 is connected
to c3 by the zigzag path m1 ; . . . ; mq  (Fig. 8.4). It follows from the induction
assumption that c1 and c3 cannot belong to the same consistent global checkpoint.
More generally, given any c3 that appears after c3 on the same process pj , c1 and
c3 cannot belong to the same consistent global checkpoint.
It follows from this observation that, for both c1 and c2 to belong to the
same consistent global checkpoint Σ , this global checkpoint must include a local
checkpoint of pj that precedes c3 . But, due the definition of “zigzag path”, the
message mq+1 has necessarily been sent after the local checkpoint of pj that im-
mediately precedes c3 . This local checkpoint is denoted c3 in Fig. 8.4. It follows
that any local checkpoint of pj that causally precedes c3 , causally precedes c2 .
Thus, c2 cannot be combined with any local checkpoint preceding c3 to form
a consistent global checkpoint. The fact that no local checkpoint of pj can be
combined with both c1 and c2 to form a consistent global checkpoint concludes
the proof of the theorem. 

It follows that a global checkpoint is consistent if and only if there is no z-


dependence among its local checkpoints.

Useless Checkpoint A local checkpoint is useless if it cannot belong to any con-


sistent global checkpoint. It follows from the previous theorem that a local check-
zz
point c is useless if and only if c −→ c (i.e., it belongs to a cycle of the z-precedence
relation). Said the opposite way, a local checkpoint belongs to a consistent global
checkpoint if and only if it does not belong to a z-cycle.
196 8 Asynchronous Distributed Checkpointing

Fig. 8.5 Domino effect (in a system of two processes)

As an example, let us consider again Fig. 8.1. The local checkpoint ck2 is useless
because it belongs to the zigzag path m7 ; m5 ; m6  which includes the zigzag pattern
m7 ; m5 .

8.2 Consistent Checkpointing Abstractions

Let us recall that C denotes the set of all the local checkpoints defined during a
distributed computation 
σ zz
S = (S, −→). The pair (C, −→) constitutes a checkpoint-
ing abstraction of this distributed computation. Hence the fundamental question of
zz
the asynchronous checkpointing problem: Is (C, −→) a consistent checkpointing
abstraction of the distributed computation 
S?
Two different consistency conditions can be envisaged to answer this question.

8.2.1 Z-Cycle-Freedom

zz
Definition An abstraction (C, −→) is z-cycle-free if none of its local checkpoints
zz
belongs to a z-cycle: ∀c ∈ C : ¬(c −→ c).
This means that z-cycle-freedom guarantees that no local checkpoint is useless,
or equivalently each local checkpoint belongs to at least one consistent global check-
point.

Domino Effect The domino effect is a phenomenon that may occur when looking
for a consistent global checkpoint. As a simple example, let us consider processes
that define local checkpoints in order to be able to recover after a local failure. In
order that the computation be correct, the restart of a process from one of its pre-
vious checkpoints may entail the restart of other processes (possibly all) from one
of their previous checkpoints. This is depicted in Fig. 8.5. After its local checkpoint
c21 , process p2 experiences a failure and has to restart from one of its previous local
state c2 , which is such that there is a local state c1 of process p1 and the global
checkpoint [c1 , c2 ] is consistent. Such a global checkpoint cannot be [c13 , c21 ] be-
cause it is not consistent. Neither can it be [c12 , c21 ] for the same reason. The reader
can check that, in this example, the only consistent global checkpoint is the initial
8.2 Consistent Checkpointing Abstractions 197

one, namely [c10 , c20 ]. This backtracking to find a consistent global checkpoint is
called the domino effect.
It is easy to see that, if the local checkpoints satisfy the z-cycle-freedom property,
no domino effect can occur. In the example, there would be a local checkpoint c1
on p1 , such that Σ = [c1 , c21 ] would be consistent, and the computation could be
restarted from this consistent global checkpoint. Moreover, this consistent global
checkpoint Σ has the property to be “as fresh as possible” in the sense that, any other
consistent global checkpoint Σ  from which the computation could be restarted is
Σ Σ
such that Σ  −→ Σ (where −→ is the reachability relation on global states defined
in Chap. 6).

8.2.2 Rollback-Dependency Trackability

Definition Rollback-dependency trackability (RDT) is a consistency condition


stronger than the absence of z-cycles. It was introduced by Y.-M. Wang (1997). An
zz
abstraction (C, −→) can be z-cycle-free, while having pairs of local checkpoints
which are related only by zigzag paths which are not causal paths (hence, each of
these zigzag paths includes a zigzag pattern). This means that, c1 and c2 being two
zz σ
such local checkpoints, we have c1 −→ c2 and ¬(c1 −→ c2 ).
RDT states that any hidden dependency (created by a zigzag pattern) is “doubled”
zz
by a causal path. Formally, an abstraction (C, −→) satisfies the RDT consistency
condition if
zz σ
∀ c1 , c2 ∈ C : (c1 −→ c2 ) ⇒ (c1 −→ c2 ).
zz
It follows that a checkpoint-based abstraction (C, −→) of a distributed execution
satisfies the RDT consistency condition if it is such that, when considering only the
zz σ
local states which are local checkpoints, we have −→≡−→. This does not mean
that there are no zigzag patterns. It means that, if there is a zigzag pattern on a
zigzag path connecting two local checkpoints, there is also a causal path connecting
these local checkpoints.
zz
An example of such a “doubling” is given in Fig. 8.1. We have ci2 −→ ck2 , and
the zigzag pattern [m5 ; m4 ] is “doubled” by the causal path [m5 ; m6 ]. Differently,
zz
while we have ck0 −→ ci2 , the zigzag pattern [m3 ; m2 ] is not “doubled” by a causal
path.

Why RDT Is Important When a checkpoint-based abstraction of a distributed


execution satisfies the RDT consistency condition, we know that (if any) hidden
dependencies among local checkpoints are doubled by causal paths. It follows that,
in such a context, the statement of Theorem 9 simplifies and becomes: Any set LC
of local checkpoints which are not pairwise causally related can be extended to form
a consistent global checkpoint.
198 8 Asynchronous Distributed Checkpointing

As causal dependences can be easily tracked by vector clocks, the RDT con-
sistency condition can benefit and simplify the design of many checkpoint-based
applications such as the detection of global properties, or the definition of global
checkpoints for failure recovery. A noteworthy property of RDT is the following
one: it allows us to associate with each local checkpoint c (on the fly and without
additional cooperation among processes) the first consistent global checkpoint in-
cluding c. (The notion of “first” considered here is with respect to the sublattice
of consistent global states obtained by eliminating the global states which are not
global checkpoints, see Chap. 6.) This property is particularly interesting when one
has to track software errors or recover after the detection of a deadlock.

8.2.3 On Distributed Checkpointing Algorithms

Spontaneous vs. Forced Local Checkpoints It is assumed that, for application-


dependent reasons, each process pi defines some of its local states as local check-
points. Such checkpoints are called spontaneous checkpoints. However, there is no
zz
guarantee that the resulting abstraction (C, −→) satisfies any of the previous con-
sistency conditions. To that end, checkpointing algorithms have to be superimposed
to the distributed execution in order additional local checkpoints be automatically
defined. These are called forced checkpoints.
When a local state of a process pi is defined as a local checkpoint, it is said that
“pi takes a local checkpoint”. According to the application, the corresponding local
checkpoint can be kept locally in volatile memory, saved in a local stable storage,
or sent to a predetermined process.

Classes of Checkpointing Algorithms According to the underlying coordination


mechanism they use, two classes of distributed checkpointing algorithms can be
distinguished.
• In addition to the application messages that they can overload with control infor-
mation, the algorithms of the first class use specific control messages (which do
not belong to the application). Hence, this class of algorithms is based on explicit
synchronization. Such an algorithm (suited to FIFO channels) was presented in
Sect. 6.6. In the literature, these algorithms are called coordinated checkpointing
algorithms.
• The second family of checkpointing algorithms is based on implicit synchroniza-
tion. The processes can only add control information to application messages
(they are not allowed to add control messages). This control information is used
by the receiver process to decide if this message reception has to entail a forced
local checkpoint. In the literature, these algorithms are called communication-
induced checkpointing algorithms.
8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention 199

8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention


This section presents checkpointing algorithms that (a) ensure the z-cycle-freedom
property, and (b) associate with each local checkpoint a consistent global check-
point to which it belongs. It is important to notice that (b) consists in associating a
global identity with each local checkpoint, namely the identity of the corresponding
consistent global checkpoint

8.3.1 An Operational Characterization of Z-Cycle-Freedom

Let us associate with each local checkpoint c a integer denoted c.date (a logical date
zz
from an operational point of view). Let (C, −→) be a checkpointing abstraction.
zz zz
Theorem 10 (∀c1 , c2 ∈ C : (c1 −→ c2 ) ⇒ (c1 .date < c2 .date)) ⇔ ((C, −→)
is z-cycle-free).
zz
Proof Direction ⇒. Let us assume by contradiction that there is a z-cycle c −→ c.
It follows from the assumption that c.date < c.date, which is clearly impossible.
zz
Direction ⇐. Let us consider that (C, −→) is acyclic. There is consequently a
topological sort of its vertices. It follows that each vertex c (local checkpoint) can
zz
be labeled with an integer c.date such that c1 −→ c2 ⇒ c1 .date < c2 .date. 

This theorem shows that all the algorithms ensuring the z-cycle-freedom prop-
erty implement (in an explicit or implicit way) a consistent logical dating of local
checkpoints (the time notion being linear time). It follows that, when considering
the algorithms that implement explicitly such a consistent dating system, the local
checkpoints that have the same date belong to the same consistent global check-
point.

8.3.2 A Property of a Particular Dating System


zz
Given a checkpointing abstraction (C, −→) that satisfies the z-cycle-freedom prop-
erty, let us consider the dating system that associates with each local checkpoint c
zz
the lowest possible consistent date, i.e., c.date = min{c .date | c −→ c} + 1.
This dating system has an interesting property, namely, it allows us to systemati-
cally associate with each local checkpoint c a consistent global checkpoint to which
c belongs. It follows that any communication-induced checkpointing algorithm that
implements this dating system has this property. The following theorem, which cap-
tures this dating system, is due to J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, and
M. Raynal (2000).
200 8 Asynchronous Distributed Checkpointing

Fig. 8.6
Proof by contradiction
of Theorem 11

zz
Theorem 11 Let (C, −→) be a z-cycle-free checkpointing and communication pat-
tern abstraction, in which the first local checkpoint of each process is dated 0, and
zz
the date of each other local checkpoint c is such that c.date = max{c .date | c −→
c} + 1. Let us associate with each local checkpoint c the global checkpoint Σ(c) =
[c1x1 , . . . , c1xn ] where cixi is the last local checkpoint of pi , such that cixi .date ≤
c.date. Then, Σ(c) includes c and is consistent.

Proof By construction of Σ(c), c belongs to Σ(c). The proof that Σ(c) is consistent
is by contradiction; α being the date of c (i.e., c.date = α), let us assume that Σ(c) is
not consistent. Let next(x) be the local checkpoint (if any) that appears immediately
after x on the same process.

R1 Due to the theorem assumption, we have y.date ≤ α for any y ∈ Σ(c).


R2 If y ∈ Σ(c) and next(y) exist, due to the definition of y.date, we have
next(y).date > α.
R3 As Σ(c) is not consistent (assumption), there are local checkpoints a, b ∈ Σ(c)
zz
such that a −→ b (Fig. 8.6).
R4 It follows from R3, the way dates are defined, and R1 that a.date < b.date ≤ α.
R5 As the last local state of any process is a local checkpoint, and this local check-
point does not z-precede any other local checkpoint, it follows from R3 that
next(a) exists.
R6 It follows from R5 and R2 applied to next(a), that next(a).date > α. More-
over, due to the property of the underlying dating system (stated in the theorem)
zz
we also have next(a).date = max{c .date | c −→ next(a)} + 1 > α. Hence,
zz
max{c .date | c −→ next(a)} ≥ α.
zz
R7 From a −→ next(a) and R4 (namely, a.date < α), we conclude that the max-
imal date used in R6 is associated with a local checkpoint g produced on a
process different from the one on which a has been produced, and we have
zz
g.date = max{c .date | c −→ next(a)} ≥ α.
zz zz zz
R8 It follows from R7 that g −→ next(a). As a −→ b, we have g −→ b and, due
to the dating system, date(g) < date(b).
R9 It follows from R7 and R8 that date(b) > α. But, as b ∈ Σ(c), this contradicts
R1 which states that date(b) ≤ α, which concludes the proof of the theorem. 
8.3 Checkpointing Algorithms Ensuring Z-Cycle Prevention 201

internal operation take_local_checkpoint() is


(1) c ← copy of current local state; c.date ← clocki ;
(2) save c and its date c.date.

when pi decides to take a spontaneous checkpoint do


(3) clocki ← clocki + 1; take_local_checkpoint().

when sending MSG (m) to pj do


(4) send MSG (m, clocki ) to pj .

when receiving MSG (m, sd) from pj do


(5) if (clocki < sd) then
(6) clocki ← sd; take_local_checkpoint() % forced local checkpoint
(7) end if;
(8) Deliver the message m to the application process.

Fig. 8.7 A very simple z-cycle-free checkpointing algorithm (code for pi )

8.3.3 Two Simple Algorithms Ensuring Z-Cycle Prevention

Both the algorithms that follow are based on Theorems 10 and 11: They both ensure
z-cycle-freedom and associate dates with local checkpoints as described in Theo-
rem 11. Hence, given any local checkpoint c, the processes are able to determine a
consistent global checkpoint to which c belongs.
To that end, each process pi has a scalar local variable clocki that it manages and
uses to associate a date with its local checkpoints. Moreover, each process defines its
initial local state as its first local checkpoint (e.g., with date 1), and all local clocks
are initialized to that value.

A Simple Algorithm A fairly simple checkpointing algorithm, which ensures the


z-cycle-freedom property, is described on Fig. 8.7. This algorithm is due to D. Mani-
vannan and M. Singhal (1996). Before taking a spontaneous local checkpoint a pro-
cess pi increases its local clock (line 3) whose next value is associated with this
checkpoint (line 1). Each message carries the current value of the clock of its sender
(line 4). Finally, when a process pi receives a pair (m, sd) (where sd is the date of
the last local checkpoint taken by the sender, before it sent this message), it com-
pares sd with clocki (which is the date of its last local checkpoint). If sd > clocki ,
pi resets its clock to sd and takes a forced local checkpoint (lines 5–6) whose date is
sd, and passes the message to the application process only after the local checkpoint
has been taken.
The behavior associated with a message reception is motivated by the following
observation. The message m received by a process pi might create a zz-pattern
(from cj to ck , as depicted in Fig. 8.8). However, if, since its last checkpoint, pi has
sent messages with dates clocki ≥ sd, the zz-pattern created by m does not prevent
local checkpoint dates from increasing in agreement with the z-precedence relation.
When this occurs, pi has nothing specific to do in order to guarantee the property
202 8 Asynchronous Distributed Checkpointing

Fig. 8.8
To take or not to take
a forced local checkpoint

zz
∀c1 , c2 : (c1 −→ c2 ) ⇒ (c1 .date < c2 .date) (used in Theorem 10 to obtain z-cycle-
freedom). On the contrary, if clocki < sd, pi has to prevent the possible formation
of a zz-pattern with nonincreasing dates. A simple way to solve this issue (and be
in agreement with the assumption of Theorem 10) consists in directing pi to update
clocki and take a forced local checkpoint.

A Simple Improvement Let us observe that, whatever the values of clocki and sd,
no zz-pattern can be formed if pi has not sent messages since its last local check-
point. Let us introduce a Boolean variable senti to capture this “no-send” pattern.
The previous algorithm is consequently modified as follows:
• senti is set to false at line 1 and set to true at line 4.
• Moreover, line 6 becomes
clocki ← sd; if (senti ) then take_local_checkpoint() end if.

Figure 8.9 represents an execution of the checkpointing algorithm of Fig. 8.7


enriched with the local Boolean variables senti . The added forced local checkpoints
are depicted with white rectangles (the base execution is the one described in Fig. 8.1
with one more spontaneous local checkpoint—the one on pj whose date is 5—and
one more message, namely m8 ).
Let c(x, y) be the checkpoint of px whose date is y. The forced local checkpoint
c(j, 3) is to prevent the zz-pattern m5 , m4  from forming. If, when pj receives m5 ,
clockj was greater than or equal to 3 (the date of the last local checkpoint of the
sender of m5 ), this forced local checkpoint would not have been taken.
Differently, pj does not have to prevent the zz-pattern m3 , m2  from forming
because it has previously taken a local checkpoint dated 2, and the message m3 it

Fig. 8.9 An example of z-cycle prevention


8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 203

receives has a smaller date. This means that c(j, 2) can be combined with c(k, 1) to
form a consistent global checkpoint (let us notice that, as shown by the figure, it can
also be combined with c(k, 2), and that is the combination imposed by Theorem 11).
The forced local checkpoint c(i, 4) is taken to prevent the formation of a z-cycle
including c(k, 4), while c(k, 5) is taken to prevent the possible formation of a z-
cycle (that does not exist). Additionally, c(k, 5) is needed to associate a consistent
global checkpoint with c(i, 5) as defined by Theorem 11). If c(k, 5) is not taken
but the clock of pk is updated when m8 is received, c(i, 5) would be associated
with c(j, 5) and c(k, 4) which (due to message m7 ) defines an inconsistent global
checkpoint. If c(k, 5) is not taken and the clock of pk is not updated when m8 is
received, c(k, 6) would be dated 5 and consequently denoted c (k, 5); c(i, 5) would
then be associated with c(j, 5) and c (k, 5), which (due to message m8 ) defines an
inconsistent global checkpoint.

8.3.4 On the Notion of an Optimal Algorithm


for Z-Cycle Prevention

An important issue concerns the design of an optimal communication-induced


checkpointing algorithm which ensures z-cycle prevention. “Optimal” means here
that the algorithm has to take as few forced local checkpoints as possible.
As it has been seen in Chaps. 6 and 7, the knowledge which is accessible to a
process (and from which it can then benefit) is restricted to its causal past. This
is because a process learns new information only when it receives messages. It is
possible to design a communication-induced z-cycle-preventing algorithm which
is optimal with respect to the communication and message pattern included in its
causal past (see the bibliographic notes at the end of the chapter). But such an algo-
rithm is not necessarily optimal with respect to the whole computation. This is due
to the fact that, based on the information extracted from its causal past, a process
pi may not be forced to take a local checkpoint when it receives some message m.
But, due to the (unknown for the moment) pattern of messages exchanged in the
future, taking a local checkpoint when m is received could save the taking of local
checkpoints in the future of pi or other processes. In that sense, there is no optimal
algorithm for z-cycle prevention.

8.4 Checkpointing Algorithms Ensuring


Rollback-Dependency Trackability

8.4.1 Rollback-Dependency Trackability (RDT)

Definition Reminder As seen in Sect. 8.2.2, rollback-dependency trackability


(RDT) is a consistency condition for communication and checkpoint patterns, which
204 8 Asynchronous Distributed Checkpointing

just after pi has taken a spontaneous/forced local checkpoint do


(1) tdvi [i] ← tdvi [i] + 1.

when sending MSG (m) to pj do


(2) send MSG (m, tdvi [1..n]) to pj .

when receiving MSG (m, tdv) from pj do


(3) for each k ∈ {1, . . . , n} do tdvi [k] ← max(tdvi [k], tdv[k]) end for.

Fig. 8.10 A vector clock system for rollback-dependency trackability (code for pi )

Fig. 8.11 Intervals and


vector clocks
for rollback-dependency
trackability

is stronger than z-cycle-freedom: c1 and c2 being any pair of local checkpoints, RDT
zz σ
states that (c1 −→ c2 ) ⇒ (c1 −→ c2 ). This means that there is no z-dependence
relation among local checkpoints which remains hidden from a causal precedence
point of view. As we have seen, given any local checkpoint c, a noteworthy property
of RDT is the possibility to associate with c a global checkpoint to with it belongs,
and this can be done on the fly and without additional communication among pro-
cesses.

Vector Clock for RDT As causality tracking is involved in RDT, at the opera-
tional level, a vector clock system suited to checkpointing can be defined as follows.
Each process pi has a vector clock, denoted tdvi [1..n] (transitive dependence vec-
tor), managed as described in Fig. 8.10. The entry tdvi [i] is initialized to 1, while all
other entries are initialized to 0. Process pi increases tdvi [i] after it has taken a new
local checkpoint. Hence, tdvi [i] is the sequence number of the current interval of
pi (see Fig. 8.1), which means that tdvi [i] is the sequence number of its next local
checkpoint.
A vector date is associated with each local checkpoint. Its value is the current
value of the vector clock tdvi [1..n] of the process pi that takes the corresponding
local checkpoint. Let us consider Fig. 8.11 in which Ijx+1 is the interval separating
cjx and cjx+1 , which means that tdvj [j ] = x + 1 just after cjx . It follows that the
messages forming the causal path from Ijx+1 to pi carry a value tdv[j ] > x, and we
y
have consequently ci .tdv[j ] > x.
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 205

Fig. 8.12 Russell’s pattern


for ensuring the RDT
consistency condition

The RDT property can consequently be re-stated in an operational way as fol-


lows, where c1 and c2 are any two distinct local checkpoints and c1 belongs to pj :
zz  
(c1 −→ c2 ) ⇔ c1 .tdv[j ] < c2 .tdv[j ] .
If the communication and checkpoint pattern satisfies the RDT consistency con-
dition, the previous vector clock system allows a process pi to associate on the fly
with each of its local checkpoint c the first consistent global checkpoint including
c. This global checkpoint Σ = [c1 , . . . , cn ] is defined as follows: (a) ci is c (the
(c.tdv[i])th local checkpoint on pi ) and (b) for any j = i, cj is the (c.tdv[j ])th local
checkpoint of pj .

8.4.2 A Simple Brute Force RDT Checkpointing Algorithm

To ensure the RDT property, a very simple algorithm consists in preventing any
zz-pattern from forming. To that end, the algorithm forces the communication and
checkpoint pattern to be such that, at any process, there is no message sending fol-
lowed by a message reception without a local checkpoint separating them. The only
pattern allowed is the one described in Fig. 8.12. This pattern (called Russell’s pat-
tern) states that the only message pattern that can appear at a process between two
consecutive local checkpoints is a (possibly empty) sequence of message receptions,
followed by a (possibly empty) sequence of message sendings.
The corresponding algorithm (due to Russell, 1980), is described in Fig. 8.13.
The interest of this algorithm lies in its simplicity and in the fact that it needs only
one Boolean per process.

internal operation take_local_checkpoint() is


(1) c ← copy of current local state; save c;
(2) senti ← false.

when pi decides to take a spontaneous checkpoint do


(3) take_local_checkpoint().

when sending MSG (m) to pj do


(4) send MSG (m) to pj ; senti ← true.

when receiving MSG (m) from pj do


(5) if (senti ) then take_local_checkpoint() end if; % forced local checkpoint
(6) Deliver the message m to the application process.

Fig. 8.13 Russell’s checkpointing algorithm (code for pi )


206 8 Asynchronous Distributed Checkpointing

It is possible to use the vector clock algorithm of Fig. 8.10 to associate a vector
date with each checkpoint c. In this way, we obtain on the fly the first consistent
global checkpoint to which c belongs.

8.4.3 The Fixed Dependency After Send (FDAS)


RDT Checkpointing Algorithm

On Predicates for Forced Local Checkpoints According to the control data


managed by processes and carried by application messages, stronger predicates gov-
erning forced checkpoints can be designed, where the meaning of “stronger” is the
following. Let P 1 and P 2 be two predicates used to take forced checkpoints (when
true, a checkpoint has to be taken). P 1 is stronger than P 2 if, when evaluated in the
same context, we always have P 1 ⇒ P 2. This means that when P 2 is true while
P 1 is false, no forced local checkpoint is taken if P 1 is used, while P 2 might force
a local checkpoint to be taken. Hence P 1 is stronger in the sense that it allows for
less forced checkpoints.
Let us recall that taking a local checkpoint can be expensive, mainly when it has
to be saved on a disk. Hence, finding strong predicates, i.e., predicates that allows
for “as few as possible” forced checkpoints is important.
This section presents a predicate which is stronger than the one used in the algo-
rithm of Fig. 8.13. This predicate, which is called “fixed dependency set after send”
(FDAS) was introduced by Y.-M. Wang (1997).

The FDAS Predicate and the FDAS Algorithm The predicate is on the local
Boolean variable senti used in Russell’s checkpointing algorithm (Fig. 8.13), the
vector clock tdvi [1..n] of the process pi , and the vector date tdv[1..n] carried by
the message received by pi . The corresponding checkpointing algorithm, with the
management of vector clocks, is described in Fig. 8.14. The predicate controlling
forced checkpoints appears at line 5. It is the following
 
senti ∧ ∃ k : tdv[k] > tdvi [k] .

A process takes a forced local checkpoint if, when a message m arrives, it has sent
a message since its last local checkpoint and, because of m, its vector tdvi [1..n] is
about to change.
As we have already seen, no zz-pattern can be created by the reception of a
message m if the receiver pi has not sent messages since its last checkpoint. Conse-
quently, if senti is false, no local forced checkpoint is needed. As far as the second
part of the predicate is concerned, we have the following. If ∀k : tdv[k] ≤ tdvi [k],
from a local checkpoint point of view, pi knows everything that was known by the
sender of m (when it sent m). As this message does not provide pi with new infor-
mation on dependencies among local checkpoints, it cannot create local checkpoint
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 207

internal operation take_local_checkpoint() is


(1) c ← copy of current local state; save c and its vector date c.tdv = tdvi [1..n];
(2) senti ← false; tdvi [i] ← tdvi [i] + 1.

when pi decides to take a spontaneous checkpoint do


(3) take_local_checkpoint().

when sending MSG (m) to pj do


(4) send MSG (m, tdvi ) to pj ; senti ← true.

when receiving MSG (m, tdv) from pj do


(5) if (senti ∧ (∃ k : tdv[k] > tdvi [k]))
(6) then take_local_checkpoint() % forced local checkpoint
(7) end if;
(8) for each k ∈ {1, . . . , n} do tdvi [k] ← max(tdvi [k], tdv[k]) end for;
(9) Deliver the message m to the application process.

Fig. 8.14 FDAS checkpointing algorithm (code for pi )

dependencies that would remain unknown to pi . Hence, the name FDAS comes
from the fact that, at any process, after the first message sending in any interval, the
transitive dependency vector remains unchanged until the next local checkpoint.

8.4.4 Still Reducing the Number of Forced Local Checkpoints

The fact that a process takes a forced local checkpoint at some point of its exe-
cution has trivially a direct impact on the communication and checkpoint pattern
defined by the application program: The communication pattern is not modified but
the checkpoint pattern is. This means that it might not be possible to determine on
the fly if taking now an additional forced local checkpoint would decrease the num-
ber of forced local checkpoints taken in the future. The best that can be done is to
find predicates (such as FDAS) that, according to the control information at their
disposal, strive to take as few as possible forced local checkpoints. This section
presents such a predicate (called BHMR) and the associated checkpointing algo-
rithm. This predicate, which is stronger than FDAS, and the associated algorithm
are due to R. Baldoni, J.-M. Hélary, A. Mostéfaoui, and M. Raynal (1997).

Additional Control Variables The idea that underlies the design of this predicate
and the associated checkpointing algorithm is based on additional control variables
that capture the interplay of the causal precedence relation and the last local check-
points taken by each process. To that end, in addition to the vector clock tdvi [1..n],
each process manages the following local variables, which are all Boolean arrays.
• sent_toi [1..n] is a Boolean array such that sent_toi [j ] is true if and only if pi sent
a message to pj since its last local checkpoint. (This array replaces the Boolean
variable senti used in the FDAS algorithm, Fig. 8.14.) Initially, for any j , we have
sent_toi [j ] = false.
208 8 Asynchronous Distributed Checkpointing

Fig. 8.15 Matrix


causali [1..n, 1..n]

• causali [1..n, 1..n] is a two-dimensional Boolean array such that causali [k, j ] is
true if and only if, to pi ’s knowledge, there is a causal path from the last local
checkpoint taken by pk (as known by pi ) to the next local checkpoint that will be
taken by pj (this is the local checkpoint of pj that follows its last local checkpoint
known by pi ). Initially, causali [1..n, 1..n] is equal to true on its diagonal, and
equal to false everywhere else.
As an example, let us consider Fig. 8.15 where μ, μ and μ are causal paths.
When pi sends the message m, we have causali [k, j ] = true (pi learned the
causal path μ from pk to pj thanks to the causal path μ ), causali [k, i] = true
(this is due to both the causal paths μ and μ; μ ), and causali [j, i] = true (this
is due to the causal path μ ).
• purei [1..n] is a Boolean array such that purei [j ] is true if and only if, to pi ’s
knowledge, no causal path starting at the last local checkpoint of pj (known
by pi ) and ending at pi , contains a local checkpoint. An example is given in
Fig. 8.16. A causal path from a process to itself without local checkpoints is
called pure. The entry purei [i] is initialized to true and keeps that value forever;
for any j = i, purei [j ] is initialized to false.

The BHMR Predicate to Take Forced Local Checkpoints Let MSG (m, tdv, pure,
causal) be a message received by pi . Hence, if pj is the sender of this message,
tdv[1..n], pure[1..n], and causal[1..n] are the values of tdvj , purej , and causalj ,
respectively, when it sent the message. The predicate is made up of two parts.

Fig. 8.16 Pure (left) vs. impure (right) causal paths from pj to pi
8.4 Checkpointing Algorithms Ensuring Rollback-Dependency Trackability 209

Fig. 8.17 An impure causal


path from pi to itself

• The first part of the predicate, which concerns the causal paths from any process
pj (j = i) to pi , is
  
∃(k, ): sent_toi [k] ∧ tdv[k] > tdvi [k] ∧ ¬causal[k, ] .

As we can see, the sub-predicate ∃k : sent_toi [k] ∧ (tdv[k] > tdvi [k]) is just the
FDAS predicate expressed on a process basis. If it is true, pi sent a message m
to pk since its last checkpoint and the sender pj knows more local checkpoints
of pk than pi .
But, if causal[k, ] is true, pj knows also that there is a causal path from the
last local checkpoint it knows of pk to p . Hence, there is no need for pi to take a
forced local checkpoint as the zz-pattern created by the message m just received,
and the message m previously sent by pi to pk , is doubled by a causal path (see
the zigzag path μ ; m; m  in Fig. 8.15 which is doubled by the causal path μ).
Consequently, pi takes conservatively a local checkpoint only if ¬causal[k, ].
• The second part of the predicate concerns the causal paths from pi to itself, which
start after its last local checkpoint. It is
 
tdv[i] = tdvi [i] ∧ ¬pure[i].

In this case (see Fig. 8.17), if the causal path whose last message is m is not
pure, a local checkpoint c has been taken along this causal path starting and end-
ing in the same interval Iitdv[i] of pi . In order this checkpoint c belongs to a con-
sistent global checkpoint, pi takes a forced local checkpoint (otherwise, a z-cycle
would form).

The BHMR Checkpointing Algorithm The algorithm based on the previous


control data structures and predicates is described in Fig. 8.18. When a process pi
takes a local checkpoint, it defines its vector date as the current value of its vector
clock tdvi [1..n] (line 1), and updates accordingly the appropriate entries of its arrays
sent_toi , purei and causali (lines 2–3). Finally, it increases tdvi [i] to the sequence
number of its new interval.
When it sends a message m, a process pi adds to it the current values of its
three control data structures (line 6). When it receives a message MSG (m, tdvi [1..n],
pure[1..n], causal[1..n, 1..n]) from a process pj , pi check first the predicate BHMR
210 8 Asynchronous Distributed Checkpointing

internal operation take_local_checkpoint() is


(1) c ← copy of current local state; save c and its vector date c.tdv = tdvi [1..n];
(2) for each k ∈ {1, . . . , n} do sent_toi [k] ← false end for;
(3) for each k ∈ {1, . . . , n} \ {i} do purei [k] ← false; causali [i, k] ← false end for;
(4) tdvi [i] ← tdvi [i] + 1.

when pi decides to take a spontaneous checkpoint do


(5) take_local_checkpoint().

when sending MSG (m) to pj do


(6) send MSG (m, tdvi [1..n], purei [1..n], causali [1..n, 1..n]) to pj ;
(7) sent_toi [j ] ← true.

when receiving MSG (m, tdvi [1..n], pure[1..n], causal[1..n, 1..n]) from pj do
(8) if [∃(k, ) : sent_toi [k] ∧ ((tdv[k] > tdvi [k]) ∧ ¬causal[k, ])]
(9) ∨ [(tdv[i] = tdvi [i]) ∧ ¬pure[i]]
(10) then take_local_checkpoint() % forced local checkpoint
(11) end if;
(12) for each k ∈ {1, . . . , n} do
(13) case (tdv[k]> tdvi [k]) then
(14) tdvi [k] ← tdv[k]; purei [k] ← pure[k];
(15) for each ∈ {1, . . . , n} do causali [k, ] ← causal[k, ] end for
(16) (tdv[k] = tdvi [k]) then
(17) purei [k] ← purei [k] ∧ pure[k];
(18) for each ∈ {1, . . . , n}
(19) do causali [k, ] ← causali [k, ] ∨ causal[k, ]
(20) end for
(21) (tdv[k] < tdvi [k]) then skip
(22) end case
(23) end for;
(24) for each ∈ {1, . . . , n} do causali [ , i] ← causali [ , i] ∨ causal[ , j ] end for;
(25) Deliver the message m to the application process.

Fig. 8.18 An efficient checkpointing algorithm for RDT (code for pi )

that has been presented previously (lines 8–9). Then, before delivering the message
(line 25), pi updates its control data structures so that their current values correspond
to their definition (lines 12–24).
For each process pk , pi compares tdv[k] (the value of tdvj [k] when pj sent the
message) with its own value tdvi [k].
• If tdv[k] > tdvi [k], pj knows more local checkpoints of pk than pi . In this case,
pi resets tdvi [k], purei [k], and causali [k, ] for every , to the corresponding
more up to date values sent by pj (lines 13–15).
• If tdv[k] = tdvi [k], pi and pj know the same last local checkpoint of pk . As they
possibly know it through different causal paths, pi adds what is known by pj
(pure[k] and causal[k, ]) to what it already knows (lines 16–18).
• If pi knows more on pk than pj , it is more up to date and consequently does
nothing (line 21).
Finally, as the message m extends causal paths ending at pj , process pi updates
accordingly each Boolean causal[ , i], 1 ≤ ≤ n (line 24).
8.5 Message Logging for Uncoordinated Checkpointing 211

This algorithm reduces the number of forced local checkpoints at the price of
more control data and more information carried by each application message, which
has to carry n2 + n bits and n integers (logical dates). Nevertheless, as we have seen
in Sect. 8.3.4 for z-cycle prevention, there is no optimal communication-induced
checkpointing algorithm that ensures the RDT property.

8.5 Message Logging for Uncoordinated Checkpointing

8.5.1 Uncoordinated Checkpointing

In uncoordinated checkpointing, each process defines independently its local check-


points and there is no notion of a forced checkpoint. While this approach is prone
to the domino effect, it can be interesting for some applications (for example, back-
ward recovery after a failure) where processes take few local checkpoints and do it
periodically. A consistent global checkpoint has then to be computed by a recovery
algorithm. Moreover, according to the aim of the computed global checkpoint, this
algorithm might also have to compute channel states, which have to be consistent
with the computed global checkpoint. It follows that, if channel states have to com-
puted, the processes have to log messages on stable storage during their execution.

Pessimistic vs. Optimistic Message Logging The message logging technique is


called sender-based (resp., receiver-based) if messages are logged by their senders
(resp., receivers). Two logging techniques are possible.
• In the case of pessimistic logging, a message is saved on stable storage by its
sender (receiver) at the time it is sent (received). This can incur high overhead in
failure-free executions, as each message entails an additional input/output.
• In the case of optimistic logging, a message is first saved in a volatile log, and this
log is then saved on stable storage when a process takes a local checkpoint. When
this occurs, the corresponding process saves on stable storage both its local state
and the messages which are in its volatile log.

Content of This Section Considering an uncoordinated checkpointing algorithm,


this section presents an optimistic sender-based message logging algorithm. The
main feature of this algorithm is that only a subset of messages logged on a volatile
log have to be saved on stable storage. To simplify the presentation, the channels are
assumed to be FIFO.

8.5.2 To Log or Not to Log Messages on Stable Storage

Basic Principle Let cix denote the xth local checkpoint taken by a process pi .
When pi sends a message m, it saves it in a local volatile log. When later pi takes
212 8 Asynchronous Distributed Checkpointing

Fig. 8.19 Sender-based


optimistic message logging

its next local checkpoint cix+1 , it (a) saves on stable storage cix+1 and the content of
its volatile log, and (b) re-initializes to empty its volatile log. Hence, the messages
are saved on stable storage by batch, and not individually. This decreases the number
of input/output and consequently the overhead associated with message logging.
A simple example is depicted in Fig. 8.19. After it has taken its local checkpoint
cix , the volatile log of pi is empty. Then, when it sends m1 , m2 , and m3 , it saves
them in its volatile log. Finally, when it takes cix+1 , pi writes cix+1 and the current
content of its volatile message log on stable storage before emptying its volatile log.

To Log or Not to Log: That Is the Question The question is then the following
one: Which messages saved in its volatile log, pi has to save in stable storage when
it take its next local checkpoint cix+1 ?
To answer this question, let us consider Fig. 8.20. When looking at the execution
y
of the left side, the local checkpoints cj and cix+1 are concurrent and can conse-
quently belong to the same consistent global checkpoint. The corresponding state of
the channel from pi to pj then comprises the message m (this message is in transit
y
with respect to the ordered pair (cix+1 , cj )).
When looking at the execution of the right side, there is an additional causal path
y y
starting after cj and ending before cix+1 . It follows that, in this case, cj and cix+1
are no longer concurrent (independent) and it is not necessary to save the message
m on stable storage as it cannot appear in a consistent global checkpoint. Hence,
if pi knows this fact, it does not have to save m from its volatile storage to stable
storage. Process pi can learn it if there is causal path starting from pj after it has

Fig. 8.20 To log or not to log a message?


8.5 Message Logging for Uncoordinated Checkpointing 213

received m, and arriving at pi before it takes cix+1 (this is depicted in the execution
at the bottom of the figure).

Checkpointing Algorithm: Local Data Structures Hence, a process pi has to


log on stable storage the messages it sends only if (from its point of view) they might
belong to a consistent global checkpoint. To implement this idea, the checkpointing
algorithm is enriched with the following control data structures.

• volatile_logi is the volatile log of pi . It is initially empty.


• sni [1..n] is an array of sequence numbers. sni [j ] = α means that pi has sent α
messages to pj (as the channels are FIFO, those are the first α messages sent by
pi to pj ).
• rec_knowni [1..n, 1..n] is an array of sequence numbers which captures the
knowledge of pi on the messages that have been exchanged by each pair of pro-
cesses. rec_knowni [j, k] = β means that pi knows (due to causal paths) that pj
has received β messages from pk .
(If processes do not send messages to themselves, all entries rec_knowni [j, j ]
remain equal to 0. From an implementation point of view, they can be used to
store the array sni [1..n].)
• ckpt_vci [1..n] is a vector clock associated with local checkpoints. ckpt_vci [j ] =
γ means that pi knows that pj has taken γ local checkpoints. This control data
is managed by the checkpointing algorithm, but used by the recovery algorithm
to compute a consistent global checkpoint.

Checkpointing Algorithm: Process Behavior The behavior of a process pi is


described in Fig. 8.21.
When a process pi sends a message m to a process pj , it first adds the triple
m, sni [j ], j  to its volatile log (the pair (sni [j ], j ) is the identity of m, lines 1–2).
Then it sends m to pj with its vector date ckpt_vci [1..n] and its current knowledge
on the messages that have been received (line 3).
When it receives a message, pi first updates its vector clock ckpt_vci [1..n]
(line 4), and its knowledge on which messages have been received (lines 5–6). Then
it delivers the message (line 9).
When pi takes an uncoordinated local checkpoint, it first updates its entry of
its vector clock (line 10). Then it withdraws from its volatile log every message
m it has sent and, to its current knowledge, has been received by its destination
process p_dest (lines 11–13). Finally, before emptying its volatile log (line 15), pi
saves in stable storage the current state σi of the application process plus the control
data that will allow for a computation of a consistent global checkpoint and the
corresponding channel states (line 14). These control data comprise the volatile log
plus three vectors of size n.
214 8 Asynchronous Distributed Checkpointing

when sending MSG (m) to pj do


(1) sni [j ] ← sni [j ] + 1;
(2) add m, sni [j ], j  to volatile_logi ;
(3) send MSG (m, rec_knowni [1..n, 1..n], ckpt_vci [1..n]) to pj .

when receiving MSG (m, rec_known[1..n, 1..n], ckpt_vc[1..n]) from pj do


(4) for each k ∈ {1, . . . , n} do ckpt_vci [k] ← max(ckpt_vci [k], ckpt_vc[k]) end for;
(5) rec_knowni [i, j ] ← rec_knowni [i, j ] + 1;
(6) for each k, ∈ {1, . . . , n}
(7) do rec_knowni [k, ] ← max(rec_knowni [k, ], rec_knowni [k, ])
(8) end for;
(9) Deliver the message m to the application process.

when pi decides to take a (spontaneous) local checkpoint do


(10) ckpt_vci [i] ← ckpt_vci [i] + 1;
(11) for each m, sn, dest ∈ volatile_logi do
(12) if (sn ≤ rec_knowni [dest, i]) then withdraw m, sn, dest from volatile_logi end if
(13) end for;
(14) save on stable storage the current local state plus
volatile_logi , sni [1..n], ckpt_vci [1..n], and rec_knowni [i, 1..n];
(15) empty volatile_logi .

Fig. 8.21 An uncoordinated checkpointing algorithm (code for pi )

8.5.3 A Recovery Algorithm

The following recovery algorithm is associated with the previous uncoordinated


checkpointing algorithm. When a failure occurs, the recovery algorithm executes
the following sequence of steps.
1. The non-faulty processes are first required to take a local checkpoint.
2. The most recent set of n concurrent local checkpoints it then computed itera-
tively.
Let Σ = [c1 , . . . ci , . . . , cn ], where ci is last local checkpoint taken by pi ;
σ
while (∃ (i, j ) such that ci −→ cj ) do
σ
let cj = first predecessor of cj such that ¬(ci −→ cj );
Σ ← [c1 , . . . , ci , . . . , cj , . . . , cn ]
end while.
σ σ
Thanks to Theorem 6, the values of ci −→ cj and ¬(ci −→ cj ) can be effi-
ciently determined from the vector date ci .chpt_vc associated with each local
checkpoint.
Assuming that the initial local state of each process is a local checkpoint, the
previous loop terminates and the resulting global checkpoint Σ is consistent.
3. The state of the channel connecting pi to pj (i.e., the sequence of messages
which are in transit with respect to the ordered pair (ci , cj )) is extracted from the
stable storage of pi as follows (let us note that only the sender pi is involved in
this computation).
8.5 Message Logging for Uncoordinated Checkpointing 215

Fig. 8.22 Retrieving the messages which are in transit with respect to the pair (ci , cj )

Let sn(i, j ) be the value of the sequence number sni [j ] which has been saved
by pi on its stable storage together with ci ; this is the number of messages sent
by pi to pj before taking its local checkpoint ci . It follows that the messages sent
by pi to pj (after ci ) have a sequence number greater than sn(i, j ). Similarly, let
rk(j, i) be the value of rec_knownj [j, i] which has been saved by pj on its stable
storage together with cj ; this is the number of messages received (from pi ) by
pj before it took its local checkpoint cj .
It follows that the messages from pi received by pj (before cj ) have a se-
quence number smaller than or equal to rk(j, i). As the channel is FIFO, the
messages whose sequence number sqnb is such that rk(j, i) < sqnb ≤ sn(i, j ),
define the sequence of messages which are in transit with respect to ordered pair
(ci , cj ). This is depicted in Fig. 8.22, which represents the sequence numbers
attached to messages.
Due to the predicate used at line 12 of the checkpointing algorithm, these mes-
sages have not been withdrawn from the volatile log of pi and are consequently
with ci in pi ’s stable storage.

8.5.4 A Few Improvements

Adding Forced Checkpoints Let us note that, even if the basic checkpointing
algorithm is uncoordinated, a few forced checkpoints can be periodically taken to
reduce the impact of the domino effect.

Space Reclamation As soon as a consistent global checkpoint Σ = [c1 , . . . ,


ci , . . . , cn ] has been determined, the control data, kept in stable storage, which con-
cern the local checkpoints which precede ci , 1 ≤ i ≤ n, become useless and can be
discarded.
Moreover, a background task can periodically compute the most recent consistent
global checkpoint in order to save space in the stable storage of each process.

Cost The fact that each message sent by a process pi has to carry the current value
of the integer matrix rec_knowni [1..n, 1..n] can reduce efficiency and penalize the
application program. Actually, only the entries of the matrix corresponding to the
(directed) channels of the application program have to be considered. When the
communication graph is a directed ring, the matrix shrinks to a vector. A similar gain
is obtained when the communication graph is a tree with bidirectional channels.
216 8 Asynchronous Distributed Checkpointing

Moreover, as channels are FIFO, for any two consecutive messages m1 and m2
sent by pi to pj , for each entry rec_knowni [k, ], m2 has only to carry the difference
between its current value and its previous value (which was communicated to pj
by m1 ).

The Case of Synchronous Systems Let us finally note that synchronous systems
can easily benefit from uncoordinated checkpointing without suffering the domino
effect. To that end, it is sufficient for the processes to take local checkpoints at the
end of each round of a predefined sequence of rounds.

8.6 Summary

This chapter was on asynchronous distributed checkpointing. Assuming that pro-


cesses take local checkpoints independently from another, it has presented two
consistency conditions which can be associated with local checkpoints, namely,
z-cycle-freedom and rollback-dependency trackability. The chapter then presented
distributed algorithms that force processes to take additional local checkpoints so
that the resulting communication and checkpoint pattern satisfies the consistency
condition we are interested in. The chapter also introduced important notions such
as the notion of a zigzag path, and the z-dependence relation among local check-
points. Finally, a message-logging algorithm suited to uncoordinated checkpointing
was presented.

8.7 Bibliographic Notes

• The checkpointing problem is addressed in many textbooks devoted to operat-


ing systems. One of the very first papers that formalized this problem is due to
B. Randell [303]. The domino effect notion was also introduced in this paper.
• The fundamental theorem on the necessary and sufficient condition for a set of lo-
cal checkpoints to belong to a same consistent global checkpoint is due to R.H.B.
Netzer and J. Xu [283]. This theorem was generalized to more general communi-
cation models in [175], and to the read/write shared memory model in [35].
• The notions of zigzag path and z-cycle-freedom were introduced R.H.B. Netzer
and J. Xu [283].
• The notion of an interval and associated consistency theorems are presented
in [174].
• The rollback-dependency trackability consistency condition was introduced by
Y.-M. Wang [381]. An associated theory is presented and investigated in [33, 36,
145, 375].
• Theorems 10 and 11 on linear dating to prevent z-cycle-freedom are due to J.-M.
Hélary, A. Mostéfaoui, R.H.B. Netzer, and M. Raynal [171].
8.8 Exercises and Problems 217

• The algorithm that ensures z-cycle-freedom presented in Fig. 8.7 is due to


D. Manivannan and M. Singhal [247].
A more sophisticated algorithm which takes fewer forced checkpoints is pre-
sented [171]. This algorithm manages additional control data and uses a sophis-
ticated predicate (whose spirit is similar to that of Fig. 8.18) in order to decrease
the number of forced checkpoints. This algorithm is optimal with respect to the
causal past of each process.
An evaluation of z-cycle-free checkpointing algorithms is presented in [376].
• The FDAS algorithm that ensures rollback-dependency trackability described in
Fig. 8.14 is due Y.-M. Wang [381]. Russell’s algorithm is described in [332].
• The algorithm that ensures rollback-dependency trackability described in
Fig. 8.18 is due to R. Baldoni, J.-M., Hélary, A. Mostéfaoui, and M. Raynal [34].
• Another algorithm that ensures rollback-dependency trackability is presented
in [146]. A garbage collection algorithm suited to RDT checkpointing protocols
is presented in [337].
• Numerous checkpointing algorithms are presented in the literature, e.g., [2, 52,
62, 75, 103, 129, 144, 207, 246, 299, 344] to cite a few. The textbooks [149, 219]
contain specific chapters devoted to checkpointing.
• The sender-based message logging algorithm presented in Sect. 8.5 is due to
A. Mostéfaoui and M. Raynal [271]. The matrix of sequence numbers used in
this algorithm is actually a matrix time (as defined in Chap. 7).
Other message logging algorithms have been proposed (e.g., [16, 104,
202, 383]). Recovery is the topic of many papers (e.g., [161, 284, 352] to
cite a few). Space reclamation in uncoordinated checkpointing is addressed
in [382]. A nice survey on rollback-recovery protocols for message-passing sys-
tems is presented in [120].

8.8 Exercises and Problems


zz
1. Let us consider the z-dependency relation −→ introduced in Sect. 8.1.
• Show that:
 x zz t+1   t zz y   x zz y 
ci −→ ck ∧ ck −→ cj ⇒ ci −→ cj .

• Let c1 and c2 be two local checkpoints (from distinct processes). Show that
they can belong to the same consistent global checkpoint if
– no z-cycle involving c1 or c2 exists, and
– no zigzag path exists connecting c1 and c2 .
Solution in [283].
2. Let c be a local checkpoint of a process pi , which is not its initial checkpoint. The
notation pred(c) is used to denote the local checkpoint that precedes immediately
218 8 Asynchronous Distributed Checkpointing

zz zz
c on pi . Let us define a new relation −→ as follows. c1 −→ c2 if:

σ  σ  zz 
(c1 −→ c2 ) ∧ ∃c ∈ C : (c1 −→ c) ∧ pred(e) −→ c2 .

zz zz
Show first that −→≡−→. Then give a proof of Theorem 9 based on the rela-
σ zz σ zz
tions −→ and −→ (instead of the relations −→ and −→).
Solution in [149] (Chap. 29).
3. Prove formally that the predicate BHMR used to take a forced checkpoint in
Sect. 8.4.4 is stronger than FDAS.
Solution in [34].
4. The communication-induced checkpointing algorithms presented in Sects. 8.3
and 8.4 do not record the messages which are in transit with respect to the corre-
sponding pairs of local checkpoints. Enrich one of the checkpointing algorithms
presented in these sections so that in-transit messages are recorded.
Solution in [173].
5. Modify the uncoordinated checkpointing algorithm described in Sect. 8.5 so that
only causal paths made up of a single message are considered (i.e., when consid-
ering that the causal path μ of Fig. 8.20 has a single message, and this message
is from pj ).
When comparing this algorithm with the one described in Fig. 8.21, does this
constraint reduce the size of the control information carried by messages? Does
it increase or reduce the number of messages that are logged on stable storage?
Which algorithm do you prefer (motivate your choice)?
Chapter 9
Simulating Synchrony
on Top of Asynchronous Systems

Synchronous distributed algorithms are easier to design and analyze than their asyn-
chronous counterparts. Unfortunately, they do not work when executed in an asyn-
chronous system. Hence, the idea to simulate synchronous systems on top of an
asynchronous one. Such a simulation algorithm is called a synchronizer. First, this
chapter presents several synchronizers in the context of fully asynchronous sys-
tems. It is important to notice that, as the underlying system is asynchronous, the
synchronous algorithms simulated on top of it cannot consider physical time as a
programming object they could use (e.g., to measure physical duration). The only
notion of time they can manipulate is a logical time associated with the concept of a
round. Then, the chapter presents synchronizers suited to partially synchronous sys-
tems. Partial synchrony means here that message delays are bounded but the clocks
of the processes (processors) are not synchronized (some private local area networks
have such characteristics).

Keywords Asynchronous system · Bounded delay network · Complexity ·


Graph covering structure · Physical clock drift · Pulse-based programming ·
Synchronizer · Synchronous algorithm

9.1 Synchronous Systems, Asynchronous Systems,


and Synchronizers
9.1.1 Synchronous Systems

Synchronous systems were introduced in Chap. 1, and several synchronous algo-


rithms have been presented in Chap. 2; namely the computation of shortest paths
in Fig. 2.3, and the construction of a maximal independent set in Fig. 2.12. In the
first of these algorithms, a process sends a message to each of its neighbors at ev-
ery round, while in the second algorithm the set of neighbors to which it sends a
message monotonically decreases as rounds progress.
In a general synchronous setting, a process may send messages to some subset
of neighbors during a round, and to a different subset during another round. It can
also send no message at all during a round. In the following, the terms “pulse” and
“round” are considered synonyms.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 219


DOI 10.1007/978-3-642-38123-2_9, © Springer-Verlag Berlin Heidelberg 2013
220 9 Simulating Synchrony on Top of Asynchronous Systems

Fig. 9.1
A space-time diagram
of a synchronous execution

A Pulse-Based Synchronous Model We consider here another way to define


the synchronous model. There is a logical global clock (denoted CLOCK) that can
be read by all the processes. This clock produces pulses at instants defined by the
integers 0, 1, 2, etc. The behavior of the processes and the channels is governed by
the clock. They behave as follows.
• A message sent by a process pi to its neighbors pj at the beginning of a pulse r
is received and processed by pj before the pulse r + 1 starts. Hence, in terms of
pulses, the transfer delays are bounded.
• It is assumed that local processing times are negligible with respect to transfer
delays. Hence, they are assumed to have a null duration. (Actually, the processing
time associated with the reception of a message can be seen as being “absorbed”
in the message transfer time.)
• A process sends to a given neighbor at most one message per pulse. Hence, when
a new pulse r + 1 is generated, a process pi knows that (a) all the messages sent
during pulse r have been received and processed, and (b) all the other processes
are at pulse r + 1.
As we can see, this definition captures the same behavior as the definition given in
Sect. 1.1. Only the terminology changes: a pulse is simply a round number. The only
difference lies in the writing of algorithms. the clock/pulse-based notation allows
synchronous algorithms to be easily expressed with the pattern “when . . . do . . . ”
(instead of the round-based pattern used in Chap. 2).
A synchronous execution is described in Fig. 9.1, where time flows from left to
right and space is from top to bottom. In this space-time diagram, there are three
processes p1 , p2 , and p3 . During the first pulse (round), p1 sends a message only
to p2 , and p2 and p3 send a message to each other. Each of the other rounds has a
specific message exchange pattern.

A Synchronous Breadth-First Traversal Algorithm To illustrate the syn-


chronous model expressed with pulses, let us consider the breadth-first traversal
problem. The channels are bidirectional, and the communication graph is connected.
Let pa be the process that launches the traversal. As in the previous chapter,
channeli [1..ci ] denotes the array that defines the local addresses of the ci channels
of a process pi . Moreover, each process has two local variables, denoted leveli and
parenti .
9.1 Synchronous Systems, Asynchronous Systems, and Synchronizers 221

when CLOCK is increased do


% a new pulse (round) is generated by the synchronous system %
if (leveli = CLOCK) then
for each x ∈ {1, . . . , ci } do send LEVEL (leveli ) on channeli [x] end for
end if.

when LEVEL ( ) is received on channeli [x] do


if (fatheri = ⊥) then leveli ← + 1; parenti ← x end if.

Fig. 9.2 Synchronous breadth-first traversal algorithm (code for pi )

• The initial values of the variables leveli , 1 ≤ i ≤ n, are such that levela = 0, and
leveli = +∞ for i = a. At the end of the algorithm, leveli contains the level of pi
(i.e., its distance to pa ).
• The initial values of the variables parenti , 1 ≤ i ≤ n, are such that parenta = a,
and parenti = ⊥ for i = a. At the end of the algorithm, parenti contains the index
of the channel connecting pi to its parent in the tree rooted at pa .
The corresponding synchronous algorithm is described in Fig. 9.2. It is particu-
larly simple. When the clock is set to 0, the algorithm starts, and pa sends the mes-
sage LEVEL (0) to all its neighbors which discover they are at distance 1 from pa by
the end of the first pulse. Then, when the next pulse is generated (CLOCK = 1), each
of them sends the message LEVEL (1) to all its neighbors, etc. Moreover, the first
message received by a process defines its parent in the breadth-first tree rooted at
pa . The global synchronization provided by the model ensures that the first LEVEL ()
message received by a process has followed a path whose number of channels is the
distance from the root to that process.
It is easy to see that the time complexity is equal to the eccentricity of pa (maxi-
mal distance from pa to any other process), and the message complexity is equal to
O(e), the number of communication channels. (Let us observe that (n−1) messages
can be saved by preventing a process from sending a message to its parent.)

9.1.2 Asynchronous Systems and Synchronizers

Asynchronous Systems As we saw in Chap. 1, an asynchronous distributed sys-


tem is a time-free system in the sense that there is no notion of an external time
that would be given a priori, and could be used in the algorithms executed by the
processes. In an asynchronous system, the progress of time is related only to the
sequence of operations executed by each process and the flow of messages that they
exchange.
To simplify the presentation, it is assumed that all the channels are FIFO chan-
nels. (If the channels are not FIFO, sequence numbers can be used to provide a
communication layer where channels are FIFO.)
222 9 Simulating Synchrony on Top of Asynchronous Systems

Fig. 9.3 Synchronizer: from asynchrony to logical synchrony

Synchronizer A synchronizer is a distributed asynchronous algorithm that simu-


lates a synchronous system on top of an asynchronous system, as shown in Fig. 9.3.
This concept was introduced by B. Awerbuch (1985).
Hence, if we consider a synchronizer and an asynchronous system, we obtain a
simulator of a synchronous system on which synchronous algorithms can be exe-
cuted (left part of Fig. 9.3). Conversely, if we consider a synchronous algorithm and
a synchronizer, we obtain an asynchronous algorithm that can be executed on top
of an asynchronous system (as depicted on the right part of Fig. 9.3). It is important
to see that the simulation has to be general in the sense it has not to depend on the
specific synchronous algorithms that will be executed on top it. A synchronizer is
consequently an interpreter for synchronous algorithms.
Thus, the design of a synchronizer consists of developing a simulation technique
enabling users to design distributed algorithms as they were intended to be executed
on a synchronous system (as defined previously). To that end, a synchronizer has to
ensure the following:
• Implement a local variable clocki on each process pi such that the set of logical
clocks clock1 , . . . , and clockn , behave as if there was a single read-only clock
CLOCK.
• Ensure that each message sent at pulse (round) r is received and processed at the
very same pulse r by its destination process.

9.1.3 On the Efficiency Side

Time and Message Costs An important question concerns the cost added by a
synchronizer to simulate a given synchronous algorithm. As being a synchronous
algorithm, let Ts (As ) and Ms (As ) be its complexities in time and in number of
messages (when executed in a synchronous system).
As we have just seen, a synchronizer Σ generates a sequence of pulses on each
process in such a way that all the processes are simultaneously (with respect to a
logical time framework) at the same pulse at the same time.
pulse pulse
• The simulation of a pulse by Σ requires MΣ messages and TΣ time units.
• Moreover, the initialization of Σ requires MΣ
init messages and T init time units.
Σ
9.2 Basic Principle for a Synchronizer 223

These values allows for the computation of the time and message complexities,
denoted Tas (As ) and Mas (As ), of the asynchronous algorithm resulting from the
execution of As by Σ . More precisely, as As requires Ts (As ) pulses, and each pulse
pulse pulse
costs MΣ messages and TΣ time units, we have
 pulse 
Mas (As ) = Ms (As ) + MΣ init
+ Ts (As ) × MΣ , and
 pulse 
Tas (As ) = TΣinit + Ts (As ) × TΣ .

pulse
init , T init , M pulse
A synchronizer Σ will be efficient if MΣ Σ Σ , and TΣ are “rea-
sonably” small. Moreover, these four numbers, which characterize every synchro-
nizer, allow synchronizers to be compared. Of course, given a synchronizer Σ , there
is a compromise between these four values and they cannot be improved simultane-
ously.

Design Cost As indicated on the right of Fig. 9.1, combining a synchronous al-
gorithm with a synchronizer gives an asynchronous algorithm. In some cases, such
an asynchronous algorithm can “compete” with ad hoc asynchronous algorithms
designed to solve the same problem.
Another interest of the synchronizer concept lies in its implementations. Those
are based on distributed synchronization techniques, which are general and can be
used to solve other distributed computing problems.

9.2 Basic Principle for a Synchronizer

9.2.1 The Main Problem to Solve

As stated previously, the role of a synchronizer is to


• generate pulses at each process of an asynchronous distributed system as if these
pulses were issued by a global clock whose changes of value would be known
instantaneously by all the processes, and
• ensure that a message sent at the beginning of a pulse r is received by its destina-
tion process before this process starts pulse r + 1.
It follows that a sequence of pulses has to satisfy the following property, denoted
P: A new pulse r + 1 can be generated at a process only after this process has
received all the pulse r messages (of the synchronous algorithm) sent to it by its
neighbors.
The crucial issue is that, during a round r, a process does not know which of
its neighbors sent it a message (see Fig. 9.1). At the level of the underlying asyn-
chronous system, transit delays are finite but not bounded, and consequently the
solution for a process to wait during a “long enough” period of time cannot work.
224 9 Simulating Synchrony on Top of Asynchronous Systems

9.2.2 Principle of the Solutions

Solving the previous issue requires adding synchronization among neighboring pro-
cesses. To that end, let us introduce the notion of a safe process.

Notion of a Safe Process A process pi is safe with respect to a pulse r if all


the messages it has sent to its neighbors at the beginning of this pulse have been
received by their destination process.
Let us observe that, using acknowledgment messages, it is easy for a process pi
to learn when it becomes safe with respect to a given pulse r. This occurs when
it has received the acknowledgments associated with the messages it sent during
the pulse r. With respect to the synchronous algorithm, this doubles the number
of messages but does not change its order of complexity. Moreover, a process that
sends no messages during a pulse is safe (at no cost) since the beginning of the
corresponding round.

From Safe Processes to the Property P The notion of a safe process can be used
to ensure the property P as follows: the local module implementing the synchronizer
at a process pi can generate a new pulse (r + 1) at that process when (a) pi has
carried on all its processing of pulse r and (b) learned that each of its neighbors is
safe with respect to the pulse r.
The synchronizers, which are presented below, differ in the way they deliver to
each process pi the information “your neighbor pj is safe with respect to the current
pulse r”.

9.3 Basic Synchronizers: α and β

Both the synchronizers α and β are due to B. Awerbuch (1985).

9.3.1 Synchronizer α

Principle of the Synchronizer α The underlying principle is very simple. When,


thanks to the acknowledgment messages, a process pi learns that it is safe with
respect to its current pulse, it indicates this to its neighbors by sending them a control
message denoted SAFE (). So, when a process (a) has terminated its actions with
respect to a pulse r and (b) learned that its neighbors are safe with respect this pulse
r, the local synchronizer module can generate the pulse r + 1 at pi . This is because
there is no longer any message related to a pulse r  ≤ r, which is in transit on a
channel incident to pi .
9.3 Basic Synchronizers: α and β 225

pulse pulse
Complexities For the complexities Mα and Tα , we have the following,
where e is the number of channels of the communication graph. At each pulse, ev-
ery application message is acknowledged, and each process pi informs its ci neigh-
pulse pulse
bors that it is safe. Hence Mα = O(e), i.e. Mα ≤ O(n2 ). As far the time
complexity is concerned, let us observe that control messages are sent only between
pulse
neighbors, hence Tα = O(1).
For the complexities Mαinit and Tαinit , it follows from the fact that there is no
specific initialization part that we have Mαinit = Tαinit = 0.

Messages Used by the Synchronizer α The local modules implementing the


synchronizer α exchange three types of message.
• An acknowledgment message is denoted ACK ().
• A control message denoted SAFE () is sent by a process pi to indicate that it is
safe with respect to its last pulse.
• Each application message m sent by a synchronous application process to one of
its neighbors is encapsulated in a simulation message denoted MSG (m).

Local Variables of the Synchronizer α The local variable clocki is built by the
synchronizer and its value can only be read by the local synchronous application
process.
The local variables channelsi and channeli [1..ci ] are defined by the communi-
cation graph of the synchronous algorithm. As in previous algorithms, channelsi is
the set {1, . . . , ci } (which consists of the indexes of the ci channels of pi ), and for
each x ∈ channelsi , channeli [x] denotes locally at pi the corresponding channel.
The other two local variables, which are hidden to the upper layer, are used only
to implement the required synchronization. These variables are the following:
• expected_acki contains the number of acknowledgments that pi has still to re-
ceive before becoming safe with respect to the current round.
• neighbors_safei is a multiset (also called a bag) that captures the current percep-
tion of pi on which of its neighbors are safe. Initially, neighbors_safei is empty.
A multiset is a set that can contain several times some of its elements. As we
about to see, it is possible that a process pi receives several messages SAFE ()
(each corresponding to a distinct pulse r, r + 1, etc.), from a neighbor pj while
it pi still at pulse r. Hence, the use of a multiset.

On the Wait Statement It is assumed that the code of all the synchronizers can-
not be interrupted except in a wait statement. This means that a process receives
and processes a message (MSG (), ACK (), or SAFE ()) only when it executes line 5
or line 7 of Fig. 9.4 when considering the synchronizer α. The same holds for the
other synchronizers.

Algorithm of the Synchronizer α The algorithm associated with the local mod-
ule implementing α at a process pi is described in Fig. 9.4. When a process starts its
next pulse (line 1), it first sends the messages MSG (m) (if any), which correspond to
226 9 Simulating Synchrony on Top of Asynchronous Systems

repeat
(1) clocki ← clocki + 1; % next pulse is generated %
(2) Send the messages MSG () of the current pulse of the local synchronous algorithm;
(3) expected_acki ← number of MSG (m) sent during the current pulse;
(4) neighbors_safei ← neighbors_safei \ channelsi ;
(5) wait (expected_acki = 0); % pi is safe respect to the current pulse %
(6) for each x ∈ channelsi do send SAFE () on channeli [x] end for;
(7) wait (channelsi ⊆ neighbors_safei ) % The neighbors of pi are safe %
% pi has received all the messages MSG () sent to it during pulse clocki %
until the last local pulse has been executed end repeat.

when MSG (m) is received on channeli [x] do


(8) send ACK () on channeli [x];
(9) if (x ∈/ neighbors_safei )
(10) then m belongs to the current pulse; deliver it to the synchronous algorithm
(11) else m belongs to the next pulse; keep it to deliver it at the next pulse
(12) end if.

when ACK () is received on channeli [x] do


(13) expected_acki ← expected_acki − 1.

when SAFE () is received on channeli [x] do


(14) neighbors_safei ← neighbors_safei ∪ {x}.

Fig. 9.4 Synchronizer α (code for pi )

the application messages m that the local synchronous algorithm must send at pulse
clocki (line 2).
Then pi initializes appropriately expected_acki (line 3), and neighbors_safei
(line 4). As neighbors_safei is a multiset, its update consists in suppressing one
copy of each channel index. Process pi then waits until it has become safe (line 5),
and when this has happens, it sends a message SAFE () to each of its neighbors to in-
form them (line 6). Finally, it waits until all its neighbors are safe before proceeding
to the next pulse (line 7).
When, it receives a message ACK () or SAFE (), a process pi updates the corre-
sponding local variable expected_acki (line 13), or neighbors_safei (line 14).
Let us remark that, due to the control messages SAFE () exchanged by neighbor
processes, a message MSG (m) sent at pulse r by a process pj to a process pi will
arrive before pi starts pulse (r + 1). This is because pi must learn that pj is safe
with respect to pulse r before being allowed to proceed to pulse (r + 1). But a
message MSG (m) sent at pulse r  to a process pi can arrive before pi starts pulse r  .
This is depicted on Fig. 9.5, where pj and pk are two neighbors of pi , and where
r  = r + 1 and pi receives a pulse (r + 1) message while it is still at pulse r.
Moreover, let us benefit from this figure to consider the case where pj does not
send application messages during pulse (r + 1). In Fig. 9.5, the message MSG () sent
by pj is consequently replaced by the message SAFE (). In that case, neighbors_safei
contains twice the local index of the channel connecting pj to pi . This explains why
neighbors_safei has to be a multiset.
9.3 Basic Synchronizers: α and β 227

Fig. 9.5 Synchronizer α: possible message arrival at process pi

When it receives a message MSG (m) on a channel channeli [x] (which connects
it to its neighbor pj ), a process pi first sends back an ACK () message (line 8).
According to the previous observation (Fig. 9.5), the behavior of pi depends then
on the current value of neighbors_safei (line 9). There are two cases.
• The channel index x ∈ / neighbors_safei (line 10). In that case, pi has not received
from pj the SAFE () message that closes pulse r. Hence, the message MSG (m)
is associated with pulse r, and consequently, pi delivers m it to the upper layer
local synchronous algorithm.
• The channel index x ∈ neighbors_safei (line 11). This means that pi has already
received the message SAFE () from pj concerning the current pulse r. Hence, m
is a message sent at pulse r + 1. Consequently, pi has to store the message m,
and delivers it during pulse r + 1 (after it has sent its messages associated with
pulse r + 1).

9.3.2 Synchronizer β

Principle of the Synchronizer β The synchronizer β is based on a spanning


tree rooted at some process pa . The construction of this tree has to be done in an
initialization stage. This tree is used to convey the control messages SAFE () from the
leaves to the root, and (new) control messages PULSE () from the root to the leaves.
As soon as a process that is a leaf in the control tree is safe, it indicates this to
its parent in the tree (messages SAFE ()). A non-leaf process waits until it and its
children are safe before informing its parent of this. When the root learns that all the
processes are safe with respect to the current round, it sends a message PULSE (),
which is propagated to all the processes along the channels of the tree.
Let us notice that, when the root learns that all the processes are safe, no message
MSG (m) is in transit in the system.

Complexities Every message MSG (m) gives rise to an acknowledgment message,


and at every pulse, two control messages (SAFE () and PULSE ()) are sent on each
channel of the tree. Moreover, the height of the tree rooted at a process pa is equal
pulse pulse
to its eccentricity ecca , which is at most (n − 1). Hence, we have Mβ = Tβ =
O(n).
228 9 Simulating Synchrony on Top of Asynchronous Systems

If a rooted tree pre-exists, we have Mβinit = Tβinit = 0. If the tree has to be built,
Mβinit and Tβinit are the costs of building a spanning tree (see Chap. 1).

Local Variables of the Synchronizer β As before, each process pi has ci neigh-


bors with which it can communicate through the channels channeli [1..ci ]. The span-
ning tree is implemented with the following local variables at each process pi .
• channeli [parenti ] denotes the channel connecting pi to its parent in the tree.
Moreover, the root pa has an additional channel index parenta such that
channela [parenta ] = ⊥.
• childreni is a set containing the indexes of the channels connecting process pi to
its children. If childreni = ∅, pi is a leaf of the tree.
Finally, as the messages SAFE () are sent only from a process to its parents, the
multiset neighbors_safei (used in the synchronizer α) is replaced by a local variable
denoted children_safei . Since a process pi waits for messages SAFE () only from its
children in the control tree, this variable is no longer required to be a multiset. The
set children_safei is initially empty.

Algorithm of the Synchronizer β The behavior of the synchronizer β is de-


scribed in Fig. 9.6. The root sends a message PULSE () to its neighbors, and this mes-
sage is propagated along the tree to all the processes (lines 1–3). Then, pi sends (if
any) its messages related to pulse clocki of its local synchronous algorithm (line 4),
and resets expected_acki to the number of these messages (line 6).
After it has sent the messages of the current pulse (if any), pi waits until both
itself and its children are safe (line 7). When this happens, it sends a message SAFE ()
to its parent to inform it that the subtree it controls is safe (line 8). Process pi also
resets children_safei to ∅ (line 9). In this way, all the local variables children_safei
are equal to their initial value (∅) when the root triggers the next pulse. If, while pi
is waiting at line 7, or after it has sent the message SAFE () to its parent, pi receives
messages MSG (m) from some neighbor, pi processes them and this lasts until it
enter the next pulse at line 2.
This last point is one where synchronizer β differs from synchronizer α. In α, a
process pi learns locally when its neighbors are safe. In the synchronizer β, a pro-
cess learns locally this only from its children (which are a subset of its neighbors). It
learns that all its neighbors are safe only when it receives a message PULSE () which
carries the global information that all the processes are safe with respect to pulse
clocki . This means that the message pattern depicted in Fig. 9.7 can happen with
synchronizer β, while it cannot with synchronizer α.
Moreover, a message MSG (m) sent at pulse r can arrive at a process pi while
it is still at pulse clocki = r − 1. This is due to the fact that messages PULSE ()
are not sent between each pair of neighbors, but only along the channel spanning
tree. This pattern is described in Fig. 9.8. A simple way to allow the destination
process pi to know if a given message MSG (m) is related to pulse clocki or pulse
clocki + 1 consists in associating a sequence number to these messages. Actually,
as the channels are FIFO, a simple parity bit (denoted pb in Fig. 9.6) is sufficient to
disambiguate messages.
9.3 Basic Synchronizers: α and β 229

repeat
(1) if (channeli [parenti ] = ⊥) then wait (PULSE() received on channeli [parenti ]) end if;
% pi and all its neighbors are safe with respect to the pulse clocki : %
% it has received all the messages MSG () sent to it during pulse clocki %
(2) clocki ← clocki + 1; % next pulse is generated %
(3) for each x ∈ childreni do send PULSE () on channeli [x] end for;
(4) Send the messages MSG (−, pb) of the local synchronous algorithm
(5) where pb = (clocki mod 2);
(6) expected_acki ← number of MSG (m) sent during the current pulse;
(7) wait ((expected_acki = 0) ∧ (children_safei = childreni ));
% pi and all its children are safe respect to the current pulse %
(8) if (channeli [parenti ] = ⊥) then send SAFE () on channeli [parenti ] end if;
(9) children_safei ← ∅
until the last local pulse has been executed end repeat.

when MSG (m, pb) is received on channeli [x] do


(10) send ACK () on channeli [x];
(11) if (pb = (clocki mod 2))
(12) then m belongs to the current pulse; deliver it to the synchronous algorithm
(13) else m belongs to the next pulse; keep it to deliver it at the next pulse
(14) end if.

when ACK () is received on channeli [x] do


(15) expected_acki ← expected_acki − 1.

when SAFE () is received on channeli [x] do % we have then x ∈ childreni %


(16) children_safei ← children_safei ∪ {x}.

Fig. 9.6 Synchronizer β (code for pi )

Fig. 9.7 A message pattern


which can occur
with synchronizer β
(but not with α): Case 1

Fig. 9.8 A message pattern


which can occur
with synchronizer β
(but not with α): Case 2
230 9 Simulating Synchrony on Top of Asynchronous Systems

Fig. 9.9 Synchronizer γ : a communication graph

9.4 Advanced Synchronizers: γ and δ

Both the synchronizers γ and δ are generalizations of α and β. Synchronizer γ


is due to B. Awerbuch (1985), while synchronizer δ is due to D. Peleg and J.D.
Ullman (1989). These synchronizers differ in the totally different ways that they
generalize α and β.

9.4.1 Synchronizer γ

Looking for an Overlay Structure: Partitioning the Communication Graph


When looking at complexity, the synchronizers α and β have opposite perfor-
mances, namely α is better than β for time but worse for the number of messages.
Synchronizer α is good for communication graphs with limited degree, while syn-
chronizer β is good for communication graphs with small diameter.
The principle that underlies the design of synchronizer γ is to combine α and
β in order to obtain better time complexity than β and better message complexity
than α. Such a combination relies on a partitioning of the system (realized during
the initialization part of γ ).
To fix the idea, let us consider Fig. 9.9 which describes a communication graph
with eleven processes. A partitioning of this graph is described in Fig. 9.10, where
the communication graph is decomposed into three spanning trees which are inter-
connected. Each tree is surrounded by an ellipsis, and its corresponding edges (com-
munication channels) are denoted by bold segments. The trees are connected by
communication channels denoted by dashed segments. The communication chan-
nels which are neither in a tree nor interconnecting two trees are denoted by dotted
segments.
While, messages of the synchronous application algorithm can be sent on any
communication channel, control messages sent by the local modules of the syn-
chronizer γ are sent only on the (bold) edges of a tree, or on the (dashed) edges that
interconnect trees.
9.4 Advanced Synchronizers: γ and δ 231

Fig. 9.10 Synchronizer γ : a partitioning

A tree is called a group, and we say “intragroup channel” or “intergroup chan-


nel”. Hence, when considering Fig. 9.10, the dashed segments are inter-group chan-
nels.

Principle of Synchronizer γ Synchronizer γ is based on a two-level synchro-


nization mechanism.
• Inside each group, the tree is used (as in β) to allow the processes of the group to
learn that the group is safe (i.e., all its processes are safe).
• Then, each group behaves as if it was a single process and the synchronizer α is
used among the groups to allow each of them to learn that its neighbor groups are
safe. When this occurs, a new pulse can be generated inside the corresponding
group.

Local Variables Used by Synchronizer γ The local variables of a process pi


related to communication are the following ones.
• As previously, channelsi = {1, . . . , ci } is the set of local indexes such that, for
each x ∈ channelsi , channelsi [x] locally denotes a communication channel of pi .
These are all the channels connecting process pi to its neighbor processes.
• As for β, channeli [parenti ] denotes the channel connecting pi to its parent in
the tree associated with its group, and childreni denotes the set of its children.
The root pa of a group is the process pa such that channela [parenta ] = ⊥. These
channels are depicted as bold segments in Fig. 9.10.
• The set it_group_channelsi ⊆ {1, . . . , ci }, contains indexes of channels connect-
ing pi to processes of other groups. From a control point of view, two neighbor
groups are connected by a single channel on which travel (application messages
and) control messages. These channels are called inter-group channels. These are
the channels depicted by dashed segments in Fig. 9.10.
The local variable expected_acki has the same meaning as in α or β, and
children_safei has the same meaning as in β, while neighbors_safei is a multiset
(initialized to ∅) which has the same meaning as in α.
232 9 Simulating Synchrony on Top of Asynchronous Systems

Messages Used by Synchronizer γ In addition to the messages MSG (), ACK ()


used as in α or β, and the messages SAFE () used as in β (i.e., used inside a
group and sent along the tree to inform the root), processes exchange messages
GROUP _ SAFE () and ALL _ GROUP _ SAFE ().
A message GROUP _ SAFE () is sent by a process pi to its children, and to its
neighbors which belong to other groups. These messages are used to inform their
destination processes that all the processes of the group to which the sender belongs
are safe. A message ALL _ GROUP _ SAFE () is used inside a group (from the leaves of
the spanning tree until its root), to inform its processes that its neighbor groups are
safe.

Algorithm of the Synchronizer γ The algorithm implementing synchronizer γ


at a process pi is described in Fig. 9.11.
Inside each group, the root process first sends a message PULSE () that is prop-
agated along the channels of the tree associated to this group (lines 1–3). After it
has received a message PULSE (), a process pi starts interpreting the current pulse
of the synchronous algorithm by sending the corresponding application messages
(line 4). As we have seen with the tree-based synchronizer β, these messages carry
the parity of the corresponding pulse so that they are processed at the correct pulse
(lines 23–26). Then, pi initializes expected_acki to its new value (line 6), and
neighborsi _safei similarly to what is done in the synchronizer α (line 7).
After these statements, pi waits until it and its children are safe (hence, all the
processes of its group in the subtree for which it is the root are safe line 8). When
this occurs, pi informs its parent by sending it the message SAFE () (line 10). If it
is root, pi learns that all the processes in its group are safe. It consequently sends a
message GROUP _ SAFE () (line 12–14) to its children and the processes of the other
groups to which it is connected with a channel in it_group_channelsi . These mes-
sages GROUP _ SAFE () are propagated to all the processes of the group, and along the
channels in it_group_channelsi (lines 12–14 and 29–31). Moreover, when a process
pi receives a message GROUP _ SAFE () from a process in another group (this oc-
curs on a channel in it_group_channelsi ), it updates neighbors_safei accordingly
(line 32).
Then, pi waits until it knows that all its neighbor groups are safe with respect to
the current pulse clocki (line 16). When this occurs, the corresponding information
will propagate inside a group from the leaves of the spanning tree to its root (lines 17
and 18–20). (Let us observe that the waiting predicate of line 17 is trivially satisfied
at a leaf of the tree.) When the root learns that the processes in all groups are safe
with respect the current round, it starts the next pulse inside its groups.

Two Particular (Extreme) Cases If each process constitutes a group, the span-
ning trees inside each group, the tree-related variables (parenti , childreni ), and
the tree-related messages (GROUP _ SAFE ()) disappear. Moreover, we have then
it_group_channelsi = channelsi . The parity can also be suppressed. It follows that
we can suppress lines 1, 3, 9–11, 15, 17–21, 28, and 29–31. As the reader can
check, we then obtain the synchronizer α (where the messages SAFE () are replaced
by GROUP _ SAFE ()).
9.4 Advanced Synchronizers: γ and δ 233

repeat
(1) if (channeli [parenti ] = ⊥) then wait (PULSE() received on channeli [parenti ]) end if;
% all processes are safe with respect to pulse clocki %
(2) clocki ← clocki + 1; % next pulse is generated %
(3) for each x ∈ childreni do send PULSE () on channeli [x] end for;
(4) Send the messages MSG (−, pb) of the local synchronous algorithm
(5) where pb = (clocki mod 2);
(6) expected_acki ← number of MSG (m) sent during the current pulse;
(7) neighbors_safei ← neighbors_safei \ (it_group_channelsi ∪ childreni );
(8) wait ((expected_acki = 0) ∧ (children_safei = childreni ));
% pi and all its children are safe respect to the current pulse %
(9) if (channeli [parenti ] = ⊥)
(10) then send SAFE () on channeli [parenti ]
(11) else % pi is the root of its group %
(12) for each x ∈ childreni ∪ it_group_channelsi
(13) do send GROUP _ SAFE () on channeli [x]
(14) end for
(15) end if;
(16) wait (it_group_channelsi ⊆ neighbors_safei );
% pi ’s group and its neighbor groups are safe with respect to pulse clocki %
(17) wait (∀x ∈ childreni : ALL _ GROUPS _ SAFE () received on channelsi [x] );
(18) if (channeli [parenti ] = ⊥)
(19) then send ALL _ GROUPS _ SAFE () on channeli [parenti ]
(20) end if;
(21) children_safei ← ∅
until the last local pulse has been executed end repeat.
when MSG (m, pb) is received on channeli [x] do
(22) send ACK () on channeli [x];
(23) if (pb = (clocki mod 2))
(24) then m belongs to the current pulse; deliver it to the synchronous algorithm
(25) else m belongs to the next pulse; keep it to deliver it at the next pulse
(26) end if.
when ACK () is received on channeli [x] do
(27) expected_acki ← expected_acki − 1.
when SAFE () is received on childreni [x] do % we have then x ∈ childreni %
(28) children_safei ← children_safei ∪ {x}.
when GROUP _ SAFE () is received on channeli [parenti ] do
(29) for each x ∈ childreni ∪ it_group_channelsi
(30) do send GROUP _ SAFE () on channeli [x]
(31) end for.
when GROUP _ SAFE () is received on channeli [x] where x ∈ it_group_channelsi do
(32) neighbors_safei ← neighbors_safei ∪ {x}.

Fig. 9.11 Synchronizer γ (code for pi )

In the other extreme case, there is a single group to which all the processes
belong. In this case, both it_group_channelsi , which is now equal to ∅, and
neighbors_safei become useless and disappear. Similarly, the messages GROUP _
SAFE () and ALL _ GROUP _ SAFE () become useless. It follows that we can suppress
lines 7, 11–14, 16–20, 29–31, and 32, and synchronizer β.
234 9 Simulating Synchrony on Top of Asynchronous Systems

Complexities Let Ep be the channels on which control messages other than


ACK () travel (those are the channels of the spanning trees and the intergroup chan-
nels). We do not consider the messages ACK () because they are used in all synchro-
nizers. Let Hp be the maximum height of a spanning tree.
At most four message are sent on each channel of Ep , namely PULSE (), SAFE (),
pulse
GROUP _ SAFE (), and ALL _ GROUP _ SAFE (). It follows that we have Cγ =
pulse pulse
O(|Ep |) and Tγ = O(|Hp |). It is possible to find partitions such that Cγ ≤
pulse
kn and Tγ ≤ logk n, where 1 ≤ k ≤ n.
More generally, according to the partition that is initially built, we have O(n) ≤
pulse
Cγ ≤ O(e) ≤ O(n2 ) (where e is the number of communication channels) and
pulse
O(1) ≤ Tγ ≤ O(n).

9.4.2 Synchronizer δ

Graph Spanner Let us first recall that a partial graph is obtained by suppressing
edges, while a subgraph is obtained by suppressing vertices and their incident edges.
Given a connected undirected graph G = (V , E) (V is the set of vertices and E
the set of edges), a partial subgraph G = (V , E  ) is a t-spanner if, for any edge
(x, y) ∈ E, there is path (in G ) from the vertex x to the vertex y whose distance
(number of channels) is at most t.
The notion of a graph spanner generalizes the notion of a spanning tree. It is used
in distributed message-passing distributed systems to define overlay structures with
appropriate distance properties.

Principle of Synchronizer δ This synchronizer assumes that a t-spanner has


been built on the communication graph defined by the synchronous algorithm. Its
principle is simple. When a process becomes safe, it executes t communication
phases with its neighbors in the t-spanner, at the end of which it will know that all
its neighbors in the communication graph are safe.
From an operational point of view, let us consider a process pi that becomes
safe. It sets a local variable phi to 0, sends a message SAFE () to its neighbors in
the t-spanner, and waits such a message from each of its neighbors in the t-spanner.
When, it has received these messages, it increases phi to phi + 1 and re-executes
the previous communication pattern. After this has been repeated t times, pi locally
generates its next pulse.

Theorem 12 For all k ∈ [0..t] and every process pi , when phi is set to k, the pro-
cesses at distance d ≤ k from pi in the communication graph are safe.

Proof The proof is by induction. Let us observe that the invariant is true for k = 0
(pi is safe when it sets phi to 0). Let us assume that the invariant in satisfied up to
k. When pi increases its counter to k + 1, it has received (k + 1) messages SAFE ()
9.4 Advanced Synchronizers: γ and δ 235

repeat
% pi and all its neighbors are safe with respect to the pulse clocki %
(1) clocki ← clocki + 1; % next pulse is generated %
(2) Send the messages MSG (−, pb) of the local synchronous algorithm
(3) where pb = (clocki mod 2);
(4) expected_acki ← number of MSG (m) sent during the current pulse;
(5) wait (expected_acki = 0 );
% pi and all its children are safe respect to the current pulse %
(6) phi ← 0;
(7) repeat t_neighbors_safei ← t_neighbors_safei \ spanner_channelsi ;
(8) for each x ∈ spanner_channelsi do send SAFE () on channeli [x] end for;
(9) wait (spanner_channelsi ⊆ t_neighbors_safei );
(10) phi ← phi + 1
(11) until (phi = t) end repeat
until the last local pulse has been executed end repeat.

when MSG (m, pb) is received on channeli [x] do


(12) send ACK () on channeli [x];
(13) if (pb = (clocki mod 2))
(14) then m belongs to the current pulse; deliver it to the synchronous algorithm
(15) else m belongs to the next pulse; keep it to deliver it at the next pulse
(16) end if.

when ACK () is received on channeli [x] do


(17) expected_acki ← expected_acki − 1.

when SAFE () is received on channeli [x] do % we have then x ∈ spanner_channelsi %


(18) t_neighbors_safei ← t_neighbors_safei ∪ {x}.

Fig. 9.12 Synchronizer δ (code for pi )

from each of its neighbors in the t-spanner. Let pj be one of these neighbors. When
pj sent its (k + 1)th message SAFE () to pi , we had phj = k. It follows from the
induction assumption that the processes at distance d ≤ k from pj (in the commu-
nication graph) are safe, and this applies to every neighbor of pi in the t-spanner.
So, when pi increases its counter to k + 1, all its neighbors in the communication
graph at a distance d ≤ k + 1 are safe. 

Algorithm of Synchronizer δ The behavior of a process pi is described in


Fig. 9.12. The local set spanner_channelsi contains the indexes of the channels
of pi that belong to the t-spanner. The local variable t_neighbors_safei (which is
initially empty) is a multiset whose role is to contain indexes of t-spanner channels
on which messages SAFE () have been received.

Complexities Let m be the number of channels in the t-spanner. It is easy to see


pulse pulse
that Cγ = O(t) and Tγ = O(mt).
It easy to see that the case t = 1 corresponds to the synchronizer α. If the
t-spanner is a spanning tree, we have m = n − 1 and t ≤ D (let us recall that D
236 9 Simulating Synchrony on Top of Asynchronous Systems

when INIT () received do


if (¬ donei ) then
donei ← true; set timeri to 0;
for each x ∈ neighborsi do send INIT () on channeli [x] end for
end if.

Fig. 9.13 Initialization of physical clocks (code for pi )

pulse
is the diameter of the communication graph), and we have Cγ = O(D) and
pulse
Tγ = O(nD).

9.5 The Case of Networks with Bounded Delays

While synchronous and asynchronous algorithms are two extreme points of the syn-
chronous behavior spectrum, there are distributed systems that are neither fully syn-
chronous nor entirely asynchronous. This part of this chapter is devoted to such a
type of system, and the construction of synchronizers, which benefit from its spe-
cific properties, is used to illustrate their noteworthy properties. Both synchronizers
presented here are due to C.Y. Chou, I. Cidon, I. Gopal, and S. Zaks (1987).

9.5.1 Context and Hypotheses

Bounded Delay Networks These networks are systems in which (a) commu-
nication delays are bounded (by one time unit), (b) each process (processor) has
a physical clock, (c) the clocks progress to same speed but are not synchronized,
and (d) processing times are negligible when compared to message delay and are
consequently assumed to be equal to 0.
Thus, the local clocks do not necessarily show the same time at the same moment
but advance by equal amounts in equal time intervals. If the clocks could be started
simultaneously, we would obtain a perfectly synchronous system. The fact that the
system is not perfectly synchronous requires the addition of a synchronizer when
one wants to execute synchronous algorithms on top of such systems.

Initialization of the Local Clocks The initialization (reset to 0) of the local


clocks can be easily realized by a network traversal algorithm as described in
Fig. 9.13. Each process has a Boolean donei initialized to false. Its local clock is
denoted timeri .
At least one process (possibly more) receives an external message INIT (). The
first time it receives such a message, a process sets its timer and propagates the
message to its neighbors. The local variables neighborsi and channeli [neighborsi ]
have the same meaning as before.
9.5 The Case of Networks with Bounded Delays 237

Fig. 9.14 The scenario to be prevented

Let us consider an abstract global time, and let τi be the time at which pi set s
timeri to 0. This global time, which is not accessible to the processes, can be seen as
the time of an omniscient external observer. Its unit is assumed to be the same as the
one of the local clocks. The previous initialization provides us with the following
relation:
∀(i, j ): |τi − τj | ≤ d(i, j ) (R1),
where d(i, j ) is the distance separating pi and pj (minimal number of chan-
nels between pi and pj ). Let us notice that if pi and pj are neighbors, we have
|τi − τj | ≤ 1.

9.5.2 The Problem to Solve

After a process pi has received a message INIT (), timeri accurately measures the
passage of time, one unit of time being always the maximum transit time for a
message.
Let us consider a scenario where, after the initialization of the physical clocks,
there is no additional synchronization. This scenario is depicted in Fig. 9.14 where
there are two neighbor processes pi and pj , and a logical pulse takes one physical
time unit. At the beginning of its first pulse, pj sends a message to pi , but this
message arrives while pi is at its second pulse. Hence, this message arrives too late,
and violates the fundamental property of synchronous message-passing.
Hence, synchronizers are not given for free in bounded delay networks. Imple-
menting a synchronizer requires to compute an appropriate duration of ρ physical
time units for a logical pulse (the previous scenario shows that we have necessarily
ρ > 1). The rth pulse of pi will then start when timeri = rρ, and will terminate
when timeri reaches the value (r + 1)ρ.
Several synchronizers can be defined. They differ in the value of ρ and the in-
stants at which they send messages of the synchronous algorithm they interpret.
Thee next sections present two of them, denoted λ and μ.
238 9 Simulating Synchrony on Top of Asynchronous Systems

Fig. 9.15 Interval during which a process can receive pulse r messages

9.5.3 Synchronizer λ

In the synchronizer λ, a process pi sends a message m of the synchronous algorithm


relative to pulse r when timeri = ρr. It remains to compute the value of ρ so that
no message is received too late.

Pulse Duration Let pj be a neighbor of a pi , and let τj (r) be the global instant
time at which pj receives a pulse r message m from pi . This message must be
received and processed before pj enter pulse r + 1, i.e., we must have

τj (r) < τj + (r + 1)ρ (R2),

the left-hand side of this inequality is the abstract global time at which pj starts
pulse r + 1. As transfer delays are at most one time unit, we have τj (r) < (τi +
rρ) + 1. Combined with (R1), namely τi < τj + 1, we obtain τj (r) < τj + rρ + 2,
that is to say
τj (r) < τj + (r + 1)ρ + (2 − ρ) (R3).
It follows that we have
 
(ρ ≥ 2) ∧ (R3) ⇒ (R2).
This means that the property (R2) required for the correct implementation of a syn-
chronizer (namely, no message arrive too late) is satisfied as soon as ρ ≥ 2.
(R3) gives an upper bound on the global time instant at which a process can
receive pulse r messages sent by its neighbors. In the same way, it is possible to
find a lower bound. Let pi be the sender of a pulse r message received by process
pj at time τj (r). We have τj (r) ≥ τi + rρ. Combining this inequality with (R1)
(τi ≥ τj + 1), we obtain

τj (r) ≥ τj + rρ − 1 = τj + (r − 1)ρ + (ρ − 1) (R4).

Hence, the condition ρ ≥ 2 ensures that a message sent at pulse r will be received by
it destination process pi (a) before pi progresses to the pulse (r + 1), and (b) after
pi has started its pulse (r − 1). This is illustrated in Fig. 9.15.
9.5 The Case of Networks with Bounded Delays 239

when timeri = ρr do
clocki ← r; % next pulse is generated %
Send the messages MSG (−, pb) of the local synchronous algorithm where
pb = (clocki mod 2);
process the pulse r messages.

when MSG (m, pb) is received on channeli [x] do


if (clocki × ρ ≤ timeri < (clocki + 1)ρ) ∧ (pb = (clocki mod 2))
then m belongs to the current pulse; deliver it to the synchronous algorithm
else m belongs to the next pulse; keep it to deliver it at the next pulse
end if.

Fig. 9.16 Synchronizer λ (code for pi )

It follows that the pulse r of a process pi spans the global time interval
 
τi + ρr, τi + (ρ + 1)r ,

and during this time interval, pi can receive from its neighbors only messages sent
at pulse r or r + 1. It follows that messages have to carry the parity bit of the pulse
at which they are sent so that the receiver be able to know if the received message is
for the current round r or the next one (r + 1).

Algorithm of the Synchronizer λ The algorithm defining the behavior of a pro-


cess pi is described in Fig. 9.16. This description follows the previous explanation.

Complexities The time complexity of the synchronizer λ is a function of ρ. Tak-


ing ρ = 2, the time of the synchronous algorithm is multiplied by 2 and we have
pulse
Tλ = O(1). Moreover, as every message m has to carry one control bit and there
pulse
are no additional control messages, the message complexity is Cλ = O(1). Fi-
nally, the initialization part consists of the initialization of the local physical clocks
of the processes.

9.5.4 Synchronizer μ

Aim of Synchronizer μ The aim of synchronizer μ is to ensure that each message


sent at pulse r is received by its destination process pj while it executes the pulse
r. So that no message arrives too early, we need to have

τj + ρr ≤ τj (r) < τj + (r + 1)ρ (R5).

From an operational point of view, this means that the parity bits used in λ have
to be eliminated.
240 9 Simulating Synchrony on Top of Asynchronous Systems

when timeri = ρr do
clocki ← r. % next pulse is generated %

when timeri = ρ × clocki + η do


Send the messages MSG (−) of the local synchronous algorithm;
When received, process pulse clocki messages.

when MSG (m) is received on channeli [x] do


m belongs to the current pulse; deliver it to the synchronous algorithm.

Fig. 9.17 Synchronizer μ (code for pi )

Determining the Appropriate Timing Parameters One way of achieving the


previous requirement consists in delaying the sending of messages by an appro-
priate amount η of time units (in such a way that η = 0 would correspond to the
synchronizer λ).
As the transit time is upper bounded by 1, and the sender pi sends a pulse r
message at time τi + rρ + η, we have on the one hand τj (r) ≤ τi + rρ + η + 1, and
on the other hand τj (r) ≥ τi + rρ + η (as transit times are not negative). Hence, to
satisfy (R5), it is sufficient to ensure

τj (r) ≤ τi + rρ + η + 1 ≤ τj + (r + 1)ρ (R6),

and
τj (r) ≥ τi + rρ + η ≥ τj + rρ (R7)
or again
ρ ≥ η + 1 + (τi − τj ) and η ≥ τj − τi .
But we have from (R1): τj − τi < 1 and τi − τj < 1. Hence, conditions (R6) and
(R7) are satisfied when
ρ ≥η+2 and η ≥ 1.
It follows that any pair of values (η, ρ) satisfying the previous condition ensures
that any message is received at the same pulse at which it has been sent. The smallest
values for η and ρ are thus 1 and 3, respectively.

Algorithm of Synchronizer μ The algorithm defining the behavior of a process


pi is described in Fig. 9.17. This synchronizer adds neither control messages nor
pulse pulse
control information to application messages, and we have Cμ = Tμ = O(1).

9.5.5 When the Local Physical Clocks Drift

When the Clocks Drift The previous section assumed that, while local clocks
of the processes (processors) do not output the same value at the same reference
9.5 The Case of Networks with Bounded Delays 241

Fig. 9.18 Clock drift


with respect to reference time

time, they progress at the same speed. This simplifies the design of synchronizers α
and β.
Unfortunately, physical clocks drift, but fortunately their drift with respect to
the abstract global time perceived by an omniscient external observer (also called
reference time) is usually bounded and captured by a parameter denoted  (called
the clock drift).
Hence, we consider that the faster clock counts one unit for (1 − ) of refer-
ence time, while the slowest clock counts one unit for (1 + ) of reference time
(Fig. 9.18). Let  denote a time duration measured with respect to the reference
time, and i be this time duration as measured by the clock of pi . We have

(1 − ) ≤ i ≤ (1 + ).

This formula is depicted in Fig. 9.18. As  is usually very small,  2 is assumed to


be equal to 0 (hence (1 − )(1 + )  1). The upper and lower lines define what is
called the linear time envelope of a clock. The upper line in the figure considers the
case where the clock of pi always counts 1 for (1 − ) of reference time, while the
lower line considers the case where its clock always counts 1 for (1 + ) of reference
time. The dotted line in the middle considers the case where the clock does not drift.
Of course, the drift of a clock is not constant, and a given clock can have a positive
drift at some time and a negative one at another time.
It follows that, advancing at different rates, the clocks can measure differently
a given interval of the reference time. It is consequently important to know how
many pulses can be generated before such a confusion occurs, and makes a message
arrive too late with respect to its pulse number, thereby making the corresponding
synchronizer incorrect.

Conditions for Synchronizer μ We consider here the case of synchronizer μ. Let


pi and pj be two neighbor processes, and let us assume that their clocks (timeri and
timerj ) have maximum and opposite drifts (i.e., of  with respect to the reference
time).
As we have seen, the conditions (R6) and (R7) stated in Sect. 9.5.4 ensure the re-
spect of condition (R7), which defines the correct behavior of μ, namely a message
sent by pi at pulse r arrives at pj during its pulse r.
When considering (R6), the worst case is produced when the sender pi has the
slower clock (timeri counts 1 for 1 +  of reference time), while the receiver process
242 9 Simulating Synchrony on Top of Asynchronous Systems

pj has the faster clock (timeri counts 1 for 1 −  of reference time). In such a
context, the condition (R6) becomes

τi + (rρ + η)(1 + ) + 1 ≤ τj + (r + 1)ρ(1 − ) (R6 ),

where everything is expressed within the abstract reference time.


When considering (R7), the worst case is produced when the sender pi has the
faster clock, while the receiver pj has the slower clock. In such a context, the con-
dition (R7) becomes

τi + (rρ + η)(1 − ) ≥ τj + rρ(1 + ) (R7 ).

Let us now consider (R1) to eliminate τi and τj and obtain conditions on ρ, η,


and , which implies simultaneously (R6 ) and (R7 ).
• Taking τi − τj < 1, and considering (R7 ), we obtain the following condition (C1)
which implies (R6 )
 
C1 = 2rρ ≤ ρ(1 − ) − 2 − η(1 + ) .

• Taking τj − τi < 1, and considering (R7 ), we obtain the following condition


(C2), which implies (R7 )
 
C2 = 2rρ ≤ η(1 − ) − 1 .

It follows from these conditions that the greatest pulse number rmax that can be
attained without problem is such that 2rmax ρ = ρ((1 − ) − 2 − η(1 + )) = η(1 −
) − 1, which is obtained for
ρ(1 − ) − 1
η= ,
2
and we have then
ρ(1 − )2 − 3 + 
rmax = .
4ρ
Let us remark that, when there is no drift, we have  = 0, and we obtain rmax =
+∞ and 2η = ρ − 1. As already seen in Sect. 9.5.4, ρ = 3 and η = 1 are the smallest
values satisfying this equation.
Considering physical clocks whose drift is 10−1 seconds a day (i.e.,  =
1/864 000), Table 9.1 shows a few numerical results for three increasing values
of ρ.

9.6 Summary
This chapter has presented the concept of a synchronizer which encapsulates a gen-
eral methodology to simulate (non-real-time) distributed synchronous algorithms on
9.7 Bibliographic Notes 243

Table 9.1 Value of rmax


as a function of ρ Value of ρ Value of rmax

4 54 000
8 135 000
12 162 000

top of asynchronous distributed systems. It has presented several synchronizers for


both pure asynchronous systems and for bounded delay networks. The chapter has
also shown that graph covering structures (spanning trees and t-spanners) are im-
portant concepts when one has to define appropriate overlay structures on which are
sent control messages.

9.7 Bibliographic Notes

• The interpretation of synchronous distributed algorithms on asynchronous dis-


tributed systems was introduced by B. Awerbuch [27] who in 1985 introduced
the concept of a synchronizer and defined synchronizers α, β, and γ . This paper
also presents a distributed algorithm that builds partitions for γ where the span-
ning trees and the intergroups channels are such that, given k ∈ [1..n], we obtain
pulse pulse
a message complexity Cγ ≤ kn and a time complexity Tγ ≤ logk n. The
work of B. Awerbuch is mainly focused on the design of synchronizers from the
point of view of their time and message complexities.
• Synchronizer δ is due to D. Peleg and J.D. Ullman [293]. They applied it to net-
works whose t-spanner is a hypercube structure. The notion of a graph t-spanner
is due to D. Peleg. An in-depth study of these graph covering structures can be
found in [292].
• The synchronization-related concepts used in the design of synchronizers have
their origin in the work of R.G. Gallager [142] (1982), A. Segall [341] (1983),
and B. Awerbuch [26] (1985).
The idea of a safe state of a process (i.e., it knows that all its messages have
been received) has been used in many distributed algorithms. One of the very first
works using this idea is a paper by E.W. Dijkstra and C.S. Scholten [115], who
use it to control termination of what they called a “diffusing computation”.
• The study of synchronizers for bounded delay networks is due to C.T. Chou, I.
Cidon, I. Gopal, and S. Zaks [93], who defined the synchronizers λ and μ. Their
bounded delay assumption is valuable for many local area networks and real-time
embedded systems. Adaptation to the case where clocks can drift is addressed
in [319].
• The interested reader will find in [385] a method of simulating a system com-
posed of synchronous processes and asynchronous channels on top of a system
where both processes and channels are asynchronous. This type of simulation is
investigated in the presence of different types of faults.
244 9 Simulating Synchrony on Top of Asynchronous Systems

• On the implementation side, K.B. Lakshmanan and K. Thulisaraman studied the


problems posed by the management of waiting queues in which messages are
stored, when one has to implement a synchronizer on top of a fully asynchronous
system [225]. On the theoretical side, A. Fekete, N.A. Lynch, and L. Shrira pre-
sented a general methodology, based on communicating automata, for the design
of modular proofs of synchronizers [123].
• Numerous works have addressed the synchronization of distributed physical
clocks (e.g., [101, 116, 230, 291, 358, 386]).
• Optimal synchronization for asynchronous bounded delay networks with drifting
local clocks is addressed in [211].
• The distributed unison problem requires that (a) no process starts its round r + 1
before all processes have executed their round r, and (b) no process remains
blocked forever in a given round. Self-stabilizing algorithms are algorithms that
can cope with transient faults such as the corruption of values [112, 118]. An
abundant literature has investigated distributed unison in the context of self-
stabilizing algorithms. The interested reader can consult [60, 100, 162] to cite
a few.

9.8 Exercises and Problems

1. The simulation of a synchronous system implemented by the synchronizer α pre-


sented in Sect. 9.3.1 is more synchronized than what is needed. More precisely,
it allows a process to proceed to next pulse r + 1 only when (a) its neighbors are
safe with respect to pulse r, and (b) it is safe itself with respect to pulse r. But,
as the reader can observe, item (b) is not required by the property P, which has
to be satisfied for a process to locally progress from its current pulse to the next
one (property P is stated in Sect. 9.2.1).
Modify the synchronizer α in order to obtain a less synchronized synchronizer
α  in which only item (a) is used to synchronize neighbors’ processes.
Solutions in [27, 319].
2. Write versions of synchronizers λ and μ in which the drift of the physical clocks
of processors is upper bounded by .
3. Prove that it is impossible to implement a synchronizer in an asynchronous dis-
tributed system in which processes (even a single one) may crash.
Part III
Mutual Exclusion and Resource Allocation

This part of the book and the following one are on the enrichment of the distributed
message-passing system in order to offer high-level operations to processes. This
part, which is on resource allocation, is composed of two chapters. (The next part
will be on high-level communication operations.)

Chapter 10 introduces the mutual exclusion (mutex) problem, which is the most
basic problem encountered in resource allocation. An algorithm solving the mutex
problem has to ensure that a given hardware or software object (resource) is accessed
by at most one process at a time, and that any process that wants to access it will be
able to do so. Two families of mutex algorithms are presented. The first is the family
of algorithms based on individual permissions, while the second is the family of
algorithms based on arbiter permissions. A third family of mutex algorithms is the
family of token-based algorithms. Such a family was already presented in Chap. 5,
devoted to mobile objects navigating a network (a token is a dataless mobile object).
Chapter 11 considers first the problem posed by a single resource with several
instances, and then the problem posed by several resources, each with multiple in-
stances. It assumes that a process is allowed to acquire several instances of several
resources. The main issues are then to prevent deadlocks from occurring and to pro-
vide processes with efficient allocation algorithms (i.e., algorithms which reduce
process waiting chains).
Chapter 10
Permission-Based Mutual Exclusion Algorithms

This chapter is on one of the most important synchronization problems, namely mu-
tual exclusion. This problem (whose name is usually shortened to “mutex”) consists
of ensuring that at most one process at a time is allowed to access some resource
(which can be a physical or a virtual resource).
After having defined the problem, the chapter presents two approaches which
allow us to solve it. Both are based on permissions given by processes to other
processes. The algorithms of the first approach are based on individual permis-
sions, while the algorithms of the second approach are based on arbiter permissions
(arbiter-based algorithms are also called quorum-based algorithms).

Keywords Adaptive algorithm · Arbiter permission · Bounded algorithm ·


Deadlock-freedom · Directed acyclic graph · Extended mutex · Adaptive
algorithm · Grid quorum · Individual permission · Liveness property ·
Mutual exclusion (mutex) · Preemption · Quorum · Readers/writers problem ·
Safety property · Starvation-freedom · Timestamp · Vote

10.1 The Mutual Exclusion Problem

10.1.1 Definition

The Operations acquire_mutex() and release_mutex() The mutual exclusion


problem consists in enriching the underlying system with two operations denoted
acquire_mutex() and release_mutex(), which are used as “control statements” to
encapsulate a set of application statements usually called a critical section. Let cs
denote such a set of application statements. A process pi that wants to execute cs
issues the following sequence of statements:
acquire_mutex(); cs; release_mutex().
The operations acquire_mutex() and release_mutex() can be seen as “control
brackets” which open and close a particular execution context, respectively.
It is assumed that the processes are well-formed, i.e., they always execute the pre-
vious pattern when they want to execute the statements cs. Moreover, it is assumed
that, when executed by a single process, the code denoted cs always terminates.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 247


DOI 10.1007/978-3-642-38123-2_10, © Springer-Verlag Berlin Heidelberg 2013
248 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.1 A mutex invocation pattern and the three states of a process

The Three States of a Process Given a process pi , let cs_statei be a local vari-
able denoting its current local state from the critical section point of view. We have
cs_statei ∈ {out, trying, in}, where
• cs_statei = out, means that pi is not interested in executing the statement cs.
• cs_statei = trying, means that pi is executing the operation acquire_mutex().
• cs_statei = in, means that pi is executing the statement cs.
An invocation pattern of acquire_mutex() and release_mutex(), together with the
corresponding values of cs_statei , is represented in Fig. 10.1.

Problem Definition The mutual exclusion problem consists in designing an algo-


rithm that implements the operations acquire_mutex() and release_mutex() in such
a way that the following properties are satisfied:
• Safety. At any time, at most one process pi is such that cs_statei = in (we say
that at most one process is inside the critical section).
• Liveness. If a process pi invokes acquire_mutex(), then we eventually have
cs_statei = in (i.e., if pi wants to enter the critical section, it eventually enters it).
This liveness property is sometimes called starvation-freedom. It is important
to notice that it is a property stronger than the absence of deadlock. Deadlock-
freedom states that if processes want to enter the critical section (i.e., invoke
acquire_mutex()), at least one process will enter it. Hence, a solution that would
allow some processes to repeatedly enter the critical section, while preventing other
processes from entering, would ensure deadlock-freedom but would not ensure
starvation-freedom.

10.1.2 Classes of Distributed Mutex Algorithms

Mutex Versus Election The election problem was studied in Chap. 4. The aim
of both an election algorithm and a mutex algorithm is to create some asymme-
try among the processes. But these problems are deeply different. In the election
problem, any process can be elected and, once elected, a process remains elected
forever. In the mutex problem, each process that wants to enter the critical section
must eventually be allowed to enter it. In this case, the asymmetry pattern evolves
dynamically according to the requests issued by the processes.
10.2 A Simple Algorithm Based on Individual Permissions 249

Token-Based Algorithms One way to implement mutual exclusion in a message-


passing system consists in using a token that is never duplicated. Only the process
that currently owns the token can enter the critical section. Hence, the safety prop-
erty follows immediately from the fact that there is a single token. A token-based
mutex algorithm has only to ensure that any process that wants the token will even-
tually obtain it.
Such algorithms were presented in Chap. 5, which is devoted to mobile objects
navigating a network. In this case, the mobile object is a pure control object (i.e.,
the token carries no application-specific data) which moves from process to process
according to process requests.

Algorithms Based on Individual/Arbiter Permissions This chapter is on the


class of mutex algorithms which are based on permissions. In this case, a process
that wants to enter the critical section must ask for permissions from other processes.
It can enter the critical section only when it has received all the permissions it has
requested. Two subclasses of permission-based algorithms can be distinguished.
• In the case of individual permissions, the permission given by a process pi to a
process pj engages only pi . Mutex algorithms based on individual permissions
are studied in Sects. 10.2 and 10.3.
• In the case of arbiter permissions, the permission given by a process pi to a pro-
cess pj engages all the processes that need pi ’s permission to enter the critical
section. Mutex algorithms based on arbiter permissions are studied in Sect. 10.4.
In the following Ri represents the set of processes to which pi needs to ask
permission in order to enter the critical section.

Remark on the Underlying Network In all the algorithms that are presented in
Sects. 10.2, 10.3, and 10.4, the communication network is fully connected (there
is a bidirectional channel connecting any pair of distinct processes). Moreover, the
channels are not required to be FIFO.

10.2 A Simple Algorithm Based on Individual Permissions

10.2.1 Principle of the Algorithm

The algorithm presented in this section is due to G. Ricart and A.K. Agrawala
(1981).

Principle: Permissions and Timestamps The principle that underlies this algo-
rithm is very simple. We have Ri = {1, . . . , n} \ {i}, i.e., a process pi needs the
permission of each of the (n − 1) other processes in order to be allowed to enter the
critical section. As already indicated, the intuitive meaning of the permission sent
by pj to pi is the following “as far as pi (only) is concerned, pi allows pj to enter
the critical section”.
250 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.2 Mutex module at a process pi : structural view

Hence, when a process pi wants to enter the critical section, it sends a REQUEST()
message to each other process pj , and waits until it has received the (n − 1) corre-
sponding permissions. The core of the algorithm is the predicate used by a process
pj to send its permission to a process pi when it receives a request from this pro-
cess. There are two cases. The behavior of pj depends on the fact that it is currently
interested or not in the critical section.

• If pj is not interested (i.e., cs_statej = out), it sends by return its permission


to pi .
• If pj is interested (i.e., cs_statej = out), it is either waiting to enter the critical
section or is inside the critical section. In both cases, pj has issued a request, and
the requests from pj and pi are conflicting. One process has to give its permission
to the other one (otherwise, they will deadlock each other), and only one has to
give its permission by return (otherwise, the safety property would be violated).
This issue can be solved by associating a priority with each request.

Priorities can easily be implemented with timestamps. As we have seen in


Chap. 7, a timestamp is a pair h, j , where h is a logical clock value and i a process
identity. As we have seen in Sect. 7.1.2, any set of timestamps can be totally ordered
by using a topological sort, namely, h1, i and h2, j  being two timestamps, we
have
def  
h1, i < h2, j  = (h1 < h2) ∨ (h1 = h2) ∧ (i < j ) .

It follows that this algorithm uses timestamps to ensure both the safety property
and the liveness property defining the mutex problem.

Structural View The structure of the local module implementing the mutual ex-
clusion service at a process pi is described in Fig. 10.2. (As the reader can check,
this structure is the same as the one described in Fig. 5.2.)
10.2 A Simple Algorithm Based on Individual Permissions 251

operation acquire_mutex() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) waiting_fromi ← Ri ; % Ri = {1, . . . , n} \ {i}
(4) for each j ∈ Ri do send REQUEST ( rdi , i) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.

operation release_mutex() is
(7) cs_statei ← out;
(8) for each j ∈ perm_delayedi do send PERMISSION (i) to pj end for;
(9) perm_delayedi ← ∅.

when REQUEST (k, j ) is received do


(10) clocki ← max(clocki , k);
(11) prioi ← (cs_statei = out) ∧ ( rdi , i < k, j );
(12) if (prioi ) then perm_delayedi ← perm_delayedi ∪ {j }
(13) else send PERMISSION (i) to pj
(14) end if.

when PERMISSION (j ) is received do


(15) waiting_fromi ← waiting_fromi \ {j }.

Fig. 10.3 A mutex algorithm based on individual permissions (code for pi )

10.2.2 The Algorithm

Description of the Algorithm: Local Variables In addition to the constant set


Ri and the local variable cs_statei (initialized to out), each process pi manages the
following variables:
• clocki is a scalar clock initialized to 0. Its scope is the whole execution; rdi (for
ast request date) is a local variable used by pi to save the logical date of its last
invocation of acquire_mutex().
• waiting_fromi is a set used to contain the identities of the processes from which
pi is waiting for a permission.
• perm_delayedi is a set used by pi to contain the identities of the processes to
which it will have to send its permission when it exits the critical section.
• prioi is an auxiliary Boolean variable where pi computes its priority when it
receives a request message.

Description of the Algorithm: Behavior of a Process The algorithm imple-


menting mutual exclusion at a process pi is described in Fig. 10.3. Except for the
wait statement at line 5, the four sets of statements are locally executed in mutual
exclusion.
When it invokes acquire_mutex(), a process pi first updates cs_statei (line 1),
computes a clock value for its current request (line 2), and sets waiting_fromi to Ri
(line 3). Then, it sends a timestamped request message to each other process (line 4),
252 10 Permission-Based Mutual Exclusion Algorithms

and waits until it has received the corresponding permissions (line 5). When this
occurs, it enters the critical section (line 6).
When it receives a permission, pi updates accordingly its set waiting_fromi
(line 15).
When it invokes release_mutex(), pi proceeds to the local state out (line 7), and
sends its permission to all the processes of the set perm_delayedi . This is the set
of processes whose requests were competing with pi ’s request, but pi delayed the
corresponding permission-sending because its own request has priority over them
(lines 8–9).
When pi receives a message REQUEST (k, j ), it first updates its local clock
(line 10). It then computes if it has priority (line 11), which occurs if cs_statei = out
(it is then interested in the critical section) and the timestamp of its current request
is smaller than the timestamp of the request it has just received. If pi has priority, it
adds the identity j to the set perm_delayedi (line 12). If pi does not have priority,
it sends by return its permission to pj (line 13).

A Remark on the Management of clocki Let us observe that clocki is not in-
creased when pi invokes acquire_mutex(): The date associated with the current re-
quest of pi is the value of clocki plus 1 (line 2). Moreover, when pi receives a
request message, it updates clocki to max(clocki , k), where k is the date of the re-
quest just received by pi (line 10), and this update is the only update of clocki .
As a very particular case, let us consider a scenario in which only pi wants to
enter the critical section. It is easy to see that, not only 1, i is the timestamp of its
first request, but 1, i is the timestamp of all its requests (this is because, as line 10
is never executed, clocki remains forever equal to 0).
As we are about to see in the following proof, the algorithm is correct. It actually
considers that, when a process pi enters several times the critical section while the
other processes are not interested, it is not necessary to increase the clock of pi .
This is because, in this case and from a clock point of view, the successive invoca-
tions of acquire_mutex() issued by pi can appear as a single invocation. Hence, the
algorithm increases the local clocks as slowly as possible.

Message Cost It is easy to see that each use of the critical section by a process
requires 2(n − 1) messages: (n − 1) request messages and (n − 1) permission mes-
sages.

10.2.3 Proof of the Algorithm

Lemma 3 The algorithm described in Fig. 10.3 satisfies the mutex safety property.

Proof The proof is by contradiction. Let us assume that two processes pi and pj
are simultaneously in the critical section, i.e., from an external omniscient observer
point of view, we have cs_statei = in and cs_statej = in. It follows from the code of
10.2 A Simple Algorithm Based on Individual Permissions 253

Fig. 10.4 Proof of the safety property of the algorithm of Fig. 10.3

acquire_mutex() that each of them has sent a request message to the other process
and has received its permission. A priori two scenarios are possible for pi and pj to
be simultaneously inside the critical section. Let h, i and k, j  be the timestamps
of the request messages sent by pi and pj , respectively.
• Each process has sent its request message before receiving the request from the
other process (left side of Fig. 10.4).
As i = j , we have either h, i < k, j  or k, j  < h, i. Let us assume (with-
out loss of generality) that h, i < k, j . In this case, pj is such that ¬prioj
when it received REQUEST (h, i) (line 11) and, consequently, it sent its permis-
sion to pi (line 13). Differently, when pi received REQUEST (h, i), we had prioi ,
and consequently pi did not send its permission to pj ((line 12; it will send the
permission only when it will execute (line 8 of release_mutex()).
It follows that, when this scenario occurs, pj cannot enter the critical section
while pi is inside the critical section, which contradicts the initial assumption.
• One process (e., pj ) has sent its permission to the other process (pi ) before send-
ing its own request (right side of Fig. 10.4).
When pj receives REQUEST (h, i), it executes clockj ← max(clockj , h)
(line 10), hence we have clockj ≥ h. Then, when later pj invokes acquire_
mutex(), it executes rd j ← clocki + 1 (line 2). Hence, rdj > h, and the mes-
sage REQUEST (k, j ) sends by pj to pi is such that k = rdj > h. It follows
that, when pi receives this message we have cs_statei = out (assumption), and
h, i < k, j . Consequently prioi is true (line 11), and pi does not send its per-
mission to pj (line 12). It follows that, as in the previous case, pj cannot enter
the critical section while pi is inside the critical section, which contradicts the
initial assumption. 

It can be easily checked that the previous proof remains valid if messages are
lost.

Lemma 4 The algorithm described in Fig. 10.3 satisfies the mutex liveness prop-
erty.

Proof The proof of the liveness property is done in two parts. The first part
shows that the algorithm is deadlock-free. The second part shows the algorithm
254 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.5 Proof of the liveness property of the algorithm of Fig. 10.3

is starvation-free. Let us first observe that as the clocks cannot decrease, the times-
tamps cannot decrease.
Proof of the deadlock-freedom property. Let us assume, by contradiction, that
processes have invoked acquire_mutex() and none of them enters the local state in.
Among all these processes, let pi be the process that sent the request message
with the smallest timestamp h, i, and let pj any other process. When pi re-
ceives REQUEST (h, i) it sends by return the message PERMISSION (j ) to pi if
cs_tatej = out. If cs_tatej = out, let k, j  the timestamp of its request. Due to def-
inition of h, i, we have h, i < k, j . Consequently, pj sends PERMISSION (j )
to pi . It follows that pi receives a permission message from each other process
(line 15). Consequently, pi stops waiting (5) and enters the critical section (line 6).
Hence, the algorithm is deadlock-free.
Proof of the starvation-freedom property. To show that the algorithm is starvation-
free, let us consider two processes pi and pj which are competing to enter the crit-
ical section. Moreover, let us assume that pi repeatedly invokes acquire_mutex()
and enters the critical section, while pj remains blocked at line 5 waiting for the
permission from pi . The proof consists in showing that this cannot last forever.
Let h, i and k, j  be the timestamps of the requests of pi and pj , respec-
tively, with h, i < k, j . The proof shows that there is a finite time after which the
timestamp of a future request of pi will be h , i > k, j . When this will occur, the
request of pj will have priority with respect to that of pi . The worst case scenario is
described in Fig. 10.5: clocki = h − 1 < clockj = k − 1, the request messages from
pi to pj and the permission messages from pj to pi are very fast, while the request
message from pj to pi is very slow. When it receives the message REQUEST (h, i),
pj sends by return its permission to pi . This message pattern, which is surrounded
by an ellipsis, can occur repeatedly an unbounded number of times, but the impor-
tant point is that it cannot appear an infinite number of times. This is because, when
pi receives the message REQUEST (k, j ), it updates clocki to k and, consequently, its
next request message (if any) will carry a date greater than k and will have a smaller
priority than pj ’s current request. Hence, no process can prevent another process
from entering the critical section. 
10.2 A Simple Algorithm Based on Individual Permissions 255

The following property of the algorithm follows from the proof of the previous
lemma. The invocations of acquire_mutex() direct the processes to enter the critical
section according to the total order on the timestamps generated by these invoca-
tions.

Theorem 13 The algorithm described in Fig. 10.3 solves the mutex problem.

Proof The proof is a direct consequence of Lemmas 3 and 4. 

10.2.4 From Simple Mutex to Mutex on Classes of Operations

The Readers/Writers Problem The most known generalization of the mutex


problem is the readers/writers problem. This problem is defined by two operations,
denoted read() and write(), whose invocations are constrained by the following syn-
chronization rule.
• Any execution of the operation write() is mutually exclusive with the simultane-
ous execution of any (read() or write()) operation.
It follows from this rule that simultaneous executions of the operation read() are
possible, as long as there is no concurrent execution of the operation write().

Generalized Mutex More generally, it is possible to consider several types of


operations and associated exclusion rules defined by a concurrency matrix. Let
op1() and op2() be two operations whose synchronization types are st1 and st2,
respectively. These synchronization types are used to state concurrency constraints
on the execution of the corresponding operations. To that end, a Boolean sym-
metric matrix denoted exclude is used (symmetric means that exclude[st1, st2] =
exclude[st2, st1]). Its meaning is the following: An operation whose type is
st1 and an operation whose type is st2 cannot be executed simultaneously if
exclude[st1, st2] is true.
As an example, let us consider the readers/writers problem. There are two opera-
tions and a synchronization type per operation, namely r is the type of the operation
read() and w is the type associated with the operation write(). The concurrency
matrix is such that exclude[r, r] = false, exclude[w, r] = exclude[w, w] = false.

An Extended Mutex Algorithm Let us associate with each operation op() two
control operations denoted begin_op() and end_op(). These control operations are
used to bracket each invocation of op() as follows:
begin_op(); op(); end_op().

Let op_type be the synchronization type associated with the operation op() (let
us observe that several operations can be associated with the same synchroniza-
tion type). The algorithm described in Fig. 10.6 is a trivial extension of the mutex
256 10 Permission-Based Mutual Exclusion Algorithms

operation begin_op() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) waiting_fromi ← Ri ; % Ri = {1, . . . , n} \ {i}
(4 ) for each j ∈ Ri do send REQUEST ( rdi , i, op_type) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.

operation end_op() is
(7) cs_statei ← out;
(8) for each j ∈ perm_delayedi do send PERMISSION (i) to pj end for;
(9) perm_delayedi ← ∅.

when REQUEST (k, j, op_t) is received do


(10) clocki ← max(clocki , k);
(11 ) prioi ← (cs_statei = out) ∧ ( rdi , i < k, j ) ∧ exclude(op_type, op_t);
(12) if (prioi ) then perm_delayedi ← perm_delayedi ∪ {j }
(13) else send PERMISSION (i) to pj
(14) end if.

when PERMISSION (j ) is received do


(15) waiting_fromi ← waiting_fromi \ {j }.

Fig. 10.6 Generalized mutex based on individual permissions (code for pi )

algorithm described in Fig. 10.3. It ensures that (a) the concurrency constraints ex-
pressed by the exclusion matrix are respected, and (b) the invocations which are not
executed concurrently are executed according to their timestamp order.
Only two lines need to be modified to take into account the synchronization type
of the operation (their number is postfixed by  ). At line 4 , the request message has
to carry the type of the corresponding operation. Then, at line 11 , the Boolean value
of exclude(op_type, op_t) (where op_type is the type of the current operation of pi
and op_t is the type of the operation that pj wants to execute) is used to compute
the value of prioi , which determines if pi has to send by return its permission to pj .
It is easy to see that, if there is a single operation op(), and this operation ex-
cludes itself (i.e., its type op_type is such that exclude(op_type, op_type = true),
the algorithm boils down to that of Fig. 10.3.

10.3 Adaptive Mutex Algorithms


Based on Individual Permissions

10.3.1 The Notion of an Adaptive Algorithm

The notion of an adaptive message-passing algorithm was introduced in Sect. 5.4.1,


in the context of a mobile object navigating a network. In the context of mutual
exclusion, it translates as follows. If after some time τ a process pi is no longer
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 257

interested in accessing the critical section, then there is a time τ  ≥ τ after which
this process is no longer required to participate in the mutual exclusion algorithm. It
is easy to see that the mutex algorithm described in Fig. 10.3 is not adaptive: Each
time a process pi wants to enter the critical section, every other process pj has to
send it a permission, even if we always have cs_statej = out.
This section presents two adaptive mutex algorithms. The first one is obtained
from a simple modification of the algorithm of Fig. 10.3. The second one has the
noteworthy property of being both adaptive and bounded.

10.3.2 A Timestamp-Based Adaptive Algorithm

This algorithm is due to O. Carvalho and G. Roucairol (1983).

Underlying Principle: Shared Permissions Let us consider two processes pi


and pj such that pi wants to enter the critical section several times, while pj is
not interested in the critical section. The idea is the following: pi and pj share a
permission and, once pi has this permission, it keeps it until pj asks for it. Hence,
if pj is not interested in the critical section, pj will not reclaim it, and pi will not
have to ask for it again to pj .
As an example, let consider the case where only pi wants to enter the critical
section and, due to its previous invocation of acquire_mutex(), it has the (n − 1)
permissions that it shares with every other process. In this scenario, pi can enter the
critical section without sending request messages, and its use of the critical section
then costs no message.
The previous idea can be easily implemented with a message PERMISSION ({i, j })
shared by each pair of processes pi and pj (PERMISSION ({j, i}) is not another per-
mission but a synonym of PERMISSION ({i, j })). Initially, this message is placed
either on pi or pj , and the set Ri containing the identities of the processes to which
pi has to ask the permission is initialized as follows

 
Ri = j | PERMISSION {j, i} is initially on pj .
Then, pi adds j to Ri when it sends PERMISSION ({i, j }) to pj , and suppresses k
from Ri when it receives PERMISSION ({k, i}) from pk .

Timestamp-Based Adaptive Mutex Algorithm The corresponding algorithm is


described in Fig. 10.7.
The code implementing acquire_mutex() is nearly the same as in Fig. 10.3. The
only difference lies in the fact that the set Ri is no longer a constant. A process pi
sends a request message only to the processes from which it does not have shared
permission (line 3), and then waits until it has the (n − 1) permissions it individu-
ally shares with each other process (line 4). When it receives such a permission it
withdraws the corresponding process from Ri (line 17).
The code implementing nearly_mutex() is the same as in Fig. 10.3, with an ad-
ditional statement related to the management of Ri (line 8). This set takes the value
258 10 Permission-Based Mutual Exclusion Algorithms

operation acquire_mutex() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do send REQUEST ( rdi , i) to pj end for;
(4) wait (Ri = ∅);
(5) cs_statei ← in.

operation release_mutex() is
(6) cs_statei ← out;
(7) for each j ∈ perm_delayedi do send PERMISSION ({i, j }) to pj end for;
(8) Ri ← perm_delayedi ;
(9) perm_delayedi ← ∅.

when REQUEST (k, j ) is received from pj do


(10) clocki ← max(clocki , k);
(11) prioi ← (cs_statei = in) ∨ ((cs_statei = trying) ∧ ( rdi , i < k, j ));
(12) if (prioi ) then perm_delayedi ← perm_delayedi ∪ {j }
(13) else send PERMISSION ({i, j }) to pj
(14) Ri ← Ri ∪ {j };
(15) if (cs_statei = trying) then send REQUEST ( rdi , i) to pj end if
(16) end if.

when PERMISSION ({i, j }) is received from pj do


(17) Ri ← Ri \ {j }.

Fig. 10.7 An adaptive mutex algorithm based on individual permissions (code for pi )

of perm_delayedi , which contains the set of processes to which pi has just sent the
permission it shares with each of them.
Finally, when pi receives a message REQUEST (k, j ), it has the priority if it is
currently inside the critical section or it is waiting to enter it (line 11). If its current
request has priority with respect to the request it has just received, it delays the send-
ing of its permission (line 12). If it does not have priority, it sends by return to pj the
message PERMISSION ({i, j }) (line 13) and adds j to Ri (line 14). Moreover, if pi
is competing for the critical section (cs_statei = trying), it sends a request message
to pj so that pj eventually returns to it the shared message PERMISSION ({i, j })
(line 15) so that it will be allowed to enter the critical section.

Adaptivity, Message Cost, and the Management of Local Clocks It is easy to


see that the algorithm is adaptive. If after some time a process pi does not invoke
acquire_mutex(), while other processes pj do invoke this operation, pi sends to
each of them the permission it shares with them, after which it will no longer receive
request messages.
It follows from the adaptivity property that the number of messages involved in
one use of the critical section is 2|Ri | (requests plus the associated permissions),
where 0 ≤ |Ri | ≤ n − 1. The exact number depends on the current state of the pro-
cesses with respect to their invocations of acquire_mutex() and release_mutex().
When considering the basic algorithm of Fig. 10.3, a process that invokes
acquire_mutex() sends a timestamped request message to each other process. This
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 259

Fig. 10.8 Non-FIFO channel in the algorithm of Fig. 10.7

allows the local clocks to be synchronized in the sense that, as it has been already
noticed, we always have |clocki − clockj | ≤ n − 1. Due to the adaptivity feature of
the previous algorithm, this local clock synchronization is no longer ensured and the
difference between any two local clocks cannot be bounded.

Non-FIFO Channels and Management of the Ri Sets It is possible that a pro-


cess pi sends to a process pj a message PERMISSION ({i, j }) followed by a message
REQUEST ( rdi , i) (lines 13 and 15). As the channel is not required to be FIFO, it
is possible that pj receives and processes first the message REQUEST ( rdi , i) and
then the message PERMISSION ({i, j }). Does this create a problem?
To see why the answer is “no”, let us consider Fig. 10.8, where are depicted a
request message from pi to pj , the corresponding permission message from pj to
pi , and a request message from pj to pi , which arrives at pi before the request
message. Predicates on the values of Ri and Rj are indicated on the correspond-
ing process axis. Due to line 10, which is executed when pj receives the message
REQUEST (h, i), and line 2, which is executed before pj sends the message RE -
QUEST (k, j ), we have k > h. It follows that, when pi receives REQUEST (k, j ), we
have h, i < k, j , i.e., prioi is true, and pi delays the sending of the permission to
pj (line 12). Hence, pi and pj do not enter a livelock in which they would repeat-
edly send to each other the permission message without ever entering the critical
section. The fact that the message REQUEST (k, j ) arrives before or after the mes-
sage PERMISSION ({i, j }) has no impact on the way it is processed.

10.3.3 A Bounded Adaptive Algorithm

The mutex algorithm that is presented in this section has two main properties: It
is adaptive and has only bounded variables. Moreover, process identities (whose
scope is global) can be replaced by channel identities whose scopes are local to
each process.
This algorithm is due to K.M. Chandy and J. Misra (1984). It is derived here from
the permission-based algorithms which were presented previously. To simplify the
presentation, we consider a version of the algorithm which uses process identities.
260 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.9 States of the


message PERMISSION ({i, j })

Principle of the Algorithm: State of a Permission As before, each pair of dis-


tinct processes pi and pj share a permission message denoted PERMISSION ({i, j }),
and, to enter the critical section, a process needs each of the (n − 1) permissions it
shares with each other process.
The main issue consists in ensuring, without using (unbounded) timestamps, that
each invocation of acquire_mutex() terminates. To that end, a state is associated
with each permission message; the value of such a state is used or new.
When a process pi receives the message PERMISSION ({i, j }) from pj , the state
of the permission is new. After it has benefited from this message to enter the critical
section, the state of the permission becomes used. The automaton associated with
the permission shared by pi and pj is described in Fig. 10.9.

Principle of the Algorithm: Establish a Priority on Requests The core of the


algorithm is the way it ensures that each invocation of require_mutex() terminates.
The state of a permission is used to establish a priority among conflicting requests.
Let perm_statei [j ] be a local variable of pi , which stores the current state of the
permission shared by pi and pj when this permission is at process pi .
When, while it has issued a request, process pi receives a request from pj , it has
priority on pj if one of the following cases occurs:
• cs_statei = in. (Similarly to the corresponding case in the algorithm of Fig. 10.7,
pi has trivially priority because it is inside the critical section.)
• If cs_statei = trying, there are two subcases.
– Case j ∈ / Ri . In this case, the permission shared by pi and pj is located at pi ,
and pi does not have to ask for it. The priority depends then on the state of
this permission. If perm_statei [j ] = new, pi has not yet used the permission
and consequently it has priority. If perm_statei [j ] = used, pi does not have
priority with respect to pj .
– Case j ∈ Ri . As pi receives a request from pj , process pj was such that i ∈
Rj when it sent the request, which means that pj did not have the message
PERMISSION ({i, j }). As j ∈ Ri , this permission message is not at process pi
either. It follows that this permission is still in transit from pj to pi (it has been
sent by pj before its request, but will be received by pi after pj ’s request).
When this permission message PERMISSION ({i, j }) will be received by pi , its
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 261

operation acquire_mutex() is
(1) cs_statei ← trying;
(2) for each j ∈ Ri do send REQUEST () to pj end for;
(3) wait (Ri = ∅);
(4) cs_statei ← in;
(5) for each j ∈ {1, . . . , n} \ {i} do perm_statei [j ] ← used end for.

operation release_mutex() is
(6) cs_statei ← out;
(7) for each j ∈ perm_delayedi do send PERMISSION ({i, j }) to pj end for;
(8) Ri ← perm_delayedi ;
(9) perm_delayedi ← ∅.

when REQUEST () is received from pj do


(10) prioi ← (cs_statei = in)∨
((cs_statei = trying) ∧ [(perm_statei [j ] = new) ∨ (j ∈ Ri )]);
(11) if (prioi ) then perm_delayedi ← perm_delayedi ∪ {j }
(12) else send PERMISSION ({i, j }) to pj
(13) Ri ← Ri ∪ {j };
(14) if (cs_statei = trying) then send REQUEST () to pj end if
(15) end if.

when PERMISSION ({i, j }) is received from pj do


(16) Ri ← Ri \ {j };
(17) perm_statei [j ] ← new.

Fig. 10.10 A bounded adaptive algorithm based on individual permissions (code for pi )

state will be new, which means that the current request of pi has priority on
the request of pj .
This scenario, which is due to the fact that channels are not FIFO, is ex-
actly the scenario which has been depicted in Fig. 10.8 for the timestamp-based
adaptive mutex algorithm (in Fig. 10.8, as pi knows both the timestamp of its
last request and the timestamp of pj ’s request, a timestamp comparison is used
instead of the predicate j ∈ Ri ).
It follows that, when pi receives a request from pj , it has priority if
   
(cs_statei = in) ∨ (cs_statei = trying) ∧ perm_statei [j ] = new ∨ (j ∈ Ri ) .

Initialization As in both previous algorithms, the local variables cs_statei and


perm_delayedi of each process pi are initialized to out and ∅, respectively. More-
over, for each pair of processes pi and pj such that i < j , the message PERMIS -
SION ({i, j }) is initially at pi (hence, i ∈ Rj and j ∈
/ Ri ), and its initial state is used
(i.e., perm_statei [j ] = used).

Bounded Adaptive Mutex Algorithm The corresponding algorithm is described


in Fig. 10.10.
262 10 Permission-Based Mutual Exclusion Algorithms

The code is nearly the same as in Fig. 10.7. The only modifications are the sup-
pression of the management of the local clocks, the addition of the management of
the permission states, and the appropriate modification of the priority predicate.
• Just before entering the critical section, a process pi indicates that the permission
it shares with any other process pj is now used (line 5).
• When pi receives (from pj ) the message PERMISSION ({i, j }), pi sets to new the
state of this permission (line 17).
• The computation of the value of the Boolean prioi is done as explained previously
(the timestamp-based comparison  rdi , i < k, j  used in Fig. 10.7 is replaced
by the predicate perm_statei [j ] = new ∨ j ∈ Ri , which is on bounded local vari-
ables).

Adaptivity and Cost As for the algorithm of Fig. 10.7, it is easy to see that
the algorithm is adaptive and each use of the critical section costs 2|Ri | messages,
where the value of |Ri | is such that 0 ≤ |Ri | ≤ n − 1 and depends on the current
state of the system. Moreover, the number of distinct messages is bounded.
As far as the local memory of a process pi is concerned, we have the following:
As in the previous algorithms, cs_statei , Ri , and perm_delayedi are bounded, and
so is perm_statei [1..n], which is an array of one-bit values.

10.3.4 Proof of the Bounded Adaptive Mutex Algorithm

An Acyclic Directed Graph The following directed graph G (which evolves ac-
cording to the requests issued by the processes) is central to the proof of the liveness
property of the bounded adaptive algorithm. The vertices of G are the n processes.
There is a directed edge from pi to pj (meaning that pj has priority over pi ) if:
• the message PERMISSION ({i, j }) is located at process pi and perm_statei [j ] =
used, or
• the message PERMISSION ({i, j }) is in transit from pi to pj , or
• the message PERMISSION ({i, j }) is located at process pj and perm_statej [i] =
new.
It is easy to see that the initial values are such that there is a directed edge from pi
to pj if and only if i < j . Hence, this graph is initially acyclic.

Lemma 5 The graph G always remains acyclic.

Proof Let us observe that the only statement that can change the direction of an
edge is (a) when a process uses the corresponding permission and (b) the previous
state of this permission was new. This occurs at line 5. If perm_statei [j ] = new
before executing perm_statei [j ] ← used, the execution of this statement changes
the priority edge from pj to pi into an edge from pi to pj (i.e., pj then has priority
with respect to pi ).
10.3 Adaptive Mutex Algorithms Based on Individual Permissions 263

It follows that, whatever the values of the local variables perm_statei [x] before
pi executes line 5, after it has executed this line, the edges adjacent to pi are only
outgoing edges. Consequently, no cycle involving pi can be created by this state-
ment. It follows that, if the graph G was acyclic before the execution of line 5, it
remains acyclic. The fact that the graph is initially acyclic concludes the proof of
the lemma. 

Theorem 14 The algorithm described in Fig. 10.10 solves the mutex problem.

Proof Proof of the safety property. It follows from the initialization that, for any pair
of processes pi and pj , we initially have either (i ∈ Rj ) ∧ (j ∈ / Ri ) or (j ∈ Ri ) ∧
(i ∈
/ Rj ). Then, when a process pi sends a permission it adds the corresponding
destination process to Ri (lines 7–8, or lines 12–13). Moreover, when it receives a
permission from a process pj , pi suppresses this process from Ri . It follows that
there is always a single copy of each permission message.
Due to the waiting predicate of line 3, a process pi has all the permissions it
shares with each other process when it is allowed to enter the critical section. From
then on, and until it executes release_mutex(), we have cs_statei = in. Hence, dur-
ing this period, the Boolean variable prioi cannot be false, and consequently, pi
does not send permissions. As permissions are not duplicated, and there is a single
permission shared by any pair of processes, it follows that, while cs_statei = in, no
process pj has the message PERMISSION ({i, j }) that it needs to enter the critical
section, which proves the safety property of the bounded adaptive mutex algorithm.
Proof of the liveness property. Considering a process pi in the acyclic graph
G, let height(i) be the maximal distance of a path from pi to the process without
outgoing edges. Let pi be a process such that cs_statei = trying, and k = height(i).
The proof consists in showing, by induction on k, that eventually pi is such that
cs_statei = in.
Base case: k = 0. In this case, pi has only incoming edges in G. Let us consider
any other process pj .
• It follows from the directed edge from pj to pi in G that, if the message PER -
MISSION ({i, j }) is at pi (or in transit from pj to pi ), its state is (or will be) new.
It then follows from the priority computed at line 10 that, even if it receives a
request from pj , pi keeps the permission until it invokes release_mutex().
• If the message PERMISSION ({i, j }) is at pj and pj is such that cs_statej = in, it
follows from the directed edge from pj to pi in G that perm_statei [j ] = used.
Hence, pj sends the message PERMISSION ({i, j }) to pi . As the state of this mes-
sage is set to new when it arrives, pi keeps it until it invokes release_mutex().
• If the message PERMISSION ({i, j }) is at pj and pj is such that cs_statej = in,
the previous item applies after pj invoked release_mutex().
It follows that pi eventually obtains and keeps the (n − 1) messages, which allows
it to enter the critical section (lines 3–4).
Induction case: k > 0. Let us assume that all the processes that are at a height
≤ k − 1, eventually enter the critical section. Let pj be a process whose height is k.
264 10 Permission-Based Mutual Exclusion Algorithms

This process has incoming edges and outgoing edges. As far as the incoming edges
are concerned, the situation is the same as in the base case, and pj will eventually
obtain and keep the corresponding permissions.
Let us now consider an outgoing edge from pi to some process pj . The height
of pj is ≤ k − 1. Moreover, (a) the message PERMISSION ({i, j }) is at pi in the state
used, or (b) is in transit from pi to pj , or (c) is at pj in the state new. As the height
of pj is ≤ k − 1, it follows from the induction assumption, that pj eventually enters
the critical section. When pj does, the state of the permission becomes used, and
the direction of the edge connecting pi and pj is then from pj to pi . Hence, the
height of pi becomes eventually ≤ k − 1, and the theorem follows. 

10.4 An Algorithm Based on Arbiter Permissions

10.4.1 Permissions Managed by Arbiters

Meaning of an Arbiter Permission In an algorithm based on arbiter permissions


(similarly to the algorithms based on individual permissions), each process pi must
ask the permission of each process in its request set Ri . As indicated in Sect. 10.1.2,
the important difference lies in the meaning of a permission whose scope is no
longer restricted to a pair of processes.
A permission has now a global scope. More precisely, let PERMISSION (i) be
the permission managed by pi . When it gives its permission to pj , pi gives this
permission, not on behalf of itself, but on behalf of all the processes that need this
permission to enter the critical section. It follows that, when a process pj exits
the critical section, it has to send back to each process pi such that i ∈ Rj the
permission it has previously obtained from it. This is needed to allow pi to later
give the permission PERMISSION (i) it manages to another requesting process.

Mutex Safety from Intersecting Sets As a process gives its permission to only
one process at a time, the safety property of the mutual exclusion problem is ensured
if we have

∀i, j : Ri ∩ Rj = ∅.

This is because, as there is at least one process pk such that k ∈ Ri ∩ Rj , this process
cannot give PERMISSION (k) to pj , if it has sent it to pi and pi has not yet returned
it. Such a process pk is an arbiter for the conflicts between pi and pj . According to
the definition of the sets Ri and Rj , the conflicts between the processes pi and pj
can be handled by one or more arbiters.
10.4 An Algorithm Based on Arbiter Permissions 265

Fig. 10.11 Arbiter permission-based mechanism

Example This arbiter-based mechanism is depicted in Fig. 10.11, where there


are five processes, p1 , p2 , p3 , p4 , and p5 , R1 = {2, 3} and R5 = {3, 4}. As 3 ∈
R1 ∩ R5 , process p3 is an arbiter for solving the conflicts involving p1 and p5 .
Hence, as soon as p3 has sent PERMISSION (3) to p1 , it cannot honor the request
from p5 . Consequently, p3 enqueues this request in a local queue in order to be
able to satisfy it after p1 has returned the message PERMISSION (3) to it. (A similar
three-way handshake mechanism was described in Fig. 5.1, where the home process
of a mobile object acts as an arbiter process.)

10.4.2 Permissions Versus Quorums

Quorums A quorum system is a set of pairwise intersecting sets. Each intersect-


ing set is called a quorum. Hence, in mutual exclusion algorithms based on arbiter
permissions, each set Ri is a quorum and the sets R1 , . . . , Rn define a quorum sys-
tem.

The Case of a Centralized System An extreme case consists in defining a quo-


rum system made up of a single quorum containing a single process, i.e.,
∀i: Ri = {k}.
In this case, pk arbitrates all the processes, and the control is consequently central-
ized. (This corresponds to the home-based solution for a mobile object—token—
navigating a network; see Chap. 5.)

Constraints on the Definition of Quorums Ideally, a quorum system should be


symmetric and optimal, i.e., such that
• All quorums have the size K, i.e., ∀i: |Ri | = K. This is the “equal effort rule”: All
the processes need the same number of permissions to enter the critical section.
266 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.12 Values


of K and D for symmetric
optimal quorums

• Each process pi belongs to the same number D of quorums i.e., ∀i: |{j | i ∈
Rj }| = D. This is the “equal responsibility rule”: All the processes are engaged
in the same number of quorums.
• K and D have to be as small as possible. (Of course, a solution in which ∀i: Ri =
{1, . . . , n} works, but we would then have K = D = n, which is far from being
optimal.)
The two first constraints are related to symmetry, while the third one is on the opti-
mality of the quorum system.

10.4.3 Quorum Construction

Optimal Values of K and D Let us observe that the previous symmetry and
optimality constraints on K and D link these values. More precisely, as both nK and
nD represent the total number of possible arbitrations, the relation K = D follows.
To compute their smallest value, let us count the greatest possible number of
different sets Ri that can be built. Let us consider a set Ri = {q1 , . . . , qK } (all qj are
distinct and qj is not necessarily pj ). Due to the definition of D, each qj belongs
to Ri and (D − 1) other distinct sets. Hence, an upper bound on the total number
of distinct quorums that can be built is 1 + K(D − 1). (The value 1 comes from the
initial set Ri , the value K comes from the size of Ri , and the value D − 1 is the
number of quorums to which each qj belongs—in addition to Ri —see Fig. 10.12.)
As there is a quorum per process and there are n processes, we have
n = K(K − 1) + 1.
It follows that the lower bound on K√and D, which, satisfies both the symmetry and
optimality constraints, is K = D  n.

Finite Projective Planes Find n sets Ri satisfying K = D  n amounts to find
a finite projective plane of n points. There exists such planes of order k when k is
10.4 An Algorithm Based on Arbiter Permissions 267

Fig. 10.13 An order two


projective plane

⎛ ⎞
Table 10.1 Defining
√ √ 12 8 5 9
quorums from a n × n grid ⎜ ⎟
⎜6 2 13 1⎟
⎜ ⎟
⎜ 10 7⎟
⎝ 3 4 ⎠
14 11 8 6

power of a prime number. Such a plane has n = k(k + 1) + 1 points and the same
number of lines. Each point belongs to (k + 1) distinct lines, and each line is made
up of (k + 1) points. Two distinct points share a single line, and two distinct lines
meet a single point. A projective plane with n = 7 points (i.e., k = 2) is depicted
in Fig. 10.13. (The points are marked with a black bullet. As an example, the lines
“1, 6, 5” and “3, 2, 5” meet only at the point denoted “5”.) A line defines a quorum.
Being optimal, any two quorums (lines) defined from finite projective planes have
a single process (point) in common. Unfortunately, there are not finite projective
planes for any value of n.

Grid Quorums A simple way to obtain quorums of size O( n), √ consists in
arbitrarily placing the processes in a square grid. If n is not a square, ( n)2 − n
arbitrary processes can be used several times to√complete the grid. An example with
n = 14 processes is given in Table 10.1. As ( 14)2 − 14 = 2, two processes are
used twice to fill the grid (namely, p5 and p8 appear twice in the grid).
A quorum Ri consists then of all the processes in a line plus one process per
column. As an example the set {6, 2, 13, 1, 8, 4, 14} constitutes a quorum. As any
quorum includes a line of the grid, it follows from their construction rule that any
two quorums intersect. √ Moreover, due√to the grid structure, and according to the
value of n, we have  n ≤ |Ri | ≤ 2 n − 1.

Quorums and Antiquorums Let R = {R1 , . . . , Rn } be a set of quorums. An


antiquorum (with respect to R) is a set R  such that ∀i: R  ∩ Ri = ∅. Let us observe
that two antiquorums are not required to intersect.
An arbiter permission-based system that solves the readers/writers problem can
be easily designed as follows. A quorum Ri and an antiquorum Ri are associated
with each process pi . The quorum Ri defines the set of processes from which pi
must obtain the permission in order to write, while the antiquorum Ri defines the
set of processes from which it must obtain the permission in order to read.
268 10 Permission-Based Mutual Exclusion Algorithms

When considering grid-based quorums, an antiquorum can be defined as com-


posed
√ of all the process in a column. The size of such an antiquorum is consequently
 n . It is easy to see that we have then for all pairs (i, j ):
• Ri ∪ Rj = ∅, which ensures mutual exclusion among each pair of concurrent
write operations.
• Ri ∪ Rj = ∅, which ensures mutual exclusion among each pair of concurrent read
and write operations.
It is also easy to see that the square grid structure can be replaced by a rectangle.
When this rectangle becomes a vector (one line and n columns) a write operation
needs all permissions, while a read operation needs a single permission. This ex-
treme case is called ROWA (Read One, Write All).

Crumbling Walls In a crumbling wall, the processes are arranged in several lines
of possibly different lengths (hence, all quorums will not have the same size). A quo-
rum is then defined as a full line, plus a process from every line below this full line.
A triangular quorum system is a crumbling wall in which the processes are ar-
ranged in such a way the th line has processes (except possibly the last line).

Vote-Based Quorums Quorums can also be defined from weighted votes as-
signed to processes, which means that each process has a weighted permission.
A vote is nothing more than a permission. Let S be the sum of the weight of all
votes. A quorum is then a set of processes whose sum of weighted votes is greater
than S/2. This vote system is called majority voting.

10.4.4 An Adaptive Mutex Algorithm


Based on Arbiter Permissions

The algorithm that is presented in this section is a variant of an algorithm


√ proposed
by M. Maekawa (1985). The introduction of quorums of size n (defined from
projective planes and grids) is also due to M. Maekawa.

Two Simplifying Assumptions To simplify the presentation, the channels are


assumed to be FIFO. Moreover, while in practice it is interesting to have i ∈ Ri (to
save messages), we assume here that i ∈ / Ri . This simplification allows for a simpler
explanation of the behavior of pi as a client (which wants to enter the critical section
and, to that end, asks for and releases permissions), and its behavior as a server
(which implements an arbiter by managing and granting a permission).

On the Safety Side This section describes a first sketch of an algorithm where the
focus is only on the safety property. This safe algorithm is described in Fig. 10.14.
The meaning of the local variables is the same as in the previous permission-based
algorithms. An empty queue is denoted ∅.
10.4 An Algorithm Based on Arbiter Permissions 269

operation acquire_mutex() is
(1) cs_statei ← trying;
(2) waiting_fromi ← Ri ;
(3) for each j ∈ Ri do send REQUEST () to pj end for;
(4) wait (waiting_fromi = ∅);
(5) cs_statei ← in.

when PERMISSION (j ) is received from pj do % j ∈ Ri %


(6) waiting_fromi ← waiting_fromi \ {j }.

operation release_mutex() is
(7) cs_statei ← out;
(8) for each j ∈ Ri do send PERMISSION (j ) to pj end for.

when REQUEST () is received from pj do % i ∈ Rj %


(9) if (perm_herei ) then send PERMISSION (i) to pj ;
(10) perm_herei ← false
(11) else append pj ’s request to queuei
(12) end if.

when PERMISSION (i) is received from pj do % i ∈ Rj %


(13) withdraw pj ’s request from queuei ;
(14) if (queuei = ∅) then let pk be the process at the head of queuei ;
(15) send PERMISSION (i) to pk
(16) end if.

Fig. 10.14 A safe (but not live) mutex algorithm based on arbiter permissions (code for pi )

On the client side we have the following (lines 1–8). When a process invokes
acquire_mutex(), it sends a request to each process pj of its request set Ri to obtain
the permission PERMISSION (j ) managed by pj (line 3). Then, when it has received
all the permissions (line 6 and line 4), pi enters the critical section (line 5). When, it
releases the critical section, pi returns each permission to the corresponding arbiter
process (line 8).
On the arbiter side, the behavior of pi is as follows (lines 9–16). The local
Boolean variable perm_herei is equal to true if and only if pi has the permission it
manages (namely, PERMISSION (i)).
• When it receives a request from a process pj (we have then i ∈ Rj ), pi sends it
the permission it manages (PERMISSION (i)) if it has this permission (lines 9–10).
Otherwise, it adds the request of pj to a local queue (denoted queuei ) in order to
serve it later.
• When pi is returned its permission from a process pj , it suppresses pj ’s request
from its queue queuei , and, if this queue is not empty, it sends PERMISSION (i) to
the first process of this queue. Otherwise, it keeps its permission until a process
pj (such that j ∈ Rj ) requests it.

The Liveness Issue Due to the quorum intersection property (∀i, j : Ri ∩ Rj = ∅)


and the management of permissions (which are returned to their managers after be-
ing used), the previous algorithm satisfies the safety property of the mutex problem.
270 10 Permission-Based Mutual Exclusion Algorithms

Fig. 10.15 Permission preemption to prevent deadlock

Unfortunately, this algorithm is deadlock-prone. As a simple case, let us consider


that both pj 1 and pj 2 are such that i1, i2 ∈ Rj 1 ∩ Rj 2 . If pi1 gives its permission
to pj 1 while concurrently pi2 gives its permission to pj 2 , none of pj 1 and pj 2 will
obtain both permissions and will wait forever for the other permission.
Such a deadlock scenario may occur even if all the intersections of the sets Ri
contain a single process. To that, it is sufficient to consider a system of three pro-
cesses such that R1 = {2, 3}, R2 = {1, 3}, R3 = {1, 2}, and an execution in which p1
gives its permission to p2 , p2 gives its permission to p3 , and p3 gives its permission
to p1 . When this scenario occurs, no process can progress.

Solving the Liveness Issue, Part 1: Using Timestamps Two additional mech-
anisms are introduced to solve the liveness issue. The first is a simple scalar clock
system that permits us to associate a timestamp with each request issued by a pro-
cess. The local clock of a process pi , denoted clocki , is initialized to 0. As we saw in
Chap. 7 and Sect. 10.2.1, this allows requests to be totally ordered, and consequently
provides us with a simple way to establish priority among conflicting requests.

Solving the Liveness Issue, Part 2: Permission Preemption As a process is


an arbiter for a subset of processes and not for all processes, it is possible that a
process pi gives its permission to a process pj 1 (whose request its timestamped
h1, j 1), and while it has not yet been returned its permission, it receives a request
timestamped h2, j 2, such that h2, j 2 < h1, j 1. As the request from pj 2 has a
smaller timestamp, it has priority over the request of pj 1 .
In that case, pi asks pj 1 to yield the permission in order to prevent a possible
deadlock. This is described in Fig. 10.15.
• pi receives first a request message from pj 1 , timestamped h1, j 1. As queuei =
∅, pi sends its permission to pj 1 and adds h1, j 1 to queuei .
• Then pi receives a request message from pj 2 , timestamped h2, j 2 and such that
h2, j 2 < h1, j 1. pi then adds h2, j 2 to queuei (which is always ordered
10.4 An Algorithm Based on Arbiter Permissions 271

according to timestamp order, with the smallest timestamp at its head), and sends
the message YIELD _ PERM () to pj 1 in order to be able to serve the request with
the highest priority (here the request from pj 2 , which has the smallest timestamp).
• Then pi receives a request message from pj 3 . The timestamp of this message
h3, j 3 is such that h2, j 2 < h3, j 3 < h1, j 1. As this timestamp is not
smaller than the timestamp at the head of the queue, pi only inserts it in queuei .
• When pi receives PERMISSION (i) from pj 1 , pi forwards it to the process whose
request is at the head of its queue (namely pj 2 ).
• Finally, when pj 2 returns its permission to pi , pi forwards it to the process which
is at the head of its queue, namely pj 3 .
The resulting algorithm, which is an extension of the safe algorithm of Fig. 10.14,
is described in Fig. 10.16. On the client side we have the following:
• When a process pi invokes acquire_mutex(), it increases its local clock (line 3),
and sends a message timestamped clocki , i (line 4).
• When pi receives a permission with a date d, pi updates its local clock (line 7)
and its waiting set waiting_fromi (line 8).
• When a process pi invokes release_mutex(), it returns its permission to each
process of the set Ri (line 10).
• When pi receives a message YIELD _ PERM () from one of its arbiters pj , it re-
turns its permission to pj only if its local state is such that cs_statei = trying
(lines 21–23). In this case, as it has no longer the permission of pj , it also adds j
to waiting_fromi .
If cs_statei = in when pi receives a message YIELD _ PERM () from one of
its arbiter pj , it does nothing. This is because, as it is in the critical section, it
will return pj ’s permission when it will invoke release_mutex(). Let us finally
notice that it is not possible to have cs_statei = out when pi receives a message
YIELD _ PERM (). This is because, if pi receives such a message, it is because it has
previously received the corresponding permission (let us recall that the channels
are FIFO).
On the arbiter side, the behavior of pi is defined by the statements it executes
when it receives a message REQUEST () or RETURNED _ PERM (), or when a process
sends it back its the message PERMISSION (i).
• When pi receives a message REQUEST (d) from a process pj , it first updates its
clock (line 11) and adds the corresponding timestamp d, j  to queuei (line 12).
Then, if PERMISSION (i) is here, pi sent it by return to pj (lines 13–14). In
this case, the permission message carries the current value of clocki . This is to
allow the client pj to update its local clock (line 7). In this way, thanks to the
intersection property of the sets Ri and Rj , the local clocks of the processes are
forced to progress (and this global progress of local clocks allows each request to
eventually have the smallest timestamp).
If PERMISSION (i) is not here, pi has sent this permission message to some
process pk (whose identity has been saved in sent_toi ). If d, j  is the smallest
timestamp in queuei and pi has not reclaimed its permission to pk , pi reclaims it
(lines 15–18).
272 10 Permission-Based Mutual Exclusion Algorithms

operation acquire_mutex() is
(1) cs_statei ← trying;
(2) waiting_fromi ← Ri ;
(3) clocki ← clocki + 1;
(4) for each j ∈ Ri do send REQUEST (clocki ) to pj end for;
(5) wait (waiting_fromi = ∅);
(6) cs_statei ← in.

when PERMISSION (j, d) is received from pj do % j ∈ Ri %


(7) clocki ← max(clocki , d);
(8) waiting_fromi ← waiting_fromi \ {j }.

operation release_mutex() is
(9) cs_statei ← out;
(10) for each j ∈ Ri do send PERMISSION (j ) to pj end for.

when REQUEST (d) is received from pj do


(11) clocki ← max(clocki , d);
(12) add the timestamp d, j  to queuei and sort it according to timestamp order;
(13) if (perm_herei ) then send PERMISSION (i, clocki ) to pj ;
(14) sent_toi ← j ; perm_herei ← false
(15) else if ((d, j  = head of queuei ) ∧ (¬ perm_askedi ))
(16) then let k = sent_toi ;
(17) send YIELD _ PERM () to pk ;
(18) perm_askedi ← true
(19) end if
(20) end if.

when YIELD _ PERM () is received from pj do % j ∈ Ri %


(21) if (cs_statei = trying)
(22) then send RETURNED _ PERM (j ) to pj ; wait_fromi ← wait_fromi ∪ {j }
(23) end if.

when RETURNED _ PERM (i) is received from pj do % j ∈ Ri %


(24) let −, k = head of queuei ;
(25) send PERMISSION (i) to pk ;
(26) sent_toi ← k; perm_askedi ← false.

when PERMISSION (i) is received from pj do % j ∈ Ri %


(27) withdraw −, j  from queuei ;
(28) if (queuei = ∅) then let −, k = head of queuei ;
(29) send PERMISSION (i, clocki ) to pk ;
(30) sent_toi ← j ; perm_askedi ← false
(31) end if.

Fig. 10.16 A mutex algorithm based on arbiter permissions (code for pi )

• When pi receives RETURNED _ PERM () from pj , it forwards the message PER -


MISSION (i) to the process at the head of the queue, which is the requesting pro-
cess with the smallest timestamp (lines 24–26). As the request of pj remains
pending, pi does not suppress pj ’s timestamp from queuei .
10.5 Summary 273

• When pi receives from pj the message PERMISSION (i) it manages, pi first sup-
presses pj ’s timestamp from queuei (lines 27). Then, if queuei is not empty,
pi sends its permission (with it local clock value) to the first process in queuei
(lines 27–31).

Message Cost of the Algorithm The number of messages generated by one


use of the critical section depends on the current system state. In the best case, a
single process pi wants to use the critical section, and consequently 3|Ri | mes-
sages are used. When several processes are competing, the average number of mes-
sages per use of the critical sections can be up to 6|Ri |. This is due to the follow-
ing observations. First, a YIELD _ PERM () message can entail the sending of a RE -
TURNED _ PERM () message and, in that case, the permission message will have to be
sent again to the yielding process. Second, it is possible that several YIELD _ PERM ()
messages be (sequentially) sent to the same process pi by the same arbiter pj (each
other process pk such that k ∈ Rj can entail such a scenario once).

10.5 Summary

The aim of this chapter was to present the mutual exclusion problem and the class
of permission-based algorithms that solve it. Two types of permission-based algo-
rithms have been presented, algorithms based on individual permissions and algo-
rithms based on arbiter permissions. They differ in the meaning of a permission. In
the first case a permission engages only its sender, while it engages a set of pro-
cesses in the second case. Arbiter-based algorithms are also known under the name
quorum-based algorithms. An aim of the chapter was to show that the concept of a
permission is a fundamental concept when one has to solve exclusion problems. The
notions of an adaptive algorithm and a bounded algorithm have also been introduced
and illustrated.
The algorithms presented in Chap. 5 (devoted to mobile objects navigating a
network) can also be used to solve the mutex problem. In this case, the mobile object
is usually called a token, and the corresponding algorithms are called token-based
mutex algorithms.

10.6 Bibliographic Notes

• The mutual exclusion problem was introduced in the context of shared memory
systems by E.W. Dijkstra, who presented its first solution [109].
• One of the very first solutions to the mutex problem in message-passing systems
is due to L. Lamport [226]. This algorithm is based on a general state-machine
replication technique. It requires 3n messages per use of the critical section.
274 10 Permission-Based Mutual Exclusion Algorithms

• The mutex problem has received many solutions both in shared memory sys-
tems (e.g., see the books [317, 362]) and in message-passing systems [306].
Surveys and taxonomies of message-passing mutex algorithms are presented in
[310, 333, 349].
• The algorithm presented in Sect. 10.2.1 is due to R. Ricart and A.K. Agrawala
[327].
• The unbounded adaptive algorithm presented in Sect. 10.3.2 is due to O. Carvalho
and G. Roucairol [71]. This algorithm was obtained from a Galois field-based
systematic distribution of an assertion [70].
• The bounded adaptive algorithm presented in Sect. 10.3.3 is due to K.M. Chandy
and J. Misra [78]. This is one of the very first algorithms that used the notion of
edge reversal to maintain a dynamic cycle-free directed graph. A novelty of this
paper was the implementation of edge reversals with the notion of a new/used
permission.
• The algorithm based on arbiter permissions is due to M. Maekawa [243], who
was the first to define optimal quorums from finite projective planes.
• The notion of a quorum was implicitly introduced by R.H. Thomas and D.K.
Gifford in 1979, who were the first to introduce the notion of a vote to solve
resource allocation problems [159, 368].
The mathematics which underly quorum and vote systems are studied in [9, 42,
147, 195]. The notion of an anti-quorum is from [41, 147]. Tree-based quorums
were introduced in [7]. Properties of crumbling walls are investigated in [294].
Availability of quorum systems is addressed in [277]. A general method to define
quorums is presented in [281]. A monograph on quorum systems was recently
published [380].
• Numerous algorithms for message-passing mutual exclusion have been proposed.
See [7, 68, 226, 239, 282, 348] to cite a few. A mutex algorithm combining in-
dividual permissions and arbiter permissions is presented in [347]. An algorithm
for arbitrary networks is presented in [176].

10.7 Exercises and Problems

1. Let us consider the mutex algorithm based on individual permission described


in Sect. 10.2. The aim is here to prevent the phenomenon depicted in Fig. 10.5.
More precisely, if pi and pj are conflicting and pi wins (i.e., the timestamp as-
sociated with the request of pi is smaller than the timestamp associated with the
request of pj ), then pj has to enter the critical section before the next invocation
of pi .
• Modify the algorithm so that it satisfies the previous sequencing property.
• Obtain the sequencing property by adding a simple requirement on the behav-
ior of the channels (and without modifying the algorithm).
• Which of the previous solutions is the most interesting? Why?
10.7 Exercises and Problems 275

2. When considering the mutex algorithm based on individual permission described


in Sect. 10.2, each use of the critical section by a process requires 2(n − 1) mes-
sages. Modify this algorithm in such a way that each use of the critical section
requires exactly n messages when the fully connected network is replaced by a
unidirectional ring network.
Solution in [327].
3. When considering the mutex algorithm based on individual permission described
in Sect. 10.2, show that the relation |clocki − clockj | ≤ n − 1 is invariant. Exploit
this property to obtain an improved algorithm in which the clock value domain
is bounded by 2n − 1.
Solution in [327].
4. Rewrite the bounded adaptive mutex algorithm described in Fig. 10.10, in such
a way that the identities of the processes (which have a global meaning) are
replaced by channel identities (which have a local meaning).
5. When considering the bounded adaptive mutex algorithm described in Fig. 10.10,
can the predicate “j ∈ Ri ” (used to compute the priority at line 10) be suppressed
when the channels are FIFO?
6. When considering the mutex algorithm based on arbiter permissions described
in Fig. 10.16, is the process identity j or i necessary in the following message
received by pi : PERMISSION (j, d), RETURN _ PERM (i), and PERMISSION (i).
7. To simplify its presentation, the algorithm described in Fig. 10.16 assumes that
i∈/ Ri . To save messages, it is interesting to always have i ∈ Ri . Modify the
algorithm so that it works when, for any i, we have i ∈ Ri .
Chapter 11
Distributed Resource Allocation

This chapter is on resource allocation in distributed systems. It first considers the


case where there are M instances of the same resource, and a process may request
several instances of it. The corresponding resource allocation problem is called k-
out-of-M problem (where k, 1 ≤ k ≤ M, stands for the—dynamically defined—
number of instances requested by a process). Then, the chapter addresses the case
where there are several resources, each with a single or several instances.
The multiplicity of resources may generate deadlocks if resources are arbitrar-
ily allocated to processes. Hence, the chapter visits deadlock prevention techniques
suited to resource allocation. It also introduces the notion of a conflict graph among
processes. Such a graph is a conceptual tool, which captures the possible conflicts
among processes, when each resource can be accessed by a subset of processes. Fi-
nally, the chapter considers two distinct cases, according to the fact that the subset of
resources required by a process is always the same, or may vary from one resource
session to another one.

Keywords Conflict graph · Deadlock prevention · Graph coloring ·


Incremental requests · k-out-of-M problem · Permission · Resource allocation ·
Resource graph · Resource type · Resource instance · Simultaneous requests ·
Static/dynamic (resource) session · Timestamp · Total order · Waiting chain ·
Wait-for graph

11.1 A Single Resource with Several Instances


11.1.1 The k-out-of-M Problem

The mutual exclusion problem considers the case of a single resource with a sin-
gle instance. A simple generalization is when there are M instances of the same
resource, and a process may require several instances of it. More precisely, (a) each
instance can be used by a single process at a time (mutual exclusion), and (b) each
time it issues a request, a process specifies the number k of instances that it needs
(this number is specific to each request).
Solving the k-out-of-M problem consists in ensuring that, at any time:
• no resource instance is accessed by more than one process,

M. Raynal, Distributed Algorithms for Message-Passing Systems, 277


DOI 10.1007/978-3-642-38123-2_11, © Springer-Verlag Berlin Heidelberg 2013
278 11 Distributed Resource Allocation

• each process is eventually granted the number of resource instances it has re-
quested, and
• when possible, resource instances have to be granted simultaneously. This last
requirement is related to efficiency (if M = 5 and two processes asks for k = 2
and k  = 3 resource instances, respectively, and no other process is using or re-
questing resource instances, these two processes must be allowed to access them
simultaneously).

11.1.2 Mutual Exclusion with Multiple Entries:


The 1-out-of-M Mutex Problem

The 1-out-of-M problem is when every request of a process is always for a single
resource instance. This problem is also called mutual exclusion with multiple entries.
This section presents an algorithm that solves the 1-out-of-M problem. This al-
gorithm, which is due to K. Raymond (1989), is a straightforward adaptation of the
mutex algorithm based on individual permission described in Sect. 10.2.

A Permission-Based Predicate As in Sect. 10.2, a process pi that wants to ac-


quire an instance of the resource, asks for the individual permission of each other
process pj . As there are M instances of the resource, pi has to wait until it knows
that at most (M − 1) other processes are using a resource instance. This translates
into the waiting statement: Wait until (n − 1) − (M − 1) = n − M permissions have
been received.

Management of Individual Permissions When a process has received (n − M)


permissions, it is allowed to use an instance of the resource. The (M − 1) permis-
sions that have not yet arrived, will be received later, i.e., while pi is using the
resource instance it has obtained, after it has used it, or even during a new use of a
resource instance.
Moreover, let us notice that a process pi may delay the sending of several per-
missions with respect to another process pj . As an example, let us consider three
processes, p1 , p2 , and p3 , and two instances of a resource (M = 2). Process p1
obtains n − M = 1 permission (from p2 or p3 ) and uses a resource instance for a
very long period of time. During this period of time, process p2 may use the other
resource instance several times due to the permission sent by p3 (which is not inter-
ested in the resource). In this scenario, process p1 receives several requests from p2
that it will answer only when it will release its instance of the resource.

Local Variables of a Process pi In addition to the local variables cs_statei ,


clocki , rdi , and the constant set Ri = {1, . . . , n} \ {i}, and according to the previous
discussion on the permissions sent and received by the processes, each process pi
manages two local arrays of non-negative integers (both initialized to [0, . . . , 0]),
and a non-negative integer variable.
11.1 A Single Resource with Several Instances 279

operation acquire_resource() is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) nb_permi ← 0;
(4) for each j ∈ Ri do send REQUEST ( rdi , i) to pj ;
(5) wait_permi [j ] ← wait_permi [j ] + 1
(6) end for;
(7) wait (nb_permi ≥ n − M));
(8) cs_statei ← in.

operation release_resource() is
(9) cs_statei ← out;
(10) foreach j such that perm_delayedi [j ] = 0 do
(11) send PERMISSION (i, perm_delayed i [j ]); perm_delayedi [j ] ← 0
(12) end for.

when REQUEST (k, j ) is received do


(13) clocki ← max(clocki , k);
(14) prioi ← (cs_statei = in) ∨ [(cs_statei = trying) ∧ ( rdi , i < k, j )];
(15) if (prioi ) then perm_delayedi [j ] ← perm_delayedi [j ] + 1
(16) else send PERMISSION (i, 1) to pj
(17) end if.

when PERMISSION (j, x) is received do


(18) wait_permi [j ] ← wait_permi [j ] − x;
(19) if ((cs_statei = trying) ∧ (wait_permi [j ] = 0)) then nb_permi ← nb_permi + 1 end if.

Fig. 11.1 An algorithm for the multiple entries mutex problem (code for pi )

• wait_permi [1..n] is an array such that wait_permi [j ] contains the number of per-
missions that pi is waiting on from pj .
• perm_delayedi [1..n] is an array such that perm_delayedi [j ] contains the number
of permissions that pi has to send to pj when it will release the resource instance
it is currently using.
• nb_permi , which is meaningful only when pi is waiting for permission, con-
tains the number of permissions received so far by pi (i.e., nb_permi = |{j =
i such that wait_permi [j ] = 0}|).

Behavior of a Process pi The algorithm based on individual permission that


solves the multiple entries mutual exclusion problem is described in Fig. 11.1. As
already indicated, this algorithm is a simple extension of the mutex algorithm de-
scribed in Fig. 10.3. The following comments are only on the modified parts of this
algorithm.
When a process pi invokes acquire_resource(), it sends a timestamped request
message to each other process pj in order to obtain its permission, and updates
accordingly wait_permi [j ] (lines 4–5). Then, pi waits until it has received (n − M)
permissions (line 7).
280 11 Distributed Resource Allocation

When a process pi invokes release_resource(), it sends to each process pj all


the permissions whose sending has been delayed. If several permissions have been
delayed with respect to pj , they are sent in a single message (line 11).
When pi receives a message REQUEST (k, j ), pi has priority if it is currently
using an instance of the resource (cs_statei = in), or it is waiting and its current
request has a smaller timestamp than the one it just received (line 14). If pi has
priority, it increases accordingly perm_delayedi [j ] (line 15), otherwise it sends its
permission by return (line 16).
Finally, when pi receives a message PERMISSION (j, x), it updates accordingly
wait_perm_i[j ] (line 18). Moreover, if pi is waiting to obtain a resource instance
and it is no longer waiting permissions from pj , pi increases nb_permi (line 19),
which may allow it to use an instance of the resource if nb_permi ≥ n − M (line 7).

Message Cost of the Algorithm There are always (n − 1) request messages per
use of a resource instance, while the number of permission messages depends on
the system state and varies between (n − 1) and (n − M). The total number of
messages per use of a resource instance is consequently at most 2(n − 1) and at
least 2n − (M + 1).

11.1.3 An Algorithm for the k-out-of-M Mutex Problem

From 1-out-of-M to k-out-of-M As previously indicated, in the k-out-of-M


problem, there are M instances of the same resource type and, each time a pro-
cess pi invokes acquire_resource(), pi passes as an input parameter the number k
of instances it wants to acquire. The parameter k, 1 ≤ k ≤ M, is specific to each
instance (which means that there is no relation linking the parameters k and k  of
any two invocations acquire_resource(k) and acquire_resource(k  )).
An algorithm solving the k-out-of-M problem can be obtained from an appro-
priate modification of a mutex algorithm. As for the 1-out-of-M problem studied in
the previous section, we consider here the mutex algorithm based on individual per-
missions described in Sect. 10.2. The resulting algorithm, which is described below,
is due to M. Raynal (1991).

Algorithmic Principles: Guaranteeing Safety and Liveness In order to never


violate the safety property (each resource instance is accessed in mutual exclusion
and no more than M instances are simultaneously allocated to processes), a pro-
cess pi first computes an upper bound on the number of instances of the resource
which are currently allocated to the other processes. To that end, each process pi
manages a local array used_byi [1..n] such that used_byi [j ] is an upper bound on
the number of instances currently used by pj . Hence, when a process pi invokes
acquire_resource(ki ), it first computes the values of used_byi [1..n], and then waits
until the predicate

used_byi [j ] ≤ M
1≤j ≤n
11.1 A Single Resource with Several Instances 281

Fig. 11.2 Sending pattern of NOT _ USED () messages: Case 1

becomes satisfied. When this occurs, pi is allowed to access ki resource instances.


As far as the liveness property is concerned, each request message carries a times-
tamp, and the total order on timestamps is used to establish a priority among the
requests.

Basic Message Pattern (1) When a process pi sends a timestamped request mes-
sage to pj , it conservatively considers that pj uses the M resource instances. When
it receives the request sent by pi , pj sends by return a message NOT _ USED (M) if it
is not interested, or if its current request has a greater timestamp.
If its current request has priority over pi ’s request, pj sends it by return a mes-
sage NOT _ USED (M − kj ) (where kj is the number of instances it asked for). This
is depicted on Fig. 11.2. This message allows pi to know how many instances of
the resource are currently used or requested by pj . Then, when it will later invoke
release_resource(kj ), pj will send to pi the message NOT _ USED (kj ) to indicate
that it has finished using its kj instances of the resource.
When considering the permission-based terminology, a message PERMISSION ()
sent by a process pj to a process pi in the mutex algorithm based on individual
permissions is replaced here by two messages, namely a message NOT _ USED (M −
kj ) and a message NOT _ USED (kj ). Said in another way, a message NOT _ USED (x)
x
represents a fraction of a whole permission, namely, M % of it.

Basic Message Pattern (2) It is possible that a process pj uses kj instances of a


resource during a long period, and during this period pi invokes several times first
acquire_resource(ki ) (where ki + kj ≤ M), and later invokes acquire_resource(ki )
(where ki + kj ≤ M), etc. This scenario is depicted in Fig. 11.3.
When pi issues its second invocation, it adds M to used_byi [j ], which becomes
then equal to M + kj . In order that this value does not prevent pi from progressing,
pj is required to send to pi a message NOT _ USED (M) when it receives the second
request of pi . The corresponding message exchange pattern is described in Fig. 11.3.

Behavior of a Process pi The corresponding k-out-of-M algorithm executed by


each process pi is described in Fig. 11.4. The local variables cs_statei , clocki , rdi ,
and perm_delayedi are the same as in the basic-mutex algorithm of Fig. 10.3. The
meaning of the additional array variable used_byi [1..n] has been explained above.
The message pattern described in Fig. 11.3 is generated by the execution of line 15.
282 11 Distributed Resource Allocation

Fig. 11.3 Sending pattern of NOT _ USED () messages: Case 2

operation acquire_resource(ki ) is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do send REQUEST ( rdi , i) to pj ;
(4) used_byi [j ] ← used_byi [j ] + M
(5) end for;
i [i] ← ki ;
(6) used_by
(7) wait ( 1≤j ≤n used_byi [j ] ≤ M);
(8) cs_statei ← in.

operation release_resource(ki ) is
(9) cs_statei ← out;
(10) foreach j ∈ perm_delayedi do send NOT _ USED (ki ) end for;
(11) perm_delayedi ← ∅.

when REQUEST (k, j ) is received do


(12) clocki ← max(clocki , k);
(13) prioi ← ((cs_statei = out) ∧ ( rdi , i < k, j ));
(14) if (¬ prioi ∨ [prioi ∧ (j ∈ perm_delayedi )])
(15) then send NOT _ USED (M) to pj
(16) else if (ki = M) then send NOT _ USED (M − ki ) to pj end if;
(17) perm_delayedi ← perm_delayedi ∪ {j }
(18) end if.

when NOT _ USED (x) is received from pj do


(19) used_byi [j ] ← used_byi [j ] − x.
% this line can render true the waiting predicate of line 7 %.

Fig. 11.4 An algorithm for the k-out-of-M mutex problem (code for pi )

The particular case where a process asks for all instances of the resource is addressed
at line 16.
Let us remember that, except for the wait statement of line 7, the code of each
operation and the code associated with message receptions are locally executed in
mutual exclusion.
The requests are served in their timestamp order. As an example, let us consider
that M = 8 and let us assume that the first five invocations (according to their times-
11.1 A Single Resource with Several Instances 283

tamp) are for 3, 1, 2, 5, 1 instances of the resource, respectively. Then, the first three
invocations are served concurrently. The fourth invocation will be served as soon as
any three instances of the resource will have been released.
Let us finally observe that it is easy to modify the algorithm to allow a process
pi to release separately its ki instances of the resource.

Message Cost of the Algorithm There are always (n − 1) request messages per
use of a resource instance. The number of NOT _ USED () messages depends on the
system state and varies between (n − 1) and 2(n − 1). Hence, the total number of
messages per use of a set of instances of the resource is at least 2(n − 1) and at most
3(n − 1).

11.1.4 Proof of the Algorithm

This section proves that the previous algorithm solves the k-out-of-M problem. It
assumes that (a) each invocation acquire_resource(k) is such that 1 ≤ k ≤ M, and
(b) the periods during which a process uses the instances of the resource that it has
obtained are finite.

Lemma 6 Let x τ be the number of instances of the resource accessed by the pro-
cesses at time τ . The algorithm described in Fig. 11.4 guarantees that ∀ τ : x τ ≤ M.

Proof The proof of the safety property is based on three claims.

Claim C1 ∀ i, j : used_byi [j ] ≥ 0.
Proof of Claim C1 As all the local variables used_byi [j ] are initialized to 0, the
claim is initially true.
A process pi increases all its local counters used_byi [j ] at line 4 when it in-
vokes acquire_resource(ki ) (and this is the only place where pi increases them).
Moreover, it increases each of them by M. After these increases, the claim remains
trivially true.
The reception of a message REQUEST () sent by pi to another process pj gives
rise to either one message NOT _ USED (M) or at most one message NOT _ USED (M −
kj ) followed by at most one message NOT _ USED (kj ). It follows from line 19 that,
when these messages are received by pi , used_byi [j ] is decreased and can be set to
0. Hence, used_byi [j ] never becomes negative and the claim follows. (End of the
proof of the claim.)

Claim C2 ∀ i : ∀ τ : (cs_stateτi = in) ⇒ ( 1≤j ≤n used_byτi [j ] ≤ M).
Proof of Claim C2 It follows from line 7 that the claim is true when pi sets cs_statei
to the value in. As pi does not invoke acquire_resource() while cs_statei = in, no
local variable used_byi [j ] is increased when cs_statei = in, and the claim follows.
(End of the proof of the claim.)
284 11 Distributed Resource Allocation

Claim C3 Let IN τ = {i | cs_stateτi = in}. Moreover, let nb_allocj be such that


nb_allocj = kj if j ∈ IN τ , and nb_allocj = 0, otherwise (kj is the number of in-
stances requested by pj ).
Considering IN τ = ∅, let pm be the process such that m ∈ IN τ , and its request has
the greatest timestamp among all the requests issued by the processes in IN τ (i.e.,
∀ k ∈ IN τ ∧ k = m : dm , m > dk , k, where dx , x is the timestamp associated
with the request of px ). We claim ∀ j = m : used_byτm [j ] ≥ nb_allocj .
Proof of Claim C3 If j ∈ / IN τ , we have (from Claim C1) used_byτm [j ] ≥ 0 =
nb_allocj , and the claim follows. Hence, let us assume j ∈ IN τ and j = m. We have
then nb_allocj = kj . Due to the definition of pm , we have dm , m > dj , j . Con-
sequently, when pj receives REQUEST (dm , m), we have prioj = true and pj sends
to pm the message NOT _ USED (M − kj ) (line 16). It follows that at time τ , pj has
not yet received NOT _ USED (kj ) and consequently used_byτm [j ] ≥ kj = nb_allocj .
(End of the proof of the claim.)

Let U τ be the number of instances of the resource which are used at time τ .
If IN τ = ∅, no instance is used and consequently U τ = 0 ≤ M. So, let us assume
IN τ = ∅. Let pm be defined as in Claim C3. We have the following:

• U = j ∈IN τ kj (definition
τ
of U τ ),

j ∈IN τ kj ≤ km + j ∈IN τ \{m} used_by
m [j ] (from Claim C3),
τ

• k m+ j ∈IN \{m}
τ used_by τ
m [j ] ≤ 1≤j ≤n used_bym [j ] (from Claim C1),
τ

• 1≤j ≤n used_bym [j ] ≤ M (from Claim C2).


τ

It follows by transitivity that U τ ≤ M, which concludes the proof of the safety


property. 

Lemma 7 The algorithm described in Fig. 11.4 ensures that any invocation of
acquire_resource() eventually terminates.

Proof Let us notice that the only statement at which a process


may block is the wait
statement at line 7. Let WAIT τ = {i | (cs_statei = trying)∧( 1≤j ≤n used_byτi [j ] >
M)}. The proof is based on the following claim.

Claim C4 Considering WAIT τ = ∅, let pm be the process such that m ∈ WAIT τ , and
its request has the smallest timestamp among all the requests issued by the processes
in WAIT τ .
Then, the quantity 1≤j ≤n used_by
m [j ] decreases when τ increases,
and even-
tually either cs_statem = in or 1≤j ≤n used_bym [j ] = km + z∈PRIO kz > M,
where PRIO = {z | (dz , z < dm , m) ∧ (cs_statez = in)}.
Proof Claim C4 Let j = m. There are two cases.
1. Case j ∈ WAIT τ . Due to the definition of m we have dj , j  > dm , m. It follows
that, when pj receives the message REQUEST (dm , m), it sends NOT _ USED (M)
by return to pm (lines 13–15). It follows that j ∈
/ PRIO and we eventually have
used_bym [j ] = 0.
11.2 Several Resources with a Single Instance 285

2. Case j ∈/ WAIT τ . If cs_statej = in or dj , j  > dm , m, we are in the same case
as previously, j ∈/ PRIO and eventually we have used_bym [j ] = 0. If cs_statej =
in and dj , j  < dm , m, we have j ∈ PRIO and pj sends NOT _ USED (M − kj )
by return to pm (line 16). Hence, we eventually have used_bym [j ] = kj .
While pm is such that cs_statem = trying, it does send new requests, and conse-
quently no used_by m [x] can increase. It follows that,
after a finite time, we have
cs_statem = in or 1≤j ≤n used_bym [j ] = km + z∈PRIO kz > M, which con-
cludes the proof of the claim. (End of the proof of the claim.)

We now prove that, for any τ , if a process belongs to WAIT τ , it will be such
that cs_statei = in. Let us consider the process pm of WAIT τ that has the smallest
timestamp. It follows from Claim C4 that, after some finite time, we have
• Either cs_statem = in. In this case, pm returns from its invocation of acquire_
resource().
• Or (cs_statem = trying) ∧ (km + z∈PRIO kz > M). In this case, as the resource
instances are eventually released (assumption), the set PRIO decreases and the
quantity z∈PRIO kz decreases accordingly allowing the predicate cs_statem = in
to eventually become true.
Hence, pm eventually returns from its invocation of acquire_resource().
The proof that any process px returns from acquire_resource() follows from
the fact that the new invocations will eventually have timestamps greater than the
one of px . The proof is the same as for the mutex algorithm based on individual
permissions (see the proof of its liveness property in Lemma 4, Sect. 10.2.3). 

Theorem 15 The algorithm described in Fig. 11.4 solves the k-out-of-M problem.

Proof The theorem follows from Lemmas 6 and 7. 

11.1.5 From Mutex Algorithms to k-out-of-M Algorithms

As noted when it was presented, the previous algorithm, which solves the k-out-of-
M mutex problem, is an adaptation of the mutex algorithm described in Sect. 10.2.
More generally, it is possible to extend other mutex algorithms to solve the k-out-of-
M mutex problem. Problem 1 at the end of this chapter considers such an extension
of the adaptive mutex algorithm described in Sect. 10.3.

11.2 Several Resources with a Single Instance


This section considers the case where there are X resource types, and there is one
instance of each resource type x, 1 ≤ x ≤ X.
286 11 Distributed Resource Allocation

Fig. 11.5 Examples of conflict graphs

The pattern formed by (a) the acquisition of resources by a process, (b) their use,
and (c) finally their release, is called a (resource) session. During its execution, a
process usually executes several sessions.

11.2.1 Several Resources with a Single Instance

The Notion of a Conflict Graph A graph, called conflict graph, is associated with
each resource. Let CG(x) be the conflict graph associated with the resource type x.
This graph is an undirected fully connected graph whose vertices are the processes
allowed to access the resource x. An edge means a possible conflict between the
two processes it connects: their accesses to the resource must be executed in mutual
exclusion.
As an example, let us consider six processes p1 , . . . , p6 , and three resource types
x1 , x2 , and x3 . The three corresponding conflict graphs are depicted in Fig. 11.5.
The edges of CG(x1 ) are depicted with plain segments, the edges of CG(x2 ) with
dotted segments, and edges of CG(x3 ) with dashed segments. As we can see, the
resource x1 can be accessed by the four processes: p2 , p3 , p5 , and p6 ; p1 accesses
only x3 , while both p2 and p4 are interested in the resources x1 and x3 .
The union of these three conflict graphs defines a graph in which each edge is
labeled by the resource from which it originates (the label is here the fact that the
edge is plain/dotted/dashed). This graph, called global conflict graph, gives a global
view on the conflicts on the whole set of resources.
The global graph associated with the previous three resources is depicted in
Fig. 11.6. As we can see, this global graph allows p1 to access x3 while p6 is access-
ing the pair resources (x1 , x2 ). Differently, p3 and p6 conflict on both the resources
x1 and x2 .

Requests Are on All or a Subset of the Resources As already indicated, two


types of requests can be defined.
• In the first case, each resource session of a process pi is static in the sense that it
is always on all the resources that pi is allowed to access. This set of resources is
the set {x | i ∈ CG(x)}.
11.2 Several Resources with a Single Instance 287

Fig. 11.6
Global conflict graph

When considering the global conflict graph of Fig. 11.6, this means that each
session of p6 is always on both x1 and x2 , while each resource session of p4 is
always on x2 and x3 , etc.
• In the second case, each session of a process pi is dynamic in the sense that it is
on a dynamically defined subset of the resources that pi is allowed to access.
In this case, some sessions of p6 can be only on x1 , others on x2 , and others on
both x1 and x2 . According to the request pattern, dynamic allocation may allow
for more concurrency than static allocation. As a simple example, if p6 wants to
access x1 while p3 wants to access x2 , and no other process wants to access these
resources, then p6 and p3 can access them concurrently.
The first case is sometimes called in the literature the dining philosophers prob-
lem, while the second case is sometimes called the drinking philosophers problem.

11.2.2 Incremental Requests for Single Instance Resources:


Using a Total Order

Access Pattern We consider here the case where a process pi requires the re-
sources it needs for its session one after the other, hence the name incremental re-
quests. Moreover, a session can be static or dynamic.

Possibility of Deadlock The main issue that has to be solved is the prevention
of deadlocks. Let us consider the processes p3 and p6 in Fig. 11.6. When both
processes want to acquire both the resources x1 and x2 , the following can happen
with incremental requests.
• p3 invokes acquire_resource(x1 ) and obtains the resource x1 .
• p6 invokes acquire_resource(x2 ) and obtains the resource x2 .
• Then p3 and p6 invoke acquire_resource(x2 ) and acquire_resource(x1 ), respec-
tively. As the resource asked by a process is currently owned by the other process,
each of p3 and p6 starts waiting, and the waiting period of each of them will ter-
minate when the resource it is waiting on is released. But, as each process releases
resources only after it has obtained all the resources it needs for the current ses-
sion, this will never occur (Fig. 11.7).
288 11 Distributed Resource Allocation

Fig. 11.7 A deadlock scenario involving two processes and two resources

This is a classical deadlock scenario. (Deadlocks involving more than two pro-
cesses accessing several resources can easily be constructed.)

Deadlock Prevention: A Total Order on Resources Let {x1 , x2 , . . . , xX } be the


whole set of resources accessed by the processes, each process accessing possibly
only a subset of them (as an example, see Fig. 11.6). Let ≺ be a total order on this set
of resources, e.,g., x1 ≺ · · · ≺ xm . The processes are required to obey the following
rule:
• During a session, a process may invoke acquire_resource(xk ) only if it has al-
ready obtained all the resources xj it needs (during this session) which are such
that xj ≺ xk .

Theorem 16 If, during each of its sessions, every process asks for the resources (it
needs for that session) according to the total order ≺, no deadlock can occur.

Proof Let us first introduce the notion of a wait-for graph (WFG). Such a graph
is a directed graph defined as follows (see also Sect. 15.1.1). Its vertices are the
processes and its edges evolve dynamically. There is an edge from pi to pj if (a)
pj is currently owning a resource x (i.e., pj has obtained and not yet released the
resource x) and (b) pi wants to acquire it (i.e., pi has invoked acquire_resource(x)).
Let us observe that, when there is a (directed) cycle, this cycle never disappears (for
it to disappear, a process in the cycle should release a resource, but this process
is blocked waiting for another resource). A cycle in the WFG means that there is
deadlock: each process pk in the cycle is waiting for a resource owned by a process
pk  that belongs to the cycle, which in turn is waiting for a resource owned by a
process pk  that belongs to the cycle, etc., without ever exiting from the cycle. The
progress of each process pk on the directed cycle depends on a statement (releasing
a resource) that it cannot execute.
The proof is by contradiction. Let us suppose that there is a directed cycle in
the WFG. To simplify the proof, and without loss of generality, let us assume that
the cycle involves only two processes p1 and p2 : p1 wants to acquire the resource
xa that is currently owned by p2 , and p2 wants to acquire the resource xb that is
currently owned by p1 (Fig. 11.8).
The following information can be extracted from the graph:
11.2 Several Resources with a Single Instance 289

Fig. 11.8 No deadlock with ordered resources

Fig. 11.9 A particular pattern in using resources

• As p1 is owning the resource xa and waiting for the resource xb , it invoked first
acquire_resource(xa ) and then acquire_resource(xb ).
• As p2 is owning the resource xb and waiting for the resource xa , it invoked first
acquire_resource(xb ) and then acquire_resource(xa ).
As there is an imposed total order to acquire resources, it follows that either p1 or
p2 does not follow the rule. Taking the contrapositive, we have that “each process
follows the rule” ⇒ “there is no cycle in WFG”. Hence, no deadlock can occur if
every process follows the rule (during all its sessions of resource use). 

Drawback of the Previous Approach Accessing resources according to a pre-


defined total order agreed upon by all processes prevents deadlock from occurring.
But deadlock prevention always has a price. Let us consider a process that, during a
session, wants to use first a resource xb alone, then both the resources xa and xb , and
finally the resource xa alone. The corresponding time pattern is described (above the
process axis) in Fig. 11.9.
• If xb ≺ xa , the order in which pi needs the resources is the same as the imposed
total order. In this case, the allocation is optimal in the sense that, during the
session, xb and xa are obtained just when pi needs them. This is the case depicted
above the process axis in Fig. 11.9.
• If xa ≺ xb , to obey the rule, pi is forced to require xa before xb . This case is de-
picted below the process axis in Fig. 11.9. The backward arrow means that pi has
290 11 Distributed Resource Allocation

Fig. 11.10 Conflict graph


for six processes, each resource
being shared by two processes

to invoke acquire_resource(xa ) before invoking acquire_resource(xb ). The total


order on xa and xb does not fit the order in which pi needs them. Hence, during
some duration, xa is owned but not used by pi . If, during that period of time, an-
other process wants to use the resource xa , it cannot. This is the inescapable price
that has to be paid when a total order on resources is used to prevent deadlocks
from occurring.

11.2.3 Incremental Requests for Single Instance Resources:


Reducing Process Waiting Chains

A Worst-Case Scenario Let us consider six processes p1 , . . . , p6 , and six re-


sources x1 , . . . , x6 , such that, in each session, each process needs two resources,
namely
pi requires xi−1 and xi , for 1 < i ≤ 6, and
p1 requires x1 and x6 .
The corresponding global conflict graph is described in Fig. 11.10. The label
associated with an edge indicates the resource on which its two processes (ver-
tices) conflict. (This example is the conflict graph of the resource allocation problem
known as the dining philosophers problem. Each process is a philosopher who needs
both the fork at his left and the fork at its right—the forks are the resources—to eat
spaghetti.)
Let us consider an execution in which
First p6 obtains x5 and x6 ,
Then p5 obtains x4 and waits for x5 (blocked by p6 ),
Then p4 obtains x3 and waits for x4 (blocked by p5 ),
Then p3 obtains x2 and waits for x3 (blocked by p4 ),
Then p2 obtains x1 and waits for x2 (blocked by p6 ),
Then p1 waits for x2 (blocked by p2 ).
There is no deadlock (process p6 is active, using x5 and x6 ). However, all the
processes, except p6 , are passive (waiting for a resource instance). Moreover, p1 ,
p2 , p3 , and p4 are waiting for a resource currently owned by a passive process. The
corresponding WFG is as follows (let us recall that a directed edge means “blocked
by”)
p1 → p2 → p3 → p4 → p5 → p6 .
11.2 Several Resources with a Single Instance 291

Fig. 11.11 Optimal


vertex-coloring
of a resource graph

Hence, the length of the waiting chain (longest path in the WFG) is n − 1 = 5, which
is the worst case: only one process is active.

Reduce the Length of Waiting Chains: Coloring a Resource Graph An ap-


proach to reduce the length of waiting chains consists in imposing on each process
separately a total order on the resources it wants to acquire. Moreover, the resulting
partial order obtained when considering all the resources has to contain as few edges
as possible. This approach is due to N.A. Lynch (1981).
To that end, let us define a resource graph as follows. Its vertices are the re-
sources and there is an edge connecting two resources if these two resources can be
requested by the same process during one of its sessions. The resource graph asso-
ciated with the previous example is described in Fig. 11.11 (the label on an edge
denotes the processes that request the corresponding resources). The meaning of a
(non-directed) edge is that the corresponding resources are requested in some order
by some process(es). As there is no process accessing both x1 and x4 , there is no
edge connecting these vertices.
Hence, the aim is to find a partial order on the resources such that (a) the re-
sources requested by each process (during a session) are totally ordered and (b)
there are as few edges as possible. Such a partial order can be obtained as follows.
• First, find an optimal vertex-coloring of the resource graph (no two resources
connected by an edge have the same color and the number of color is minimal).
• Then, define a total order on the colors.
• Finally, each process acquires the resources it needs during a session according to
the order on their colors. (Let us observe that, due to its construction, the resources
accessed by a process are totally ordered.)

Example When considering Fig. 11.11, the minimal number of colors is trivially
two. Let us color the non-neighbor resources x1 , x3 , and x5 with color a and the
non-neighbor resources x2 , x4 , and x6 with color b. Moreover, let a < b be the
total order imposed on these colors. Hence, the edges of the corresponding resource
graph are directed from top to bottom, which means that p4 has to acquire first x3
and then x4 , while p5 has to acquire first x5 and then x4 .
The worst case scenario described in Sect. 11.2.2 cannot occur. When p6 has
obtained x5 and x6 , p5 is blocked waiting for x5 (which it has to require before x4 ).
Hence, p4 can obtain x3 and x4 , and p2 can obtain x1 and x2 . More generally, the
maximal length of a waiting chain is the number of colors minus 1.
292 11 Distributed Resource Allocation

Fig. 11.12 Conflict graph


for static sessions (SS_CG)

Remark Let us remember that the optimal coloring of the vertices of a graph is
an NP-complete problem (of course, a more efficient near-optimal coloring may be
used). Let us observe that the non-optimal coloring in which all the resources are
colored with different colors corresponds to the “total order” strategy presented in
Sect. 11.2.2.

11.2.4 Simultaneous Requests


for Single Instance Resources and Static Sessions

Differently from the incremental request approach, where a process requires, one
after the other, the resources it needs during a session, the simultaneous request
approach directs each process to require simultaneously (i.e., with a single operation
invocation) all the resources it needs during a session. To that end, the processes are
provided with an operation

acquire_resource(res_seti ),

whose input parameter res_seti is the set of resources that the invoking process pi
needs for its session.
As we consider static sessions, every session of a process pi involve all the re-
sources that this process is allowed to access (i.e., res_seti = {x | i ∈ CG(x)}).

Conflict Graph for Static Sessions As, during a session, the requests of a pro-
cess are on all the resources it is allowed to access, the global conflict graph (see
Fig. 11.6) can be simplified. Namely, when two processes conflict on several re-
sources, e.g., xa and xb , these resources can be considered as a single virtual re-
source xa,b . This is because, the sessions being static, each session of these pro-
cesses involves both xa and xb .
The corresponding conflict graph for static session is denoted SS_CG. Its vertices
are the vertices in the conflict graphs 1≤x≤M CG(x). There is an edge (py , pz ) in
SS_CG if there is a conflict graph CG(x) containing such an edge. The graph SS_CG
associated with the global conflict graph of Fig. 11.6 is described in Fig. 11.12. Let
11.2 Several Resources with a Single Instance 293

us observe that, when p6 accesses x1 and x2 , nothing prevents p1 from accessing


x3 .
Due to the fact that, in each of its sessions, a process pi requires all the resources
x such that i ∈ CG(x), it follows that each process is in mutual exclusion with all
its neighbors in the static session conflict graph. Hence, a simple solution consists
in adapting a mutex algorithm so that mutual exclusion is required only between
processes which are neighbors in the static conflict graph.

Mutual Exclusion with Neighbor Processes in the Conflict Graph Let us con-
sider the mutex algorithms based on individual permissions described in Chap. 10.
These algorithms are modified as follows:
• Each site manages an additional local variable neighborsi which contains the
identities of its neighbors in the conflict graph. (Let us notice that if j ∈
neighborsi , then i ∈ neighborsj .)
• When considering the non-adaptive algorithm of Fig. 10.3, the set Ri (which
was a constant equal to {1, . . . n} \ {i}) remains a constant which is now equal to
neighborsi .
• When considering the adaptive algorithm of Fig. 10.7, let pi and pj be two pro-
cesses which are neighbors in the conflict graph. The associated message PER -
MISSION ({i, j }) is initially placed at one these processes, e.g., pi , and the initial
values of Ri and Rj are then such that j ∈ / Ri and i ∈ Rj .
• When considering the bounded adaptive algorithm of Fig. 10.10, the initialization
is as follows. For any two neighbors pi and pj in the conflict graph such that i >
j , the message PERMISSION ({i, j }) is initially at pi (hence, i ∈ Rj and j ∈ / Ri ),
and its initial state is used (i.e., perm_statei [j ] = used).
In the last two cases, Ri evolves according to requests, but we always have
Ri ⊆ neighborsi . Moreover, when the priority on requests (liveness) is determined
from timestamps (the first two cases), as any two processes pi and pj which are not
neighbors in the conflict graph never compete for resources, it follows that pi and
pj can have the same identity if neighborsi ∩ neighborsj = ∅ (this is because their
common neighbors, if any, must be able to distinguish them). Hence, any two pro-
cesses at a distance greater than 2 in the conflict graph can share the same identity.
This can help reduce the length of waiting chains.

11.2.5 Simultaneous Requests


for Single Instance Resources and Dynamic Sessions

The Deadlock Issue In this case, each session of a process pi involves a dynami-
cally defined subset of the resources for which pi is competing with other processes,
i.e., a subset of {x | i ∈ CG(x)}.
Let us consider the global conflict graph depicted in Fig. 11.6. If, during a ses-
sion, the process p6 wants to access the resource x1 only, it needs a mutual exclusion
294 11 Distributed Resource Allocation

algorithm in order to prevent p2 , p3 , or p5 from simultaneously accessing this re-


source. To that end, a mutex algorithm per resource is used. Let M(x) denote the
mutex algorithm associated with x.
As each of these algorithms involves only the processes accessing the corre-
sponding resource, this allows processes whose current sessions require different
sets of resources to access them concurrently. As an example, if the current session
of p6 requires only x1 while, simultaneously, the current session of p3 requires only
x2 , being managed by distinct mutex algorithms, these resources can be accessed
concurrently by p6 and p3 .
Unfortunately, as the mutex algorithms M(x) are independent one from the oth-
ers, there is a risk of deadlock when both the current sessions of p6 and p3 require
both the resources x1 and x2 . This is because the mutex algorithm M(x1 ) can allo-
cate x1 to p3 , while the algorithm M(x2 ) allocates x2 to p6 .

Establish a Priority to Prevent Deadlock A way to prevent deadlock from oc-


curring consists in the stacking of an additional mutex algorithm on top of the M(x)
algorithms. More precisely, let us consider an additional mutex algorithm, denoted
GM, which ensures mutual exclusion on neighbor processes in the conflict graph
for static session SS_CG introduced in the previous section (Fig. 11.12).
This algorithm is used to solve conflicts between neighbor processes in SS_CG
when they want to use several resources. As an example, when both p6 and p3 want
to use the resources x1 and x2 , their conflict is solved by GM: If p6 has priority over
p3 in GM, it has priority to access both resources. This has a side effect, namely,
if M(x2 ) has already granted the resource x2 to p3 , this process has to release it
in order p6 can obtain it. (If p3 has obtained both resources, whatever the priority
given by GM, it keeps them until the end of its session).

Cooperation Between GM and the Mutex Algorithms Associated with Each Re-
source Mutex algorithms between neighbor processes in a graph were introduced
in the last part of Sect. 11.2.4. Let us consider that the mutex algorithm GM and all
the mutex algorithms M(x) are implemented by the one described in the last item
of Sect. 11.2.4, in which all variables are bounded. (As we have seen, this algo-
rithm was obtained from a simple modification of the adaptive mutex algorithm of
Fig. 10.10.)
Let us recall that such a mutex algorithm is fair: Any process pi that invokes
acquire_mutex() eventually enters the critical section state, and none of its neighbors
pj will enter it simultaneously. Moreover, two processes pi and pj which are not
neighbors can be simultaneously in critical section.
Let cs_statei be the local variable of pi that describes its current mutex exclusion
state with respect to the general algorithm GM. As we have seen, its value belongs to
{out, trying, in}. Similarly, let cs_statei [x] be the local variable of pi that describes
its current state with respect to the mutex algorithm M(x).
Let us consider the algorithm GM. Given a process pi , we have the following
with regard the transitions of its local variable cs_statei . Let us recall that the tran-
sition of cs_statei from trying to in is managed by the mutex algorithm itself. Dif-
ferently, the transitions from out to trying, and from in to out, are managed by the
11.3 Several Resources with Multiple Instances 295

invoking process pi . Hence, as far as GM is concerned, rules that force a process to


proceed from out to trying and from in to out have to be defined. These rules are as
follows:
• R1. If (cs_statei = out) ∧ (∃x : cs_statei [x] = trying), pi must invoke the oper-
ation acquire_resource() of GM to acquire the mutual exclusion with respect to
its neighbors in SS_CG, so that eventually cs_statei = trying.
• R2. If (cs_statei = in) ∧ (∀x : i ∈ CG(x) : cs_statei [x]] = trying), pi must invoke
the operation release_resource() of GM so that eventually cs_statei = out.
Finally, when a process pi receives a request for a resource x from its neighbor
pj , the priority rule followed by pi is the following:
• R3. If (a) pi is not interested in the resource x (cs_statei [x] = out), or (b) pi is
neither using the resource x nor having priority with respect to pj in GM, then
pi allows pj to use the resource x (i.e., pi sends its permission to pj ).

Sketch of a Bounded Resource Allocation Algorithm for Dynamic Sessions At


each process pi and for each resource x, cs_statei [x] is initialized to out. Moreover,
each process pi manages the following additional local variables.
• needi [x] is a Boolean which is true when pi is interested in the resource x.
• reqi [x, j ] is a Boolean which is true when pi is allowed to request (to pj ) the per-
mission to access the resource x. This permission is represented by the message
X _ PERMISSION (x(i, j )).
• herei [x, j ] is a Boolean which is true when pi has obtained (from pj ) the per-
mission to access the resource x.
Initially, for any pair of neighbors pi and pj the message X _ PERMISSION(x(i, j )) is
placed at one process, e.g., pi . We have then herei [x, j ] = true, reqi [x, j ] = false,
herej [x, i] = false, and reqj [x, i] = true.
As already indicated, the transitions of the mutex algorithm GM are governed by
the rules R1 and R2. A sketch of a description of the resource allocation algorithm
is presented in Fig. 11.13. It considers a single pair of neighbor processes pi and pj
and a single of the possibly many resources x that they share. When it receives from
pj a request for the resource x, pi computes its priority (line 5); as explained above,
this computation involves the state of pi with respect to GM. (This is expressed
by the presence of the message PERMISSION ({i, j }) at pi . Let us recall that this
message is a message of the algorithm GM.) The writing of the whole detailed
algorithm is the topic of Problem 5 at the end of this chapter.

11.3 Several Resources with Multiple Instances


The Generalized k-out-of-M Problem This section considers the case where
there are X resource types, and for each x, 1 ≤ x ≤ X, there are M[x] instances
of the resource type x. A request of a process pi may concern several instances of
296 11 Distributed Resource Allocation

to acquire x with respect to pj do


(1) if ((cs_statei [x] = trying) ∧ needi [x] ∧ reqi [x, j ] ∧ (¬herei [x, j ]))
(2) then send REQUEST (x) to pj ; reqi [x, j ] ← false
(3) end if.

to release x with respect to pj do


(4) if (reqi [x, j ] ∧ herei [x, j ])
(5) then if (¬needi [x, j ]) ∨ ¬[(cs_statei [x] = in) ∧ ( PERMISSION ({i, j }) is at pi ))
(6) then send X _ PERMISSION (x) to pj ; herei [x, j ] ← false
(7) end if
(8) end if.

when REQUEST (x) is received from pj do


(9) reqi [x, j ] ← true.

when X _ PERMISSION (x) is received from pj do


(10) herei [x, j ] ← true.

Fig. 11.13 Simultaneous requests in dynamic sessions (sketch of code for pi )

each resource. As an example, for a given session, pi may request ki [x] instances
of resource type x and ki [y] instances of the resource type y.
This section presents solutions for dynamic sessions (as we have seen this means
that, in each session, a process defines the specific subset of resources it needs from
the set of resources it is allowed to access). Let us observe that, as these solutions
work for dynamic sessions, they trivially work for static sessions.

The Case of Dynamic Sessions with Incremental Requests In this case, the
same techniques as the ones described in Sect. 11.2 (which was on resources with
a single instance) can be used to prevent deadlocks from occurring (total order on
the whole set of resources, or partial order defined from a vertex-coloring of the
resource graph).
As far as algorithms are concerned, a k-out-of-M mutex algorithm is associated
with each resource type x. A process invokes then acquire_resource(x, ki ), where
x is the resource type and ki the number of its instances requested by pi , with
1 ≤ ki ≤ M[x].

The Case of Dynamic Sessions with Simultaneous Requests Let RX i ⊂ {x | i ∈


CG(x)} denote the dynamically defined set of resource types that the invoking pro-
cess pi wants to simultaneously acquire during a session, and, for each x ∈ RX i , let
kix denote the number of x’s instances that it needs.
A generalization of the k-out-of-M algorithm presented in Fig. 11.4, which im-
plements the operations acquire_resource({(x, kix )}x∈RX i ) and release_
resource({(x, kix )}x∈RX i ), is presented in Fig. 11.14. It assumes a partial instance
of the basic-algorithm of Fig. 11.4 for each resource type x ∈ RX i . Let cs_tatexi ,
used_byxi , and perm_delayedxi be the local variables of pi associated with the re-
source type x; perm_delayedxi contains only identities of the processes in CG(x),
11.4 Summary 297

operation acquire_resource({(x, kix )}x∈RX i ) is


(1) for each x ∈ RX i do cs_statexi ← trying end for;
(2) rdi ← clocki + 1;
(3) for each x ∈ RX i do
(4) for each j ∈ CF(x) do send REQUEST (x, kix ,  rdi , i) to pj ;
(5) used_byxi [j ] ← used_byxi [j ] + M[x]
(6) end for;
(7) used_byxi [i] ← kix ;
(8) end for;

(9) wait ( x∈RX i ( 1≤j ≤n used_byxi [j ] ≤ M[x]));
(10) for each x ∈ RX i do cs_statexi ← in end for.

operation release_resource({(x, kix )}x∈RX i ) is


(11) for each x ∈ RX i do
(12) cs_statexi ← out;
(13) for each j ∈ perm_delayedxi do send NOT _ USED (x, kix ) to pj end for;
(14) perm_delayedxi ← ∅
(15) end for.

Fig. 11.14 Algorithms for generalized k-out-of-M (code for pi )

and used_byxi is an array with one entry per process in CG(x). Moreover, each
message is tagged with the corresponding resource type. Figure 11.14 is a simple
extension of the basic- k-out-of-M.
When pi invokes acquire_resource({(x, kix )}x∈RX i ), it first proceeds to the state
trying for each resource x (line 1). Then, it computes a date for its request (line 2).
Let us observe that this date is independent of the set RX i . Process pi then sends a
timestamped request to all the processes with which it competes for the resources in
RX i , and computes an upper bound of the number of instances of the resources in
which it is interested (lines 3–8). When there are enough available instances of the
resources it needs, pi is allowed to use them (line 9). It then proceeds to the state in
with respect to each of these resources (line 10).
When pi invokes release_resource({(x, kix )}x∈RX i ), it executes the same code as
in Fig. 11.4 for each resource of RX i (lines 11–15).
The “server” role of a process (management of the message reception) is the same
as in Fig. 11.4. The messages for a resource type x are processed by the correspond-
ing instance of the basic-algorithm. The important point is that all these instances
share the same logical clock clocki . The key of the solution lies in the fact that a
single timestamp is associated with all the request messages sent during an invoca-
tion, and all conflicting invocations on one or several resources are totally ordered
by their timestamps.

11.4 Summary
This chapter was devoted to the generalized k-out-of-M problem. This problem cap-
tures and abstracts resource allocation problems where (a) there are one or several
298 11 Distributed Resource Allocation

types of resource, (b) each resource has one or several instances, and (c) each pro-
cess may request several instances of each resource type. Several algorithms solving
this problem have been presented. Due to the multiplicity of resources, one of the
main issues that these algorithms have to solve is deadlock prevention. Approaches
that address this issue have been discussed.
The chapter has also introduced the notion of a conflict graph, which is an im-
portant conceptual tool used to capture conflicts among processes. It has also shown
how the length of process waiting chains can be reduced. Both incremental ver-
sus simultaneous requests on the one side, and static versus dynamic sessions of
resource allocation on the other side, have been addressed in detail.

11.5 Bibliographic Notes

• Resource allocation and the associated deadlock prevention problem originated


in the design and the implementation of the very first operating systems (e.g.,
[110, 165]).
• The dining philosophers problem was introduced by E.W. Dijkstra in [111]. It
abstracts the case where, in each of its sessions, each process requires always
the same set of resources. The drinking philosophers problem was introduced by
K.M. Chandy and J. Misra in [78]. It generalizes the dining philosophers problem
in the sense that the set of resources required by a process is defined dynamically
in each session.
• The deadlock prevention technique based on a total ordering of resources is due
to J.W. Havender [166]. The technique based on vertex coloring of a resource
graph is due to N.A. Lynch [241].
• The algorithm solving the multiple entries mutex problem (1-out-of-M) presented
in Fig. 11.1 is due to K. Raymond [305].
The general k-out-of-M resource allocation algorithm presented in Fig. 11.4
is due to M. Raynal [311].
The basic mutex algorithm from which these two algorithms have been derived
is due to G. Ricart and A.K. Agrawala [327] (this algorithm was presented in
Chap. 10).
• The bounded resource allocation algorithm for single instance resources in dy-
namic sessions presented in Sect. 11.2.5 is due to K.M. Chandy and J. Misra [78].
This algorithm is known under the name drinking philosophers algorithm (the re-
sources are bottles shared by neighbor processes). Presentation of this algorithm
can also be found in [149, 242, 387].
• Quorum (arbiter)-based algorithms that solve the k-out-of-M are presented
in [245, 280].
11.6 Exercises and Problems 299

operation acquire_resource(ki ) is
(1) cs_statei ← trying;
(2) rdi ← clocki + 1;
(3) for each j ∈ Ri do
(4) if (used_byi [j ] = 0) then send REQUEST ( rdi , i) to pj ;
(5) sent_toi [j ] ← true; used_byi [j ] ← M
(6) else sent_toi [j ] ← false
(7) end if
(8) end for;
(9) used_by i [i] ← ki ;
(10) wait ( 1≤j ≤n used_byi [j ] ≤ M);
(11) cs_statei ← in.

operation release_resource(ki ) is
(12) cs_statei ← out;
(13) foreach j ∈ perm_delayedi do send NOT _ USED (ki ) end for;
(14) perm_delayedi ← ∅.

when REQUEST (k, j ) is received do


(15) clocki ← max(clocki , k);
(16) prioi ← ((cs_statei = in) ∨ [(cs_statei = trying) ∧ ( rdi , i < k, j )]);
(17) if (¬ prioi ) then send NOT _ USED (M) to pj
(18) else if (ki = M) then send NOT _ USED (M − ki ) to pj end if;
(19) perm_delayedi ← perm_delayedi ∪ {j }
(20) end if.

when NOT _ USED (x) is received from pj do


(21) used_byi [j ] ← used_byi [j ] − x;
(22) if ((cs_statei = trying) ∧ (used_byi [j ] = 0) ∧ (¬ sent_toi [j ]))
(23) then send REQUEST ( rdi , i) to pj
(24) sent_toi [j ] ← true; used_byi [j ] ← M
(25) end if.

Fig. 11.15 Another algorithm for the k-out-of-M mutex problem (code for pi )

11.6 Exercises and Problems

1. The k-out-of-M algorithm described in Fig. 11.4 requires between 2(n − 1) and
3(n − 1) messages per use of a set of instances of the resource. The following al-
gorithm is proposed to reduce this number of messages. When used_byi [j ] = 0,
the process pi knows an upper bound on the number of instances of the resource
used by pj . In that case, it is not needed for pi to send a request message to pj .
Consequently, pi sends a request to pj only when used_byi [j ] = 0; when this
occurs, pi records it by setting a flag sent_toi [j ] to the value true.
The corresponding algorithm is described
in Fig. 11.15 (where Ri = {1, . . . , n}
\ {i}). Let us recall that the quantity 1≤j ≤n used_byi [j ], which appears in the
wait statement (line 10) is always computed in local mutual exclusion with the
code associated with the reception of a message NOT _ USED ().
• Show that the algorithm is correct when the channels are FIFO.
300 11 Distributed Resource Allocation

• Show that the algorithm is no longer correct when the channels are not FIFO.
To that end construct a counterexample.
• Is the statement sent_toi [j ] ← true at line 24 necessary? Justify your answer.
• Is it possible to replace the static set Ri by a dynamic set (as done in the mutex
algorithm of Fig. 10.7)?
• Can the message exchange pattern described in Fig. 11.3 occur?
• What are the lower and upper bounds on the number of messages per use of a
set of k instances of the resource?
• Is the algorithm adaptive?
• Let the waiting time be the time spent in the wait statement. Compare the
waiting time of the previous algorithm with the waiting time of the algorithm
described in Fig. 11.4. From a waiting time point of view, is one algorithm
better than the other?
Solution in [311].
2. Write the server code (i.e., the code associated with message receptions) of the
generalized k-out-of-M algorithm for simultaneous requests whose client code
is described in Fig. 11.14.
3. The generalized k-out-of-M algorithm for simultaneous requests in dynamic ses-
sions described in Fig. 11.14 uses timestamp.
Assuming the conflict graph of Fig. 11.6 in which each resource has a single
instance (hence k = 1), let us consider an execution in which concurrently
• p2 issues acquire_resource(1) (i.e., p2 requests resource x1 ),
• p6 issues acquire_resource(1, 3) (i.e., p6 requests each of the resources x1
and x3 ),
• p6 issues acquire_resource(3) (i.e., p4 requests resource x3 ).
Moreover, let hi , i be the timestamp of the request of pi . How are these
requests served if
• h2 , 2 < h6 , 6 < h4 , 4?
• h2 , 2 < h4 , 4 < h6 , 6?
What can be concluded about the order in which the requests are served?
4. The algorithm of Fig. 11.14 solves the generalized k-out-of-M problem for si-
multaneous requests in dynamic sessions. This algorithm uses timestamp and is
consequently unbounded. Design a bounded algorithm for the same problem.
5. Write the full code of the algorithm for simultaneous requests in dynamic
sessions (see Sect. 11.2.5), i.e., the code of (a) the corresponding operations
acquire_resource(RX i ) and release_resource(RX i ), and (b) the code associated
with the corresponding message receptions. (As in the algorithm described in
Fig. 11.14, RX i denotes the dynamically defined set of resources that pi needs
for its current session.)
Elements for a solution in [242, 387].
Part IV
High-Level Communication Abstractions

This part of the book is on the enrichment of a base send/receive distributed


message-passing system with high-level communication abstractions. Chapter 12
focuses on abstractions that ensure specific order properties on message delivery.
The most important of such communication abstractions is causal message delivery
(also called causal order). Chapter 13 is on the rendezvous communication abstrac-
tion and logically instantaneous communication (also called synchronous commu-
nication).
These communication abstractions reduce the asynchrony of the system and con-
sequently can facilitate the design of distributed applications. Their aim is to hide
“basic machinery” to users and offer them high-level communication operations so
that they can concentrate only on the essence of the problem they have to solve.
Chapter 12
Order Constraints on Message Delivery

High-level communication abstractions offer communication operations which en-


sure order properties on message delivery. The simplest (and best known) order
property is the first in first out (FIFO) property, which ensures that, on each chan-
nel, the messages are received in their sending order. Another order property on
message delivery is captured by the total order broadcast abstraction, which was
presented in Sect. 7.1.4. This communication abstraction ensures that all the mes-
sages are delivered in the same order at each process, and this order complies with
their causal sending order.
This chapter focuses first on causal message delivery. It defines the corresponding
message delivery property and presents several algorithms that implement it, both
for point-to-point communication and broadcast communication. Then the chapter
presents new algorithms that implement the total order broadcast abstraction. Fi-
nally, the chapter plays with a channel by considering four order properties which
can be associated with each channel taken individually.
When discussing a communication abstraction, it is assumed that all the mes-
sages sent at the application level are sent with the communication operation pro-
vided by this communication abstraction. Hence, there is no hidden relation on mes-
sages that will be unknown by the algorithms implementing these abstractions.

Keywords Asynchronous system · Bounded lifetime message · Causal barrier ·


Causal broadcast · Causal message delivery order · Circulating token ·
Client/server broadcast · Coordinator process · Delivery condition ·
First in first out (FIFO) channel · Order properties on a channel ·
Size of control information · Synchronous system

12.1 The Causal Message Delivery Abstraction

The notion of causal message delivery was introduced by K.P. Birman and
T.A. Joseph (1987).

M. Raynal, Distributed Algorithms for Message-Passing Systems, 303


DOI 10.1007/978-3-642-38123-2_12, © Springer-Verlag Berlin Heidelberg 2013
304 12 Order Constraints on Message Delivery

Fig. 12.1 The causal message delivery order property

12.1.1 Definition of Causal Message Delivery

The Problem Let us consider the communication pattern described at the left of
Fig. 12.1. Process p1 sends first the message m1 to p3 and then the message m2
to p2 . Moreover, after it has received m2 , p2 sends the message m3 to p3 . Hence,
ev
when considering the partial order relation −→ on events (defined in Chap. 6), the
sending of m1 belongs to the causal past of the sending of m2 , and (by the transitivity
created by m2 ) belongs to the causal past of the sending of m3 . We consequently
ev
have s(m1 ) −→ s(m3 ) (where s(m) denotes the “sending of m” event). But we do
ev
not have r(m1 ) −→ r(m3 ) (where r(m) denotes the “reception of m” event).
While the messages m1 and m2 are sent to the same destination process, and
their sending are causally related, their reception order does not comply with their
sending order. The causal message delivery order property (formally defined below)
is not ensured. Differently, the reception order in the communication pattern de-
ev
scribed at the right in Fig. 12.1 is such that r(m1 ) −→ r(m3 ) and satisfies the causal
message delivery order property.

Definition The causal message delivery (also called causal order or causal mes-
sage ordering) abstraction provides the processes with two operations denoted
co_send() and co_deliver(). When a process invokes them, we say that it co_sends
or co_delivers a message. The abstraction is defined by the following properties.
It is assumes that all messages are different (which can be easily realized by associ-
ating with each message a pair made up of a sequence number plus the identity of
the sender process). Let co_s(m) and co_del(m) be the events associated with the
co_sending of m and its co_delivery, respectively.
• Validity. If a process pi co_delivers a message m from a process pj , then m was
co_sent by pj .
• Integrity. No message is co_delivered more than once.
ev
• Causal delivery order. For any pair of messages m and m , if co_s(m) −→
ev
co_s(m ) and m and m have the same destination process, we have co_del(m) −→
co_del(m ).

• Termination. Any message that was co_sent is co_delivered by its destination


process.
This definition is the similar to the one of the total order broadcast abstraction
given in Sect. 7.1.4, from which the “total order” requirement is suppressed. The first
12.1 The Causal Message Delivery Abstraction 305

Fig. 12.2
The delivery pattern prevented
by the empty interval property

three requirements define the safety property of causal message delivery. Validity
states that no message is created from thin air or is corrupted. Integrity states that
there is no message duplication. Causal order states the added value provided by the
abstraction. The last requirement (termination) is a liveness property stating that no
message is lost.
While Fig. 12.1 considers a causal chain involving only two messages (m2 and
m3 ), the length of such a chain in the third requirement can be arbitrary.

A Geometrical Remark As suggested by the right side of Fig. 12.1, causal order
on message delivery is nothing more than the application of the famous “triangle
inequality” to messages.

12.1.2 A Causality-Based Characterization


of Causal Message Delivery

Let us recall the following definitions associated with each event e (Sect. 6.1.3):
ev
• Causal past of e: past(e) = {f | f −→ e},
ev
• Causal future of e: future(e) = {f | e −→ f }.

Let us consider an execution in which the messages are sent with the operations
send() and receive(), respectively. Moreover, let M be the set of messages that
have been sent. The message exchange pattern of this execution satisfies the causal
message delivery property, if and only if we have
   
∀ m ∈ M: future s(m) ∩ past r(m) = ∅,

or equivalently,

 ev   ev 
∀ m ∈ M: e | s(m) −→ e ∧ e −→ r(m) = ∅.

This formula (illustrated in Fig. 12.2) states that, for any message m, there is a
single causal path from s(m) to r(m) (namely the path made up of the two events
s(m) followed by r(m)). Considered as a predicate, this formula is called the empty
interval property.
306 12 Order Constraints on Message Delivery

Table 12.1 Hierarchies of communication abstractions


Point-to-point Asynchronous ≺ FIFO channels ≺ Causal message delivery
Broadcast Asynchronous ≺ FIFO channels ≺ Causal message delivery ≺ Total order

12.1.3 Causal Order with Respect to Other


Message Ordering Constraints

It is important to notice that two messages m and m , which have been co_sent to
the same destination process and whose co_sendings are not causally related, can be
received in any order. Concerning the abstraction power of causal message delivery
with respect to other ordering constraints, we have the following:
• The message delivery constraint guaranteed by FIFO channels is simply causal
order on each channel taken separately. Consequently, FIFO channels define an
ordering property weaker than causal delivery.
• Let us extend causal message delivery from point-to-point communication to
broadcast communication. This is addressed in Sect. 12.3. Let us recall that the
total order broadcast abstraction presented in Sect. 7.1.4 is stronger than causal
broadcast. It is actually “causal broadcast” plus “same message delivery order at
each process” (even the messages whose co_broadcasts are not causally related
must be co_delivered in the same order at any process).
We consequently have the hierarchies of communication abstractions described
in Table 12.1, where “asynchronous” means no constraint on message delivery, and
≺ means “strictly weaker than”.

12.2 A Basic Algorithm


for Point-to-Point Causal Message Delivery

12.2.1 A Simple Algorithm

The algorithm described in this section is due to M. Raynal, A. Schiper, and S.


Toueg (1991). It assumes that no process sends messages to itself.

Underlying Principle As done when implementing the FIFO property on top of


a non-FIFO channel, a simple solution consists in storing in a local buffer the mes-
sages whose delivery at reception time would violate causal delivery order. The cor-
responding structure of an implementation at a process pi is described in Fig. 12.3.
The key of the algorithm consists in devising a delivery condition that allows us
(a) to delay the delivery of messages that arrive “too early” (to ensure the safety
property of message deliveries), and (b) to eventually deliver all the messages (to
ensure the liveness property). To that end, each process manages the following local
data structures.
12.2 A Basic Algorithmfor Point-to-Point Causal Message Delivery 307

Fig. 12.3 Structure of a causal message delivery implementation

• senti [1..n, 1..n] is an array of integers, each initialized to 0. The entry senti [k, ]
represents the number of messages co_sent by pk to p , as known by pi . (Let us
recall that a causal message chain ending at a process pi is the only way for pi to
“learn” new information.)
• deliveredi [1..n] is an array of integers, each initialized to 0. The entry
deliveredi [j ] represents the number of messages co_delivered by pi , which have
been co_sent by pj to pi .
• bufferi is the local buffer where pi stores the messages that have been received
and cannot yet be co_delivered. The algorithm is expressed at an abstraction level
at which the use of this buffer is implicit.

Delivery Condition When a process pj co_sends a message m to a process pi ,


it associates with m its current knowledge of which messages have been co_sent
in the system, i.e., the current value of its array sentj [1..n, 1..n]. Let CO(m, sentj )
denote the corresponding message sent to pj .
When pi receives, at the underlying network level, the message CO(m, sent)
from pj , it is allowed to co_deliver m only if it has already co_delivered all the
ev
messages m which have been sent to it and are such that co_(m ) −→ co_s(m).
This delivery condition is captured by the following predicate, which can be locally
evaluated by pi :
 
DC(m) ≡ ∀k : deliveredi [k] ≥ sent[k, i] .

Due to its definition, sent[k, i] is the number of messages sent by pk to pi , to


m’s knowledge (i.e., as known by pj when it sent m). Hence, if deliveredi [k] ≥
sent[k, i], pi has already co_delivered all the messages m whose sending is in the
causal past of the event co_s(m). If this is true for all k ∈ {1, . . . , n}, pi can safely
co_deliver m. If there exists a process pk such that deliveredi [k] < sent[k, i], there
is at least one message sent by pk to pi , whose sending belongs to the causal past of
m, which has not yet been co_delivered by pi . In this case, m is stored in the local
buffer bufferi , and remains in this buffer until its delivery condition becomes true.
308 12 Order Constraints on Message Delivery

operation co_send(m) to pj is
(1) send CO(m, senti ) to pj ;
(2) senti [i, j ] ← senti [i, j ] + 1.

when CO(m, sent) is received from pj do


(3) wait (∀k : deliveredi [k] ≥ sent[k, i]);
(4) co_delivery of m to the application layer;
(5) senti [j, i] ← senti [j, i] + 1;
(6) deliveredi [j ] ← deliveredi [j ] + 1;
(7) for each (x, y) ∈ {1, . . . , n}2
(8) do senti [x, y] ← max(senti [x, y], sent[x, y])
(9) end for.

Fig. 12.4 An implementation of causal message delivery (code for pi )

The Algorithm The algorithm based on the previous data structures is described
in Fig. 12.4. When a process pi invokes co_send(m), it first sends the message
CO(m, senti ) to the destination process pj (line 1), and then increases the se-
quence number senti [i, j ], which counts the number of messages co_sent by pi
to pj (line 2). Let us notice that the sequence number of an application message m
is carried by the algorithm message CO(m, sent) (this sequence number is equal to
sent[i, j ] + 1).
When pi receives from the network a message CO(m, sent) from a process pj ,
it stores the message in bufferi until its delivery condition DC(m) becomes satisfied
(line 3). When this occurs, m is co_delivered (line 4), and the control variables are
updated to take this co_delivery into account. First senti [j, i] and deliveredi [j ] are
increased (lines 5–6) to record the fact that m has been co_delivered. Moreover, the
knowledge on the causal past of m (which is captured in sent[1..n, 1..n]) is added
to current knowledge of pi (namely, every local variable senti [x, y] is updated to
max(senti [x, y], sent[x, y]), lines 7–9).
It is assumed that a message is co_delivered as soon as its delivery condition be-
comes true. If, due to the co_delivery of some message m, the conditions associated
with several messages become true simultaneously, these messages are co_delivered
in any order, one after the other.

Remark Let us observe that, for any pair (i, j ), both senti [j, i] and deliveredi [j ]
are initialized to 0 and are updated the same way at the same time (lines 5–6). It
follows that the vector deliveredi [1..n] can be replaced by the vector senti [j, 1..n].
Line 6 can then be suppressed and the delivery condition becomes
 
DC(m) ≡ ∀k : senti [k, i] ≥ sent[k, i] .

In the following we nevertheless consider the algorithm as written in Fig. 12.4 as


it is easier to understand.
12.2 A Basic Algorithmfor Point-to-Point Causal Message Delivery 309

Fig. 12.5 Message pattern


for the proof
of the causal order delivery

12.2.2 Proof of the Algorithm

Lemma 8 Let m be an application message sent by pj to pi , and sent[1..n, 1..n]


the control information attached to this message. When considering all the mes-
sages co_sent by pj to pi , and assuming that sequence numbers start at value 1,
sent[j, i] + 1 is the sequence number of m.

Proof As the process pj does not co_send messages to itself, the only line at which
sentj [j, i] is modified is line 2. The proof trivially follows from the initialization of
sentj [j, i] to 0 and the sequentiality of the lines 1 an 2. 

Lemma 9 Message co_delivery respects causal delivery order.

Proof Let m1 and m2 be two application messages such that (a) co_send(m1)
causally precedes co_send(m2), (b) both are co_sent to the same process pi , (c) m1
is co_sent by pj 1 , and (d) m2 is co_sent by pj 2 . Moreover, let CO(m1, sent1) and
CO(m2, sent2) be the corresponding messages sent by the algorithm (see Fig. 12.5).
As there is a causal path from co_send(m1) to co_send(m2), it follows from the
fact that line 2 is executed between send(m1, sent1) and send(m2, sent2) that we
have sent1[j 1, i] < sent2[j 1, i]. We consider two cases.
• j 1 = j 2 (m1 and m2 are co_sent by the same sender). As deliveredi [j 1] is the se-
quence number of the last message co_delivered from pj 1 , it follows from (a) the
predicate deliveredi [j 1] ≥ sent[j 1, i], and (b) the fact that sent[j 1, i] is the se-
quence number of the previous message co_sent by pj 1 to pi (Lemma 8), that
the messages co_sent by pj 1 to pi are co_delivered according to their sequence
number. This proves the lemma for the messages co_sent by the same process.
• j 1 = j 2 (m1 and m2 are co_sent by distinct senders). If the delivery condition
of m2 is satisfied we have deliveredi [j 1] ≥ sent2[j 1, i], which means that, as
sent2[j 1, i] > sent1[j 1, i], we have then deliveredi [j 1] > sent1[j 1, i]. But, it
follows from the previous item that the messages from pj 1 are received according
their sequence numbers, from which we conclude that m1 has been previously
co_delivered, which concludes the proof of the lemma. 

Lemma 10 Any message that is co_sent by a process is co_delivered by its destina-


tion process.
310 12 Order Constraints on Message Delivery

Proof Let ≺ denote the following relation on the application messages. Let m and
m be any pair of application messages; m ≺ m if co_send(m) causally precedes
ev
co_send(m ). As the causal precedence relation −→ on events is a partial order, so
is the relation ≺.
Given a process pi , let pendingi be the set of messages which have been co_sent
to it and are never co_delivered. Assuming pendingi = ∅, let m ∈ pendingi a mes-
sage that is minimal with respect ≺ (i.e., a message which has no predecessor—
according to ≺—in pendingi ).
As m cannot be co_delivered by pi there is (at least) one process pk such that
deliveredi [k] < sent[k, i] (line 3). It then follows from Lemma 9 that there is a
message m such that (a) m has been co_sent by pk to pi , co_send(m ) causally
precedes co_send(m), and m is not co_delivered by pi (Fig. 12.5 can still be con-
sidered after replacing m1 by m and m2 by m). Hence this message belongs to
pendingi and is such that m ≺ m. But this contradicts the fact that m is minimal in
pendingi , which concludes the proof of the liveness property. 

Theorem 17 The algorithm described in Fig. 12.4 implements the causal message
delivery abstraction.

Proof The proof of the validity and integrity properties are trivial and are left to the
reader. The proof of the causal message delivery follows from Lemma 9, and the
proof of the termination property follows from Lemma 10. 

12.2.3 Reduce the Size of Control Information


Carried by Messages

The main drawback of the previous algorithm lies in the size of the control infor-
mation that has to be attached to each application message m. Let b be the number
of bits used for each entry of a matrix senti [1..n, 1..n]. As a process does not send
message to itself, the diagonal senti [j, j ], 1 ≤ j ≤ n, can be saved. The size of
the control information that is transmitted with each application message is conse-
quently (n2 − n)b bits. This section shows how this number can be reduced.

Basic Principle Let us consider a process pi that sends a message CO(m, senti )
to a process pj . The idea consists in sending to pj only the set of values of the
entries of the matrix senti that have been modified since the last message CO() sent
to pj . The set of values that has to be transmitted by pi to pj is the set of 3-tuples

 
k, , senti [k, ] | senti [k, ] has been

modified since the last co _send of pi to pj .

Let us remember that a similar approach was used in Sect. 7.3.2 to reduce the
size of the vector dates carried by messages.
12.2 A Basic Algorithmfor Point-to-Point Causal Message Delivery 311

A First Solution An easy solution consists in directing each process to pi manage


(n − 1) additional matrices, one per process pj , j = i, in such a way that the matrix
last_senti [j ] contains the value of the matrix senti when pi sent its last message
to pj . Each such matrix is initialized as senti (i.e., 0 everywhere). The code of the
operation co_send(m) then becomes:
operation co_send(m) to pj is
let seti = {k, , senti [k, ] | senti [k, ] = last_senti [j ][k, ]};
send CO(m, seti ) to pj ;
last_senti [j ] ← senti ;
senti [i, j ] ← senti [i, j ] + 1.
The code associated with the reception of a message CO(m, set) can be easily mod-
ified to reconstruct the matrix senti used in the algorithm of Fig. 12.4.
As far as local memory is concerned, this approach costs n − 1 additional ma-
trices (without their diagonal) per process, i.e., (n − 1)(n2 − n)b ∈ O(n3 b) bits per
process. It is consequently worthwhile only when n is small.

A Better Solution: Data Structures This section presents a solution for which
the additional data structures at each process need only (n2 + 1)b bits. These data
structures are the following.
• clocki is a local logical clock which measures the progress of pi , counted as
the number of messages it has co_sent (i.e., clocki counts the number of rele-
vant events—here they are the invocations of co_send()—issued by pi ). Initially,
clocki = 0.
• last_send_toi [1..n] is an vector, initialized to [0, . . . , 0], such that last_send_
toi [j ] records the local date of the last co_send to pj .
• last_modi [1..n, 1..n] is an array such that last_modi [k, ] is the local date of the
last modification of senti [k, ]. The initial value of each last_modi [k, ] is −1.
As the diagonal of last_modi is useless, it can be used to store the vector
last_send_toi [1..n].

A Better Solution: Algorithm The corresponding algorithm is described in


Fig. 12.6. When a process wants to co_send a message m to a process pj , it first
computes the set of entries of senti [1..n, 1..n] that have been modified since the
last message it sent to pj (line 1). Then, it attaches the corresponding tuples to m,
sends them to pj (line 2), and increases the local clock (line 3). Finally, pi updates
the other control variables: senti [i, j ] (as in the basic algorithm) and last_i[i, j ]
(line 4); and last_sent_i [j ] (line 5).
When pi receives a message CO(m, set) from pj , it first checks the delivery
condition (line 6). Let us notice that the delivery condition is now only on the
pairs (k, i) such that k, i, − ∈ set. This is because, for each pair (k  , i) such that
k  , i, − ∈
/ set, the local variable sentj [k  , i] of the sender pj has not been modi-
fied since its last sending to pi . Consequently the test deliveredi [k  ] ≥ sentj [k  , i]
was done when the message m carrying k  , i, − (previously sent by pj to pi ) is
312 12 Order Constraints on Message Delivery

operation co_send(m) to pj is
(1) let seti = {k, , senti [k, ] | last_modi [k, ] ≥ last_sent_toi [j ]};
(2) send CO(m, seti ) to pj ;
(3) clocki ← clocki + 1;
(4) senti [i, j ] ← senti [i, j ] + 1; last_modi [i, j ] ← clocki ;
(5) last_sent_toi [j ] ← clocki .

when CO(m, set) is received from pj do


(6) wait (∀k, , x ∈ set : deliveredi [k] ≥ x);
(7) co_delivery of m to the application layer;
(8) deliveredi [j ] ← deliveredi [j ] + 1;
(9) senti [j, i] ← senti [j, i] + 1; last_modi [i, j ] ← clocki ;
(10) for each k, , x ∈ seti do
(11) if (senti [k, ] < x) then senti [k, ] ← x; last_modi [k, ] ← clocki end if
(12) end for.

Fig. 12.6 An implementation reducing the size of control information (code for pi )

Fig. 12.7 Control information carried by consecutive messages sent by pj to pi

received by pi . Let us notice that—except possibly for the first message co_sent by
pj —the set set cannot be empty because there is at least the triple j, i, x, where x
is the sequence number of the previous message co_sent by pj to pi . (See Fig. 12.7.)
After m has been co_delivered, pi updates its local control variables
deliveredi [j ], senti [j, i] as in the basic algorithm (line 8–9). It also updates en-
tries of senti [1..n, 1..n] according to the values it has received in the set seti
(line 10–12).

An Adaptive Solution As done in Sect. 7.3.2 for vector clocks, it is possible to


combine the basic algorithm of Fig. 12.4 with the algorithm of Fig. 12.6 to obtain an
adaptive algorithm. The resulting sending procedure to be used at line 2 of Fig. 12.6
is described in Fig. 12.8.
As we can see, the test at line 2 is a simple switch that directs pi to attach the least
control information to the message m. The associated delivery condition is then that
of Fig. 12.4 or that of Fig. 12.6 according to the tag of the message.
12.3 Causal Broadcast 313

before sending CO (m, −) to pj is


(1) let s = |seti |;
(2) if ((n2 − n)b < s(2 log2 n + b))
(3) then tag the message to be sent with “0” and send the full matrix senti
(4) else tag the message to be sent with “1” and send the set seti
(5) end if.

Fig. 12.8 An adaptive sending procedure for causal message delivery

Fig. 12.9 Illustration of causal broadcast

12.3 Causal Broadcast

12.3.1 Definition and a Simple Algorithm

Definition Causal broadcast is a communication abstraction that ensures causal


message delivery in the context of broadcast communication. It provides processes
with two operations denoted co_broadcast() and co_deliver(). The only difference
from point-to-point communication is that each message has to be co_delivered by
all the processes (including its sender).
An example is depicted in Fig. 12.9. The message pattern on the left side does
not satisfy the causal message delivery property. This is due to the messages m1 and
m3 : while the broadcast of m1 causally precedes the broadcast of m3 , their delivery
order at p3 does not comply with their broadcast order. Differently, as there is no
causal relation linking the broadcast of m1 and m2 , they can be delivered in any
other at each process (and similarly for m2 and m3 ). The message pattern on the
right side of the figure satisfies the causal message delivery property.

A Simple Causal Broadcast Algorithm A simple algorithm can be derived from


the point-to-point algorithm described in Fig. 12.4. As the sending of a message m to
a process is replaced by the sending of m to all processes, the matrix senti [1..n, 1..n]
can be shrunk into a vector broadcasti [j ] such that

broadcasti [j ] = senti [j, 1] = · · · = senti [j, n],

which means that broadcasti [j ] represents the number of messages that, to pi ’s


knowledge, have been broadcast by pj . (The initial value of broadcasti [j ] is 0.)
314 12 Order Constraints on Message Delivery

operation co_broadcast(m) is
(1) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, broadcasti [1..n]) to pj end for;
(2) broadcasti [i] ← broadcasti [i] + 1;
(3) co_delivery of m to the application layer.

when CO_BR(m, broadcast[1..n]) is received from pj do


(4) wait (∀k : broadcasti [k] ≥ broadcast[k]);
(5) co_delivery of m to the application layer;
(6) broadcasti [j ] ← broadcasti [j ] + 1.

Fig. 12.10 A simple algorithm for causal broadcast (code for pi )

In addition to the previous observation, let us remark that when a process pi


which co_broadcasts a message, can locally co_deliver it at the very same time. This
is because the local co_delivery of such a message m cannot depend on messages
not yet co_delivered by pi (it causally depends only on the messages previously
co_broadcast or co_delivered by pi ).
The corresponding algorithm is described in Fig. 12.10. When a process pi in-
vokes co_broadcast(m), it sends the message CO_BR(m, broadcasti ) to each other
process (line 1), increases accordingly broadcasti [i] (line 2), and co_delivers m to
itself (line 3).
When it receives a message CO_BR(m, broadcast), a process pi first checks
the delivery condition (line 4). As the sequence numbers of all the messages m
that have been co_broadcast in the causal past of m are registered in the vector
broadcast[1..n], the delivery condition is
 
∀k : broadcasti [k] ≥ broadcast[k] .

When this condition becomes true, m is locally co_delivered (line 5), and
broadcasti [j ] is increased to register this co_delivery of a message co_broadcast
by pj (line 6).

Remark on the Vectors broadcasti Let us notice that each array broadcasti [1..n]
is nothing more than a vector clock, where the local progress of each process is mea-
sured by the number of messages it has co_broadcast. Due to the delivery condition,
the update at line 6 is equivalent to the vector clock update

for each k ∈ {1, . . . , n}


 
do broadcasti [k] ← max broadcasti [k], broadcast[m]
end for.

It follows that the technique presented in Sect. 7.3.2 to reduce the size of control
information carried by application messages can be used.

An Example An example of an execution of the previous algorithm is described


in Fig. 12.11. The co_broadcast of the application messages m1 and m2 are inde-
pendent (not causally related). The co_broadcast of m1 generates the algorithm mes-
sage CO_BR(m1 , [0, 0, 0]). As it has an empty causal past (from the co_broadcast
12.3 Causal Broadcast 315

Fig. 12.11 The causal broadcast algorithm in action

point of view), this message can be co_delivered as soon as it arrives at a pro-


cess. We have the same for the application message m2 , which gives rise to the
algorithm message CO_BR(m2 , [0, 0, 0]). When this messages arrives at p1 we
have broadcast1 = [1, 0, 0] > [0, 0, 0], and after p1 has co_delivered m2 , we have
broadcast1 = [1, 1, 0], witnessing that the co_broadcast of m1 and m2 are in the
causal past of the next message that will co_broadcast by p1 .
Differently, the co_broadcast of m1 belongs to the causal past of m3 , and p3
consequently sends the algorithm message CO_BR(m3 , [1, 0, 0]). As broadcast1 =
[1, 1, 0] when p1 receives CO_BR(m3 , [1, 0, 0]), p1 can immediately co_deliver
m3 when it receives CO_BR(m3 , [1, 0, 0]). Differently, broadcast2 = [0, 1, 0] when
p2 receives CO_BR(m3 , [1, 0, 0]). As broadcast2 [1] = 0 < 1, p2 is forced to delay
the co_delivery of m3 until it has co_delivered m1 .

12.3.2 The Notion of a Causal Barrier

This section presents a simple causal broadcast algorithm that reduces the size of
the control information carried by messages.

Message Causality Graph Let M be the set of all the messages which are
co_broadcast during the execution of an application. Let us define a partial order
relation, denoted ≺im , on application messages as follows. Let m and m be any two
application messages. We have m ≺im m (read: m precedes immediately m in the
message graph) if:
• The co_broadcast(m) causally precedes co_broadcast(m ).
• There is no message m such that (a) co_broadcast(m) causally pre-
cedes co_broadcast(m ), and (b) co_broadcast(m ) causally precedes
co_broadcast(m ).
316 12 Order Constraints on Message Delivery

Fig. 12.12 The graph of immediate predecessor messages

An example is depicted on Fig. 12.12. The execution is on the left side, and
the corresponding message graph is on the right side. This graph, which captures
the immediate causal precedence on the co_broadcast of messages, states that the
co_delivery of a message depends only on the co_delivery of its immediate pre-
decessors in the graph. As an example, the co_delivery of m4 is constrained by
the co_delivery of m3 and m2 , and the co_delivery of m3 is constrained by the
co_delivery of m1 . Hence, the co_delivery of m4 is not (directly) constrained by the
co_delivery of m1 . It follows that a message has to carry control information only
on its immediate predecessors in the graph defined by the relation ≺im .

Causal Barrier and Local Data Structures Let us associate with each applica-
tion message an identity (a pair made up of a sequence number plus a process iden-
tity). The causal barrier associated with an application message is the set of iden-
tities of the messages that are its immediate predecessors in the message causality
graph. Each process manages accordingly the following data structures.
• causal_barrieri is the set of identities of the messages that are immediate prede-
cessors of the next message that will be co_broadcast by pi . This set is initially
empty.
• deliveredi [1..n] has the same meaning as in previous algorithms. It is initialized
to [0, . . . , 0], and deliveredi [j ] contains the sequence number of the last message
from pj that has been co_delivered by pi .
• sni is a local integer variable initialized to 0. It counts the number of messages
that have been co_broadcast by pi .

Algorithm The corresponding algorithm, in which the causal barrier of a message


m constitutes the control information attached to it, is described in Fig. 12.13.
When a process pi invokes co_broadcast(m), it first builds the message
CO_BR(m, causal_barrieri ) and sends it to all the other processes (line 1). Ac-
cording to the definition of causal_barrieri , a destination process will be able
to co_deliver m only after having co_delivered all the messages whose identi-
ties belong to causal_barrieri . Then, pi co_delivers locally m (line 2), increases
sni (line 3), and resets its causal barrier to {sni , i} (line 4). This is because the
co_delivery of the next message co_broadcast by pi will be constrained by m
(whose identity is the pair sni , i).
12.3 Causal Broadcast 317

operation co_broadcast(m) is
(1) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, causal_barrieri ) to pj end for;
(2) co_delivery of m to the application layer;
(3) sni [i] ← sni + 1;
(4) causal_barrieri ← {sni , i}.

when CO_BR(m, causal_barrier) is received from pj do


(5) wait (∀sn, k ∈ causal_barrier : deliveredi [k] ≥ sn);
(6) co_delivery of m to the application layer;
(7) deliveredi [j ] ← deliveredi [j ] + 1;
(8) causal_barrieri ← (causal_barrieri \ causal_barrier) ∪ {deliveredi [j ], j }.

Fig. 12.13 A causal broadcast algorithm based on causal barriers (code for pi )

When it receives an algorithm message (m, causal_barrier) from pj , pi delays


the co_delivery of m until it has co_delivered all the messages whose identity be-
longs to causal_barrier, i.e., until it has co_delivered all the immediate predeces-
sors of m (lines 5–6). Then, pi updates deliveredi [j ] (line 7). Finally, pi updates its
causal barrier (line 8). To that end, pi first suppresses from it the message identities
which are in causal_barrier (this is because, due to delivery condition, it has already
co_delivered the corresponding messages). Then, pi adds to causal_barrieri the
identity of the message m it has just co_delivered, namely the pair deliveredi [j ], j 
(this is because, the message m will be an immediate predecessor of the next mes-
sage that pi will co_broadcast).
Let us finally observe that the initial size of a set causal_barrier is 0.
Then, as soon as pi has co_broadcast or co_delivered a message, we have 1 ≤
|causal_barrieri | ≤ n.

12.3.3 Causal Broadcast with Bounded Lifetime Messages

This section has two aims: show the versatility of the causal barrier notion and
introduce the notion of messages with bounded lifetime. The algorithm presented in
this section is due to R. Baldoni, R. Prakash, M. Raynal, and M. Singhal (1998).

Asynchronous System with a Global Clock Up to now we have considered fully


asynchronous systems in which there is no notion of physical time accessible to
processes. Hence, no notion of expiry date can be associated with messages in such
systems.
We now consider that, while the speed of each process and the transit time of
each message remains arbitrary (asynchrony assumption), the processes have access
to a common physical clock that they can only read. This global clock is denoted
CLOCK. (Such a common clock can be implemented when the distributed applica-
tion covers a restricted geographical area.)
318 12 Order Constraints on Message Delivery

Fig. 12.14 Message with bounded lifetime

Bounded Lifetime Message The lifetime of a message is the physical time du-
ration during which, after the message has been sent, its content is meaningful and
can consequently be used by its destination process(es). A message that arrives at
its destination process after its lifetime has elapsed becomes useless and must be
discarded. For the destination process, it is as if the message was lost. A message
that arrives at a destination process before its lifetime has elapsed must be delivered
by the expiration of its lifetime.
For simplicity, we assume that all the messages have the same lifetime  and
processing times are negligible when compared to message transit times.
Let τ be the sending time of a message. The physical date τ +  is the deadline
after which this message is useless for its destination process. This is illustrated in
Fig. 12.14. The message m arrives by its deadline and must be processed by its
deadline. On the opposite, m arrives after its deadline and must be discarded.
It is also assumed that the lifetime  is such that, in practice, a great percentage
of messages arrives by their deadline, as is usually the case in distributed multimedia
applications.

-Causal Broadcast -causal broadcast is a causal broadcast in a context in


which the application messages have a bounded lifetime . It is defined by the
operations co_broadcast() and co_deliver() which satisfy the following properties.
• Validity. If a process pi co_delivers a message m from a process pj , m was
co_broadcast by pj .
• Integrity. No message is co_delivered more than once.
• Causal delivery order. For any pair of messages m and m that arrives at a process
pj by their deadlines, pj co_delivers m before m if co_broadcast(m) causally
precedes co_broadcast(m ).
• Expiry constraint. No message that arrives after its deadline at a process is
co_delivered by this process.
• Termination. Any message that arrives at a destination process pj by its deadline
is processed by pj by its deadline.
12.3 Causal Broadcast 319

Fig. 12.15 On-time versus too late

As shown in Fig. 12.15, it is possible for a message m to be co_delivered by its


deadline at a process pj (termination property), and discarded at another process pk
(expiry constraint).

Message Delivery Condition Let us replace the sequence numbers associated


with messages in the time-free algorithm of Fig. 12.13 by their sending dates, as
defined by the common clock CLOCK. Consequently, the identity of a message m
is now a pair made up of a physical date (denoted) sdt plus a process identity.
An algorithm message sent by a process pj carries now three parts: the concerned
application message m, its sending date st, and the current value of causal_barrierj .
When, pi receives such a message CO_BR(m, st, causal_barrier) from pj , pi
discards the message if it arrives too late, i.e., if CLOCK − st > . In the other case
(CLOCK − st ≤ ), the message has to be co_delivered by time st + . If  was
equal to +∞, all the messages would arrive by their deadlines, and they all would
have to be co_delivered. The delivery condition would then be that of the time-free
algorithm, where sequence numbers are replaced by physical sending dates, i.e., we
would have
 
DC (m) ≡ ∀sdt, k ∈ causal_barrier : deliveredi [k] ≥ sdt .

When  = +∞, it is possible that there is a process pk that, in the causal past
of co_broadcast(m), has co_broadcast a message m (identified sdt, k) such that
(a) sdt, k ∈ causal_barrier, (b) deliveredi [k] > sdt, and (c) m will be discarded
because it will arrive after its deadline. The fact that (i) the co_broadcast of m
causally precedes the co_broadcast of m, and (ii) m is not co_delivered, has not to
prevent pi from co_delivering m. To solve this issue, pi delays the co_delivery of m
until the deadline of m (namely std + ), but no more. The final delivery condition
is consequently

 (deliveredi [k] ≥ sdt)
DC (m) ≡ ∀sdt, k ∈ causal_barrier :
∨ (CLOCK − sdt > ).
320 12 Order Constraints on Message Delivery

operation co_broadcast(m) is
(1) st ← CLOCK;
(2) for each j ∈ {1, . . . , n} \ {i} do send CO_BR(m, st, causal_barrieri ) to pj end for;
(3) co_delivery of m to the application layer;
(4) causal_barrieri ← {st, i}.

when CO_BR(m, st, causal_barrier) is received from pj do


(5) current_time ← CLOCK;
(6) if current_time > st + 
(7) then discard m
(8) else wait (∀std, k ∈ causal_barrier :
(9) (deliveredi [k] ≥ sdt) ∨ (current_time > std + ));
(10) co_delivery of m to the application layer;
(11) deliveredi [j ] ← sdt;
(12) causal_barrieri ← (causal_barrieri \ causal_barrier) ∪ {sdt, j };
(13) end if.

Fig. 12.16 A -causal broadcast algorithm (code for pi )

Algorithm The corresponding -causal broadcast algorithm is described in


Fig. 12.16. It is a simple adaptation of the previous time-free causal broadcast al-
gorithm to (a) physical time and (b) messages with a bounded lifetime . (Taking
 = +∞ provides us with a deadline-free causal order algorithm based on physical
clocks.)
As already said, it is assumed that the duration of local processing is 0. Moreover,
the granularity of the common clock is assumed to be such that any two consecutive
invocations of co_broadcast() by a process have different dates.
A message CO_BR(m, st, causal_barrieri ) sent at line 2 carries its sending date
st, so that the receiver can discard it as soon as it arrives, if it arrives too late
(lines 6–7).

12.4 The Total Order Broadcast Abstraction

12.4.1 Strong Total Order Versus Weak Total Order

Strong Total Order Broadcast Abstraction The total order broadcast abstraction
was introduced in Sect. 7.1.4 to illustrate the use of scalar clocks. For the self-
completeness of this chapter, its definition it repeated here. This abstraction provides
the processes with two operations denoted to_broadcast() and to_deliver() which
satisfy the following properties. Let us recall that it is assumed that all the messages
which are co_broadcast are different.
• Validity. If a process to_delivers a message m, there is a process that has
to_broadcast m.
• Integrity. No message is to_delivered more than once.
12.4 The Total Order Broadcast Abstraction 321

Fig. 12.17 Implementation


of total order message delivery
requires coordination

• Total order. If a process to_delivers m before m , no process to_delivers m be-


fore m.
• Causal precedence order. If the to_broadcast of m causally precedes the to_
broadcast of m , no process to_delivers m before m.
• Termination. If a process to_broadcasts a message m, every process to_delivers m.

In the following this communication abstraction is called strong total order abstrac-
tion.

Weak Total Order Broadcast Abstraction Some applications do not require that
message delivery complies with causal order. For these applications, the important
point is the fact that the messages are delivered in the same order at each process,
the causality among their broadcast being irrelevant. This defines a weak total order
abstraction, which is total order broadcast without the “causal precedence order”
requirement.

Causal Order Versus Total Order As has been seen in the algorithms ensuring
causal message delivery, when a process pi receives an algorithm message carrying
an application message m, it can be forced to delay the local delivery of m (until
the associated delivery condition becomes true), but it is not required to coordinate
with other processes.
This is the fundamental difference with the algorithms that implement total order
message delivery. Let us consider Fig. 12.17 in which, independently, p1 invokes
to_broadcast(m) and p2 invokes to_broadcast(m ). Let us consider any two distinct
processes pi and pj . As the co_broadcasts of m and m are not causally related, it
is possible that pi receives first the algorithm message carrying m and then the one
carrying m , while pj receives them in the opposite order. If the message delivery
requirement was causal order, there will be no problem, but this is no longer the
case for total order: pi and pj have to coordinate in one way or another, directly or
indirectly, to agree on the same delivery order. This explains why the algorithms im-
plementing the total order message delivery requirement are inherently more costly
(in terms of time and additional exchanged messages) than the algorithms imple-
menting only causal message delivery.
322 12 Order Constraints on Message Delivery

Fig. 12.18 Total order broadcast based on a coordinator process

12.4.2 An Algorithm Based on a Coordinator Process


or a Circulating Token

Using a Coordinator Process A simple solution to implement a total order on


message delivery consists in using a coordinator process pcoord , as depicted in the
example on the left of Fig. 12.18. More precisely, we have the following.
• Each process pi manages a local variable lsni , which is used to associate a se-
quence number with each application message it to_broadcasts. Then, when a
process pi invokes to_broadcast(m), it increases lsni and sends the algorithm
message LTO _ BR (m, lsni ) to pcoord .
• The coordinator process pcoord manages a local variable gsni , used to associate
a global sequence number (i.e., a sequence number whose scope is the whole
system) with each message m it receives.
For each process pi , the coordinator processes the messages LTO _ BR (m, lsn) it
receives from pi according to their local sequence numbers. Hence, when looking
at the arrival pattern described on the right of Fig. 12.18, pcoord is able to recover
the sending order of m and m .
The processing of a message m is as follows. First pcoord increases gsni , and
associates the new value with m. It then sends the message GTO _ BR (m, gsni ) to
all the processes.
• A process pi to_delivers the application messages m, that it receives in algorithm
messages GTO _ BR (m, gsn), according to their global sequence numbers.
This simple algorithm implements the strong total order broadcast abstraction
(i.e., total order on the delivery of application messages which complies with mes-
sage causal precedence). It is a two-phase algorithm: For each application message
m, a process sends first an algorithm message to the coordinator process, and then
the coordinator sends an algorithm message to all.

Replacing the Coordinator Process by a Token The coordinator process can be


replaced by a mobile token which acts as a moving coordinator. The token mes-
sage, denoted TOKEN (gsn), carries a sequence number generator gsn, initialized
12.4 The Total Order Broadcast Abstraction 323

Fig. 12.19 Token-based


total order broadcast

to 0. When a process pi invokes co_broadcast(m), it waits for the token and, when
it receives the token, it increases gsn by 1, whose new value becomes the global se-
quence number associated with m. Then, pi sends the message GTO _ BR (m, gsn) to
all the processes. Finally, a process to_delivers the application messages it receives
according to their sequence numbers. An example is depicted in Fig. 12.19.
As far as the moves of the token are concerned, two approaches are possible.

• As the token is a mobile object, any of the algorithms presented in Chap. 5 can be
used. A process that needs a global sequence number invokes the corresponding
operations acquire_object() and release_object().
This approach can be interesting when the underlying navigation algorithm is
adaptive as, in this case, only the processes that want to to_broadcast messages
are required to participate in the navigation algorithm.
• The processes are arranged along a unidirectional logical ring, and the token
moves perpetually along the ring. When a process receives the token, it com-
putes the next sequence number (if needed), and then forwards the token to the
next process on the ring.
This solution has the drawback that all the processes are required to participate
in the move of the token, even if they do not want to co_broadcast messages. This
approach is consequently more interesting in applications where all the processes
very frequently want to to_broadcast messages.

Mutual Exclusion Versus Total Order As shown by the previous algorithms,


there is a strong relation between mutual exclusion and total order broadcast. In
both cases, a total order has to be implemented: on the accesses to a critical section,
or on the delivery of messages.
Nevertheless, mutual exclusion and total order broadcast differ in the sense that,
in addition to a total order, mutual exclusion prevents two or more processes from
being simultaneously in the critical section (hence the operation release_resource(),
which needs to be invoked to allow the next process to enter the critical section; such
a “release” operation does not exist in total order broadcast).
324 12 Order Constraints on Message Delivery

Fig. 12.20
Clients and servers
in total order broadcast

12.4.3 An Inquiry-Based Algorithm

Structural Decomposition: Clients, Servers, and State Machine Let us con-


sider that the processes are decomposed into two groups: the client processes
p1 , . . . , pn , and the server processes q1 , . . . , qm (see Fig. 12.20). Of course sev-
eral processes can be both client and server, but, to simplify the presentation and
without loss of generality, we consider here clients and servers as distinct entities.
This algorithm is due to D. Skeen (1982).
The clients broadcast messages to the servers, and all servers have to deliver
these messages in the same order. Hence, this structure is particularly well suited to
the duplication of a deterministic state machine (e.g., a queue, a stack, etc.), where
each server maintains a copy of the state machine. When a process wants to issue
an operation (or command) on the state machine, it builds a message describing this
operation and broadcasts it to all the servers using a total order broadcast communi-
cation abstraction. The same sequence of operations (command) will consequently
be applied on all the copies of the state machine.

Underlying Principle The idea is to associate a timestamp with each application


message and to_deliver messages according to their timestamp order. The dates used
in timestamps, which can be considered as delivery dates, are computed as follows
from logical scalar clocks associated with each server.
When a process pi wants to broadcast an application message m, it sends an
inquiry message to each server qj , which sends back to it the current value of its
clock. This value is a proposed delivery date for m. When pi has received all the
dates proposed for the delivery of m, it computes the maximal one, which becomes
the final delivery date of m, and sends it to all the servers. As a same delivery
date can be associated with different messages, (as already announced) the control
data associated with m is actually a timestamp, i.e., a pair made up of the delivery
date of m and the identity of the client that sent it. Finally, each server delivers the
application messages according to their timestamp order.
12.4 The Total Order Broadcast Abstraction 325

Communication Graph Let us remark that in the communication pattern gen-


erated by the previous process and server behavior, the clients (resp., the servers)
do not communicate among themselves: A client communicates only with servers
and a server communicates only with clients. The communication graph is a bi-
partite graph with a channel connecting each client and each server as depicted in
Fig. 12.20.

Local Variable at a Client Process A client pi manages a single local variable


denoted sni . Initialized to 0, this variable allows pi to associate a sequence number
with every application message that it to_broadcasts. Hence, each pair (sni , i) is the
identity of a distinct message.

Local Variables at a Server Process A server process qj manages three local


variables.
• clockJ , is the local scalar clock of qj . It is initialized to 0.
• pendingi is a set, initialized to ∅, which contains the application messages (and
associated control data) received and not yet to_delivered by qj .
• to_deliverablei is a queue, initially empty, at the tail of which qj deposits the next
to_delivered message.

Delivery Condition The set pendingi contains tuples of the form m, date, i, tag,
where m is an application message, i the identity of its sender, date the tentative de-
livery date currently associated with m, and tag is a control bit. If tag = no_del,
the message ms not yet been assigned its final delivery date and cannot consequently
be to_delivered (added to to_deliverablej ). Let us notice that its final delivery date
will be greater than or equal to its current tentative delivery date. If tag = del, m
has been assigned its final delivery date and can be to_delivered if it is stable. Stabil-
ity means that no message in pendingi (and by transitivity, no message to_broadcast
in the future) can have a timestamp smaller than the one associated with m.
The delivery condition DC(m) for a message m such that m, date, i, del ∈
pendingi is consequently the following (let us recall that all application messages
are assumed to be different):
 
∀ m , date , i  , − ∈ pendingi :
     
m = m ⇒ (datei < datei  ) ∨ datei = datei  ∧ i < i  .

The Total Order Broadcast Algorithm The corresponding total order broadcast
algorithm is described in Fig. 12.21. A client process pi is allowed to invoke again
to_broadcast() only when it has completed its previous invocation (this means that,
while it is waiting at line 3, a process remains blocked until it has received the
appropriate messages).
When it invokes to_broadcast(m), pi first computes the sequence number that
will identify m (line 1). It then sends the message INQUIRY (m, sni ) to each server
(line 2) and waits for their delivery date proposals (line 3). When pi has received
326 12 Order Constraints on Message Delivery

==================== on client side =========================


operation co_broadcast(m) by a client pi is
(1) sni ← sni + 1;
(2) for each j ∈ {1, . . . , m} do send INQUIRY (m, sni ) to qj end for;
(3) wait (a message PROP _ DATE (sni , dj ) has been received from each qj );
(4) Let date ← max({dj }1≤j ≤m );
(5) for each j ∈ {1, . . . , m} do send FINAL _ DATE (sni , date) to qj end for.

==================== on server side =========================


when INQUIRY (m, sn) is received from pi by a server qj do
(6) msg_sn ← sn;
(7) clockj ← clockj + 1;
(8) pendingj ← pendingj ∪ {m, clockj , i, no_del};
(9) send PROP _ DATE (sn, clocki ) to pi ;
(10) wait (FINAL _ DATE (sn, date) received from pi where sn = msg_sn);
(11) replace m, −, i, no_del in pendingi by m, date, i, del;
(12) clocki ← max(clocki , date).

back ground task T is


(13) repeat forever
(14) wait ( ∃ m, date, i, del ∈ pendingi such that ∀ m , date , i  , − ∈ pendingi :
(m = m) ⇒ [(datei < datei  ) ∨ (datei = datei  ∧ i < i  )];
(15) withdraw m, date, i, del from pendingi ;
(16) add m at the tail of to_deliverablej
(17) end repeat.

Fig. 12.21 A total order algorithm from clients pi to servers qj

them, pi computes the final delivery date of m (line 4) and sends it to the servers
(line 5). (As we can see, this is nothing more than a classical handshake coordination
mechanism which has been already encountered in other chapters.)
The behavior of a server qj is made up of two parts. When qj receives a message
TO_1(m, sn) from a client pi , qj stores the sequence of the message (line 6), and in-
creases its local clock (line 7), adds m, clockj , i, no_del to pendingi (line 8), and
sends to pi a proposed delivery date for m (line 9). Then, as far as m is concerned,
qj waits until it has received the final date associated with m (line 10). When this
occurs, it replaces in pendingi the proposed date by the final delivery date and marks
the message as deliverable (line 11), and updates its local scalar clock (line 12).
The second processing part associated with a server is a background task
which suppresses messages from pendingi and adds them at the tail of the queue
to_deliverablej . The core of this task, described at lines 13–17, is the delivery con-
dition DC(m), which has been previously introduced.

12.4.4 An Algorithm for Synchronous Systems

This section considers total order broadcast in a synchronous system. The algorithm
which is described is a simplified version of a fault-tolerant algorithm due to F.
Cristian, H. Aghili, R. Strong, and D. Dolev (1995).
12.4 The Total Order Broadcast Abstraction 327

operation co_broadcast(m) is
(1) sdt ← CLOCK;
(2) for each j ∈ {1, . . . , m} do send TO _ BR (m, sdt) to pj end for.

when (m, sdt) is received from pj do


(3) pendingj ← pendingj ∪ {m, sdt + }.

back ground task T is


(4) repeat forever
(5) let dmin = smallest date in pendingi ;
(6) wait(CLOCK = dmin);
(7) withdraw from pendingi the messages whose delivery date is dmin,
(8) and add them at the tail of to_deliverablej in increasing timestamp order
(9) end repeat.

Fig. 12.22 A total order algorithm for synchronous systems

Synchronous System The synchrony provided to the processes by the system is


defined as follows:
• There is an upper bound, denoted , on message transit duration.
• There is a global physical clock, denoted CLOCK, that all the processes can read.
(The case where CLOCK is implemented with local physical clocks—which can
drift—is considered in Problem 8.)
The granularity of this clock is such that no two messages sent by the same
process are sent at the same physical date.

The Algorithm The algorithm is described in Fig. 12.22. When a process invokes
co_broadcast(m) it sends the message TO _ BR (m, sdt) to all the processes (including
itself), where sdt is the message sending date.
When a process pi receives a message TO _ BR (m, sdt), pi delays its delivery until
time sdt + . If several messages have the same delivery date, they are to_delivered
according their timestamp order (i.e., according to the identity of their senders).
The local variables pendingi and to_deliverablei have the same meaning as in the
previous algorithms.
As we can see, this algorithm is based on the following principle: It systemati-
cally delays the delivery of each message as if its transit duration was equal to the
upper bound . Hence, this algorithm reduces all cases to the worst-case scenario.
It is easy to see that the total order on message delivery, which is the total order on
their sending times (with process identities used to order the messages sent at the
same time), is the same at all the processes.

Total Order in Synchronous Versus Asynchronous Systems Whether the sys-


tem is synchronous or synchronous, let  be the average message transit time. As-
suming that processing times are negligible with respect to transit times, this means
328 12 Order Constraints on Message Delivery

 in the coordinator-based
that, in the average, the to_delivery of a message takes 2
algorithm of Sect. 12.4.2, and 3  in the client–server algorithm of Sect. 12.4.3.
Differently, the to_delivery of a message takes always  in the synchronous algo-
rithm of Fig. 12.22. It follows that, when considering algorithms executed on top
of a synchronous system, an asynchronous algorithm can be more efficient than a
synchronous algorithm when 2  <  (or 3 < ).

12.5 Playing with a Single Channel

Considering a channel, this section considers four ordering properties that can be
associated with each message sent of this channel, and algorithms which implement
them. This section has to be considered as an exercise in two-process communica-
tion.

12.5.1 Four Order Properties on a Channel

Definitions Let us consider a message m sent by a process pi to a process pj ,


and m any other message sent by pi to pj . Four types of delivery constraints can
be associated with the message m. Let s(m) and del(m) be the events “sending of
m” and delivery of m, respectively (and similarly for m ). The four types of deliv-
ery constraints which can be imposed on m are denoted ct_future, ct_past,
marker, and ordinary. They are defined as follows, where type(m) denotes the
constraint associated with m.
• type(m) = ct_future. In this case, m cannot be bypassed by messages m sent
after it (m controls its future). Moreover, nothing prevents m from bypassing mes-
sages m sent before it. This constraint is depicted in Fig. 12.23: all the messages
m sent after m are delivered by pj after m.
Formally, if type(m) = ct_future, we have for any other message m sent
by pi to pj :
     
s(m) −→ s m ⇒ del(m) −→ del m .
ev ev

• type(m) = ct_past. In this case, m cannot bypass the messages m sent before
it (m is controlled by its past). Moreover, nothing prevents m from being bypassed
by messages m sent after it. This constraint is depicted in Fig. 12.24: all the
messages m sent before m are delivered by pj before m.
Formally, if type(m) = ct_past, we have for any other message m sent by
pi to pj :
    ev     ev 
s m −→ s(m) ⇒ del m −→ del(m) .
12.5 Playing with a Single Channel 329

Fig. 12.23 Message m


with type ct_future
(cannot be bypassed)

Fig. 12.24 Message m


with type ct_past
(cannot bypass other messages)

Fig. 12.25 Message m


with type marker

• type(m) = marker. In this case, m can neither bypass messages m sent after
it, nor be bypassed by messages m sent before it. This constraint is depicted in
Fig. 12.25. (This type of message has been implicitly used in Sect. 6.6, devoted
to the determination of a consistent global state of a distributed computation.)
Formally, if type(m) = marker, we have for any other message m sent by
pi to pj :
     
s(m) −→ s m ⇒ del(m) −→ del m
ev ev

   ev     ev 
∧ s m −→ s(m) ⇒ del m −→ del(m) .

• type(m) = ordinary. In this case, m imposes no delivery constraints on the


other messages sent by pi to pj .
It is easy to see that if the type of all the messages sent by pi to pj is marker, the
channel behaves as a FIFO channel. The same is true if the type of all the messages
is either ct_future or ct_past.

12.5.2 A General Algorithm Implementing These Properties

This section presents an algorithm which ensures that the messages sent by pi to pj
are delivered according to the constraints defined by their types.

A Simple Algorithm An algorithm that builds a FIFO channel on top of a non-


FIFO channel can be used. Considering that (whatever their actual types) all the
messages are typed ordinary, such an algorithm ensures that each message m
behaves as if it was typed marker. As this type is stronger than the other types, all
the constraints defined by the types of all the messages are trivially satisfied.
330 12 Order Constraints on Message Delivery

operation send(m, type(m)) by pi to pj is


(1) sni ← sni + 1;
(2) send (m, sni ) to pj .

when (m, sn) is received by pj from pi do


(3) if (last_snj + 1 = sn)
(4) then pendingi ← pendingi ∪ {(m, sn)}
(5) else deliver m; last_snj ← last_snj + 1
(6) while (∃ (m , sn ) ∈ pendingi : sn = last_snj + 1)
(7) do deliver m ; last_snj ← last_snj + 1
(8) end while
(9) end if.

Fig. 12.26 Building a first in first out channel

Such an algorithm is describe in Fig. 12.26. The sender pi manages a local vari-
able sni (initialized to 0), that it uses to associate a sequence number with each
message. The receiver process pj manages two local variables: last_snj (initialized
to 0) contains the sequence number of the last message of pi that it has delivered;
pendingi is a set (initially empty) which contains the messages (with their sequence
numbers) received and not year delivered by pj . The sequence numbers allows pj
to deliver the messages in their sending order.

A Genuine Algorithm The previous algorithm is stronger than necessary: What-


ever the type of each message, it forces all of them to behave as if they were typed
marker. This section presents a genuine algorithm, i.e., an algorithm which im-
poses to messages only the constraints defined by their type.
To that end, in addition to sni (managed by pi ) and pendingj (managed by pj ),
the sender pi and the receiver pj manage the following local variables:
• no_bypassi is a local variable of pi that contains the sequence number of the last
message which has been sent and cannot be bypassed by the messages that will
be sent in the future. It is initialized to 0, and updated when pi sends a message
whose type is ct_future or marker (this is because, due to the definition of
these types, such messages cannot be bypassed by messages sent in the future).
• deliv_snj is a set of pj that contains the sequence numbers of all the messages
that pj has already delivered. Its initial value is {0} (the sequence numbers start
at 1). (As the messages are not necessarily delivered in their sending order, a
simple counter such as last_snj is no longer sufficient to register which messages
have been delivered.)
The genuine algorithm is described in Fig. 12.27. When pi wants to send a mes-
sage m to pj , it associates the next sequence number sni with m, and sends the
message (m, sni , type(m), barrier) to pj . The value barrier is a control data, which
is the sequence number of the last message m that m cannot bypass. If type(m) is
ct_past or marker, due to the definition of these types, m is the last message
that pi sent before m (lines 3 and 4). If type(m) is ct_future or ordinary,
barrier is set to the value of no_bypassi (lines 2 and 5). Moreover, if type(m) is
12.5 Playing with a Single Channel 331

operation send(m, type(m)) by pi to pj is


(1) sni ← sni + 1;
(2) case (type(m) = ct_future) then barrier ← no_bypassi ; no_bypassi ← sni
(3) (type(m) = marker) then barrier ← nsi − 1; no_bypassi ← sni
(4) (type(m) = ct_past) then barrier ← nsi − 1
(5) (type(m) = ordinary) then barrier ← no_bypassi
(6) end case;
(7) send (m, sni , type(m), barrier) to pj .

when (m, sn, type, barrier) is received from pi do


(8) if [(type(m) ∈ {ct_past, marker}) ∧ ({1, . . . , barrier} ⊆ deliv_snj )]
∨ [(type(m) ∈ {ordinary, ct_future}) ∧ (barrier ∈ deliv_snj )]
(9) then deliver m; deliv_snj ← deliv_snj ∪ {sn}
(10) while (∃ (m , sn , type , barrier ) ∈ pendingi such that DC(m )
(11) do deliver m ; deliv_snj ← deliv_snj ∪ {sn };
(12) withdraw (m , sn , type , barrier ) from pendingi
(13) end while
(14) else add (m, sn, type, barrier) to pendingi
(15) end if.

Fig. 12.27 Message delivery according to message types

ct_future or marker, the local variable no_bypassi is set to sni , as no message


sent in the future will be allowed to bypass m (whose sequence number is sni ).
When pj receives an application message (m, sn, type, barrier), it checks the
delivery condition (line 8). If the condition is false, pj stores (m, sn, type, barrier)
in pendingj (line 14). If the condition is true, pj delivers the application message
m, and adds its sequence number to deliv_snj (line 9). Moreover, it also delivers the
messages in pendingj whose delivery condition has became true due to the delivery
of m, or—transitively—the delivery of other messages (lines 10–13).

The Delivery Condition Let (m, sn, type, barrier) be an algorithm message re-
ceived by pj . The delivery condition DC(m) associated with m depends on the type
of m.
• If type(m) is ct_past or marker, m has to be delivered after all the messages
sent before it, which means that we need to have {1, . . . , barrier} ⊆ deliv_snj .
• If type(m) is ct_future or ordinary, m can be delivered before messages
sent before it. Hence, the only constraint is that its delivery must not violate the
requirements imposed by the type of other messages. But these requirements are
captured by the message parameter barrier, which states that the delivery of the
message whose sequence number is barrier has to occur before the delivery of
m. Hence, for these message types, the delivery condition is barrier ∈ deliv_snj .
An example is depicted in Fig. 12.28. The sequence numbers of m1 , m2 ,
m3 , and m4 , are 15, 16, 17, and 18, respectively. Before the sending of m1 ,
no_bypassi = 10. These messages can be correctly delivered in the order m2 ,
m4 , m3 , m1 .
332 12 Order Constraints on Message Delivery

Fig. 12.28 Delivery of messages typed ordinary and ct_future

To summarize, the delivery condition DC(m) is:


   
DC(m) ≡ type(m) ∈ {ct_past, marker} ∧ {1, . . . , barrier} ⊆ deliv_snj
  
∨ type(m) ∈ {ordinary, ct_future} ∧ (barrier ∈ deliv_snj ) .

12.6 Summary
The aim of this chapter was to present two communication abstractions, namely,
the causal message delivery abstraction and the total order broadcast abstraction.
These abstractions have been defined and algorithms implementing them have been
described. Variants suited to bounded lifetime messages and synchronous systems
have also been presented. Finally, as an exercise, the chapter has investigated order-
ing properties which can be imposed on messages sent on a channel.

12.7 Bibliographic Notes


• The notion of message causal ordering is due to K.P. Birman and T.A. Joseph [53].
This notion was then extended to overlapping groups of processes and multicast
communication in [55].
Developments on the notion of causal order can be found in [88, 90, 217,
254, 340].
• The point-to-point causal message ordering described in Sect. 12.2 is due to
M. Raynal, A. Schiper, and S. Toueg [324].
Other causal order algorithms can be found in [255, 296, 336].
• Causal order-based algorithms which compute a consistent snapshot of a dis-
tributed computation are described in [1, 14].
• The technique to reduce the size of control information presented in Sect. 12.2.3
is due to F. Mattern.
• The causal broadcast algorithm of Sect. 12.3 is from [55, 324].
12.8 Exercises and Problems 333

• The notion of a causal barrier and the associated causal broadcast algorithm of
Sect. 12.3.2 are due to R. Baldoni, R. Prakash, M. Raynal, and M. Singhal [39].
This algorithm was extended to mobile environments in [298].
• The notion of bounded lifetime messages and the associated causal order algo-
rithm presented in Sect. 12.3.3 are due to R. Baldoni, A. Mostéfaoui, and M. Ray-
nal [38].
• The notion of total order broadcast was introduced in many systems (e.g., [84]
for an early reference).
• The client/server algorithm presented in Sect. 12.4.3 is due to D. Skeen [353].
• The total order algorithm for synchronous systems presented in Sect. 12.4.4 is
a simplified version of a fault-tolerant algorithm due to F. Cristian, H. Aghili,
R. Strong, and D. Dolev [102].
• State machine replication was introduced by L. Lamport in [226]. A general pre-
sentation of state machine replication can be found in [339].
• The notions of the message types ordinary, marker, controlling the past, and con-
trolling the future, are due to M. Ahuja [12]. The genuine algorithm presented in
Sect. 12.5.2 is due to M. Ahuja and M. Raynal [13].
• A characterization of message ordering specifications and algorithms can be
found in [274]. The interconnection of systems with different message delivery
guarantees is addressed in [15].
• While this book is devoted to algorithms in reliable message-passing systems,
the reader will find algorithms that implement causal message broadcast in asyn-
chronous message-passing systems in [316], and algorithms that implement total
order broadcast in these systems in [24, 67, 242, 316].

12.8 Exercises and Problems

1. Prove that the empty interval predicate stated in Sect. 12.1.2 is a characterization
of causal message delivery.
2. Prove that the causal order broadcast algorithm described in Fig. 12.10 is correct.
Solution in [39].
3. When considering the notion of a causal barrier introduced in Sect. 12.3.2, show
that any two messages that belong simultaneously to causal_barrieri are inde-
pendent (i.e., the corresponding invocations of co_broadcast() are not causally
related).
4. Modify the physical time-free algorithm of Fig. 12.4 so that it implements causal
message delivery in a system where the application messages have a bounded
lifetime . Prove then that the resulting algorithm is correct.
Solution in [38].
5. Let us consider an asynchronous system where the processes are structured
into (possibly overlapping) groups. As an example, when considering five pro-
cesses, a possible structuring into four groups is G1 = {p1 , p2 , p4 , p5 }, G2 =
{p2 , p3 , p4 }, G3 = {p3 , p3 , p4 }, and G2 = {p1 , p5 }.
334 12 Order Constraints on Message Delivery

When a process sends a message m, it sends m to all the processes of a group


to which it belongs. As an example, when p2 sends a message m, it sends m to
the group G1 or to the group G2 . The sending of a message to a group is called
multicast.
A simple causal multicast algorithm consists in using an underlying causal
broadcast algorithm in which each process discards all the messages that have
been multicast in a group to which it does not belong. While this solution works,
it is not genuine in the sense that each process receives all messages.
Design a causal multicast algorithm in which a message sent to a group is sent
only to the processes of this group.
Solution in [55].
6. As in the previous problem, let us consider an asynchronous system in which the
processes are structured into (possibly overlapping) groups. Design a genuine
total order multicast algorithm (i.e., an algorithm in which a message sent to a
group is sent only to the processes of this group).
Solution in [55, 135]. (The algorithms described in these papers consider systems
in which processes may crash.)
7. Let us consider the coordinator-based total order broadcast algorithm presented
in Sect. 12.4.2.
(a) Prove that this algorithm implements the strong total order broadcast abstrac-
tion.
(b) Does this algorithm implement the strong total order broadcast abstraction
when all sequence numbers are suppressed but channels to and from the
coordinator process are FIFO?
(c) Let us suppress the local variable lsni of each process pi and replace each
algorithm message LTO_BR(m, lsni ) by LTO_BR(m). Hence, the only se-
quence numbers used in this modified algorithm are the global sequence
numbers generated by the coordinator process pcoord . Show that this modi-
fied algorithm implements the weak total order broadcast abstraction.
8. Let us consider a synchronous system where the global clock CLOCK is im-
plemented with a physical clock ph_clocki per process pi . Moreover, the syn-
chronization of these physical clocks is such that their drift is bounded by .
This means that we always have |ph_clocki − ph_clockj | ≤  for any pair of
processes pi and pj .
Modify the total order algorithm described in Fig. 12.22 so that it works with
the previous physical clocks.
9. Let us consider a channel connecting the processes pi and pj , and the mes-
sage types ct_future, ct_past, and ct_marker, introduced in Sect. 12.5.
Show that it is impossible to implement the message type ct_marker from
messages typed ct_future and ct_past. (At the application level, pi can
send an arbitrary number of messages typed ct_future or ct_past.)
Chapter 13
Rendezvous (Synchronous) Communication

While the previous chapter was devoted to communication abstractions on message


ordering, this chapter is on synchronous communication (also called logically in-
stantaneous communication, or rendezvous, or interaction). This abstraction adds
synchronization to communication. More precisely, it requires that, for a message
to be sent by a process, the receiver has to be ready to receive it. From an external
observer point view, the message transmission looks instantaneous: The sending and
the reception of a message appear as a single event (and the sense of the commu-
nication could have been in the other direction). From an operational point of view,
we have the following: For each pair of processes, the first process that wants to
communicate—be it the sender or the receiver—has to wait until the other process
is ready to communicate.
This chapter first defines synchronous communication and introduces a charac-
terization based on a specific message pattern called a crown. It then presents sev-
eral implementations of this communication abstraction, each suited to a specific
context. It also describes implementations for real-time rendezvous in the context
of synchronous message-passing systems. In this case, each process is required to
associate a deadline with each of its rendezvous.

Keywords Asynchronous system · Client–server hierarchy ·


Communication initiative · Communicating sequential processes · Crown ·
Deadline-constrained interaction · Deterministic vs. nondeterministic context ·
Logically instantaneous communication · Planned vs. forced interaction ·
Rendezvous · Multiparty interaction · Synchronous communication ·
Synchronous system · Token

13.1 The Synchronous Communication Abstraction

13.1.1 Definition

Underlying Intuition When considering an asynchronous message-passing sys-


tem, two events are associated with each message, namely its send event and its
receive event. This captures the fact that a message takes time to transit from its
sender to its destination process, and this duration can be arbitrary.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 335


DOI 10.1007/978-3-642-38123-2_13, © Springer-Verlag Berlin Heidelberg 2013
336 13 Rendezvous (Synchronous) Communication

Fig. 13.1 Synchronous communication: messages as “points” instead of “intervals”

The intuition that underlies the definition of the synchronous communication


abstraction is to consider messages as “points” instead of “time intervals” whose
length is arbitrary. When considering messages as points (the send and receive
events of each message being pieced together into a single point), the set of mes-
sages is structured as a partial order. As we will see below, this allows us to reason
on distributed objects (each object being managed by a distinct process) as if they
were all kept in a shared common memory. Hence, while it reduces asynchrony,
synchronous communication makes reasoning and program analysis easier.
A simple example that shows how synchronous communication reduces mes-
sages as time intervals into messages as points is described in the right part of
Fig. 13.1. From a space-time diagram point of view, reducing messages to points
amounts to considering the transit time of each message as an arrow of zero dura-
tion (from an omniscient observer’s point of view), i.e., as a vertical arrow. Messages
considered as intervals are described on the left part of Fig. 13.1. It is easy to see that
this communication pattern does not comply with synchronous communication: if
the transfer of m2 and m3 can be depicted as vertical arrows, that of m1 cannot. This
is because the sending of m1 has to appear before the point m2 , which appears be-
fore the point m3 , which in turn appears before the reception of m1 . Such a schedule
of events, which prevents a message pattern from being synchronous, is formalized
below in Sect. 13.1.3.

Sense of Message Transfer As shown by the execution on the right of Fig. 13.1,
it is easy to see that, when considering synchronous communication, the sense of
direction for message transfer is irrelevant.
The fact that any of the messages m1 , m2 , or m3 would be transmitted in the other
direction, does not change the fact that message transfers remain points when con-
sidering synchronous communication. We can even assume such a point abstracts
the fact that two messages are simultaneously exchanged, one in each direction.
Hence, the notions of sender and receiver are not central in synchronous communi-
cation.

Definition The synchronous communication abstraction provides the processes


with two operations, denoted synchr_send() and synchr_deliver(), which allow
them to send and receive messages, respectively. As in the previous chapter, we say
13.1 The Synchronous Communication Abstraction 337

“a process synchr_sends or synchr_delivers a message”, and assume (without loss of


generality) that all the application messages are different. These operations satisfy
the following properties, where the set H of the events that are considered are the
events at program level. The following notations are used: sy_s(m) and sy_del(m)
denote the events associated with the synchronous send and synchronous delivery
of m, respectively.
• Validity. If a process pi synchr_delivers a message from a process pj , then pj
has synchr_sent this message.
• Integrity. No message is synchr_delivered more than once.
• Synchrony. Let D be a dating function from the set of events H (program level)
into the scalar time domain (set of natural numbers). We have:
– For any two events e1 and e2 produced by the same process:
ev
(e1 −→ e2 ) ⇒ (D(e1 ) < D(e2 )).
– For any message m: (D(sy_s(m)) = D(sy_del(m))).
• Termination. Assuming that each process executes synchr_deliver() enough
times, every message that was synchr_sent is synchr_delivered.
The validity property (neither creation nor corruption of messages), integrity
property (no duplication), and termination property are the classical properties as-
sociated with message communication abstractions. The synchrony property states
that there is time notion that increases inside each process taken individually, and,
for each message m, associates a same integer date with its synchr_send and syn-
chr_deliver events. This date, which is the “time point” associated with m, expresses
the fact that m has been transmitted instantaneously when considering the time
frame defined by the dating function D().

13.1.2 An Example of Use

Let us consider a simple application in which two processes p1 and p2 share two
objects, each implemented by a server process. These objects are two FIFO queues
Q1 and Q2 , implemented by the processes q1 and q2 , respectively.
The operations on a queue object Q are denoted Q.enqueue(x) (where x is the
item to enqueue), and Q.dequeue(), which returns the item at the head of the queue
or ⊥ if the queue is empty. The invocation of Qj .enqueue(x) and Qj .dequeue()
by a process are translated as described in Table 13.1. (Due to the fact that commu-
nications are synchronous, the value returned by Qj .dequeue() can be returned by
the corresponding synchronous invocation.)
Let us consider the following sequences of invocations by p1 and p2 :

p1 : . . . synchr_send(enqueue, a) to q1 ; synchr_send(enqueue, b) to q2 ; ...


p2 : . . . synchr_send(enqueue, c) to q2 ; synchr_send(enqueue, d) to q1 ; ...
338 13 Rendezvous (Synchronous) Communication

Table 13.1 Operations


as messages Queue operation Translated into

Qj .enqueue(x) synchr_send(enqueue, x) to qj
Qj .dequeue() synchr_send(dequeue) to qj

In a context where the communications are asynchronous (i.e., when synchr_


send() and synchr_deliver() are replaced by send() and receive(), respectively), the
message pattern depicted in Fig. 13.2 can occur, where the label associated with a
message is the value it carries.
When all messages have been received, Q1 contains d followed by a, while
Q2 contains b followed by c, as shown on the right part of the figure. But this
is inconsistent with the order of the invocations in p1 and p2 . As Q2 contains b
followed by c, it follows that the invocation of send(enqueue, a) by p1 occurred
before the invocation of send(enqueue, d) by p2 . But, due to the asynchrony of the
channels, the server q1 receives first d and then a.
Synchronous communication solves this inconsistency problem. Several scenar-
ios can happen, but any scenario which happens guarantees that the queues are con-
sistent with the order in which the processes have sent the messages. Figure 13.3
describes one of the three possible consistent scenarios. The three scenarios differ
according to the speed of processes p1 and p2 . In the scenario of the figure, p1 and
p2 first enqueue a in Q1 and c in Q2 , respectively. Then p1 enqueues b in Q2 while
p2 enqueues d in Q1 . (In another possible scenario Q1 contains a followed by d,
while Q2 contains b followed by c. This scenario is the case where p1 accesses
each object before p2 . In the last possible scenario Q1 contains d followed by a,
while Q2 contains c followed by d. This scenario is the case where p2 accesses each
object before p1 .)

13.1.3 A Message Pattern-Based Characterization

Synchronous Communication Is Strictly Stronger than Causal Order To


show that synchronous communication is stronger than causal order, let us consider

Fig. 13.2 When communications are not synchronous


13.1 The Synchronous Communication Abstraction 339

Fig. 13.3 Accessing objects


with synchronous communication

two messages m1 and m2 , sent to the same process, which are such that the event
sy_s(m1 ) causally precedes the event sy_s(m2 ). We have the following.
• D(sy_s(m1 )) = D(sy_del(m1 )) (from communication synchrony).
• D(sy_s(m2 )) = D(sy_del(m2 )) (from communication synchrony).
ev
• D(sy_s(m1 )) < D(sy_s(m2 )) (due to sy_s(m1 ) −→ sy_s(m2 )).
• D(sy_del(m1 )) < D(sy_del(m2 )) follows from the previous items. Consequently,
it follows from the fact that the function D() is strictly increasing inside a pro-
cess, and any two events of a process are ordered, that we necessarily have
ev
sy_del(m1 ) −→ sy_del(m2 ).
To show that synchronous communication is strictly stronger than causal order, let
us observe that the message pattern described in the left part of Fig. 13.1 satisfies
the causal message delivery order, while it does not satisfy the synchronous com-
munication property.

The Crown Structure As we are about to see, the notion of a crown allows for a
simple characterization of synchronous communication. A crown (of size k ≥ 2) is
a sequence of messages m1 , m2 , . . . , mk , such that we have
ev
sy_s(m1 ) −→ sy_del(m2 ),
ev
sy_s(m2 ) −→ sy_del(m3 ),
...,
ev
sy_s(mk−1 ) −→ sy_del(mk ),
ev
sy_s(mk ) −→ sy_del(m1 ).
Examples of crowns are depicted in Fig. 13.4. On the left side, the event sy_s(m1 )
causally precedes the event sy_del(m2 ) and the event sy_s(m2 ) causally precedes

Fig. 13.4 A crown of size k = 2 (left) and a crown of size k = 3 (right)


340 13 Rendezvous (Synchronous) Communication

Fig. 13.5 Four message patterns

the event sy_del(m1 ), creating a crown of size k = 2 made up of the messages m1


and m2 . The right side of the figure describes a crown of size k = 3. Hence, a crown
is a specific message pattern whose interest lies in the theorem that follows.

Theorem 18 The communication pattern of a distributed execution satisfies the


synchronous communication property if and only if there is no crown.

Proof Let us first show that if the communication pattern satisfies the synchronous
communication property, there is no crown. To that end, let us assume (by contra-
diction) that the communications are synchronous and there is a crown. Hence:
ev
• There is a sequence of k ≥ 2 messages m1 , m2 , . . . , mk such that sy_s(m1 ) −→
ev
sy_del(m2 ), . . . , sy_s(mk ) −→ sy_del(m1 ). It follows that ∀x ∈ {1, . . . , k − 1},
we have D(sy_s(mx )) < D(sy_del(mx+1 )), and D(sy_s(mk )) < D(sy_del(m1 )).
• As the communications are synchronous, we have ∀x ∈ {1, . . . , k}: D(sy_s(mx ))
= D(sy_del(mx )).
Combining the two previous items,we obtain
     
D sy_s(m1 ) < D sy_del(m2 ) = D sy_s(m2 ) ,
     
D sy_s(m2 ) < D sy_del(m3 ) = D sy_s(m3 ) , etc., until
       
D sy_del(mk ) = D sy_s(mk ) < D sy_del(m1 ) = D sy_s(m1 ) ,
i.e., D(sy_s(m1 )) < D(sy_s(m1 )), which is a contradiction.
To show that the communication pattern is synchronous if there is no crown let
us consider the following directed graph G. Its vertices are the application messages
exchanged by the processes and there is an edge from a message m to a message m
if one of the following patterns occurs (Fig. 13.5).
ev
• sy_s(m) −→ sy_s(m ), or
ev
• sy_s(m) −→ sy_del(m ), or
ev
• sy_del(m) −→ sy_s(m ), or
ev
• sy_del(m) −→ sy_del(m ).
13.2 Algorithms for Nondeterministic Planned Interactions 341

ev
It is easy to see that each case implies sy_s(m) −→ sy_del(m ), which means
ev
that a directed edge (m, m ) belongs to G if and only if sy_s(m) −→ sy_del(m ). As
there is no crown, it follows that G is acyclic. G can consequently be topologically
sorted, and such a topological sort defines a dating function D() trivially satisfying
the properties defining synchronous communication. 

13.1.4 Types of Algorithms Implementing


Synchronous Communications

An algorithm implementing synchronous communications has to prevent crowns


from forming. This chapter considers three distinct approaches to implement such
a prevention. The terms interaction or rendezvous are used as synonyms of syn-
chronous communication.
• A first approach consists in assuming that the application programs are well writ-
ten (there is no deadlock, etc.). To that end, a nondeterministic construct is offered
to users. Such a construct allows a process to list several synchronous communi-
cation statements such that one of them will be selected at run time according to
the synchronous communications invoked by the other processes. This approach,
called nondeterministic planned interaction is addressed in Sect. 13.2.
• A second approach consists in considering that a process is always ready to syn-
chr_deliver a message. Hence, the nondeterministic construct is implicit. This
approach, called nondeterministic forced interaction, is addressed in Sect. 13.3.
• A third approach consists in adding a deadline to each invocation of a syn-
chronous communication operation in such a way that the communication occurs
before the deadline or not at all. Of course, this approach is meaningful only in
synchronous systems, i.e., systems where the processes have a common notion
of physical time and the message transfer duration is bounded. This approach,
which is called deadline-constrained interaction, is addressed in Sect. 13.4.

13.2 Algorithms for Nondeterministic Planned Interactions


This section considers planned interactions. The term planned refers to the fact that
each invocation of a communication operation mentions the identity of the other
process. While this is always the case for a send operation, here each invocation
of synchr_del() states the identity of the process from which a message has to be
received.

13.2.1 Deterministic and Nondeterministic Communication


Contexts

Deterministic Context The invocation of a synchronous communication opera-


tion by a process pi can appear in a classical deterministic context such as
statements_1; synchr_send(m) to px ; statements_2,
342 13 Rendezvous (Synchronous) Communication

which means that the process pi executes first the set of statements defined by
statements_1, and then invokes the synchronous operation synchr_send(m) to px .
It remains blocked until px invokes the matching synchronous operation synchr_
deliver(v) from pi . When this occurs the value of m is copied into the local variable
v of px , and then both pi and px continue their sequential execution. As far as pi is
concerned, it executes the set of statements defined by statements_2.

Nondeterministic Construct A process can also invoke communication opera-


tions in a nondeterministic context. A nondeterministic construct is used to that end.
An example of such a construct is as follows:
begin nondeterministic context
synchr_send(m1 ) to px then statements_1
or synchr_send(m2 ) to py then statements_2
or synchr_deliver(m) from pz then statements_3
end nondeterministic context.
The meaning of this construct is the following. It states that the process pi wants
to execute one of the three synchronous operations. The choice is nondeterministic
(it will actually depend on the speed of processes and implementation messages).
Once one of these invocations has succeeded, pi executes the sequence of statements
associated with the corresponding invocation.

Associated Properties The fact that several synchronous communications may


appear in a nondeterministic construct requires the statement of properties associ-
ated with this construct. These properties are the following.
• Safety. If a process enters a nondeterministic construct, it executes at most one
of the synchronous communications listed in the construct. Moreover, if this syn-
chronous communication is with process pj , this process has invoked a matching
synchronous operation.
• Liveness. If both pi and pj have invoked matching synchronous communication
operations, and none of them succeeds in another synchronous communication,
then the synchronous communication between pi and pj succeeds.
The safety property is a mutual exclusion property: A process can be engaged in
at most one synchronous communication at a time. The liveness property states that,
despite nondeterminism, if synchronous communications are possible, then one of
them will occur (hence, nondeterminism cannot remain pending forever: it has to be
solved).

13.2.2 An Asymmetric (Static) Client–Server Implementation

The algorithm presented in this section is due to A. Silberschatz (1979).


13.2 Algorithms for Nondeterministic Planned Interactions 343

Basic Idea: An Underlying Client–Server Hierarchy Given a synchronous


communication (rendezvous) between two processes pi and pj , the idea is to as-
sociate a client behavior with one of them and a server behavior with the other one.
These associations are independent of the sense of the transfer. A client behavior
means that the corresponding process has the initiative of the rendezvous, while a
server behavior means that the corresponding process is waiting for the rendezvous.
The idea is to associate a client behavior with all the rendezvous invocations
appearing in a deterministic context and a server behavior with all the invoca-
tions appearing in a nondeterministic context. Let us observe that, if needed, it is
always possible to transform a deterministic context into a nondeterministic one,
(while the opposite is not possible). As an example, the deterministic invocation
“synchr_send(m) to px ” can be replaced by
begin nondeterministic context
synchr_send(m) to px then skip
end nondeterministic context,
to force it to be implemented with a server behavior. This can be done at compile
time.
The limit of this implementation of a rendezvous lies consequently in the fact
that it does not accept that both the matching invocations of a rendezvous appear in
a nondeterministic context.
Let us define the implementation rendezvous graph of a distributed program as
follows. This graph is a directed graph which captures the client–server relation im-
posed on the processes by their rendezvous invocations. Its vertices are the processes
and there is an edge from a process pi to a process pj if (a) there a rendezvous in-
volving pi and pj , and (b) the client behavior is associated with pi while the server
behavior is associated with pj . To prevent deadlock and ensure liveness, the algo-
rithm implementing rendezvous that follows requires that this graph be acyclic. (Let
us observe that this graph can be computed at compile time.)

Local Variables Each process pi manages the following local variables:


• my_clienti is a set containing the identities of the processes for which pi behave
as a server.
• bufferi is a variable in which pi deposits the message it wants to send, or retrieves
the message it is about to synchr_deliver.
• may_readi [clienti ] is an array of Booleans, initialized to [false, . . . , false];
may_readi [j ] = true (where j ∈ my_clienti ) means that the “client” pj allows
the “server” pi to read the content of its buffer bufferj . This variable is used in
the rendezvous where the client pj is the sender and the server pi the receiver.
• may_writei [clienti ] is an array of Booleans, initialized to [false, . . . , false];
may_writei [j ] = true means that the “client” pj allows the “server” pi to write
into its local buffer bufferj . This variable is used in the rendezvous where the
client pj is the receiver and the server pi the sender.
• end_rdvi is a Boolean (initialized to false), which is set to the value true to signal
the end of the rendezvous.
344 13 Rendezvous (Synchronous) Communication

Process pj (server) Process pi (client)

operation synch_del(x) from pi is operation synch_send(m) to pj is


(S1) wait(may_readj [i]); (C1) bufferi ← m;
(S2) x ← obtain(i); (C2) signal(j, may_read[i]);
(S3) may_readj [i] ← false; (C3) wait(end_rdvi ).
(S4) signal(i, end_rdv). (C4) end_rdvi ← false.

Fig. 13.6 Implementation of a rendezvous when the client is the sender

Underlying Communication The processes send and receive messages through a


fully connected underlying asynchronous network. To have a modular presentation,
the following intermediate communication operations are defined:
• deposit(i, a) allows the invoking process pj to deposit the content of a in bufferi .
This operation can easily be realized with a single message: pj sends the mes-
sage STORE (a) to pi , and, when it receives, pi deposits a into bufferi .
• obtain(i) allows the invoking process pj to obtain the content of bufferi .
This operation can be easily be realized with two messages. Process pj sends
a request message REQ () to pi , which sends by return to pj the message AN -
SWER (bufferi ).
• signal(i, x) allows the invoking process pj to set to true the Boolean local vari-
able xi of process pi .
This operation can be easily be realized with a single message: pj sends the
message SIGNAL (x) to pi , and when it receives it, pi assigns the value true to its
local Boolean identified by x.
As the messages STORE () and SIGNAL () sent by a process pj to a process pi
have to be received in their sending order, the channels are required to be FIFO. If
they are not, sequence numbers have to be associated with messages.

Implementation When the Sender pi Is the Client The corresponding im-


plementation for a pair of processes (pi , pj ) is described in Fig. 13.6. As i ∈
my_clientj , the invocations of “synch_send(m) to pj ” by pi always occur in a de-
terministic context, while the matching invocations “synch_del(x) from pi ” issued
by pj always occur in a nondeterministic context.
When pi invokes “synch_send(m) to pj ”, pi deposits the message m in its local
buffer (line C1), and sends a signal to pj (line C2) indicating that pj is allowed
to read its local buffer. Finally, pi awaits a signal indicating that the rendezvous is
terminated (line C3).
When pj invokes “synch_del(x) from pi ”, pj waits until pi allows it to read the
content of its local buffer (line R1). Then, when pj receives the corresponding sig-
nal, it reads the value saved in bufferi (line S2). Finally it resets may_readj [i] to its
initial value (line S3), and sends a signal to pi indicating the end of the rendezvous
(line S4).
It is easy to see that, due to the signal/wait pattern used by the client pi and
the opposite wait/signal pattern used by the server pj , no deadlock can occur and
13.2 Algorithms for Nondeterministic Planned Interactions 345

Process pj (server) Process pi (client)

operation synch_send(m) from pi is operation synch_del(x) topj is


(S1) wait(may_writej [i]); (C1) signal(j, may_write[i]);
(S2) deposit(i, m); (C2) wait(end_rdvi );
(S3) may_writej [i] ← false; (C3) x ← bufferi ;
(S4) signal(i, end_rdv). (C4) end_rdvi ← false.

Fig. 13.7 Implementation of a rendezvous when the client is the receiver

there is a physical time at which both processes are executing their synchronous
communication operations.

Implementation When the Sender pi Is the Server This implementation, de-


scribed in Fig. 13.7, is similar to the previous one. The message control pattern is
exactly the same as the previous one. The only thing that differs is the sense of the
transfer at the synchronous communication level.

Solving Nondeterminism To complete the description of the implementation,


the way the nondeterministic construct is implemented has to be described. To
that end, let us consider the nondeterministic construct described in Sect. 13.2.1,
which involves the following three invocations: “synchr_send(m1 ) to px ”, “synchr_
send(m2 ) to py ”, and “synchr_deliver(x) from pz ”.
Let pj be the corresponding invoking process. This process is a server for px ,
py , and pz , and we have consequently x, y, z ∈ my_clientj . Moreover, each of the
matching invocations issued by these three processes is in deterministic context.
When, pj enters this nondeterministic construct, it executes the following sequence
of statements:
wait(may_writej [x] ∨ may_writej [y] ∨ may_readj [z]);
among these Booleans, select one equal to true;
let i ∈ {x, y, z} be the corresponding process and
let may_xxxj [i] the corresponding Boolean;
if (may_xxxj [i] is may_readj [i])
then execute the lines S2, S3, and S4 of Fig. 13.6
else execute the lines S2, S3, and S4 of Fig. 13.7
end if.
Thanks to the modular decomposition of the implementation, the previous code
solving nondeterministic choices is particularly simple.

13.2.3 An Asymmetric Token-Based Implementation

The implementation of synchronous communication described in this section allows


any combination of deterministic and nondeterministic matching invocations of the
346 13 Rendezvous (Synchronous) Communication

Fig. 13.8 A token-based mechanism to implement an interaction

operations synchr_send() and synchr_del(). More precisely, any two matching in-
vocations are allowed to appear both in a deterministic context, or in a nondeter-
ministic context, or (as in the previous section) one in a deterministic context and
the other one in a nondeterministic context. Hence, this section focuses only on the
control needed to realize an interaction and not on the fact that a single message, or
a message in each direction, is transmitted during the rendezvous. It is even possi-
ble that no message at all is transmitted. In this case, the interaction boils down to
a pure rendezvous synchronization mechanism. The corresponding algorithm is due
to R. Bagrodia (1989).
This section uses the term interaction, which has to be considered as a synonym
of rendezvous or synchronous communication.

Underlying Idea: Associate a Token with Each Interaction A token is associ-


ated with each possible interaction. Considering any pair of processes pi and pj ,
the token TOKEN ({i, j }) is introduced to allow them to realize a rendezvous.
The process that currently has the token (e.g., pi ) has the initiative to ask the
other process (e.g., pj ) if pj agrees to perform an interaction with it. To that end,
pi sends the token to pj . When pj receives the token, several cases are possible.
• If pj agrees with this interaction request, it sends the token back to pi and the
interaction succeeds. This scenario is depicted on the left side of Fig. 13.8 (in
which the second parameter in the token is the identity of the process that initiated
the interaction).
• If pj is not interested in an interaction with pi , it sends back to pi a message
NO () and keeps the token. This means that the future request for an interaction
involving pi and pj will be initiated by pj . This scenario is depicted on the right
side of Fig. 13.8.
• If pj is interested in an interaction with pi , but is waiting for an answer from a
process pk (to which it previously sent TOKEN ({j, k})), it will send an answer to
pi according to the answer it receives from pk . The implementation has to ensure
that there is neither a deadlock (with a process waiting forever for an answer), nor
a livelock (with processes always sending NO () while an interaction is possible).

Preventing Deadlock and Livelock Let us consider the scenario described in


Fig. 13.9. Process pi , which has the token for initiating interactions with pj , sends
it this token, and starts waiting for an answer. The same occurs for pj that sent
TOKEN ({j, k}, j ) to pk , and pk that sent TOKEN ({i, k}, k) to pi . Typically, if none
of them answers, there is a deadlock, and if all of them send the answer NO (), the
liveness property is compromised as one interaction is possible.
13.2 Algorithms for Nondeterministic Planned Interactions 347

Fig. 13.9 Deadlock and livelock prevention in interaction implementation

Solving this issue requires us to break the symmetry among the three messages
TOKEN (). As seen in Sect. 11.2.2 when discussing resource allocation, one way to
do that consists in defining a total order on the request messages TOKEN (). Such
a total order can be obtained from the identity of the processes. As an example,
assuming j < i < k, let us consider the process pi when it receives the message
TOKEN ({i, k}, k). Its behavior is governed by the following rule:

if (i < k) then delay the answer to pk


else send NO () to pk
end if,
i.e., pi has priority (in the sense that it does not answer by return) if its identity is
smaller than that of the process that sent the request.
When considering the execution of Fig. 13.9, pi delays its answer to pk , pj de-
lays its answer to pi , but pk sends the answer NO () to pj . When pj receives this
message, it can tell pi that it accepts its interaction, and finally when pi learns it, pi
sends NO () to pk . (As seen in Sect. 11.2.2 this identity-based “priority” rule prevents
deadlocks from occurring and ensures interaction liveness.) Considering examples
with more than three processes would show that this priority scheme allows inde-
pendent interactions to be executed concurrently.
When considering Fig. 13.9, the crown that appears involves implementation
messages. The priority rule prevents the application level from inheriting this im-
plementation level crown.

Local Variable at a Process To implement the previous principles, each process


manages the following local variables:
• statei describes the local interaction state of pi . Its value domain is {out,
interested, engaged}. statei = out means that pi is not currently interested in a
rendezvous; statei = engaged means that pi is interested and waiting for an an-
swer; statei = interested means that, while pi is interested in an interaction, it is
not currently waiting for an answer.
• interactioni is a set containing the identities of the processes currently proposed
by pi to have an interaction. When it invokes an interaction with a process pj in
a deterministic context, pi sets interactioni to {j }. When it enters a nondetermin-
istic construct containing invocations of rendezvous with pj , pk , p , etc., it sets
interactioni to {j, k, , . . .}.
348 13 Rendezvous (Synchronous) Communication

when pi enters a (deterministic or nondeterministic) rendezvous context do


(1) interactionsi ← {ids of the candidate proc. defined in the rendezvous context};
(2) statei ← interested;
(3) if (∃ j ∈ interactionsi such that TOKEN ({i, j }) ∈ tokensi )
(4) then withdraw TOKEN ({i, j }) from tokensi ;
(5) send TOKEN ({i, j }, i) to pj ;
(6) statei ← engaged;
(7) delayedi ← ⊥
(8) end if.

when TOKEN ({i, j }, x) is received from pj do % x ∈ {i, j } %


(9) if (statei = out) ∨ (j ∈/ interactioni )
(10) then send NO() to pj ; add TOKEN ({i, j }) to tokeni
(11) else if (statei = interested)
(12) then send TOKEN ({i, j }, x) to pj ; statei ← out
(13) else % statei = engaged %
(14) case (x > i)∧(delayedi = ⊥)
(15) then delayedi ← TOKEN ({i, j }, x)
(16) (x < i) ∨ [(x > i) ∧ (delayedi = ⊥)]
(17) then send NO() to pj ; add TOKEN ({i, j }) to tokeni
(18) (x = i) then add TOKEN ({i, j }) to tokeni ;
(19) if (delayedi = ⊥)
(20) then let delayedi = TOKEN({i, k}, k);
(21) send NO() to pk ;
(22) add TOKEN ({i, k}) to tokeni
(23) end if
(24) statei ← out
(25) end case
(26) end if
(27) end if.

when NO () is received from pj do % statei = engaged %


(28) if (delayedi = ⊥) then let delayedi = TOKEN({i, k}, k);
(29) send TOKEN ({i, k}, k) to pk ; statei ← out
(30) else same as lines 2–8
(31) end if.

Fig. 13.10 A general token-based implementation for planned interactions (rendezvous)

• tokensi is a set containing the tokens currently owned by pi (let us recall that the
token TOKEN ({i, j }), allows its current owner—pi or pj —to send a request for
an interaction with the other process).
Initially, each interaction token is placed at one of the processes associated
with it.
• delayedi is a variable which contains the token for which pi has delayed sending
an answer; delayedi = ⊥ if no answer is delayed.

Behavior of a Process The algorithm executed by a process pi is described in


Fig. 13.10. When it enters a deterministic or nondeterministic communication con-
text (as defined in Sect. 13.2.1), a process pi becomes interested (line 2), and ini-
tializes its set interactionsi to the identities of the processes which it has defined
13.2 Algorithms for Nondeterministic Planned Interactions 349

as candidates for a synchronous communication (line 1). Then, if it has the token
for one of these interactions (line 3), it selects one of them (line 4), and sends the
associated token to the corresponding process pj (line 5). It becomes then engaged
(line 5), and sets delayedi to ⊥ (line 7). If the predicate of line 3 is false, pi remains
in the local state interested.
The core of the algorithm is the reception of a message TOKEN ({i, j }, x) that
a process pi receives from a process pj . As x is the identity of the process that
initiated the interaction, we necessarily have x = i or x = j . When it receives such
a message, the behavior of pi depends on its state.
• If it is not interested in interactions or, while interested, j ∈ / interactionsi , pi
sends by return the message NO () to pj and saves TOKEN ({i, j }) in tokensi
(lines 9–10). In that way, pi will have the initiative for the next interaction involv-
ing pi and pj , and during that time pj will no longer try to establish a rendezvous
with it.
• If j ∈ interactionsi , and pi has no pending request (i.e., statei = interested), it
commits the interaction with pj by sending back the message TOKEN ({i, j }, j )
(lines 11–12).
• Otherwise, we have j ∈ interactionsi and statei = engaged. Hence, pi has sent
a message TOKEN ({i, k}, i) for an interaction with a process pk and has not yet
received an answer. According to the previous discussion on deadlock prevention,
there are three cases.
– If x > i and pi has not yet delayed an answer, it delays the answer to pj
(lines 14–15).
– If x < i, or x > i and pi has already delayed an answer, it sends the answer
NO () to pj . In that way pj will be able to try other interactions (lines 16–17).
Let us remark that, if delayedi = ⊥, pi will commit an interaction: either
the interaction with the process pk from which pi is waiting for an answer to
the message TOKEN ({i, k}, i) that pi previously sent to pj , or with the process
p waiting for its answer (p is such that delayedi = TOKEN({ , i}, )).
– If x = i, by returning the message TOKEN ({i, j }, i) to pi , its sender pj com-
mits the interaction. In that case, before moving to state out (line 24), pi stores
TOKEN ({i, j }) in tokensi (hence, pi will the initiative for the next interac-
tion with pj ), and sends the answer NO () to the delayed process pk , if any
(lines 18–24).
Finally, when pi receives the answer NO () from a process pj (which means that
it previously became engaged by sending the request TOKEN ({i, j }, i) to pj ), the
behavior of pi depends on the value of delayedi . If delayedi = ⊥, pi has delayed
its answer to the process pk such that delayedi = TOKEN({k, i}, k). In that case, it
commits the interaction with pk (lines 28–29). Otherwise, it moves to the local state
interested and tries to establish an interaction with another process in interactionsi
for which it has the token (lines 2–8).

Properties Due to the round-trip of a message TOKEN ({i, j }, i), it is easy to see
that a process can simultaneously participate in at most one interaction at a time.
350 13 Rendezvous (Synchronous) Communication

Each request (sending of a message TOKEN ({i, j }, i)) gives rise to exactly one
answer (the same message echoed by the receiver or a message NO (). Moreover,
due to the total order on process identities, no process can indefinitely delay another
one. Let us also notice that, if a process pi is such that k ∈ interactionsi and TO -
KEN ({i, k}) ∈ tokensi , pi will send the request TOKEN ({i, k}, k) to pk (if it does not
commit another interaction before).
Let us consider the following directed graph. Its vertices are the processes, and
there is an edge from pi to pj if delayedj = TOKEN({j, i}, i) (i.e., pj is delaying
the sending of an answer to pi ). This graph, whose structure evolves dynamically,
is always acyclic. This follows from the fact that an edge can go from a process
pi to a process pj only if i > j . As the processes that are sink nodes of the graph
eventually send an answer to the processes they delay, it follows (by induction) that
no edge can last forever.
The liveness property follows from the previous observations, namely, if (a) two
processes pi and pj are such that i ∈ interactionsj and j ∈ interactionsi , and
(b) none of them commits another interaction, then pi and pj will commit their
common interaction.

13.3 An Algorithm for Nondeterministic Forced Interactions

13.3.1 Nondeterministic Forced Interactions

To prevent crown patterns from forming (at the level of application messages), pro-
cesses have sometimes to deliver a message instead of sending one. In the context of
the previous section, this is planned at the programming level and, to that end, pro-
cesses are allowed to use explicitly a nondeterministic communication construct. As
we have seen, this construct allows the communication operation which will prevent
crown formation to be selected at run time.
This section considers a different approach in which there is no nondeterministic
choice planned by the programmer, and all the invocations of synchr_send() appear
in a deterministic context. To prevent crowns from forming, a process can be forced
at any time to deliver a message or to delay the sending of a message, hence the
name forced interaction.

13.3.2 A Simple Algorithm

Principle As no algorithm implementing a rendezvous can be completely sym-


metric, a way to solve conflicts is to rely on the identities of the processes (as done
in Sect. 13.2.3). An interaction (rendezvous) involving pi and pj is conceptually
controlled by the process pmax(i,j ) . If the process pmin(i,j ) wants to synchr_send a
message to pmax(i,j ) , it has to ask pmax(i,j ) to manage the interaction. This algorithm
is due to V.V. Murty and V.K. Garg (1997).
13.3 An Algorithm for Nondeterministic Forced Interactions 351

operation synchr_send(m) to pj is
(1) if (i > j ) then wait (¬ engagedi );
(2) send MSG (m) to pj ; engaged i ← true
(3) else bufferi [j ] ← m; send REQUEST () to pj
(4) end if.

when MSG (m) is received from pj do


(5) if (i > j ) then synchr_delivery of m; engagedi ← false
(6) else wait (¬ engagedi );
(7) synchr_delivery of m; send ACK () to pj
(8) end if.

when ACK () is received from pj do


(9) engagedi ← false.

when REQUEST () is received from pj do % j < i %


(10) wait (¬ engagedi );
(11) send PROCEED () to pj ; engagedi ← true.

when PROCEED () is received from pj do % j > i %


(12) wait (¬ engagedi );
(13) send MSG (bufferi [j ]) to pj .

Fig. 13.11 An algorithm for forced interactions (rendezvous)

The Algorithm Each process pi manages a local Boolean variable engagedi , ini-
tially equal to false. This variable is set to true, when pi is managing a synchronous
communication with a process pj such that i > j . The aim of the variables engagedi
is to prevent the processes from being involved in a cycle that would prevent live-
ness.
To present the algorithm we consider two cases according to the values of i and
j , where pi is the sender and pj the receiver.
• i > j . In that case the sender has the initiative of the interaction. It can send the
message m only if it is not currently engaged with another process. It sends then
MSG (m) to pj and becomes engaged (lines 1–2).
When pj receives MSG (m) from pi , it waits until it is no longer engaged in an-
other rendezvous (line 6). Then it synchr_delivers the message m, and sends the
message ACK () to pi (line 7). When pi receives this message, it learns that the
synchronous communication has terminated and consequently resets engagedi to
false (line 9). The corresponding pattern of messages exchanged at the implemen-
tation level is described in Fig. 13.12.
• i < j . In this case, pi has to ask pj to manage the interaction. To that end, it
sends the message REQUEST () to pj (line 3). Let us notice that pi is not yet en-
gaged in a synchronous communication. When pj receives this message, it waits
until it is no longer engaged in another interaction (line 10). When this occurs,
pj sends to pi a message PROCEED () and becomes engaged with pi (line 11).
When it receives this message, pi sends the application message (which has been
previously saved in bufferi [j ]) and terminates locally the interaction (line 12). Fi-
352 13 Rendezvous (Synchronous) Communication

Fig. 13.12
Forced interaction:
message pattern when i > j

Fig. 13.13 Forced interaction: message pattern when i < j

nally, the reception of this message MSG (m) by pj entails the synchr_delivery of
m and the end of the interaction on pj ’s side (line 5). The corresponding pattern
of messages exchanged at the implementation level is described in Fig. 13.13. As
in the previous figure, this figure indicates the logical time that can be associated
with the interaction (from an application level point of view).

13.3.3 Proof of the Algorithm

This section shows that, when controlled by the previous algorithm, the pattern of
application messages satisfies the synchrony property, and all application messages
are synchr_delivered.

Lemma 11 The communication pattern of application messages generated by the


algorithm satisfies the synchrony property.

Proof To show that the algorithm ensures that the application messages synchr_sent
and synchr_delivered satisfy the synchrony property, we show that there is no crown
at the application level. It follows then from Theorem 18 that the synchrony property
is satisfied.
The proof is made up of two parts. We first show that, if the computation has a
crown of size k > 2, it has also a crown of size 2 (Part I). Then, we show that there
is no crown of size 2 (Part II). As in Sect. 13.1, let sy_s(m) and sy_del(m) be the
events associated with the synchronous send and the synchronous delivery of the
application message m, respectively.
13.3 An Algorithm for Nondeterministic Forced Interactions 353

Proof of Part I. Let m1 , m2 , . . . , mk be the sequence of messages involved in a


crown of size k > 2, and mx be a message of this crown synchr_sent by a process
pi to a process pj . We consider two cases.
ev
• i > j . Due to the crown, we have sy_s(mx ) −→ sy_del(mx+1 ). As, after it has
sent the implementation message MSG (mx ) to pj (line 2), pi remains engaged
with pj until it receives an implementation message ACK () from it (line 9), it fol-
ev
lows that we have sy_del(mx ) −→ sy_del(mx+1 ). As (due to the crown) we also
ev ev
have sy_s(mx−1 ) −→ sy_del(mx ), it follows that sy_s(mx−1 ) −→ sy_del(mx+1 ),
and there consequently a crown of size k − 1.
• i < j . The reasoning is similar to the previous case. Due to the crown, we have
ev
sy_s(mx−1 ) −→ sy_del(mx ). As, after it has sent the message PROCEED () to
pi (line 11), pj remains engaged with pi until it receives MSG (mx ) from it
ev
(line 5), it follows that we have sy_s(mx−1 ) −→ sy_s(mx ). Combining this with
ev ev
sy_s(mx ) −→ sy_del(mx+1 ), we obtain sy_s(mx−1 ) −→ sy_del(mx+1 ), and ob-
tain a crown whose size is k − 1.
Proof of Part II. Let assume by contradiction that there is crown of size 2.
ev
Hence, there are two messages m1 and m2 such that sy_s(m1 ) −→ sy_del(m2 ) and
ev
sy_s(m2 ) −→ sy_del(m1 ). (Let us recall that these relations can involve causal paths
as shown in the crown of size 2 depicted at the left side of Fig. 13.4.) Let pi be the
process that synchr_sent m1 to pj , and pi  be the process that synchr_sent m2 to pj  .
There are two cases.
• m1 is such that i > j or m2 is such that i  > j  . Without loss of generality we
consider i > j . It follows from the previous reasoning (first item of Part I) that
ev
sy_del(m1 ) −→ sy_del(m2 ).
ev
– If i  > j  , we obtain with the same reasoning sy_del(m2 ) −→ sy_del(m1 ),
ev
– If i  < j  , we obtain with a similar reasoning ¬(sy_del(m1 ) −→ sy_del(m1 )).
ev
Both cases contradict sy_del(m1 ) −→ sy_del(m2 ).
• m1 and m2 are such that i < j and i  < j  . By reasoning similar to that used in
ev ev
the second item of Part I, we obtain sy_s(m1 ) −→ sy_s(m2 ) and sy_s(m2 ) −→
sy_s(m1 ), a contradiction which concludes the proof of the lemma. 

Lemma 12 Each invocation of synchr_send() terminates and the corresponding


message is synchr_delivered.

Proof To prove this lemma we show that if a process pi invokes synchr_send(), it


proceeds eventually to the state ¬ engagedi .
Let us observe that process p1 is never engaged (due to its smallest identity, it
never executes line 2 or line 11, and consequently we always have ¬ engaged1 ). It
follows that p1 never receives a message REQUEST () or ACK (). Moreover, when
it receives a message MSG (m) or PROCEED (), it always sends back an answer
(lines 6–7 and lines 12–13).
354 13 Rendezvous (Synchronous) Communication

The rest of the proof is by induction. Let us assume that each process
p1 , p2 , . . . , pi , . . . , pk , eventually moves eventually to the state ¬ engagedi and
answers the messages it receives. Let us observe that pk+1 can become engaged
only when it sends a message MSG () to a process with a smaller identity (line 2), or
when it sends a message PROCEED () to a process with a smaller identity (line 11).
As these processes will answer the message from pk+1 (induction assumption), it
follows that pk+1 is eventually such that ¬ engagedk+1 , and will consequently an-
swer the message it receives. Hence, no process will remain blocked forever, and all
the invocations of synchr_send() terminate.
The fact that the corresponding messages are synchr_delivered follows trivially
from line 5 and line 7. 

Theorem 19 The algorithm of Fig. 13.11 ensures that the message communication
pattern satisfies the synchrony property, no process blocks forever, and each message
that is synchr_sent is synchr_delivered.

Proof The proof follows from Lemmas 11 and 12. 

13.4 Rendezvous with Deadlines in Synchronous Systems


This section introduces rendezvous with deadlines. As noticed in Sect. 13.1.4, this
is possible only in synchronous distributed systems. This is due to the fact that, in
asynchronous distributed systems, there is no notion of physical time accessible to
the processes. This section presents first definitions, and then algorithms implement-
ing rendezvous with deadline. These algorithms are due to I. Lee and S.B. Davidson
(1987).

13.4.1 Synchronous Systems and Rendezvous with Deadline

Synchronous Systems As processing times are negligible when compared to


message transfer durations, they are considered as having a zero duration. Moreover,
there is an upper bound, denoted δ, on the transit time of all messages that are sent
through the underlying network, where the transit time of a message is measured as
the duration that elapses between its sending and its reception for processing by the
destination process. (The time spent by a message in input/output buffers belongs to
its transit time.) This bound is such that it is an upper bound, whatever the process
that measures it.
Each process has a local physical clock denoted ph_clocki that it can read to
obtain the current date. These clocks are synchronized in such a way that, at any
time, the difference between two clocks is upper bounded by θ .
The previous assumptions on δ and θ means that if, at time τ measured on p’s
local clock, process p sends a message m to a process q, m will be received by q
by time τ + δ + θ measured on q’s local clock.
13.4 Rendezvous with Deadlines in Synchronous Systems 355

If there is a common physical clock that can be read by all processes, the local
clocks can be replaced by this common clock, and then θ = 0.

Rendezvous with Deadline The rendezvous with deadline abstraction provides


the processes with two operations denoted timed_send() and timed_receive(). Both
the sender p, when it invokes timed_send(m, deadline1), and the receiver q, when
it invokes timed_receive(x, deadline2), specify a deadline for the rendezvous (x is
the local variable of q where the received message is deposited if the rendezvous
succeeds).
If the rendezvous cannot happen before their deadlines, both obtain the control
value timeout. If the rendezvous succeeds before their deadlines, both obtain the
value commit. It is important to notice that both processes receive the same re-
sult: The rendezvous is successful for both (and then m has been transmitted), or is
unsuccessful for both (and then m has not been transmitted).

Temporal Scope Construct The following language construct, called temporal


scope, is used to ease the presentation of the algorithms. Its simplest form is the
following:
within deadline do
when a message is received do statements
at deadline occurrence return(timeout)
end within.
Its meaning is the following: Let τ be the time (measured by ph_clocki ) at which
a process pi enters a within . . . end within construct. The duration spent in this
construct is bounded by τ + deadline. During this period of time, if pi receives a
message, it executes the associated code denoted “statements”, which can contain
invocations of return(commit) or return(timeout). If it executes such a return(),
pi exits the construct with the corresponding result. If return() has not been invoked
by time τ + deadline, pi exits the construct with the result timeout.

13.4.2 Rendezvous with Deadline Between Two Processes

Let us consider the base case of two processes p and q such that p invokes
timed_send(m, deadlinep ) and q invokes timed_receive(x, deadlineq ). Moreover,
let τp be the time, measured at its local clock, at which p starts its invocation of
timed_send(m, deadlinep ). Similarly, let τq be the time, measured at its local clock,
at which q starts its invocation of timed_receive(x, deadlineq ).

A Predicate to Commit or Abort a Rendezvous Let us observe that if p sends a


message to q at a time τp (measured on its clock) such that τp + δ + θ ≤ deadlineq ,
then q will receive this message before its deadline. Similarly, if q sends a message
356 13 Rendezvous (Synchronous) Communication

Fig. 13.14 When


the rendezvous must be successful
(two-process symmetric algorithm)

to p at a time τq (measured on its clock) such that τq + δ + θ ≤ deadlinep , then p


will receive this message before its deadline. Hence, when true, the predicate
(τp + δ + θ ≤ deadlineq ) ∧ (τq + δ + θ ≤ deadlinep )
states that, whatever the actual speed on the messages, the rendezvous can be suc-
cessful (see Fig. 13.14). If the predicate is false, the rendezvous can succeed or
fail, which depends on the speed of the messages and on the deadline values. If
τp + δ + θ > deadlineq , the message sent by p to q may arrive before deadlineq
but it may also arrive later. There is no way to know this from the values of τp and
deadlineq . (And similarly, in the other direction.) Hence, the previous predicate is
the weakest predicate (based only on τp , τq , deadlinep , and deadlineq ) which (when
true) ensures safely that it is impossible for the rendezvous not to occur.
In order that both p and q compute this predicate, the message sent by p to q has
to contain τp and deadlinep , and the message sent by q to p has to contain τq and
deadlineq . This is the principle on which the rendezvous with deadlines between
two processes is based.

The Rendezvous Algorithm The corresponding algorithms implementing timed_


send() and timed_receive() are described in Fig. 13.15. Their code is a simple trans-
lation of the previous discussion in terms of the within temporal scope construct.
In addition to the application message m, the sender p sends to q the date of
its starting time τp and its deadline deadlinep (line 2). Process p then enters the
temporal scope (line 3), and waits for a message from q. If no message is received
by p’s deadline, the rendezvous failed and p returns timeout (line 9). If p receives
a message from q (line 4) before its deadline, it computes the value of the success
predicate (line 5), and returns the value commit if the rendezvous cannot “not
occur”, (line 6), or timeout otherwise.
The code of the operation timed_receive() is the same as that of the operation
timed_send() (with an additional store of m into the local variable x if the ren-
dezvous succeeds, line 16).

Theorem 20 The algorithm described in Fig. 13.15 ensures that both processes
return the same value. Moreover this value is commit if and only the predicate
(τp + δ + θ ≤ deadlineq ) ∧ (τq + δ + θ ≤ deadlinep ) is satisfied, and then the mes-
sage m is delivered by q.
13.4 Rendezvous with Deadlines in Synchronous Systems 357

operation timed_send(m, deadlinep ) to q is


(1) τp ← ph_clockp ;
(2) send MSG (m, τp , deadlinep ) to q;
(3) within deadlinep do
(4) when READY (τq , deadlineq ) is received from q do
(5) if (τp +δ + θ ≤ deadlineq ) ∧ (τq + δ + θ ≤ deadlinep )
(6) then return(commit)
(7) else return(timeout)
(8) end if
(9) at deadline occurrence return(timeout)
(10) end within.

operation timed_receive(x, deadlineq ) from p is


(11) τq ← ph_clockq ;
(12) send READY (m, τq , deadlineq ) to p;
(13) within deadlineq do
(14) when MSG (τp , deadlinep ) is received from q do
(15) if (τp +δ + θ ≤ deadlineq ) ∧ (τq + δ + θ ≤ deadlinep )
(16) then x ← m; return(commit)
(17) else return(timeout)
(18) end if
(19) at deadline occurrence return(timeout)
(20) end within.

Fig. 13.15 Real-time rendezvous between two processes p and q

Proof Let us observe that if each process receives a control message from the other
process, they necessarily return the same result (commit or timeout) because
they compute the same predicate on the same values. Moreover, if no process re-
ceives a message by its deadline, it follows from the temporal scope construct that
they both return timeout.
If a process receives a control message by its deadline, while the other does not,
we have the following where, without loss of generality, p be the process that does
not receive a message by its deadline. Hence p returns timeout. As it has not
received the message READY (τq , deadlineq ) by its deadline, we have τq + δ + θ >
deadlinep . It follows that, when evaluated by q, the predicate will be false, and q
will return timeout (lines 17).
Finally, it follows from lines 15–16 that q delivers the application m if and only
if the predicate is satisfied. 

A Simple Improvement If the message READY () (resp., MSG ()) has already ar-
rived when the sender (resp., receiver) invokes its operation, this process can imme-
diately evaluate the predicate and return timeout if the predicate is false. In that
case, it may also save the sending of the message MSG () (resp., READY ()). If the
predicate is true, it executes its algorithm as described in Fig. 13.15.
It follows that, whatever its fate (commit or timeout), a real-time rendezvous
between two processes requires one or two implementation messages.
358 13 Rendezvous (Synchronous) Communication

Fig. 13.16 When


the rendezvous must be successful
(asymmetric algorithm)

13.4.3 Introducing Nondeterministic Choice

In some applications, a sender process may want to propose a rendezvous with dead-
line to any receiver process of a predefined set of processes, or a receiver may want
to propose a rendezvous to any sender of a predefined set of processes. The first
rendezvous that is possible is then committed. We consider here the second case
(multiple senders and one receiver). Each of the senders and the receiver define
their own deadline. Let p1 , p2 , . . . , pn denote the senders and q denote the receiver.

Principle of the Solution: A Predicate for the Receiver As the invocation of


timed_send(m, deadlinei ) appears in a deterministic context, the implementation of
this operation is close to that of Fig. 13.15, namely, the sender p sends the message
send MSG (m, deadlinei ) to q.
On the other hand, differently from the two-process case, the receiver process
q plays now a particular role, namely, it has to select a single sender from the n
senders. Moreover, the rendezvous is allowed to fail only if no rendezvous is possi-
ble, which means that q has to select one of them if several rendezvous are possible.
Let μq be the date (measured with its physical clock) at which the receiver q
receives the message MSG (m, deadlinei ) from the sender pi . If the rendezvous is
possible with pi , q sends back the message COMMIT (). Otherwise it sends no mes-
sage to pi , which will obtain the value timeout when its deadline will be attained.
Hence, as before, the core of the algorithm is the predicate used by the receiver
to answer (positively) or not answer a sender. Let us observe that μq ≤ deadlineq is
a necessary requirement to answer, but this is not sufficient to ensure that both q and
pi take the same decision concerning their rendezvous (commit or timeout()).
To that end, similarly to the two-process case and assuming that the message COM -
MIT () sent by q to pi is sent at time μq , it is required that this message COMMIT ()
is received by pi before its deadline, i.e.,
μq + δ + θ ≤ deadlinei .
This is illustrated in Fig. 13.16. If this predicate is false, there is no certainty that the
control message it is about to send to pi will arrive at pi before its deadline. Hence,
if the predicate is false, q does not send the message COMMIT () to pi .
13.4 Rendezvous with Deadlines in Synchronous Systems 359

operation timed_send(m, deadlinei ) to q is % invoked by pi %


(1) send MSG (m, deadlinei ) to q;
(2) within deadlinei do
(3) when COMMIT () is received from q do return(commit)
(4) at deadline occurrence return(timeout)
(5) end within.

operation timed_receive(x, deadlineq ) is % invoked by q %


(6) within deadlineq do
(7) when MSG (m, deadlinei ) is received from pi do
(8) if (ph_clockq + δ + θ ≤ deadlinei )
(9) then send COMMIT () to pi ;
(10) x ← m; return(commit)
(11) end if
(12) at deadline occurrence return(timeout)
(13) end within.

Fig. 13.17 Nondeterministic rendezvous with deadline

As we can see, differently from the two-process case where the algorithms exe-
cuted by the sender and the receiver are symmetric, the principle is now based on
an asymmetric relation: The sender (which is in a deterministic context) acts as a
client, while the receiver (which is in a nondeterministic context) acts as a server.
Despite this asymmetry, the aim is the same as before: allow the processes to take a
consistent decision concerning their rendezvous.

The Asymmetric Algorithm for Nondeterministic Choice The algorithm is de-


scribed in Fig. 13.17. When a process pi invokes timed_send(m, deadlinei ), it sends
the implementation message MSG (m, deadlinei ) to q (line 1), and enters a temporal
scope construct whose ending date is upper bounded by deadlinei (line 2). If, while
it is in its temporal scope, pi receives a message from q, its rendezvous with q is
committed and it returns the value commit (line 3). If it receives no message from
q by its deadline, the rendezvous failed and pi returns the value timeout (line 4).
When it invokes timed_receive(x, deadlineq ), the receiver q enters a temporal
scope construct upper bounded by its deadline (line 6). Then it starts waiting. If dur-
ing this period it receives a message MSG (m, deadlinei ) from a sender pi such that
the predicate h_clockq + δ + θ ≤ deadlinei is satisfied, it commits the rendezvous
with pi by sending back to pi the message COMMIT () (lines 7–11). If no message
satisfying this predicate is received by its deadline, no rendezvous can be committed
and, consequently, q returns the value timeout (line 12).

Remark This asymmetric algorithm can be used when there is a single sender.
We then obtain an asymmetric algorithm for a rendezvous between two processes
(see Problem 5).
360 13 Rendezvous (Synchronous) Communication

13.4.4 n-Way Rendezvous with Deadline

The Problem In some real-time applications, processes need to have a ren-


dezvous with deadline involving all of them. This can be seen as a decision problem
in which all the processes must agree on the same output (commit or timeout).
Moreover, if the output is commit, each process must have received the messages
sent by each other process. This section presents an algorithm solving n-way ren-
dezvous with deadline (where n is the number of processes). This algorithm can be
modified to work with predefined subsets of processes.
The operation offered to processes for such a multirendezvous is denoted
multi_rdv(m, deadline) where m is the message sent by the invoking process pi ,
1 ≤ i ≤ n, and deadline is its deadline for the global rendezvous.

The Predicate As each process is now simultaneously a sender and a receiver, the
algorithm is a symmetric algorithm, in the sense that all the processes execute the
same code. Actually, this algorithm is a simple extension of the symmetric algorithm
for two processes, described in Fig. 13.15.
Hence, when a process pi invokes multi_rdv(mi , deadlinei ), it sends the message
MSG (mi , τi , deadlinei ) to each other process (where, as before, τi is the sending
date of this message). Let us consider a process pi that has received such a message
from each other process. As the rendezvous is global, the two-process predicate
(τp +δ +θ ≤ deadlineq )∧(τq +δ +θ ≤ deadlinep ) has to be replaced by a predicate
involving all pairs of processes. We consequently obtain the generalized predicate
∀(k, ): τk + δ + θ ≤ deadline .
Instead of considering all pairs (τk , deadline ), this predicate can be refined as fol-
lows:
• The n starting dates τ1 , . . . , τn , are replaced by a single one, namely the worst one
from the predicate point of view, i.e., the latest sending time τ = max({τx }1≤x≤n ).
• Similarly, the n starting deadlines can be replaced by the worst one from the pred-
icate point of view, i.e., the earliest deadline deadline = min({deadlinex }1≤x≤n ).
The resulting predicate is consequently:
τ + δ + θ ≤ deadline.

The Algorithm As indicated, the resulting algorithm is a simple extension of the


two-process symmetric algorithm of Fig. 13.15. It is described in Fig. 13.18.
The local variable reci is used to count the number of messages received by pi . If
the deadline occurs before reci = n − 1, one or more processes are late with respect
to pi ’s deadline and, consequently, the multirendezvous cannot occur. In this case,
pi returns the value timeout (line 13). Otherwise, pi has received a message from
each other process before its deadline. In this case, pi returns commit or timeout
according to the value of the predicate (line 9–12). The proof that this algorithm is
correct is a straightforward generalization of the proof of Theorem 20.
13.5 Summary 361

operation multi_rdv(m, deadlinei ) is % invoked by pi %


(1) reci ← 0; τi ← ph_clocki :
(2) for each j ∈ {1, . . . , n} \ {i} do send MSG (m, τi , deadlinei ) to pj end for;
(3) within deadlinei do
(4) while (reci < n − 1) do
(5) when MSG (mj , τj , deadlinej ) is received from pj do reci ← reci + 1
(6) end while;
(7) let τ = max({τx }1≤x≤n );
(8) let deadline = max({deadlinex }1≤x≤n );
(9) if (τ +δ + θ ≤ deadline)
(10) then return(commit)
(11) else return(timeout)
(12) end if
(13) at deadline occurrence return(timeout)
(14) end within.

Fig. 13.18 Multirendezvous with deadline

Observing that for any i, it is not necessary that the constraint τi + δ + θ ≤


deadlinei be satisfied, an improved predicate, which allows for more rendezvous to
succeed, can be designed. This is the topic of Problem 7.

13.5 Summary
This chapter was on synchronous communication. This type of communication syn-
chronizes the sender and the receiver, and is consequently also called rendezvous
or interaction. The chapter has first given a precise meaning to what is called syn-
chronous communication. It has also presented a characterization based on a specific
message pattern called a crown.
The chapter then presented several algorithms implementing synchronous com-
munication, each appropriate to a specific context: planned interactions or forced
interaction in fully asynchronous systems, and rendezvous with deadline suited to
synchronous systems.

13.6 Bibliographic Notes


• The notion of a synchronous communication originated in the work of P. Brinch
Hansen [63] (who introduced the notion of distributed processes), and C.A.R.
Hoare [186] (who introduced the notion of communicating sequential processes,
thereafter known under the name CSP).
The CSP language constructs relate intimately synchronous communication
with the notion of guarded command and nondeterminism introduced by E.W.
Dijkstra [113].
The notion of remote procedure call developed in some distributed sys-
tems [54] is strongly related to synchronous communication.
362 13 Rendezvous (Synchronous) Communication

• The definition of synchronous communication based on logical scalar clocks


(as used in this book) is due to V.K. Garg [149, 274]. The example used in
Sect. 13.1.2 is from [149].
• The characterization of synchronous communication based on the absence of the
message pattern called a crown is due to B. Charron-Bost, G. Tel, and F. Mat-
tern [88].
Another characterization based on the acyclicity of a message graph is given
by T. Soneoka and T. Ibaraki in [356].
• The asymmetric algorithm based on an acyclic client–server relation among the
processes, presented in Sect. 13.2.2, is due to A. Silberschatz [343].
• The asymmetric token-based algorithm presented in Sect. 13.2.3 is due to
R. Bagrodia [31].
• The algorithm for forced (message delivery) interactions presented in Sect. 13.3
is due to V.V. Murty and V.K. Garg [273].
• Other algorithms implementing synchronous rendezvous in asynchronous sys-
tems, and analyses of the concepts of synchrony and nondeterminism associated
with synchronous communications, can be found in many papers (e.g., [30, 51,
57, 58, 66, 80, 94, 131, 133, 354]).
• The notion of multiparty interaction in asynchronous systems is addressed in
[85, 121].
• The notion of rendezvous with deadline and the algorithms presented in Sect. 13.4
are due to I. Lee and S.B. Davidson [233].
• Other notions of synchronous communication have been introduced, mainly in
the domain of distributed simulation [140, 199, 263, 329].

13.7 Exercises and Problems


1. Let us recall that a sequential observation 
to
S = (H, −→) of a distributed execu-
 ev
tion H = (H, −→) is a topological sort of the partial order H, i.e., the total order
to ev
−→ respects the partial order −→ (see Chap. 6).
Show that the communication pattern of a distributed execution H  satisfies the
synchronous communication property if and only if H  has a sequential observa-
tion 
to
S = (H, −→) in which, for each message m, the receive event immediately
follows the send event (i.e., both events can be packed to form a single event).
2. Enrich the algorithm presented in Sect. 13.2.2 so that:
• During each interaction, a message is transmitted in each direction.
• Some interactions allow a message in each direction to be transmitted, while
other interactions allow only one message to be transmitted (as in Sect. 13.2.2).
3. Let us consider the token-based implementation for planned interactions de-
scribed in Fig. 13.10. Given two processes pi and pj , determine an upper bound
on the number of messages exchanged by these processes to commit a ren-
dezvous or discover that it cannot be realized.
Solution in [31].
13.7 Exercises and Problems 363

Fig. 13.19 Comparing two date patterns for rendezvous with deadline

4. Design an algorithm implementing a rendezvous in which each message is sent


to a set of destination processes. The communication operations offered to pro-
cesses are “synchr_send(m) to dest(m)” and “synchr_del()”, where dest(m) is
the set of processes to which m is sent, and synchr_del() allows the invoking
process to deliver a message sent by any process. The set dest(m) can be differ-
ent for each message m. Moreover, the communication context is that of forced
interactions (see Sect. 13.3).
Everything has to appear in such a way that, for each message m, the same
logical date can be associated with its send event and its delivery events by the
processes of dest(m) (synchrony property stated in Sect. 13.1.1).
Solution in [272].
5. Let us consider rendezvous with deadline between two processes. Compare the
respective advantages of the symmetric algorithm of Fig. 13.15 and the asym-
metric algorithm of Fig. 13.17 (instantiated with a single sender). To that end, it
is interesting to compare them in the two date patterns described in Fig. 13.19,
where p is the sender and q the receiver.
6. Let us consider rendezvous with deadline in a synchronous system where the
maximal drift of a local clock with respect to real-time is bounded by . This
physical time setting was introduced in Sect. 9.5.5. Let  be a real-time duration.
This quantity of time is measured by a process pi , with its physical clock, as a
value i such that
(1 − ) ≤ i ≤ (1 + ).
The left side corresponds to the case where the physical clock of pi is slow,
while the right side corresponds to the case where it is rapid. Hence, if a clock of
a process is always rapid while the clock of another process is always slow, the
difference between these two clocks can become arbitrarily large.
Modify the algorithms described in Figs. 13.15, 13.17, and 13.18, so that they
are based on the assumption  (instead of the assumption θ , which assume an un-
derlying clock synchronization algorithm). What do you notice for the modified
algorithm from Fig. 13.17?
Solution in [233].
7. Let us consider the algorithm for multirendezvous with deadline presented in
Sect. 13.4.4. As noticed at the end of this section, it is not necessary that, for any
i, the constraint τi + δ + θ ≤ deadlinei be satisfied.
364 13 Rendezvous (Synchronous) Communication

To show this, in addition to τ and deadline, let us define two new values as fol-
lows: τ  is the second greatest sending time of a message MSG (m, τi , deadlinei )
sent at line 2, and deadline is the second smallest deadline value. These
values can be computed by a process at line 7 and line 8. Moreover, let
same_proc(τ, deadline) be a predicate whose value is true if and only if the
values τ and deadline come from the same process. Show that the predicate
(τ + δ + θ ≤ deadline) used at line 9 can be replaced by the following weaker
predicate:
    
same_proc(τ, deadline) ∧ τ + δ + θ ≤ deadline ∧ τ  + δ + θ ≤ deadline
 
∨ ¬ same_proc(τ, deadline) ∧ (τ + δ + θ ≤ deadline) .
Part V
Detection of Properties
on Distributed Executions

The two previous parts of the book were on the enrichment of the system to pro-
vide processes with high-level operations. Part III was on the definition and the
implementation of operations suited to the consistent use of shared resources, while
Part IV introduced communication abstractions with specific ordering properties.
In both cases, the aim is to allow application programmers to concentrate on their
problems and not on the way some operations have to be implemented.
This part of the book is devoted to the observation of distributed computations.
Solving an observation problem consists in superimposing a distributed algorithm
on a computation, which records appropriate information on this computation in or-
der to be able to detect if it satisfies some property. The specificity of the information
is, of course, related to the property one is interested in detecting.
Two detection problems are investigated in this part of the book: the detection of
the termination of a distributed execution (Chap. 14), and the detection of deadlocks
(Chap. 15). Both properties “the computation has terminated” and “there is dead-
lock” are stable properties, i.e., once satisfied they remain satisfied in any future
state of the computation.

Remark Other property detection problems concern the detection of unstable


properties such as the conjunction of local predicates or the detection of proper-
ties on execution flows. Their detection is a more advanced topic not covered in this
book. The interested reader is invited to consult the following (non-exhaustive) list
of references [73, 89, 98, 137, 139, 153, 154, 192, 193, 378].
Chapter 14
Distributed Termination Detection

This chapter is on the detection of the termination of a distributed computation. This


problem was posed and solved for the first time in the early 1980s independently by
E.W. Dijkstra and C.S. Scholten (1980) and N. Francez (1980). This is a non-trivial
problem. While, in sequential computing, the termination of the only process indi-
cates that the computation has terminated, this is no longer true in distributed com-
puting. Even if we were able to observe simultaneously all the processes, observing
all of them passive could not allow us to conclude that the distributed execution has
terminated. This is because some messages can still be in transit, which will reac-
tivate their destination processes when they arrive, and these re-activations will, in
turn, entail the sending of new messages, etc.
This chapter presents several models of asynchronous computations and obser-
vation/detection algorithms suited to termination detection in each of them. As in
other chapters, the underlying channels are not required to be FIFO. Moreover, while
channels are bidirectional, the term “output” channels (resp., “input” channels) is
used when considering message send (resp., message reception).

Keywords AND receive · Asynchronous system · Atomic model · Counting ·


Diffusing computation · Distributed iteration · Global state ·
k-out-of-n receive statement · Loop invariant ·
Message arrival vs. message reception · Network traversal ·
Non-deterministic statement · OR receive statement · Reasoned construction ·
Receive statement · Ring · Spanning tree · Stable property · Termination detection ·
Wave

14.1 The Distributed Termination Detection Problem

14.1.1 Process and Channel States

As in previous chapters, p1 , . . . , pn denote the set of asynchronous processes. More-


over, we assume that the underlying asynchronous communication network is fully
connected.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 367


DOI 10.1007/978-3-642-38123-2_14, © Springer-Verlag Berlin Heidelberg 2013
368 14 Distributed Termination Detection

Fig. 14.1 Process states


for termination detection

Process States Each process pi has a local variable denoted statei , the value of
which is active or passive. Initially, some processes are in the active state, while the
others are in the passive state. Moreover, at least one process is assumed to be in
the active state. As far as the behavior of a process pi is concerned, we have the
following (Fig. 14.1):
• When statei = active:
– pi can execute local computations and send messages to the other processes.
– pi can invoke a message reception. In that case, the value of statei automati-
cally changes from active to passive, and pi starts waiting for a message.
• When, pi receives a message (we have then statei = passive), the value of statei
automatically changed from passive to active.
It is assumed that there is an invocation of a message reception for every message
that is sent. (This assumption will be removed in Sect. 14.5.)

Channel States While each channel is bidirectional at the network level, it is


assumed that at the application level the channels are unidirectional. Given two pro-
cesses pi and pj , this allows us, from an observer’s point of view, to distinguish the
state of the channel from pi to pj from the state of the channel from pj to pi .
The abstract variable empty(i,j ) is used to describe the state of the channel from
pi to pj . It is a Boolean variable whose value is true if and only if there is no
application message in transit pi to pj . Differently from the local variables statei ,
the variables empty(i,j ) are “abstract” because no process can instantaneously know
their value (the sender does not know when messages arrive, and the receiver does
not know when they are sent).

14.1.2 Termination Predicate

The Predicate Intuitively, a distributed computation is terminated if there is a


time at which, simultaneously, all the processes are passive and all the channels are
empty.
More formally, let us consider an external time reference (real-time, which is not
known by the processes); τ being any date in this time framework, let stateτi and
14.1 The Distributed Termination Detection Problem 369

emptyτ(i,j ) denote the value of statei and empty(i,j ) at time τ , respectively. C being a
distributed execution, let us define the predicate TERM(C, τ ) as follows:
   
TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ ∀i, j : emptyτ(i,j ) .
This predicate captures the fact that, at time τ , C has terminated. Finally, the predi-
cate that captures the termination of C, denoted TERM(C), is defined as follows:
 
TERM(C) ≡ ∃τ : TERM(C, τ ) .
As already seen, a stable property is a property that, once true, remains true forever.

Theorem 21 TERM(C) is a stable property.

Proof Assuming that TERM(C, τ ) is satisfied, let τ  ≥ τ . As, at time τ , no process


is active and all channels are empty, it follows from the rules governing the behavior
of the processes that no process can be re-activated by a message reception. Conse-
quently, as only active processes can send messages, no new message will be sent
after τ . Hence TERM(C, τ ) ⇒ TERM(C, τ  ). 

14.1.3 The Termination Detection Problem

Given a distributed computation C, the termination detection problem consists in


designing an algorithm A that satisfies the two following properties:
• Safety. If, at time τ , the algorithm A claims that C has terminated, then
TERM(C, τ ) is true (i.e., the computation has terminated at some time τ  ≤ τ ).
• Liveness. If C terminates at time τ (i.e., TERM(C, τ )∧(∀τ  < τ : ¬TERM(C, τ  ))),
then there is a time τ  at which A will claim that the computation has terminated.
As usual, the safety property is on consistency: An algorithm is not allowed to
claim termination before it occurs. Safety alone is not sufficient as, if termination
occurs, it allows an algorithm to never claim termination. This is prevented by the
liveness property.

14.1.4 Types and Structure of Termination Detection Algorithms

The algorithms detecting the termination of distributed computations are structured


in two parts (see Fig. 14.2):
• At each process, a local observer module that observes the corresponding appli-
cation process. Basically, the local observer at a process pi manages the local
variable statei and a few other control variables related to the state and the com-
munications of pi (e.g., number of messages sent and received by pi ).
370 14 Distributed Termination Detection

Fig. 14.2 Global structure of the observation modules

• A distributed algorithm (with a module per process) which allows the local ob-
servers to cooperate in order to detect the termination.
The termination detection algorithms differ on two points: (a) the assumptions
they do on the behavior of the computation, and (b) the way they cooperate.
As far as notations are concerned, the local observer of a process pi is denoted
obsi .

Remark As termination is a stable property, it is possible to use a stable property


detection algorithm based on global states (such as the one presented in Sect. 6.5.3)
to detect termination.
The design of specific termination detection algorithms is motivated by efficiency
and by the fact that termination detection algorithms are restricted to use only con-
sistent global states. They may rely on both consistent and inconsistent global states.

14.2 Termination Detection in the Asynchronous Atomic Model

14.2.1 The Atomic Model

The atomic model is a simplified model in which only messages take time to travel
from their senders to their destination processes. Message processing is done atom-
ically in the sense that it appears as taking no time. Hence, when it can be observed
a process is always passive.
A time-diagram of an execution in this model is represented in Fig. 14.3. Message
processing is denoted by black dots. Initially, p2 sends a message m1 to p1 , whilep1
and p3 are waiting for messages. When it receives m1 , p1 processes it and sends
two messages, m2 to p3 and m3 to p2 , etc. The processing of the messages m4 ,
m7 , and m8 entails no message sending, and consequently (from a global observer
point of view), the distributed computation has terminated after the reception and
the processing of m8 by p2 .
14.2 Termination Detection in the Asynchronous Atomic Model 371

Fig. 14.3 An execution in the asynchronous atomic model

Fig. 14.4 One visit is not sufficient

14.2.2 The Four-Counter Algorithm

An Inquiry-Based Principle The idea that underlies this algorithm is fairly sim-
ple. On the observation side, each process pi is required to count the number of
messages it has received reci , and the number of messages it has sent senti .
To simplify the presentation we consider that there is an additional control pro-
cess denoted observer. This process is in charge of the detection of the termination.
It sends an inquiry message to each process pi , which answers by returning its pair
of values (senti , reci ). When the observer has all the pairs, it computes the total
number of messages which have been sent S, and the total number of messages
which have been received R. If S = R it starts its next inquiry.
Unfortunately, as shown by the counterexample presented in Fig. 14.4, the ob-
server cannot conclude from S = R that the computation has terminated. This is
due to the asynchrony of communication. The inquiry messages REQUEST () are
not received at the same time by the application processes, and p1 sends back the
message ANSWER (0, 1), p2 sends back the message ANSWER (1, 0), and p3 sends
back the message ANSWER (0, 0). We then have S = 1 and R = 1, while the com-
putation is not yet terminated. Due to the asynchrony among the reception of the
messages REQUEST () at different processes, the final counting erroneously asso-
ciates the sending of m1 with the reception of m2 . Of course, it could be possible to
replace the counting mechanism by recording the identity of all the messages sent
and received and then compare sets instead of counters, but this would be too costly
and, as we are about to see, there is a simpler counter-based solution.
372 14 Distributed Termination Detection

% local observation at each process pi %


when a message m is received from a process pj do
(1) reci ← reci + 1;
(2) let x = number of messages sent due to the processing of m;
(3) senti ← senti + x.

% interaction of pi ’s local observer with the centralized observer %


when a message REQUEST () is received from observer do
(4) send ANSWER (senti , reci ) to observer.

detection task observer is % behavior of the centralized observer %


(5) S1 ← 0; R1 ← 0;
(6) repeat forever
(7) for each i ∈ {1, . . . , n} do send REQUEST () to pi end for;
(8) wait (an
answer message ANSWER (senti , reqi ) received from each pi );
(9) S2 ← 1≤i≤n senti ; R2 ← 1≤i≤n reci ;
(10) if (R1 = S2) then claim termination; exit loop
(11) else S1 ← S2; R1 ← R2
(12) end if
(13) end repeat.

Fig. 14.5 The four-counter algorithm for termination detection

An Algorithm Based on Sequential Inquiries A solution, due to F. Mattern


(1987), consists in replacing an inquiry by two consecutive inquiries. The resulting
algorithm is described in Fig. 14.5.
When an application process pi receives a message it processes the message
and increases accordingly reci and senti (lines 1–3). Moreover, it sends back the
message ANSWER (senti , reqi ) (line 4) when it receives a request from the observer.
The observer sends a request message to each application process (line 7) and
waits for the corresponding answers (line 8). It then computes the results of its
inquiry and stores them in S2 and R2 (line 9). S1 and R1 denote the values of its
previous inquiry (lines 5 and 11). If S2 = R1, it claims termination. Otherwise, it
starts a new inquiry. When considering two consecutive inquiries, let us observe that,
when true, the termination predicate states that the number of messages perceived as
sent by the second inquiry is equal to the number of messages perceived as received
by the first inquiry. As we are about to see in the proof of the next theorem, the
important point lies in the fact that inquires are sequential, and this allows us to
cope with fact that each inquiry is asynchronous.

Theorem 22 Assuming that the application executes in the asynchronous atomic


model, the 4-counter algorithm satisfies the safety and liveness properties defining
the termination detection problem.

Proof Let us first prove the liveness property. If the computation terminates, there
is a time τ after which no message is sent. Hence, after τ , no message is sent and no
message is received. Let us consider the first two inquiries launched by the observer
14.2 Termination Detection in the Asynchronous Atomic Model 373

Fig. 14.6 Two consecutive inquiries

after τ . It follows from the previous observation that E1 = R1 = E2 = R2, and the
algorithm detects termination.
Proving the safety property consists in showing that, if termination is claimed,
then the computation has terminated. To that end, let sentτi be the value of senti
at time τ . As counters are not decreasing, we trivially have (τ ≤ τ  ) ⇒ (sentτi ≤

sentτi ), and the same holds for the counters reci .
Let us consider Fig. 14.6, which abstracts two consecutive inquiries launched by
the observer. The first inquiry obtains the pair of values (S1, R1), and the second
inquiry obtains the pair (S2, R2). As the second inquiry is launched after the results
of the previous one have been computed, they are sequential. Hence, there is a time,
between the end of the first inquiry and the beginning of the send one at which the
number of messages sent is S  , and the number of messages received is R  (these
values are known neither by the processes, nor the observer). We have the following.
• S1 ≤ S  ≤ S2 and R1 ≤ R  ≤ R2 (from the previous observation on the nonde-
creasing property of the counters senti and reci ).
• If S2 = R1, we obtain (from the previous item) S  ≤ S2 = R1 ≤ R  .
• S  ≥ R  (due to the computation itself).
• It follows from the two previous items that (S2 = R1) ⇒ (S  = R  ).
Hence, if the predicate is satisfied, the computation was terminated before the sec-
ond inquiry was launched, which concludes the proof of the theorem. 

14.2.3 The Counting Vector Algorithm

Principle The principle that underlies this detection algorithm, which is due to
F. Mattern (1987), consists in using a token that navigates the network of pro-
cesses. This token carries a vector msg_count[1..n] such that, from its point of
view, msg_count[i] is the number of messages sent to pi minus the number of mes-
sages received by pi . When a process receives the token and the token is such that
msg_count = [0, . . . , 0], it claims termination.
374 14 Distributed Termination Detection

% local observation at each process pi %


when a message m is received from a process pj do
(1) msgi [i] ← msgi [i] − 1;
(2) let senti [k] = number of messages sent to pk due to the processing of m;
(3) for each k ∈ {1, . . . , n} \ {i} do msgi [k] ← msgi [k] + senti [k] end for.

local detection task at process pi % cooperation among local observers %


(4) repeat forever
(5) wait (TOKEN (msg_count));
(6) for each k ∈ {1, . . . , n} do
(7) msg_count[k] ← msg_count[k] + msgi [k];
(8) msgi [k] ← 0
(9) end for;
(10) if (all processes have been visited at least once) ∧ (msg_count = [0, . . . , 0])
(11) then claim termination; exit loop
(12) else send TOKEN (msg_count[1..n] to) next process on the ring
(13) end if
(14) end repeat.

Fig. 14.7 The counting vector algorithm for termination detection

Algorithm The algorithm is described in Fig. 14.7. It assumes that a process


does not send messages to itself. Moreover, in order that all processes are visited
by the token, it also assumes that the token moves along a ring including all the
processes. (According to Fig. 14.2, this is the way the local observers cooperate to
detect termination.)
Each process pi manages a local vector msgi [1..n] (initialized to [0, . . . , 0]) such
that msgi [i] is decreased by 1 each time pi receives a message, and msgi [j ] is
increased by 1 each time pi sends a message to pj (lines 1–3). Moreover, when the
token is at pi , msgi [1..n] is added to the vector msg_count[1..n], and reset to its
initial value (lines 6–9).
As previously, let xxτ be the value of x at time τ . Let nb_msgτi denote the number
of messages in transit to the process pi at time τ . The algorithm maintains the
following invariant:
 
∀i: msg_countτ [i] + msgτk [i] = nb_msgτi .
1≤k≤n

After a process pi has received the token and updated its value, it checks the
predicate (msg_count = [0, . . . , 0]) if the token has visited each process at least once
(line 10). If the predicate is satisfied, pi claims termination (line 11). Otherwise, it
propagates the token to the next process (line 12).

Example An example illustrating the algorithm is described in Fig. 14.8. The


visit of the token at each process is indicated with a rectangle. The order in which
the token visits the processes is indicated by integers representing “real time”.
We have the following. The value of the array msg_count[1..n] is its value after
its has been updated by its current owner (i.e., after line 9).
14.2 Termination Detection in the Asynchronous Atomic Model 375

Fig. 14.8 The counting vector algorithm at work

• At time τ = 1: msg_count1 [1..n] = [0, 1, 0, 1].


• At time τ = 2: msg_count2 [1..n] = [1, 0, 0, 2].
• At time τ = 3: msg_count3 [1..n] = [1, 1, −1, 2].
• At time τ = 4: msg_count4 [1..n] = [2, 1, −1, 0].
• At time τ = 5: msg_count5 [1..n] = [0, 1, 0, 0].
• At time τ = 6: msg_count6 [1..n] = [0, 0, 0, 0].
msg_count2 [1..n] = [1, 0, 0, 2] because, to the token’s knowledge, a message
(denoted m) is in transit to p1 , no messages are in transit to p2 and p3 , and a message
is in transit to p4 . msg_count4 [1..n] = [2, 1, −1, 0] because, to the token’s knowl-
edge, two messages are recorded as being in transit to p1 , a message is in transit
to p2 , a message (namely m ) has been received by p3 but this message is not yet
recorded as sent (hence the value −1), and no messages are in transit to p4 .
While the computation has terminated when the token visits p1 at time τ = 5, the
content of msg_count5 does not allow p1 to claim termination (the token has not yet
seen the reception of the message m by p2 ). Differently, at time τ = 6, the content
of msg_count6 allows p2 to detect termination.

Termination Detection, Global States, and Cuts The notions of a cut and of a
global state were introduced in Sects. 6.1.3, and 6.3.1, respectively. Moreover, we
have seen that these notions are equivalent.
Actually, the value of the vector msg_count defines a global state. The global
state Σ2 associated with msg_count2 and the global state Σ4 associated with
msg_count4 are represented on the figure.
As we can see, the global states associated with vectors all of whose values are
not negative are consistent, while the ones associated with vectors in which some
values are negative are inconsistent. This comes from the fact that a negative value
in msg_count[i] means that pi is seen by the token as having received messages,
but their corresponding sendings have not yet been recorded by the token (as is
the case for msg_count3 and msg_count4 , which do not record the sending of m ).
The positive entries of a vector with only non-negative values indicate how many
messages have been sent and are not yet received by the corresponding process (as
376 14 Distributed Termination Detection

an example, in the consistent global state associated with msg_count5 , the message
m sent to p2 is in transit).

14.2.4 The Four-Counter Algorithm vs.


the Counting Vector Algorithm

Similarity In both detection algorithms, no application message is overloaded


with control information. Observation messages and application messages are inde-
pendent.

Difference The four-counter algorithm requires an additional observer process.


Moreover, after the computation has terminated, the additional observer process
may have to issue up to two consecutive inquiries before claiming termination.
The counting vector algorithm does not require an additional process, but needs
an underlying structure that allows the token to visit all the processes. This structure
is not required to be a ring. Between two visits to a given process pi , the token may
visits any other process a finite number of times. Moreover, after termination has oc-
curred, the token can claim it during the first visit (of all the processes), which starts
after the computation has terminated. This is, for example, the case in Fig. 14.8.
Other features of this algorithm are addressed in Problem 1.

14.3 Termination Detection in Diffusing Computations

This section and the following ones consider the base asynchronous model in which
a process can be active or passive as defined in Sect. 14.1.1.

14.3.1 The Notion of a Diffusing Computation

Some applications are structured in such a way that initially a single process is
active. Without loss of generality, let p1 be this process (which is sometimes called
the environment).
Process p1 can send messages to other processes, which then become active. Let
pi such a process. It can, in turn, send messages to other processes, etc. (hence the
name diffusing computation). After a process has executed some local computation
and possibly sent messages to other processes, it becomes passive. It can become
active again if it receives a new message. The aim is here to design a termination
detection algorithm suited to diffusing computations.
14.3 Termination Detection in Diffusing Computations 377

14.3.2 A Detection Algorithm Suited to Diffusing Computations

Principle: Use a Spanning Tree As processes become active when they receive
messages, the idea is to capture the activity of the application with a spanning
tree which evolves dynamically. A process that is not in the tree enters the tree
when it becomes active, i.e., when it receives a message. A process leaves the tree
when there is no more activity in the application that depends on it. This princi-
ple and the corresponding detection algorithm are due to E.W. Dijkstra and C.S.
Scholten (1980).

How to Implement the Previous Idea Initially, only p1 is in the tree. Then, we
have the following. Let us first consider a process pi that receives a message from
a process pj . As we will see, a process that sends a message necessarily belongs to
the tree. Moreover, due to the rules governing the behavior of a process, a process
is always passive when it receives a message (see Fig. 14.1).
• If pi is not in the tree, it becomes active and enters the tree. To that end, it defines
the sender pj of the message as its parent in the tree. Hence, each process pi
manages a local variable parenti . Initially, parenti = ⊥ at any process pi , except
for p1 , for which we have parent1 = 1.
• If pi is in the tree when it receives the message from pj , it becomes active and,
as it does not need to enter the tree, it sends by return to pj a message ACK () so
that pj does not consider it as one of its children.
Let us now introduce a predicate that allows a process to leave the tree. First, the
process has to be passive. But this is not sufficient. If a process pi sent messages,
these messages created activity, and it is possible that, while pi is passive, the ac-
tivated processes are still active (and may have activated other processes, etc.). To
solve this issue, each process pi manages a local variable, called deficiti , which
counts the number of messages sent by pi minus the number of acknowledgment
messages it has received.
• If deficiti > 0, all the messages sent by pi have not yet been acknowledged.
Hence, there is possibly some activity in the subtree rooted at pi . Consequently,
pi conservatively remains in the tree.
• If deficiti = 0, all the messages that pi has sent have been acknowledged, and
consequently no more activity depends on pi . In this case, it can leave the tree,
and does this by sending the message ACK () to, its parent.
Hence, when considering a process pi which is in the tree (i.e., such that parenti =
⊥)), the local predicate that allows it to leave the tree is
(statei = passive) ∧ (deficiti = 0).
Finally, the environment process p1 concludes that the diffusing computation has
terminated when its local predicate deficit1 = 0 becomes satisfied.
378 14 Distributed Termination Detection

Fig. 14.9
when pi starts waiting for a message do
Termination detection
(1) statei ← passive;
of a diffusing computation
(2) let k = parenti ;
(3) if (deficiti = 0)
(4) then send ACK () to pk ; parenti ← ⊥
(5) end if.

when pi executes send(m) to pj do


% (statei = active) ∧ (parenti = ⊥) %
(6) deficiti ← deficiti + 1.

when pi receives a message m from pj do


(7) statei ← active;
(8) if (parenti = ⊥) then parenti ← j
(9) else send ACK () to pj
(10) end if.

when ACK () is received from pj do


(11) deficiti ← deficiti − 1;
(12) let k = parenti ;
(13) if (deficiti = 0) ∧ (statei = passive)
(14) then send ACK () to pk ; parenti ← ⊥
(15) end if.

The Algorithm: Local Observation and Cooperation Between Observers The


algorithm is described in Fig. 14.9. In this algorithm the local observation of a pro-
cess and the cooperation among the local observers are intimately related.
When a process pi starts waiting for a message, it becomes passive and leaves the
tree if all its messages have been acknowledged, i.e., it is not the root of a subtree in
which some processes are possibly still active (lines 1–5). On the observation side,
the local deficit is increased each time pi sends a message (line 6). The process pi
is then necessarily active and belongs to the tree.
When a process pi receives a message m, it becomes active (line 7), and enters
the tree if it was not there (line 8). If pi was already in the tree, it sends by return a
message ACK () to the sender of m (line 9) to inform the sender that its was already
in the tree and its activity does not depends on it.
Finally, when a process receives a message ACK (), it leaves the tree (line 14)
if it is passive and is not the root of a subtree in which some processes are active
(line 13).

14.4 A General Termination Detection Algorithm

This section presents a reasoned construction of a general termination detection al-


gorithm. This algorithm is general in the sense that it assumes neither that processing
times have a zero duration, nor that the observed computation is a diffusing com-
putation (it allows any number of processes to be initially active). This algorithm is
14.4 A General Termination Detection Algorithm 379

also generic in that sense that it uses an abstraction called a wave, which can be im-
plemented in many different ways, each providing a specific instance of the general
algorithm. This reasoned construction is due to J.-M. Hélary and M. Raynal (1991).

14.4.1 Wave and Sequence of Waves

Definition A wave is a control flow that is launched by a single process, visits


each process once, and returns to the process that activated it. Hence, the notion of a
wave initiator is associated with a wave. As a wave starts from a process and returns
to that process, it can be used to disseminate information to each process and returns
information from each process to the wave initiator.
As an example, each inquiry launched by the observer process in the four-counter
algorithm (Sect. 14.2.2) is a wave. More generally, any distributed algorithm imple-
menting a network traversal with feedback (such as the ones presented in Chap. 1)
implements a wave.
A wave provides the processes with four operations, denoted start_wave(),
end_wave(), forward_wave(), and return_wave(). Assuming the initiator is pα , let
swxα and ewxα denote the events corresponding to the execution of its xth invoca-
tion of start_wave() and end_wave(), respectively. Similarly, given any process pi ,
i = α, let fwxi and rwxi denote the events corresponding to the execution of the xth
invocation of forward_wave() and return_wave() by pi , respectively.
The control flow associated with a sequence of waves initiated by pα is defined
ev
by the following properties (where “−→” is the causal order relation on events
defined in Sect. 6.1.2).
ev ev ev
• Process visit. ∀x: ∀i = α : swxα −→ fwxi −→ rwxi −→ ewxα . This property states
that each process is visited by a wave before this wave returns to its initiator.
ev
• Sequence of waves. ∀x: ewxα −→ swx+1 α . This property states that an initiator
cannot launch the next wave before the previous has returned to it.
It is easy to see that the sequence of inquiries used in the four-counter algorithm
(Fig. 14.5) is a sequence of waves.

Implementing a Wave In the following, two implementations of a wave are pre-


sented. They are canonical in the sense that they are based on the canonical struc-
tures which are a ring and a spanning tree. It is assumed that, in each wave, (a) the
values returned to the initiator by the processes are Boolean values, and (b) the
initiator is interested in the “and” of these values. (Adapting the implementation to
other types of values, and operations more sophisticated than the “and” can be easily
done.)
These implementations consider that the role of pα is only to control the waves,
i.e., pα is an additional process which is not observed. Hence, an application process
380 14 Distributed Termination Detection

Fig. 14.10 Ring-based


operation start_wave() is % invoked by pα %
implementation of a wave
(1) send TOKEN (true) to nextα .

operation forward_wave() is % invoked by pi , i = α %


(2) wait (TOKEN (r)); xi ← r.

operation return_wave(b) is % invoked by pi , i = α %


(3) send TOKEN (xi ∧ b) to nexti .

operation end_wave(res) is % invoked by pα %


(4) wait (TOKEN (r)); res ← r.

is denoted pi , where i ∈ {1, . . . , n} and α ∈


/ {1, . . . , n}. Modifying the wave imple-
mentations so that pα is also an application process (i.e., a process that is observed)
is easy. (This is the topic of Problem 3.)
Let us observe that it is not required that all the waves of a sequence of waves
are implemented the same way. It is possible that some waves are implemented in
one way (e.g., ring-based implementation), while the other waves are implemented
differently (e.g., tree-based implementation).

Implementing a Wave on Top of a Ring Let us assume an underlying unidirec-


tional ring including each process exactly once (the construction of such a ring has
been addressed in Sect. 1.4.2). This communication structure allows for a very sim-
ple implementation of a wave. The successor on the ring of a process pi is denoted
nexti .
The control flow is represented by the progress of a message on the ring. This
message, denoted TOKEN (), carries a Boolean whose value is the “and” of the values
supplied by the process it has already visited. The corresponding implementation
(which is trivial) is described in Fig. 14.10.
When it launches a new wave, the initiator sends the message TOKEN (true) to its
successor (line 1). Then, when a process pi is visited by the wave, it stores the cur-
rent value carried by the token r in a local variable xi (line 2). To forward the wave,
it sends the message TOKEN (xi ∧ b) to its successor, where b is its local contribu-
tion to the wave, and xi is the contribution of the processes which have been already
visited by the wave (line 3). Finally, when the initiator invokes end_wave(res), it
waits for the token and deposits its value in a local variable res.

Implementing a Wave with a Spanning Tree Let us assume a spanning tree


rooted at the initiator process pα (several algorithms building spanning trees were
presented in Chap. 1). Let parenti and childreni denote the parent of pi and the set
of its children in this tree, respectively (if pi is a leaf, childreni = ∅).
If the spanning tree is a star centered at pα , the implementation reduces to the
use of REQUEST () and ANSWER () messages, similarly to what is done in Fig. 14.5.
The tree-based implementation of a wave is described in Fig. 14.11. When the
initiator pα invokes start_wave(), it sends a message GO () to each of its children
14.4 A General Termination Detection Algorithm 381

operation start_wave() is % invoked by pα %


(1) for each j ∈ childrenα do send GO () to pj end for.

operation forward_wave() is % invoked by pi , i = α %


(2) wait (GO () from pparenti );
(3) for each j ∈ childreni do send GO () to pj end for.

operation return_wave(b) is % invoked by pi , i = α %


(4) if (childreni = ∅)
(5)  (vj ) has been received from each pj , j ∈ childreni );
then wait (BACK
(6) xi ← ( j ∈childreni vj )
(7) else xi ← true
(8) end if;
(9) send BACK (xi ∧ b) to pparenti .

operation end_wave(res) is % invoked by pα %


 (vj ) has been received from each pj , j ∈ childrenα );
(10) wait (BACK
(11) res ← ( j ∈childreni vj ).

Fig. 14.11 Spanning tree-based implementation of a wave

(line 1) and, when a process pi is visited by the wave (i.e., receives a message
GO ()), it forwards the message GO () to each of its own children (lines 2–3).
Then, when a process pi invokes return_wave(b), where b is its contribution to
the final result, it waits for the contributions from the processes belonging to the
subtree for which it is the root (lines 5–9). When, it has received all of these contri-
butions, it sends back to its parent the whole contribution of this subtree (line 10).
Finally, when pα (which is the root of the spanning tree) has received the contribu-
tion of all its children, it deposits the Boolean result in its local variable res.

14.4.2 A Reasoned Construction

A Reasoned Construction: Step 1 (Definition of a Locally Evaluable Unsta-


ble Predicate) Let us recall that (as seen in Sect. 14.1.2), a computation C is
terminated at time τ if the predicate TERM(C, τ ) is true, where TERM(C, τ ) ≡
(∀i: stateτi = passive) ∧ (∀i, j : emptyτ(i,j ) ).
 While no process can instantaneously know the value of the predicate
j =i empty(i,j ) , it is possible to replace it by a stronger predicate, by associating
an acknowledgment with each application message. (Differently from the acknowl-
edgments used in the specific algorithm designed for diffusing computations, such
acknowledgment messages are systematically sent by return when application mes-
sages are received.)
Let deficiti be a local control variable of pi which counts the number of applica-
tion messages sent by pi and for which pi has not yet received the corresponding
acknowledgment. Hence, pi can locally evaluate the following unstable predicate
idlei ≡ (statei = passive) ∧ (deficiti = 0).
382 14 Distributed Termination Detection


Fig. 14.12 Why ( 1≤i≤n idlexi ) ⇒ TERM(C , ταx ) is not true


This predicate is safe in the sense that idlei ⇒ (statei = passive)∧( j =i empty(i,j ) ).
It allows consequently each process pi to know the state of the channels from it to
the other processes. Let us nevertheless observe that this predicate is unstable. This
is because, if pi receives a message while idlei is satisfied, statei becomes equal to
active, and idlei becomes consequently false.

A Reasoned Construction: Step 2 (Strengthening the Predicate) Let τix be


the time (from an external observer point of view) at which a process pi , which is
visited by the xth wave, starts the invocation of return_wave(), and ταx be the time
at which pα terminates its xth invocation of end_wave(). Let idlexi be the value of
the predicate idlei at time τix . 
Unfortunately, we do not have ( 1≤i≤n idlexi ) ⇒ TERM(C, ταx ). This is due to
the fact that the instants at which the predicates idlei are evaluated and the corre-
sponding process pi invokes return_wave() are independent from each other. This
is depicted in Fig. 14.12. Assuming the current wave is the xth, it is easy to see that,
despite the fact that both idlexi and idlexj are true, pα cannot conclude at time ταx that
the computation has terminated.

Continuously Passive Process This counterexample shows that termination de-


tection cannot boil down to a simple collection of values. The visits of the wave to
processes have to be coordinated, and the activity period of each process has to be
recorded in one way or another. To that end, let us strengthen the local predicate
idlexi by adding the following predicate ct_passxi defined as follows:
ct_passxi ≡ pi remained continuously passive between τix and ταx .
When satisfied, this predicate means that pi has not been reactivated after the xth
wave left it and returned at pα . The important point is the fact that we have the
following:
    
idlexi ∧ ct_passxi ⇒ TERM C, ταx .
1≤i≤n

This follows from the following observation. It follows from idlexi ∧ ct_passxi that
pi remained passive during the time interval [τix , ταx ], and consequently its outgoing
channels remained empty in this time interval. As this is true for all the processes,
14.4 A General Termination Detection Algorithm 383

we conclude that, at time ταx , all the processes are passive and all the channels are
empty.
Unfortunately, as a process pi does not know the time ταx , it cannot evaluate the
predicate ct_passxi . As previously, a simple way to solve this problem consists in
strengthening the predicate ct_passxi in order to obtain a predicate locally evaluable
by pi . As the waves are issued sequentially by pα , such a strengthening is easy. As
ταx < τix+1 , ταx (which is not known by pi ) is replaced by τix+1 (which is known by
pi ). Hence, let us replace ct_passxi by

sct_passi[x,x+1] ≡ pi remained continuously passive between τix and τix+1 .


We trivially have
    
idlexi ∧ sct_passi[x,x+1] ⇒ TERM C, ταx .
1≤i≤n

A Reasoned Construction: Step 3 (Loop Invariant and Progress Condition)



The predicate 1≤i≤n (idlexi ∧ sct_passi[x,x+1] ) involves two consecutive waves,
and each process is involved in the evaluation of two local predicates, namely, idlexi
and sct_passi[x,x+1] . As a sequence of waves is nothing else than a distributed it-
eration controlled by the initiator process pα , and, due to the application, there is
no guarantee that a process remains continuously passive between two consecutive
waves, this directs us to consider

• 1≤i≤n idlexi as the loop invariant, and

• 1≤i≤n sct_passi[x,x+1] as the progress condition associated with the loop.

A Reasoned Construction: Step 4 (Structure the Algorithm) A local observer


is associated with each process pi . In addition to the local variables statei and
deficiti , it manages the Boolean variable cont_passivei . As far as initial values are
concerned, we have the following: statei is initialized to active or passive, according
to the fact that pi is initially active or passive; deficiti is initialed to 0; cont_passivei
is initialized to true if and only if pi is initially passive.
The algorithm is described in Fig. 14.13. The local observation of a process pi is
described at lines 1–6. The new point with respect to the algorithms that have been
previously presented is line 4. When pi receives a message, in addition of becoming
active, its Boolean variable cont_passivei is set to false.

A Reasoned Construction: Step 5 (Guaranteeing the Loop Invariant) The


loop invariant and the progress condition associated with the sequence of waves
are implemented at lines 7–16, which describe the cooperation among the local
observers.
After it has been visited by a wave (invocation of forward_wave() at line 12), a
process waits until its local predicate idlei is satisfied (line 13). When this occurs it
invokes return_wave(b) (line 15) where b is the current value of cont_passivei .
384 14 Distributed Termination Detection

Fig. 14.13
when pi starts waiting for a message do
A general algorithm
(1) statei ← passive.
for termination detection
when pi executes send(m) to pj do
(2) deficiti ← deficiti + 1.

when pi receives a message from pj do


(3) statei ← active;
(4) cont_passivei ← false;
(5) send ACK () to obsj .

when a message ACK () is received do


(6) deficiti ← deficiti − 1.

local task at pi for the cooperation between observers is


(7) if (i = α)
(8) then repeat
(9) start_wave(); end_wave(res)
(10) until (res) end repeat;
(11) claim termination
(12) else forward_wave();
(13) wait ((statei = passive) ∧ (deficiti = 0));
(14) b ← cont_passivei ; cont_passivei ← true;
(15) return_wave(b)
(16) end if.

The important point is that, thanks to the wait statement,


 the predicate idlei is
true at time τix . As this is true at any process, we have 1≤i≤n idlexi , i.e., the loop
invariant is satisfied. For idlei to be true at time τix , it is assumed that the appli-
cation process is frozen (i.e., does not execute) from the time idlei becomes true
(line 13) until τix , i.e., until return_wave(b) starts to be executed (line 15). This is
depicted in Fig. 14.14. This freezing is introduced to simplify the presentation, it is
not mandatory (as shown in Problem 4).

A Reasoned Construction: Step 6 (Ensuring Progress) At time τix , pi resets


cont_passivei to true. (line 14), and will transmit its value b to (x + 1)th wave when
it will invoke return_wave(b) during this wave. Thanks to the “and” on the values
collected by the (x + 1)th wave, the Boolean value res obtained by the initiator pα

Fig. 14.14 Atomicity associated with τix


14.5 Termination Detection in a Very General Distributed Model 385

at the end of the (x + 1)th wave is such that res = 1≤i≤n sct_passi[x,x+1] . If this
Boolean is true, pα claims
 termination (lines 10–11). 
As we have then 1≤i≤n idlexi (due to the loop invariant) and res = 1≤i≤n sct_
passi[x,x+1] , it follows that TERM(C, ταx ) is true, and consequently the claim is
true. Hence, if the computation terminates, its termination is detected. Moreover,
the fact that there is no false claim follows from the use of the Boolean variables
cont_passivei which are used to implement the predicates sct_passi[x,x+1] .

14.5 Termination Detection in a Very General Distributed Model

14.5.1 Model and Nondeterministic Atomic Receive Statement

Motivation In the classical asynchronous model (considered previously) each


receive statement involves a single message, the sender of which is any process.
This basic receive statement is not versatile enough to render an account of spe-
cific cases where, for example, a process pi wants to atomically receive either two
messages, one from a given process pj and the other from a given process pk , or
a single message from a given process p . This means that, when such a receive
statement terminates, pi has consumed atomically either two messages (one from
pi and one from pj ), or a single message (from p ).

Message Arrival vs. Message Reception As suggested by the previous example,


the definition of powerful nondeterministic reception statements needs the introduc-
tion of the notion of an arrived and not yet consumed message. Hence, a channel
is now made up of two parts: a network component (that a message uses to attain
its destination) and a buffer component (where a message that has arrived is stored
until its destination process receives and consumes it).
There is consequently a notion of message arrival which is different from the no-
tion of a message reception (these notions are merged in the classical asynchronous
communication model). It follows that the underlying communication model is com-
posed of a communication network plus an input buffer for each pair of processes
(pj , pi ), where the message arrived from pj and not yet received by pi are saved.
This means that there is a part of each input channel, namely the buffer compo-
nent, which is now visible by the observer obsi associated with each process pi .
Differently, the content of the network component of the channels remains locally
invisible.
A simple example is depicted in Fig. 14.15. The messages m1 (sent by pj ) and
m4 (sent by pk ) have arrived, but have not yet been received. They are stored in
input buffers. The messages m2 , m3 , and m5 are still on their way to pi .

Dependency Set Providing versatile message reception requires us to define a


generalized nondeterministic reception statement. To that end, the notion of a de-
pendency set is first introduced.
386 14 Distributed Termination Detection

Fig. 14.15 Structure of the channels to pi

When it enters a receive statement, a process pi specifies a set of process iden-


tities, denoted dep_seti , composed of the processes from which it starts waiting for
messages. Let us remark that this allows a message, which has been sent, to never
be consumed by its destination process (because the sender never appears in the de-
pendency set of its destination process). Several patterns of dependency sets can be
defined, each one defining a specific message reception model.

AND Model In this reception pattern, a receive statement has the following struc-
ture:

receive message from (pa and . . . and px ).

We have then dep_seti = {a, . . . , x}. When invoked by pi , this statement terminates
when a message from each process pj , such that j ∈ dep_seti , has arrived at pi .
The reception statement then withdraws these messages from their input buffers and
returns them to pi . Hence, this message reception statement allows for the atomic
reception of several messages from distinct senders.

OR Model The receive statement of the OR pattern has the following structure:

receive message from (pa or . . . or px ).

As previously, the dependency set is dep_seti = {a, . . . , x}, but its meaning is dif-
ferent. This receive statement terminates when a message from a process pj , such
that j ∈ dep_seti , has arrived at pi . If messages from several processes in dep_seti
have arrived, a single one is withdrawn from its input buffer and received and con-
sumed by pi . Hence, the “or” is an exclusive “or”. This is consequently a simple
nondeterministic receive statement.
14.5 Termination Detection in a Very General Distributed Model 387

OR/AND Model This reception pattern is a combination of the two previous


ones. The receive statement has the following form.
receive message from dp1 or . . . or dpx ,
where each dpy , 1 ≤ y ≤ x, is a set  of processes, and (as before) the “or” is an
exclusive “or”. Moreover, dep_seti = 1≤y≤x dpy .
This statement terminates as soon as there is a set dpy such that a message from
each process pj , with j ∈ dpy , has arrived at pi . This receive statement consumes
only the messages sent by these processes. It is a straightforward generalization of
the AND and OR reception models.

Basic k-out-of-m Model The receive statement of the k-out-of-m pattern has the
following structure:
receive ki messages from dep_seti ,
where dep_seti (a set of process identities). It is assumed that 1 ≤ ki ≤ |dep_seti |.
This statement terminates when a message from ki distinct processes belonging to
dep_seti have arrived at pi .

Disjunctive k-out-of-m Model This receive statement generalizes the previous


one.
receive (ki1 messages from dp1 ) or
(ki2 messages from dp2 ) or . . . or
(kix messages from dpx ).

We have dep_seti = 1≤y≤x dpy . This statement terminates as soon as there is a set
y
dpy such that ki messages have arrived from distinct processes belonging to dpy .
The disjunctive k-out-of-m model is fairly general. When x = 1, we obtain the
basic k-out-of-m model. When x = 1 and ki1 = |dp1 | = |dep_seti |, we obtain the
AND model. When x ≥ 1 and |dpy | = 1, 1 ≤ y ≤ x, we obtain the OR model.

Unspecified Message Reception It is possible that a message that has arrived at


a process might never be consumed. This depends on the message pattern specified
by processes when they invoke a reception statement. Maybe, in some cases, this
can be due to an erroneous distributed program, but termination detection also has
to work for the executions generated by such programs.
Interestingly, the algorithms developed in this section not only detect the termi-
nation of such executions, but allow also for the reporting of messages arrived and
not received.

14.5.2 The Predicate fulfilled()

A predicate, denoted fulfilled(), is introduced to abstract the activation condition of


a process that has invoked a receive statement. This predicate is used to know if the
corresponding process can be re-activated.
388 14 Distributed Termination Detection

Let A be a set of process identities such that a message has arrived from each
of these processes and these messages have not yet been consumed (hence, they are
still in their input buffers). fulfilled(A) is satisfied if and only if these messages are
sufficient to reactivate the invoking process pi .
By definition fulfilled(∅) is equal to false. It is moreover assumed that the predi-
cate fulfilled() is monotonous, i.e.,
    
A ⊆ A ⇒ fulfilled(A) ⇒ fulfilled A .

(Monotonicity states only that, if a process can be reactivated with the messages
from a set of processes A, it can also be reactivated with the messages from a bigger
set of processes A , i.e., such that A ⊆ A .)
The result of an invocation of fulfilled(A) depends on the type of receive state-
ment issued by the corresponding process pi . As a few examples, we have the fol-
lowing:
• In the AND model: fulfilled(A) ≡ (dep_seti ⊆ A).
• In the OR model: fulfilled(A) ≡ (A ∩ dep_seti = ∅).
• In the OR/AND model: fulfilled(A) ≡ (∃dpy : dpy ⊆ A).
• In the k-out-of-n model: fulfilled(A) ≡ (|A ∩ dep_seti | ≥ ki ).
y y
• In the disjunctive k-out-of-n model: fulfilled(A) ≡ (∃y: |A ∩ dpi | ≥ ki ).
A process pi stops executing because it is blocked on a receive statement, or be-
cause it has attained its “end” statement. This case is modeled as a receive statement
in which dep_seti = ∅. As processes can invoke distinct receive statements at dif-
ferent times, the indexed predicate fulfilledi () is used to denote the corresponding
predicate as far as pi is concerned.

14.5.3 Static vs. Dynamic Termination: Definition

Channel Predicates and Process Sets Let us define the following predicates
and abstract variables that will be used to define two notions of termination of a
computation in the very general distributed model.
• empty(j, i) is a Boolean predicate (already introduced) which is true if and only
if the network component of the channel from pj to pi is empty.
• arrivedi (j ) is a Boolean predicate which is true if and only if the buffer compo-
nent of the channel from pj to pi is not empty.
• arr_fromi = {j | arrivedi (j )} (the set of processes from which messages have
arrived at pi but have not yet been received—i.e., consumed—by pi ).
• NEi = {j | ¬empty(j, i)} (the set of processes that sent messages to pi , and these
messages have not yet arrived at pi ).
When looking at Fig. 14.15, we have arr_fromi = {j, k}, and NEi = {j, }.
14.5 Termination Detection in a Very General Distributed Model 389

Static Termination A distributed computation C is statically terminated at some


time τ if the following predicate, denoted S_TERM(C, τ ), is satisfied. The variables
statei have the same meaning as before (and xxτ is the value of xx at time τ ).
       
S_TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ NEτi = ∅ ∧ ¬fulfilledi arr_fromτi .
This predicate captures the fact that, at time τ , (a) all the processes are passive,
(b) no application message is in transit in the network component, and (c) for each
process pi and according to its receive statement, the messages arrived and not
yet consumed (if any) cannot reactivate it. Similarly to the definition of TERM(C)
introduced in Sect. 14.1.2, S_TERM(C) is defined as follows:
 
S_TERM(C) ≡ ∃τ : S_TERM(C, τ ) .
This definition of termination is focused on the states of processes and channels. It
is easy to show that S_TERM(C) is a stable property.

Dynamic Termination A distributed computation C is dynamically terminated


at some time τ if the following predicate is satisfied, denoted D_TERM(C, τ ), is
satisfied.
     
D_TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ ¬fulfilledi NEτi ∪ arr_fromτi .
This predicate captures the fact that, at time τ , (a) all the processes are passive,
and (b) no process pi can be reactivated when considering both the messages arrived
and not yet consumed and the messages in transit to it. The important difference
with the predicate S_TERM(C, τ ), lies in the fact that it allows for early detection,
namely, termination can occur and be detected even if some messages have not yet
arrived at their destination processes. This definition is based on the real (statei =
passive) or potential (fulfilledi (NEi ∪ arr_fromi )) activity of processes and not on
the fact that channels are or are not empty. As previously, let us define D_TERM(C)
as follows:
 
D_TERM(C) ≡ ∃τ : D_TERM(C, τ ) .
As all messages eventually arrive at their destination, it is easy to see that
  
D_TERM(C, τ ) ⇒ ∃τ  ≥ τ : S_TERM C, τ  .
In that sense, dynamic termination allows for “early” detection, while static termi-
nation allows only for “late” detection.

Static/Dynamic Termination vs. Classical Termination In the classical model


for termination detection, each receive statement is for a single message from any
process (hence there is no unspecified reception), and message arrival and message
reception are merged (hence there is no notion of input buffer). Moreover, we have
fulfilled(A) ≡ (A = ∅). It follows that TERM(C, τ ) can be rewritten as follows:
    
TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ NEiτ = ∅ .
The static and dynamic algorithms that follow are due to J. Brzezinsky, J.-M. Hélary,
and M. Raynal (1993).
390 14 Distributed Termination Detection

14.5.4 Detection of Static Termination

The structure of the algorithm detecting the static termination of a computation C


(predicate S_TERM(C)) is the same as that of the algorithm in Fig. 14.13. A differ-
ence lies in the fact that the local observer obsi of process pi has now to consider
the arrival of application messages instead of their reception.

Local Variables and Locally Evaluable Predicate As the set NEi appears in
S_TERM(C, τ ), a value of it must consequently be computed or approximated. As a
local observer obsi cannot compute the state of its input channels without freezing
(momentarily blocking) sender processes, it instead computes the state of its output
channels. More precisely, as done in previous algorithms, a local observer can learn
the state of the network component of its output channels by associating an acknowl-
edgment with each application message. When a message from a process pi arrives
at a process pj , the local observer obsj sends by return an acknowledgment to the
local observer obsi associated with pi . As previously, let deficiti be the number of
messages sent by pi for which obsi has not yet received an acknowledgment. When
satisfied, the predicate deficiti = 0 allows the local observer obsi to know that the
all the messages by pi have arrived to their destination processes. As we will see in
the proof of the algorithm, replacing NEi by deficiti = 0 allows for a safe detection
of static termination.
To know the state of pi ’s input buffers, the local observer obsi can use the set
arr_fromi (which contains the identities of the processes pj such that there is a
message from pj in the input buffer of pi ). The content of such a set arr_fromi
can be locally computed by the local observer obsi . This is because this observer
(a) sends back an acknowledgment each time a message arrives, and (b) observes
the reception of messages by pi (i.e., when messages are withdrawn from their input
buffers to be consumed pi ).
Finally, the predicate (which is not locally evaluable)
 
(statei = passive) ∧ (NEi = ∅) ∧ ¬fulfilledi (arr_fromi )
is replaced by the locally evaluable predicate
 
(statei = passive) ∧ (deficiti = 0) ∧ ¬fulfilledi (arr_fromi ) .
In addition to the previous local variables, the algorithm uses also the Boolean
variable cont_passivei , the role of which is exactly the same as in Fig. 14.13.

Static Termination Detection Algorithm The algorithm (which is close to the


algorithm of Fig. 14.13) is described in Fig. 14.16. Lines 1–7 describe the obser-
vation part of the observer obsi (at process pi ). When a process enters a receive
statement, the dependency set and the reactivation condition (fulfilledi ()) associated
with this receive statement are defined. When, according to messages arrived and
not yet consumed, the predicate fulfilledi (arr_fromi ) becomes true (lines 5–6), the
process pi becomes active and (as in Fig. 14.13) the Boolean cont_passivei is set to
false. As already indicated, the role of cont_passivei is the same as in Fig. 14.13.
14.5 Termination Detection in a Very General Distributed Model 391

when pi enters a receive statement do


(1) compute dep_seti and fulfilledi () from the receive statement;
(2) statei ← passive.

when pi executes send(m) to pj do


(3) deficiti ← deficiti + 1.

when a message from pj arrives do


(4) send ACK () to obsj .

when fulfilledi (arr_fromi ) becomes satisfied do


(5) statei ← active;
(6) cont_passivei ← false.

when a message ACK () is received do


(7) deficiti ← deficiti − 1.

local task at pi for the cooperation between observers is


(8) if (i = α)
(9) then repeat
(10) for each j ∈ {1, . . . , n} do send REQUEST () to pj end for;
(11) wait (ANSWER
 (bj ) received from each obsj , j ∈ {1, . . . , n});
(12) res ← 1≤j ≤n bj
(13) until (res) end repeat;
(14) claim termination
(15) else wait (message REQUEST () from obsα );
(16) wait ((statei = passive) ∧ (deficiti = 0) ∧ (¬fulfilledi (arr_fromi )));
(17) bi ← cont_passivei ; cont_passivei ← true;
(18) send ANSWER (bi ) to obsα
(19) end if.

Fig. 14.16 An algorithm for static termination detection

The cooperation among the local observers is described at lines 8–19. To simplify
the presentation, we consider that each wave is implemented by a star which is
(a) centered at a process obsα (which does not participate in the computation), and
(b) implemented by messages REQUEST () and ANSWER () (as done in Fig. 14.5).
When obsα launches a new wave it sends a message REQUEST () to each process
observer obsi (line 10). Then, it waits for a message ANSWER (bi ) from each of them
(line 11), and claims termination if the conjunction b1 ∧ · · · ∧ bn is true (lines 12–
14).
When a local observer obsi is visited by a new wave (line 15), it waits until its
local predicate (statei = passive) ∧ (deficiti = 0) ∧ (¬fulfilledi (arr_fromi ) becomes
true. When this occurs, it sends the current value of cont_passivei to obsα and resets
this variable to the value true (lines 17–18).

Theorem 23 Assuming generalized receive statements, the algorithm described in


Fig. 14.16 satisfies the safety and liveness properties defining the static termination
detection problem.
392 14 Distributed Termination Detection

Fig. 14.17 Definition of time instants for the safety of static termination

Proof Proof of the liveness property. We have to show that, if the computation C is
statically terminated, the algorithm eventually claims it.
If the computation is statically terminated at time τ , we have S_TERM(C, τ )
(definition). As static termination is a stable property, it follows that all the pro-
cesses pi are continuously passive, their re-activation conditions are not satisfied
(i.e., ¬fulfilledi (arr_fromi ) at each pi ), and the network component of each chan-
nel is empty. Moreover, it follows from the acknowledgment mechanism that there
is a time τ  ≥ τ such that, from τ  , the predicate deficiti = 0 is always satisfied at
each process pi .
Let the wave launched by obsα after τ  be the xth wave. It follows from the
predicate at line 16 that neither the xth wave, nor any future wave, will be delayed
at this line. Moreover, while the value bi returned by obsi to the xth wave can be
true or false, the value bi returned to the (x + 1)th wave is necessarily the value true
(this is because the xth wave set the local variables cont_passivei to true, and none
of them has been modified thereafter).
It follows that (if not yet satisfied for the xth wave) the (x + 1)th wave will be
such that bi = true for each i. The observer obsα will consequently claims termina-
tion as soon as it has received all the answers for the (x + 1)th wave.
Proof of the safety property. We have to show that, if the algorithm claims termi-
nation at time τ  , there is a time τ ≤ τ  such that S_TERM(C, τ ).
To that end, let τ be a time such that, for each i, we have τix < ταx < τ < τix+1 <
ταx+1 , where τix is the time at which obsi sends its answer and locally terminates the
visit of the xth wave, and ταx be the time at which obsα terminates its wait statement
of the xth wave (line 10). The time instants τix+1 and ταx+1 are defined similarly for
(x + 1)th wave. These time instants are depicted in Fig. 14.17.
The proof consists in showing that, if the algorithm claims termination at the end
of the (x + 1)th wave (i.e., just after time ταx+1 ), then computation was statically
terminated at time τ (i.e., S_TERM(C, τ ) is true). The proof is decomposed into
three parts.
• Proof that, for each i, stateτi = passive.
It follows from the management of the local variables cont_passivei (lines 10
and 17) that, if obsα claims termination just after time ταx+1 , each process pi has
14.5 Termination Detection in a Very General Distributed Model 393

been continuously passive between τix and τix+1 . As τix < τ < τix+1 , we conclude
that we have stateτi = passive.
• Proof that, for each i, NEτi = ∅.
Due to the algorithm, each wave is delayed at an observer obsi (at line 16,
before the send of an answer message) until all messages sent by pi have been
acknowledged (predicate deficiti = 0). On the other hand, no process pi sent mes-
sages between τix and τix+1 (because it was continuously passive during the time
interval [τix , τix+1 ], see the previous item). It follows from these two observations
that, at time τ , there is no message sent by pi and not yet arrived at its destination.
As this is true for all the processes, if follows that we have NEτi = ∅.
• Proof that, for each i, ¬fulfilledi (arr_fromτi ).
As previously, a wave is delayed at an observer obsi until ¬fulfilledi (arr_
fromi ). Hence, as the algorithm claims termination just after ταx+1 , it follows
that, for any i, we have ¬fulfilledi (arr_fromx+1 i ), where arr_fromx+1i denotes
x+1
the value of arr_fromi at time τi (line 16).
As any process pi has been continuously passive in the time interval
[τix , τix+1 ], it consumed no message during that period. Consequently the set
arr_fromi can only increase during this period, i.e., we have arr_fromτi ⊆
arr_fromx+1 i . It follows from the monotonicity of the predicate fulfilledi () that
(¬fulfilledi (arr_fromx+1 i ) ⇒ (¬fulfilledi (arr_fromτi )). Hence, for any i we have
¬fulfilledi (arr_fromi ), which concludes the proof of the theorem.
τ


Cost of the Algorithm Each application message entails the sending of an ac-
knowledgment. A wave requires two types of messages, n request messages which
carry no value, and n answers which carry a Boolean. Moreover, after the compu-
tation has terminated, one or two waves are necessary to detect termination. Hence,
4n control messages are necessary in the worst case. If, instead of a star, a ring is
used to implement waves, this number reduces to 2n.

14.5.5 Detection of Dynamic Termination

Local Data Structures To detect dynamic termination of a distributed computa-


tion C, i.e., the existence of a time τ such that the predicate
     
D_TERM(C, τ ) ≡ ∀i: stateτi = passive ∧ ¬fulfilledi NEτi ∪ arr_fromτi
is satisfied, for each process pi , some knowledge on which messages are currently
in transit to pi is required. To that end, each observer obsi (in addition of statei and
arr_fromi ) and the observer obsα are endowed with the following local variables.
• senti [1..n] is an array, managed by obsi , such that senti [j ] contains the number
of messages which, since the beginning of the execution, have been sent by pi
to pj .
394 14 Distributed Termination Detection

Fig. 14.18 Cooperation between local observers

• arri [1..n] is a second array, managed by obsi , such that arri [j ] contains the num-
ber of processes which, since the beginning of the execution, have been sent by
pj and have arrived at pi .
• ap_sentα [1..n, 1..n] is a matrix, managed by obsα such that ap_sentα [i, j ] con-
tains the number of messages that, from obsα ’s point of view, have been sent by pi
to pj . As obsα can learn now information only when it receives control messages
from the local observers obsi , ap_sentα [i, j ] represents an approximate knowl-
edge (hence the identifier prefix ap). The way this array is used is described in
Fig. 14.18.

Dynamic Termination Detection Algorithm The algorithm is described in


Fig. 14.19. The local observation part of obsi (lines 1–6) is close to that of the
static termination detection algorithm. The difference is the fact that the acknowl-
edgment messages are replaced by the counting of messages sent and arrived, on a
per-process basis.
The cooperation among obsα and the local observers obsi is different from that
used in static termination detection (Fig. 14.18).
• First, the waves are used to provide obsα with the most recent values of the num-
ber of messages which have been sent by each process pi .
More precisely, when a local observer answers a wave (line 19), it sends a
control message including the current value of senti [1..n], so that, when obsα
receives this message, it can update accordingly the column i of its local matrix
ap_sentα to a more up-to-date value (lines 10–11). In that way, obsα is able to
learn more recent information on the communication generated by the observed
computation.
• The waves are also used by obsα to provide each local observer obsi with more
recent information on the number of messages which have been sent to process
pi (line 9 on obsα ’s side, and lines 15–16 on the side of each observer obsi ).
14.5 Termination Detection in a Very General Distributed Model 395

when pi enters a receive statement do


(1) compute dep_seti and fulfilledi () from the receive statement;
(2) statei ← passive.

when pi executes send(m) to pj do


(3) senti [j ] ← senti [j ] + 1.

when a message from pj arrives do


(4) arri [j ] ← arri [j ] + 1.

when fulfilledi (arr_fromi ) becomes satisfied do


(5) statei ← active;
(6) cont_passivei ← false.

local task at pi for the cooperation between observers is


(7) if (i = α)
(8) then repeat
(9) for each j ∈ {1, . . . , n} do send REQUEST (ap_sentα [1..n, j ]) to pj end for;
(10) wait (ANSWER (bj , stj [1..n]) received from each obsj , j ∈ {1, . . . , n});
(11)  j ∈ {1, . . . , n} do ap_sentα [1..n, j ] ← stj [1..n] end for;
for each
(12) res ← 1≤j ≤n bj
(13) until (res) end repeat;
(14) claim termination
(15) else wait (message REQUEST (ap_st[1..n]) from obsα );
(16) ap_nei ← { j | ap_st[j ] > arri [j ] };
(17) bi ← cont_passivei ∧ (¬fulfilledi (arr_fromi ∪ ap_nei ));
(18) cont_passivei ← (statei = passive);
(19) send ANSWER (bi , senti [1..n]) to obsα
(20) end if.

Fig. 14.19 An algorithm for dynamic termination detection

• As a local observer obsi does not know the value of NEi , it cannot compute
the value of the predicate ¬fulfilledi (NEi ∪ arr_fromi ) (which characterizes dy-
namic termination). To cope with this issue, obsi computes the set ap_nei = {j |
ap_st[j ] > arri [j ]} (line 16), which is an approximation of NEi (maybe more
messages have been sent to pi than the ones whose sending is recorded in the
array ap_st[1..n] sent by obsα ).
The observer obsi can now compute the value of the local predicate ¬fulfilledi
(ap_nei ∪ arr_fromi ) (line 17). Then, obsi sends to obsα (line 19) a Boolean
bi which is true (line 17) if an only if (a) pi has been continuously passive
since the previous wave (as in static detection termination), and (b) the predicate
fulfilledi (ap_nei ∪ arr_fromi ) is false (i.e., pi cannot be reactivated with the mes-
sages that have arrived and are not yet consumed, and a subset of the messages
which are potentially in transit to it).
Moreover, after bi has been computed, obsi resets its local variable cont_
passivei to the Boolean value (statei = passive). This is due to the fact that, dif-
ferently from the previous termination detection algorithms, the waves are not
396 14 Distributed Termination Detection

delayed by the local observers in dynamic termination detection. This is required


to ensure “as early as possible” termination detection.
Finally, when there is a wave such that the Boolean value
  
¬fulfilledi (arr_fromi ∪ ap_nei )
1≤i≤n
is true, obsα claims dynamic termination. If the value is false, it starts a new wave.
The proof of this algorithm is close to that of static termination. It uses both the
monotonicity of the predicate fulfilledi () and the monotonicity of the counters (see
Problem 6).

Cost of the Algorithm After dynamic termination has occurred, two waves are
necessary to detect it, in the worst case. No acknowledgment is used, but a wave
requires 2n messages, each carrying an array of unbounded integers (freezing the
observed computation would allow us to reset the counters). The algorithm does not
require the channels to be FIFO (neither for application, nor for control messages).
But as waves are sequential, the channels from obsα to each obsi , and the chan-
nels from every obsi to obsα , behave as FIFO channels for the control messages
REQUEST () and control messages ANSWER (), respectively. It follows that, instead
of transmitting the value of a counter, it is possible to transmit only the difference
between its current value and the sum of the differences already sent.

14.6 Summary
A distributed computation has terminated when all processes are passive and all
channels are empty. This defines a stable property (once terminated, a computa-
tion remains terminated forever) which has to be detected, in order that the system
can reallocate local and global resources used by the processes (e.g., local memory
space).
This chapter presented several distributed algorithms which detect the termina-
tion of a distributed computation. Algorithms suited to specific models (such as the
asynchronous atomic model), and algorithms suited to specific types of computa-
tions (such as diffusing computations) were first presented. Then the chapter has
considered more general algorithms. In this context it presented a reasoned con-
struction of a very general termination detection algorithm. It also introduced a very
general distributed model which allows for very versatile receive statements, and
presented two termination detection algorithms suited to this distributed computa-
tion model.

14.7 Bibliographic Notes


• The distributed termination detection problem was simultaneously introduced by
E.W. Dijkstra and C.S. Scholten [115], and N. Francez [130].
14.8 Exercises and Problems 397

• The asynchronous atomic model was introduced by F. Mattern in [249], who pro-
posed the four-counter algorithm (Fig. 14.5), and the counting vector algorithm
(Fig. 14.7) to detect termination in this distributed computing model.
• The notion of diffusing computations was introduced by E.W. Dijkstra and C.S.
Scholten [115], who proposed the algorithm described in Fig. 14.9 to detect the
termination of such computations.
• The reasoned construction presented in Sect. 14.4, and the associated general
termination detection algorithm presented in Fig. 14.13 are due to J.-M. Hélary
and M. Raynal [178, 180].
• The wave concept is investigated in [82, 180, 319, 365].
• The notion of freezing in termination detection algorithms is studied in [132].
• The problem of detecting the termination of a distributed computation in a very
general asynchronous model, and the associated detection algorithms, are due to
J. Brzezinski, J.-M. Hélary, and M. Raynal [64].
• The termination detection problem has given rise to an abundant literature and
many algorithms. Some are designed for the synchronous communication model
(e.g., [114, 265]); some are based on snapshots [190]; some are based on roughly
synchronized clocks [257]; some are based on prime numbers [309] (where the
unicity of the prime number factorization of an integer is used to ensure con-
sistent observations of a global state); some are obtained from a methodological
construction (e.g., [79, 342]); and some are based on the notion of credit dis-
tribution [251]. Other investigations of the termination detection problem and
algorithms can be found in [74, 169, 191, 221, 252, 261, 302, 330] to cite a few.
• Termination detection in asynchronous systems where processes may crash is
addressed in [168].
• Relations (in both directions) between termination detection and garbage collec-
tion are investigated in [366, 367].

14.8 Exercises and Problems


1. Let us consider the counting vector algorithm described in Fig. 14.7.
• Why does the termination predicate of line 10 need to include the fact that
each process has been visited at least once?
• When the local observer at process pi receives the token and is such that
msg_count[i] > 0, it knows that messages sent to pi have not yet arrived.
Modify the algorithm so that the progression of the token is blocked at pi
until msg_count[i] = 0.
2. Let us assume that the underlying system satisfies the following additional syn-
chrony property: the transmission delay of each message is upper bounded by a
constant δ. Moreover, each process has a local clock that allows it to measure du-
rations. These clocks are not synchronized but measure correctly the duration δ.
Modify the termination detection algorithm suited to diffusing computations
described in Fig. 14.9 so that it benefits from this synchrony assumption.
398 14 Distributed Termination Detection

Fig. 14.20 Example of a monotonous distributed computation

Hint. When a process enters the tree, it sends a message ACK (in) to its parent and
send no message if it is already in the tree. It will send to its parent a message
ACK (out) when it leaves the tree.
Solution in [308].
3. Modify the algorithms implementing a wave and the generic termination detec-
tion algorithm described in Figs. 14.10, 14.11, and 14.13, respectively, so that
the initiator pα is also an application process (which has consequently to be ob-
served).
4. The termination detection algorithm presented in Fig. 14.13 assumes that the
application process pi is frozen (Fig. 14.14) from the time the predicate idlei
becomes true (line 13) until obsi starts the invocation of return_wave(b) (line 15).
Considering the xth wave, let us define τix as the time at which idlei becomes
true at line 13, and let us replace line 14 by
b ← cont_passivei ; cont_passivei ← (statei = passive).
Do these modifications allow for the suppression of the freezing of the appli-
cation process pi ?
5. Let us consider the family of distributed applications in which each process has
a local variable xi whose value domain is a (finite or infinite) ordered set, and
each process follows the following rules:
• Each message m sent by pi carries a value cm ≥ xi .
• When it modifies the value of xi , a process pi can only increase it.
• When it receives a message m, which carries the value cm , pi can modify xi
to a value equal to or greater than min(xi , cm ).
These computations are called monotonous computations. An example of a
computation following these rules is given in Fig. 14.20, where the value domain
is the set of positive integers. The value cm associated with a message m is in-
dicated with the corresponding arrow, and a value v on the axis of a process pi
indicates that its local variable xi is updated to that value, e.g., x3 = 1 a time τ1 ,
and x1 = 5 at time τ2 .
14.8 Exercises and Problems 399

At some time τ , the values of the local variables xi and the values c carried
by the messages define the current global state of the application. Let F (τ ) be
the smallest of these values. Two examples are given in Fig. 14.20.
Distributed simulation programs are an example of programs where the pro-
cesses follow the previous rules (introductory surveys on distributed simulation
can be found in [140, 263]). The variables xi are the local simulation times at
each process, and the function F (τ ) defines the global simulation time.
• Show that (τ1 ≤ τ2 ) ⇒ [F (τ1 ) ≤ F (τ2 )].
• An observation algorithm for such a computation is an algorithm that com-
putes approximations of the value of F (τ ) such that
– Safety. If at time τ the algorithm returns the value z, then z ≤ F (τ ).
– Liveness. For any τ , if observations are launched after τ , there is a time after
which all the observations return values greater than or equal to z = F (τ ).
When considering distributed simulation programs, the safety property states a
correct lower bound on the global simulation time, while the liveness property
states that if the global simulation progresses then this progress can eventually
be observed.
Adapt the algorithm of Fig. 14.13 so that it computes F (τ ) and satisfies the
previous safety and liveness properties.
• Considering the value domain {active, passive} with the total order active <
passive, show that the problem of detecting the termination of a distributed
computation consists in repeatedly computing F (τ ) until F (τ ) = passive.
Solutions to all the previous questions in [87, 179].
6. Prove the dynamic termination detection algorithm described in Fig. 14.19. (This
proof is close to that of the static termination detection algorithm presented in
Fig. 14.16. In addition to the monotonicity of the predicate fulfilledi (), the fact
that the counters senti [j ], reci [j ] and ap_sentα [i, j ] are monotonically increas-
ing has to be used.)
Solution in [64].
Chapter 15
Distributed Deadlock Detection

This chapter addresses the deadlock detection problem. After having introduced the
AND deadlock model and the OR deadlock model, it presents distributed algorithms
that detect their occurrence. Let us recall that the property “there is a deadlock” is
a stable property (once deadlocked, a set of processes remain deadlocked until an
external agent—the underlying system—resolves it). Hence, as seen in Sect. 6.5,
algorithms computing global states of a computation can be used to detect dead-
locks. Differently, the algorithms presented in this chapter are specific to deadlock
detection. For simplicity, they all assume FIFO channels.

Keywords AND communication model · Cycle · Deadlock · Deadlock detection ·


Knot · One-at-a-time model · OR communication model · Probe-based algorithm ·
Resource vs. message · Stable property · Wait-for graph

15.1 The Deadlock Detection Problem

15.1.1 Wait-For Graph (WFG)

Waiting for Resources A process pi becomes blocked when it starts waiting for a
resource currently owned by another process pj . This introduces a waiting relation
between pi and pj . It is possible that a process pi needs several resources simul-
taneously. If these resources are currently owned by several processes pj , pk , etc.,
the progress of pi depends on each of these processes.

Waiting for Messages As seen in the previous chapter (Sect. 14.5), a receive state-
ment can be on a message from a specific sender, or on a message from any sender
of a specific set of senders. The specific sender, or set of possible senders, is defined
in the receive statement. (While more general receive statements were introduced in
Sect. 14.5, this chapter considers only the case of a receive statement with a specific
sender or a specific set of senders.)
The important point is that, when a process pi enters a receive statement, it starts
waiting for a message from a predefined process pj , or from any process from a
predefined set of processes.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 401


DOI 10.1007/978-3-642-38123-2_15, © Springer-Verlag Berlin Heidelberg 2013
402 15 Distributed Deadlock Detection

Fig. 15.1 Examples of wait-for graphs

Resource vs. Message After it has been used, a resource has to be released so that
it can be used by another process. Let us observe that a message can be seen as a
resource that is dynamically created by its sender and consumed by its destination
processes. A receiver has to wait until a message is received. In that sense, a message
can be seen as a consumable resource (the notion of release being then meaningless).
While a resource is shared among several processes and a message involves only
its sender and its receiver, there is no conceptual difference between a process which
is waiting for a resource and a process which is waiting for a message. In both cases,
another process has to produce an action (release a resource, or send a message) in
order that the waiting process be allowed to progress.
It follows that there is no fundamental difference between detection algorithms
for deadlocks due to resources and detection algorithms for deadlocks due to mes-
sages.

Wait-For Graph The waiting relation between processes can be represented by


a directed graph called a wait-for graph (WFG). Its vertices are the processes, and
an edge from pi to pj means that pi is blocked by pj (or, equivalently, that pj
blocks pi ). This graph is not a static graph. Its structure evolves according to the
behavior of the processes (that acquire and release resources, or send and receive
messages).
A wait-for graph is a conceptual tool that allows us to capture blocking relations
and reason on them. Deadlock detection algorithms are not required to explicitly
build this graph.
Two examples of a wait-for graph are depicted in Fig. 15.1. (The graph on the
right side is the same as the graph on the left side with an additional edge from p5
to p2 .) A given graph can be considered as capturing the blocking relations either in
the resource model or in the communication model.
The arrow from p3 to p1 means that p3 is blocked by p1 . If the graph is a resource
wait-for graph, p3 is waiting for a resource owned by p1 , and if the graph is a
communication wait-for graph, it means that p3 is waiting for a message from p1 .
A process can be blocked by several other processes. This is the case of p1 , which is
blocked by p2 and p5 . The meaning of the corresponding edges is explained below
in Sect. 15.1.2.

Dependency Set Given a wait-for graph at some time τ , the dependency set of
a process pi , denoted dep_seti (as in the previous chapter), is the set of all the
processes pj such that the directed edge (pi , pj ) belongs to the graph at time τ . If
15.1 The Deadlock Detection Problem 403

a process is not waiting for a resource (or a message, according to the model), its
dependency set is empty. As the wait-for graph evolves with time, the dependency
sets are not static sets.
When looking at Fig. 15.1, we have dep_set1 = {2, 5} in both graphs, dep_set5
= ∅ in the graph on the left and dep_set5 = {2} in the graph on the right.

15.1.2 AND and OR Models Associated with Deadlock

AND (Resource or Communication) Model In the AND waiting model a pro-


cess pi which is blocked will be allowed to progress only when
• AND communication model:
It has received a message from each process that belongs to its current depen-
dency set.
• AND resource model:
It has obtained the resources which are currently owned by the processes in its
dependency set.
Let us observe that, while it is waiting for resources, the dependency set of
a process pi might change with time. This is due to the fact that, while pi is
waiting, resources that pi wants to acquire can be granted to other processes. The
wait-for graph and the dependency set of pi are then modified accordingly.

OR (Resource or Communication) Model In the OR waiting model, a process


is allowed to progress when
• OR communication model:
It has received a message from a process in its dependency set.
• OR resource model:
It has obtained a resource from a process in its dependency set.

One-at-a-Time (Resource or Communication) Model The one-at-a-time model


captures the case where at any time a process is allowed to wait for at most one
resource (resource model), or one message from a predetermined process specified
in the receive statement (communication model).
Let us observe that both the AND model and the OR model include the one-at-a-
time model as their simplest instance. It is actually their intersection.

15.1.3 Deadlock in the AND Model

In the AND model, the progress of a process is stopped until it has received a mes-
sage (or been granted a resource) by each process in its current dependency set. This
means that a process is deadlocked when it belongs to a cycle of the current wait-for
graph, or belongs to a path leading to a cycle of this graph.
Let us consider the wait-for graph on the left of Fig. 15.1. The processes p1 , p2 ,
and p3 are deadlocked because, transitively, each of them depends on itself: p2 is
404 15 Distributed Deadlock Detection

blocked by p3 , which is blocked by p1 , which in turn is blocked p2 , hence each of


p3 , p1 , and p2 is transitively blocked by itself. Process p4 is blocked by p2 (which
is deadlocked) but, as it does not depend transitively on itself, it is not deadlocked
according to the definition of “deadlocked process” (killing p4 will not suppress the
deadlock, while killing any of p3 , p1 , or p2 , suppresses the deadlock). Finally, p5
and the other processes (if any) are not blocked.
Hence, when considering the wait-for graph in the AND model, the occurrence of
a deadlock is characterized by the presence of a cycle. Consequently, for an external
observer that would always have an instantaneous view of the current wait-for graph,
detecting a deadlock in the AND model, would consist in detecting a cycle in this
graph.

15.1.4 Deadlock in the OR Model

In the OR model, a process is re-activated as soon as it receives a message from (or


is granted a resource previously owned by) one of the processes in its dependency
set. It follows that the existence of a cycle no longer reveals a deadlock.
Let us consider again the process p1 in the wait-for graph on the left of Fig. 15.1.
As it is waiting for a message (resource) either from p2 or from p5 , and p5 is not
blocked, it is possible that p5 (before terminating its local computation), unblocks
it in the future. Hence, despite the presence of a cycle in the wait-for graph, there is
deadlock. Let us now consider the wait-for graph on the right of the figure. Each of
p1 , p2 , p3 , and p5 , is blocked, and none of them can allow another one to progress
in the future. This is because the progress of any of these processes depends on the
progress of the others (moreover, p4 is also deadlocked by transitivity). As we can
see, such a graph structure is a knot. Let us recall (see Sect. 2.3) that a knot in a
directed graph is a set S of vertices such that (a) there is a directed path from any
vertex of S to any other vertex of S, and (b) there is no outgoing edge from a vertex
in S to a vertex which is not in S (see Fig. 2.14).
Hence, when considering the wait-for graph in the OR model, the occurrence of
a deadlock is characterized by the presence of a knot. Consequently, for an external
observer that would always have an instantaneous view of the current wait-for graph,
detecting a deadlock in the OR model would consist in detecting a knot in this graph.

15.1.5 The Deadlock Detection Problem

As for the other problems (e.g., resource allocation, communication, termination de-
tection), the deadlock detection problem is defined by safety and liveness properties
that any of its solutions has to satisfy. These properties are the following:
• Safety. If, at time τ , an observer claims that there is a deadlock, the process it is
associated with is deadlocked at time τ .
15.2 Deadlock Detection in the One-at-a-Time Model 405

• Liveness. If a deadlock occurs at time τ , there is a finite time τ  ≥ τ such that,


at time τ  , the observer of at least one of the deadlocked processes claims that its
associated process is deadlocked.
The safety property states that the detection is consistent if there is no “false”
deadlock detection (i.e., claim of a deadlock while there is none). The liveness prop-
erty states that a deadlock will be discovered by at least one observer associated with
a deadlocked process.

Remark When comparing deadlock detection algorithms with algorithms that al-
low a process to know if it belongs to a cycle, or a knot, of a given communication
graph (see the algorithms presented in Sect. 2.3), the additional difficulty lies in
the fact that the graphs explored by deadlock detection algorithms consider are dy-
namic. The waiting relation depends on the computation itself, and consequently the
wait-for graph evolves with time.
Moreover, while the abstract wait-for graph is modified instantaneously for an
external observer’s point of view (i.e., when reasoning to obtain a characterization
of deadlocks in terms of properties on a graph), at the operational level messages
take time to inform processes about changes in the waiting relation. This creates
an uncertainty on the view of the waiting relation as perceived by each process.
The main issue of deadlock detection algorithms is to provide processes with an
approximate view that allows them to correctly detect deadlocks that occur.

15.1.6 Structure of Deadlock Detection Algorithms

To detect the possible occurrence of deadlocks, an observer obsi is associated with


each application process pi . As in termination detection, each observer has to locally
observe the behavior of the process it is associated with, and observers have to
cooperate among themselves to detect deadlocks. The global structure is the same
as that described in Fig. 14.2.
A main difference between termination detection and deadlock detection lies in
the fact that termination detection involves all the processes (and consequently all
the observers have to permanently cooperate), while deadlock detection involves
only a set of processes which is not known in advance (and consequently only the
observers associated with blocked processes are required to cooperate).

15.2 Deadlock Detection in the One-at-a-Time Model


This section presents a simple algorithm that detects deadlock in the one-at-a-time
model. Let us recall that, in this model, a process becomes blocked because it is
waiting for a resource currently owned by a process pk (resource model), or be-
cause it is waiting for a message from a predetermined process pk (communication
model). This algorithm is due to D.P. Mitchell and M.J. Merritt (1984).
406 15 Distributed Deadlock Detection

15.2.1 Principle and Local Variables

Principle As we have seen, a deadlock in the one-at-a-time model corresponds


to a cycle in the wait-for graph. The idea of the algorithm is to allow exactly one
process in a cycle to detect the corresponding deadlock (if needed this process can
then inform the other processes).
To that end, when a process observer suspects a deadlock, it suspects a cycle has
formed in the wait-for graph. It then launches an election on this “assumed” cycle.
(The election problem and election algorithms have been presented in Chap. 4.) If
there is a cycle, one process in the cycle will win the election, proving thereby a
deadlock occurrence.
The control messages used by the observers to discover a deadlock occurrence
are sent along the edges of the wait-for graph but in the reverse order, i.e., if a
process pi is blocked by a process pj , the observer obsj sends messages to the
observer obsi .

Local Variables Two local control variables, denoted pubi (for public) and privi
(for private) are associated with each process pi . These variables are integers which
can only increase. Moreover, the values of the private local variables are always
distinct from one another, and the local public variables can be read by the other
processes.
A simple way to ensure that no two private variables (privi and privj ) have the
same value consists in considering that the integer implementing a private variable
privi is obtained from the concatenation of an integer with the identity of the corre-
sponding process pi . The reading of the local variable pubj by an observer obsi can
be easily realized by a query/response mechanism involving obsi and obsj .
Hence, the behavior of each observer obsi consists in an appropriate management
of the pair (privi , pubi ) so that, if a cycle appears in the abstract wait-for graph, a
single process in a cycle detects the cycle, and the processes which are not involved
in a cycle never claim a deadlock.

15.2.2 A Detection Algorithm

When a process pi becomes blocked because of a process pj , its local observer


obsi (a), redefines the pair of its local variables (privi , pubi ) so that both variables
become equal (privi = pubi ) and greater than pubj , and (b) sends its new value pubi
to the observers of the processes blocked by pi .
The value pubi is then propagated from observer to observer along the reverse
edges of the wait-for graph. If pubi returns to obsi , there is a cycle in the wait-for
graph. In order that a single observer on a cycle detects the cycle, the greatest value
pubi that is propagated along a cycle succeeds in completing a full turn on the cycle
(as done in ring-based election algorithms).
The behavior of an observer obsi is consequently defined by the four following
rules, where greater(a, b) returns a value that is greater than both a and b.
15.2 Deadlock Detection in the One-at-a-Time Model 407

• R1 (Blocking rule).
When pi becomes blocked due to pj , obsi resets the values of its local vari-
ables privi and pubi such that

privi = pubi = v, where v = greater(pubi , pubj ).

• R2 (Propagation rule).
When pi is blocked by pj , obsi repeatedly reads the value of pubj , and exe-
cutes
if (pubi < pubj ) then pubi ← pubj end if.
This means that, while pi is blocked by pj , obsi discovers that the public value
pubj is greater than its own public value pubi , it propagates the value of pubj
by assigning it to pubi . Assuming that pubk is the greatest public value in a path
of the wait-for graph, this rule ensures that pubk is propagated from pk to the
process p blocked by pk , then from p to the process blocked by p , etc.
• R3 (Activation rule).
Let pj , pk , etc., be the processes blocked by a process pi . When pi unblocks
one of them (e,g., pj ), obsi informs the observers of the other processes so that
they re-execute rule R1. (This is due to the fact that these processes are now
blocked by pj .)
• R4 (Detect rule).
If after it has read the value of pubj (where pj is the process that blocks pi ),
obsi discovers that
privi = pubj ,
it claims that there is a cycle in the wait-for graph, and it is consequently involved
in a deadlock.

15.2.3 Proof of the Algorithm

Theorem 24 If a deadlock occurs, eventually a process involved in the associated


cycle detects it. Moreover, no observer claims a deadlock if its associated process is
not deadlocked.

Proof Proof of the liveness property. We have to prove that, if a deadlock occurs,
a process involved in the associated cycle will detect it. Let us assume that there
is a cycle in the wait-for graph. Let us observe that, as a deadlock defines a stable
property, this cycle lasts forever.
It follows from rule R1 that, when it becomes blocked, each process px belonging
to this cycle sets privx and pubx to a value greater than its previous value of pubx
and greater than the value puby of the process py that blocks it. Moreover, due to
the definition of private values, no other process pz sets privz and pubz to the same
408 15 Distributed Deadlock Detection

value as px . It follows that there is a process (say pi ) whose value pubi = v is the
greatest among the values pubx computed by the processes belonging to the cycle.
When the processes of the cycle execute rule R2, the value of v is propagated,
in the opposite direction, along the edges of the wait-for graph. Hence, the process
pj that blocks pi is eventually such that pubj = v. There is consequently a finite
time after which pi reads v from pubj and concludes that there is a cycle (rule R4).
Moreover, as v = privi , v is a value that has necessarily been forged by pi (no other
process can have forged it). It follows that deadlock is claimed by a single process,
which is a process involved in the corresponding cycle of the wait-for graph.
Proof of the safety property. Let us first observe that ∀x : privx ≤ pubx (this is
initially true, and is kept invariant by the rules R1 and R2 executed thereafter). It
follows from this invariant and R1 that, if px is blocked, (privx < pubx ) ⇒ px has
executed R2.
Let us assume that a process pi , blocked by process pj 1 , is such that privi =
pubj 1 = v. We have to show that pi belongs to a cycle of the wait-for graph. This is
proved in three steps.
• It follows from the rule R2 that the value v (which has been forged by obsi )
has been propagated from pi to a process blocked by pi , etc., until pj 1 . Hence,
there is a set of processes pi , pjk , pjk−1 , . . . , pj1 , such that the edges (pjk , pi ),
(pjk−1 , pjk ), . . . , (pj1 , pi ) have been, at some time, edges of the wait-for graph.
The next two items show that all these edges exist simultaneously when pi claims
deadlock.
• The process pi remained continuously blocked since the time it computed the
value v. If it had become active and then blocked again, it would have executed
again R1, and (due to the invariant privi ≤ pubi and the function greater()) we
would consequently have privi = v  > v.
• All the other process pjk , pjk−1 , . . . , pj1 remained continuously blocked since
the time they have forwarded v. This follows from the following observation. Let
us assume (by contradiction) that one of these processes, say pjy , became ac-
tive after having transmitted v. This process has been unblocked by pjy−1 , which
has transmitted v before being unblocked by pjy−2 , which in turn, etc., until pj1
which has transmitted v to pi before being unblocked by pi . It follows that pi
has not been continuously passive since the time it computed the value v, which
contradicts the previous item, and completes the proof. 

15.3 Deadlock Detection in the AND Communication Model

As already indicated, in the AND model, a process is blocked by several other pro-
cesses and each of them has to release a resource or send it a message in order to
allow it to progress. This section presents a relatively simple deadlock detection
algorithm for the communication AND model.
15.3 Deadlock Detection in the AND Communication Model 409

15.3.1 Model and Principle of the Algorithm

Model with Input Buffers As in the previous chapter, let statei be a control vari-
able whose value domain is {active, passive}; this variable is such that statei =
passive when pi is blocked waiting for messages from a predefined set of processes
whose identities define the set dep_seti .
As a process pi consumes simultaneously a message from each process in
dep_seti (and proceeds then to the state statei = active), it is possible that messages
from processes in dep_seti have arrived and cannot be received and consumed by
pi because there is not yet a message from each process in dep_seti . To take this
into account, the communication model described in Fig. 14.15 is considered. This
model allows a process pi (or more precisely its observer obsi ) to look into its input
buffers to know if messages have arrived and are ready to be consumed.

Principle of the Algorithm Let pi be a process that is blocked (hence, statei =


passive), and arr_fromi be a set containing the identities of the processes from which
messages have arrived at pi and have not yet been consumed. This means that these
messages can help unblock pi .
When it suspects that pi is deadlocked, its observer obsi sends a control mes-
sage PROBE () to each process pj in dep_seti \ arr_fromi . When such a process
pj receives this message, it discards the message if it is active. If it is passive,
its observer obsj forwards the message PROBE () to the observers obsk such that
k ∈ dep_setj \ arr_fromj , and so on. If the observer obsi that initiated the detection
receives a message PROBE (), there is a cycle in the wait-for graph, and pi belongs
to a set of deadlocked processes.

15.3.2 A Detection Algorithm

Local Variables at a Process pi In addition to the control variables statei ,


dep_seti , and arr_fromi (which is managed as in Sect. 14.5), a local observer man-
ages the three following arrays, all initialized to [0, . . . , 0]:
• sni [1..n] is an array of sequence numbers; sni [i] is used by obsi to identify its
successive messages PROBE (), while sn_i[k] contains the highest sequence num-
ber received by obsi in a message PROBE () whose sending has been initiated by
pk .
• senti [1..n] is an array of integers such that senti [j ] counts the number of mes-
sages sent by pi to pj .
• arri [1..n] is an array of integers such that arri [j ] counts the number of messages
sent by pj that have arrived to pi (it is possible that, while they have arrived at
pi , some of these messages have not yet been consumed by pi ).
410 15 Distributed Deadlock Detection

when pi enters a receive statement do


(1) compute dep_seti from the receive statement;
(2) statei ← passive.

when pi executes send(m) to pj do


(3) senti [j ] ← senti [j ] + 1.

when a message from pj arrives do


(4) arri [j ] ← arri [j ] + 1.

when arr_fromi ⊆ dep_seti do


(5) messages are withdrawn from their input buffers and given to pi ;
(6) statei ← active.

when obsi suspects that pi is deadlocked do


% we have then statei = passive and arr_fromi  dep_seti %
(7) sni [i] ← sni [i] + 1;
(8) for each j ∈ dep_seti \ arr_fromi
(9) do send PROBE (i, sni [i], i, arri [j ]) to obsj
(10) end for.

when PROBE (k, seqnb, j, arrived) is received from obsj do


(11) if (statei = passive) ∧ (senti [j ] = arrived) then
(12) if (k = i) then claim pi is deadlocked
(13) else if (seqnb > sni [k]) then
(14) sni [k] ← seqnb;
(15) for each ∈ dep_seti \ arr_fromi
(16) do send PROBE (k, seqnb, i, arri [ ]) to obs
(17) end for
(18) end if
(19) end if
(20) end if.

Fig. 15.2 An algorithm for deadlock detection in the AND communication model

The Algorithm The algorithm is described in Fig. 15.2. All the sequences of state-
ments prefixed by when are executed atomically. Lines 1–6 describe the behavior
of obsi as far as the observation of pi is concerned.
When obsi suspects pi to be deadlocked, we have statei = passive and
arr_fromi  dep_seti . This suspicion can be activated by an internal timer, or any
predicate internal to obsi . When this occurs, obsi increases sni [i] (line 7), and sends
a message PROBE () to each observer obsj associated with a process pj from which
it is waiting for a message (lines 8–10).
A probe message is as follows: PROBE (k, seqnb, j, arrived). Let us consider an
observer obsi that receives such a message. The pair (k, seqnb) means that this
probe has been initiated by obsk and it is its the seqnbth probe launched by pk ;
j is the identity of the sender of the message (this parameter could be saved, as a
receiver knows which observer sent this message; it appears as a message parameter
for clarity). Finally, the integer arrived is the number of messages that have been
sent by pi to pj and have arrived at pj .
15.3 Deadlock Detection in the AND Communication Model 411

Fig. 15.3 Determining in-transit messages

Fig. 15.4 PROBE () messages sent along a cycle (with no application messages in transit)

When obsi receives a message PROBE (k, seqnb, j, arrived), it discards it


(line 11), if pi is active or there were messages in transit from pi to pj when obsi
sent this probe message. This is captured by the predicate (sendi [j ] = arrived)
(let us recall that arrived = arrj [i] when obsj sent this message, and channels are
assumed to be FIFO, see Fig. 15.3).
If pi is passive and there is no message in transit from pi to pj , obsi checks
first if it is the initiator of this probe (predicate k = i). If it is, it declares that
pi is deadlocked (line 12). If k = i (i.e., the initiator of the probe identified
(k, seqnb) is not obsi ), and this is a new probe launched by obsk (line 13), obsi
updates sni [k] (line 14) and propagates the probe by sending a control message
PROBE (k, seqnb, i, arri [ ]) to each observer obs such that pi is waiting for a mes-
sage from p .

15.3.3 Proof of the Algorithm

Theorem 25 If a deadlock occurs, eventually a process involved in the associated


cycle detects it. Moreover, no observer claims a deadlock if its associated process is
not deadlocked.

Proof Proof of the liveness property. Let us assume that a process pi is deadlocked.
This means that (a) there is a cycle of processes pj1 , pj2 , . . . , pjk , pj1 , where i =
j1 , j2 ∈ dep_setj1 , j3 ∈ dep_setj2 , etc., j1 ∈ dep_setjk , (b) there is no application
message in transit from pj2 to pj1 , from pj3 to pj2 , etc., and from pj1 to pjk , and
(c) none of these processes can be re-activated from the content of its input buffer.
Let us consider the observer obsi , which launches a probe after the cycle
has been formed (see Fig. 15.4, where k = 3). The observer obsi sends the
message PROBE (i, sn, i, arri [j2 ]) to obsj2 (line 9). As the channel from pj2 to
412 15 Distributed Deadlock Detection

Fig. 15.5 Time instants in the proof of the safety property

pj1 is empty of application messages, the first time obsj2 receives the mes-
sage PROBE (i, sn, i, arri [j2 ]), we have sentj2 [i] = arri [j2 ], and obsj2 executes
lines 14–17. Consequently obsj2 sends the message PROBE (i, sn, j2 , arrj2 [j3 ]) to
obsj3 . And so on, until obsjk which receives the message PROBE (i, sn, jk−1 , arri [jk ])
and sends the message PROBE (i, sn, jk , arrjk [i]) to obsi . As the channel from pi to
pjk is empty, it follows from line 12 that, when it receives this message, obsi is such
that senti [jk ] = arrjk [i]. Hence, obsi claims that pi is deadlocked, which proves the
liveness property.
Proof of the safety property. Let us consider an observer obsi that claims that pi
is deadlocked. We have to show that pi belongs to a cycle pi = pj1 , pj2 , . . . , pjk ,
pj1 , such that there is a time at which simultaneously (a) j2 ∈ dep_setj1 and there is
no message in transit from pj2 to pj1 , (b) j3 ∈ dep_setj2 and there is no message in
transit from pj3 to pj2 , (c) etc. until process pj1 such that j1 ∈ dep_setjk and there
is no message in transit from pj1 to pjk .
As obsj1 claims that pj1 is deadlocked (line 12), there is a time τ at which obsj1
received a message PROBE (j1 , sn, jk , arrjk [j1 ]) from some observer obsjk . Process
pjk was passive when obsjk sent this message at some time τk < τ . Moreover,
as obsj1 did not discard this message when it received it (predicate sentj1 [jk ] =
arrived = arrjk [j1 ], line 11), the channel from pj1 to pjk did not contain applica-
tion messages between τk and τ . It follows that pjk remained continuously passive
from the time τk to time τ . (See Fig. 15.5. The fact that there is no message in transit
from one process to another is indicated by a crossed-out arrow.)
The same observation applies to obsjk . This local observer received at time
τk a control message PROBE (j1 , sn, jk−1 , arrjk−1 [jk ]), which was sent by an ob-
server obsjk−1 at time τk−1 . As obsjk did not discard this message, we have
sentjk [jk−1 ] = arrjk−1 [jk ] from which it follows that the channel from pjk to
pjk−1 did not contain application messages between τk−1 and τk . Moreover, as
jk ∈ dep_setjk−1 \ arr_fromjk−1 , and pk remained continuously passive from time
τk to time τ , it follows that pk−1 remained continuously passive from time τk−1
to time τ . This reasoning can be repeated until the sending by pj1 of the message
PROBE (j1 , sn, j1 , arrj1 [j2 ]), at time τ1 , from which we conclude that pj1 remained
continuously passive from time τ1 to time τ .
15.4 Deadlock Detection in the OR Communication Model 413

It follows that (a) the processes pj1 , . . . , pjk are passive at time τ , (b) the channel
from pj1 to pjk , the channel from pjk to pjk−1 , etc., until the channel from pj2 to
pj1 are empty at time τ , and (c) none of these processes can be re-activated from the
messages in its input buffer. Consequently these processes are deadlocked (which
means that the cycle is a cycle of the wait-for graph), which concludes the proof of
the safety property. 

15.4 Deadlock Detection in the OR Communication Model

As seen in Sect. 15.1, in the OR communication model, each receive statement spec-
ifies a set of processes, and the invoking process pi waits for a message from any of
these processes. This set of processes is the current dependence set of pi , denoted
dep_seti . As soon as a message from a process of dep_seti has arrived, pi stops wait-
ing and consumes it. This section presents an algorithm which detects deadlocks in
this model. This algorithm is due to K.M. Chandy, J. Misra, and L.M. Haas (1983).
This algorithm considers the following slightly modified definition for a set of
deadlocked processes, namely, a set D of processes is deadlocked if (a) all the pro-
cesses of D are passive, (b) the dependency set of each of them is a subset of D, and
(c) for each pair of processes {pi , pj } ∈ D such that j ∈ dep_seti , there is no mes-
sage in transit from pj to pi . When compared to the definition given in Sect. 15.1.4,
this definition includes the processes blocked by processes belonging to a knot of
deadlocked processes. When considering the wait-for graph on the left of Fig. 15.1,
this definition considers that p4 is deadlocked, while the definition of Sect. 15.1.4
considers it is blocked by a deadlocked process (p2 ).
This algorithm assumes also that a process pi is passive only when it is waiting
for a message from a process belonging to its current dependency set dep_seti ,
which means that processes do not terminate. The case where a process locally
terminates (i.e., attains an “end” statement, after which it remains forever passive)
is addressed in Problem 4.

15.4.1 Principle

Network Traversal with Feedback When the observer obsi associated with a
process pi suspects that pi is involved in a deadlock, it launches a parallel network
traversal with feedback on the edges of the wait-for graph whose it is the origin, i.e.,
on the channels from pi to pj such that j ∈ dep_seti . (A network traversal algorithm
with feedback that builds a spanning tree has been presented in Sect. 1.2.4.) If such
a process pj is active, it discards the message and, consequently, stops the network
traversal. If it is itself blocked, it propagates the network traversal to the processes
which currently define its set dep_setj . And so on. These control messages are used
to build a spanning tree rooted at pi .
414 15 Distributed Deadlock Detection

Fig. 15.6 A directed


communication graph

If the network traversal with feedback terminates (i.e., obsi receives an answer
from each process pj in dep_seti ), the aim is to allow obsi to conclude that pi is
deadlocked. If an observer obsj has locally stopped the progress of the network
traversal launched by obsi , pj is active and may send in the future a message to
the process pk that sent it a network traversal message. If re-activated, this process
pk may in turn re-activate a process p such that ∈ dep_setk , etc. This chain of
process re-activations can end in the re-activation of pi .

The Difficulty in the Observation of a Distributed Computation The network


traversal algorithms presented in Chap. 1 consider an underlying static network.
Differently, the network defined by the wait relation (wait-for graph) is not static:
it is modified by the computation. Hence the following problem: Can a network
traversal algorithm designed for a static graph be used to do a consistent distributed
observation of a graph that can be dynamically modified during its observation?

Network Traversal with Feedback on a Directed Static Communication Graph


Let QUERY be the type of messages that implement the propagation of a network
traversal on the wait-for graph, and ANSWER the type of the messages that imple-
ment the associated feedback. (These messages are typed GO and BACK, respec-
tively, in the network traversal with feedback algorithm described in Fig. 1.7).
Let us consider a four-process computation such that, at the application level,
the communication graph is the directed graph described in Fig. 15.6. At the un-
derlying level, the channels are bidirectional for the control messages exchanged
by the process observers. Let us consider the case where the local observer obs1
launches a network traversal with feedback. This network traversal is described in
Fig. 15.7, where the subscript (x, y) attached to a message means that this message
is sent by obsx to obsy . First obs1 sends a message QUERY () to both ob2 and obs3
(messages QUERY 1,2 () and QUERY 1,3 ()), then obs2 forwards the network traversal

Fig. 15.7 Network traversal with feedback on a static graph


15.4 Deadlock Detection in the OR Communication Model 415

Fig. 15.8 Modification in a wait-for graph

to obs4 (message QUERY 2,4 ()), while obs3 forwards it to obs1 and obs4 (messages
QUERY 3,1 () and QUERY 3,4 ()). As it cannot extend the traversal, obs4 sends back a
message ANSWER () each time it receives a query (messages ANSWER 4,2 () and AN -
SWER 4,3 ()). As it has already been visited by the network traversal, obs1 sends by
return the message ANSWER 1,3 () when it receives QUERY 3,1 (). When each of obs2
and obs3 has received all the answers matching its queries, it sends a message AN -
SWER () to obs1 . Finally, when obs1 has received the answers from obs2 and obs3 ,
the network traversal with feedback terminates.

Network Traversal with Feedback on a Directed Dynamic Communication


Graph Let us now consider that at time τ1 , the directed graph on the left of
Fig. 15.8 is the current wait-for graph of the corresponding four-process compu-
tation. This means that, at τ1 , dep_set1 = {2, 3}, dep_set2 = {4}, dep_set3 = {1, 4},
and dep_set4 = ∅ (hence p1 , p2 and p3 are blocked, while p4 is active). Moreover,
the channels are empty of application messages.
Let us consider the following scenario. Process p4 first sends a message m to p2
and then starts waiting for a message from p2 . The message m will re-activate p2 ,
and the wait-for graph (which is a conceptual tool) is instantaneously modified. This
is depicted in Fig. 15.8, where the directed graph on the left side is wait-for graph at
time τ1 , i.e., just before p4 sends a message to p2 and starts waiting for a message
from this process, while the graph on the right side is the wait-for graph at time τ2 ,
just after p4 has executed these statements.
After p2 has been re-activated by the application message m, it sends a message

m to p4 and starts waiting for a message from this process. Let τ3 be the time instant
just after this has been done. The corresponding modification of the wait-for graph
is indicated at the bottom of Fig. 15.8.
Let us finally consider that obs1 launches a network traversal with feedback at
time τ1 . This network traversal, which is depicted in Fig. 15.9, is exactly the same
as that described in Fig. 15.7. Let us recall that, as indicated previously, an observer
416 15 Distributed Deadlock Detection

Fig. 15.9 Inconsistent observation of a dynamic wait-for graph

obsi propagates a network traversal only if pi is passive. In the scenario which is


described, despite the application messages m and m exchanged by p4 and p2 ,
the control messages are received by any observer obsi while the process pi it is
associated with is passive. Hence, the network traversal with feedback terminates,
and obs1 concludes erroneously that p1 is involved in a deadlock.

Observe if Processes Have Been Continuously Passive The erroneous observa-


tion comes from the fact that there is process activity in the back of the network
traversal, and the network traversal misses it. Let us observe that the FIFO prop-
erty of the channels is not sufficient to solve the problem (in Fig. 15.9 all channels
behave as FIFO channels).
A simple way to solve this problem consists in requiring that each observer obsi
observes if its process pi remained continuously passive between the time it was
visited by the network traversal (reception of the first message QUERY () from an
observer obsj ) and the time it sent a message ANSWER () to this observer obsj . As
demonstrated by the algorithm that follows, this observation Boolean value plus the
fact that channels are FIFO allows for a consistent distributed observation of the
computation.
As far as channels are concerned, let us notice that, if channels were not FIFO,
erroneous observation could occur, despite the previous Boolean values. To see this,
it is sufficient to consider the case where, in Fig. 15.9, the message m sent by p4 to
p2 arrives after the control message ANSWER () sent by obs4 to obs2 .

15.4.2 A Detection Algorithm

Local Variables As in previous algorithms, each observer obsi manages a local


variable statei that describes the current state of pi (active or passive). The network
whose traversal is launched by obsi is dynamically defined according to the current
15.4 Deadlock Detection in the OR Communication Model 417

values of the sets dep_setj . (By definition dep_setj = ∅ if pj is active.) The ver-
tices of the corresponding directed graph are the observers in the set DPi , which is
recursively defined as follows

DPi = {i} dep_setx ,
x∈DPi

and there is an edge from obsx to obsy if y ∈ dep_setx .


To implement network traversals while ensuring a correct observation, each ob-
server obsi manages the following local arrays whose entry k concerns the last net-
work traversal initiated by obsk .
• sni [1..n] is an array of sequence numbers, whose meaning is the same as in the
algorithm of Fig. 15.2, namely, sni [j ] is the sequence number associated with the
last network traversal initiated by obsj and known by obsi . Initially, sni [1..n] =
[0, . . . , 0].
The last network traversal initiated by an observer obsi is consequently iden-
tified by the pair (i, sni [i]).
• parenti [1..n] is array such that parenti [j ] contains the parent of obsi with respect
to the last network traversal launched by obsj and known by obsi .
• expected_aswri [1..n] is array of non-negative integers such that expected_
aswri [j ] is the number of messages ANSWER () that obsi has still to receive
before sending an answer to its parent, when considering the network traversal
identified (j, sni [j ]).
The pair of local variables (parenti [j ], expected_aswri [j ]) is related to the last
network traversal initiated by obsj . The set of variables parenti [j ] implements
a spanning tree rooted at obsj , which is rebuilt during each network traversal
launched by obsj .
• cont_passivei [1..n] is an array of Boolean values. For any j , cont_passivei [j ] is
used to register the fact that during the last visit of a network traversal initiated
by obsj , pi did or did not remain continuously passive. This last visit started at
the last modification of sni [j ].

The Algorithm The deadlock detection algorithm for the OR communication


model is described in Fig. 15.10. As in the previous algorithm, the sequences of
statements prefixed by when are executed atomically.
When a process pi enters a receive statement, obsi computes the associated set
dep_seti and pi becomes passive if there is no message from a process that belongs
to dep_seti , which has already arrived and can be consumed (lines 1–3). If there is
such a message, pi consumes it and continues its execution.
When a message arrives from a process pj and j ∈ dep_seti , pj is re-activated,
and the Boolean array cont_passivei [1..n] is reset to [false, . . . , false] (lines 6–10).
When obsi suspects that pi is deadlocked (we have then necessarily statei =
passive), it increases sni [i] and sends a query—identified as (i, sni [i])—to the ob-
servers associated with the processes pj such that j ∈ dep_seti (because only a
message from one of these processes can re-activate pi ). It also assigns the value
418 15 Distributed Deadlock Detection

when pi enters a receive statement do


(1) compute dep_seti from the receive statement;
(2) if (∀pj such that j ∈ dep_seti :
there is no message arrived from pj and not yet consumed)
(3) then statei ← passive
(4) else consume a message received a process in dep_seti
(5) end if.

when a message arrives from pj do


(6) if (statei = passive) ∧ (j ∈ dep_seti )
(7) then for each j ∈ {1, . . . , n} do cont_passivei [j ] ← false end for;
(8) statei ← active
(9) else keep the message in input buffer
(10) end if.

when obsi suspects that pi is deadlocked do % we have then statei = passive %


(11) sni [i] ← sni [i] + 1;
(12) for each j ∈ dep_seti do send QUERY (i, sni [i]) to obsj end for;
(13) expected_aswri [i] ← |dep_seti |;
(14) cont_passivei [i] ← true.

when QUERY (k, seqnb) is received from obsj do


(15) if (statei = passive)
(16) then if (seqnb > sni [k])
(17) then sni [k] ← seqnb; parenti [k] ← j ;
(18) for each j ∈ dep_seti do send QUERY (k, seqnb) to obsj end for;
(19) expected_aswri [k] ← |dep_seti |;
(20) cont_passivei [k] ← true
(21) else if (cont_passivei [k]) ∧ (seqnb = sni [k])
(22) then send ANSWER (k, seqnb) to obsj
(23) end if
(24) end if
(25) end if.

when ANSWER (k, seqnb) is received from obsj do


(26) if ((seqnb = sni [k]) ∧ cont_passivei [k])
(27) then expected_aswri [k] ← expected_aswri [k] − 1;
(28) if (expected_aswri [k] = 0)
(29) if (k = i) then claim pi is deadlocked
(30) else let x = parenti [k]; send ANSWER (k, seqnb) to obsx
(31) end if
(32) end if
(33) end if.

Fig. 15.10 An algorithm for deadlock detection in the OR communication model

|dep_seti | to expected_aswri [i], and (because it starts a new observation period of


pi ) it sets cont_passivei [i] to true (lines 11–14).
When obsi receives a message QUERY (k, seqnb) from obsj it discards the mes-
sage if pi is active (line 15). Hence, the network traversal launched by obsk and
15.4 Deadlock Detection in the OR Communication Model 419

identified (k, seqnb) will not terminate, and consequently this network traversal will
not allow obsk to claim that pi is deadlocked. If pi is passive, there are two cases:
• If seqnb > sni [k], obsi discovers that this query concerns a new network traversal
launched by obsk . Consequently, it updates sni [k] and defines obsj as its parent
in this network traversal (line 17). It then extends the network traversal by prop-
agating the query message it has received to each observer of dep_seti (line 18),
updates accordingly expected_aswri [i] (line 19), and starts a new observation pe-
riod of pi (with respect to this network traversal) by setting cont_passivei [k] to
true (line 20).
• If seqnb = sni [k], obsi stops the network traversal if it is an old one (seqnb <
sni [k]), or if pi has been re-activated since the beginning of the observation period
(which started at the first reception of a message QUERY (k, seqnb)).
If the query concerns the last network traversal launched by pk and pi has
not been re-activated since the start of the local observation period associated
with this network traversal, obsi sends by return the message ANSWER (k, seqnb)
to obsj (line 22). This is needed to allow the network traversal to return to its
initiator (if no other observer stops it).
Finally, when an observer obsi receives a message ANSWER (k, seqnb) it dis-
cards the message if this message is related to an old traversal, or if pi did not
remained continuously passive since the beginning of this network traversal. Hence,
if (seqnb = sni [k]) ∧ cont_passivei [k], obsi first decreases expected_aswri [k]
(line 27). Then, if expected_aswri [k] = 0, the network traversal with feedback can
leave obsi . To that end, if k = i, obsi forwards the message ANSWER (k, seqnb) to
its parent in the tree built by the first message QUERY (k, seqnb) it has received. If
k = i, the network traversal has returned to obsi , which claims that pi is deadlocked.

15.4.3 Proof of the Algorithm

Theorem 26 If a deadlock occurs, eventually the observer of a deadlocked process


detects it. Moreover, no observer claims a deadlock if its associated process is not
deadlocked.

Proof Proof of the liveness property. Let D be a set of deadlocked processes, and pi
be one of these processes. Let us assume that, after the deadlock has occurred, obsi
launches a deadlock detection. We have to show that obsi eventually claims that pi
is deadlocked.
As there is deadlock, there is no message in transit among the pairs of processes
of D such that i, j ∈ D and j ∈ dep_seti . Let (i, sn) be the identity of the network
traversal with feedback launched by obsi after the deadlock occurred. The observer
obsi sends a message QUERY (i, sn) to each observer obsj , such that j ∈ dep_seti .
When such an observer receives this message it sets cont_passivej [i] to true, and
forwards QUERY (i, sn) to all the observers obsk such that k ∈ dep_setj , etc. As there
420 15 Distributed Deadlock Detection

Fig. 15.11 Activation pattern for the safety proof

is no message in transit among the processes of D, and all the processes in D are
passive, all the Boolean variables cont_passivex [i] of the processes in D are set to
true and thereafter remain true forever. It follows that no message QUERY (i, sn) or
ANSWER (i, sn) is discarded. Consequently, no observer stops the progress of the
network traversal which returns at obsi and terminates, which concludes the proof
of the liveness of the detection algorithm.
Proof of the safety property. We have to show that no observer obsi claims that
pi is deadlocked while it is not. Figure 15.11 is used to illustrate the proof.

Claim C Let obsi be an observer that sends ANSWER () to obsk where parenti = k.
Let us assume that, after this answer has been sent, pi sends a message m to pk
at time τi . Then, there is an observer obsj such that j ∈ dep_seti , which received
QUERY () from obsi and, subsequently, obsj sent an answer to obsi at some time τja
and pj became active at some time τj such that τja < τj < τi .
Proof of the Claim C In order for obsi to send ANSWER () to its parent, it needs to
have received an answer from each observer obsj such that j ∈ dep_seti (lines 19,
27–28, and 30). Moreover, for pi to become active and send a message m to pk ,
it needs to have received an application message m from a process belonging to
dep_seti (lines 6 and 8). Let pj be this process.
As the channels are FIFO, and obsj sent a message ANSWER () to obsi , m is
not an old message still in transit; it has necessarily been sent after obsj sent the
answer message to obsi . It follows that m is received after the answer message and
we have consequently, τja < τj . Finally, as the sending of m by pj occurs before
its reception by pi (which occurs after obsi sent an answer to obsk ), it follows that
pj becomes active at some time τj such that τj < τi , and we obtain τja < τj < τi .
(End of the proof of the claim.)

Let D be the set of processes involved in the network traversal with feedback ini-
tiated by an observer obsz which executes line 29 and claims that pz is deadlocked.
As the network traversal returns to its initiator obsz , it follows that every observer
15.5 Summary 421

in D has received a matching answer for each query it has sent. Hence, every ob-
server sent an answer message to its parent in the spanning tree built by the network
traversal. Moreover, there is at least one cycle in the set D (otherwise, the network
traversal will not have returned to obsz ).
Let us consider any of these cycles after obsz has claimed a deadlock at some
time τ . This cycle includes necessarily a process pi that, when it receives a query
from a process pk of D defines pk as its parent (line 17). Let us assume that such
a process pi is re-activated at some time τi > τ . It follows from the Claim C that
there is a process pj , such that j ∈ dep_seti and pj sent to pi a message m such that
τja < τj < τi . When considering the pair (pj , obsj ), and applying again Claim C, it
follows that there a process p such that ∈ dep_setj and p sent to pj a message

m such that τ a < τ < τj . Hence, τ < τi .
Let pi , pj , p , . . . , px , pi be the cycle. Using inductively the previous argument,
it follows that i ∈ dep_setx and pi sent to px a message mx at some time τi such
that τi < τx . Hence, we obtain τi < τx < · · · < τ < τj < τi . But as (a) pi remained
continuously passive between the reception of QUERY () from px and the sending
of the matching message ANSWER (), and (b) the channel from pi to px is FIFO, it
follows that τi > τi , a contradiction.
As the previous reasoning is for any cycle in the set D, it follows that, for any
pair {py , py  } ∈ D such that y ∈ dep_sety  , there is no message in transit from py to
py  at time τ , which concludes the proof of the safety property. 

15.5 Summary
A process becomes deadlocked when, while blocked, its progress depends transi-
tively on itself. This chapter has presented two types of communication (or resource)
models, which are the most often encountered. In the AND model, a process waits
for messages from several processes (or for resources currently owned by other
processes). In the OR model, it waits for one message (or resource) from several
possible senders (several distinct resources). After an analysis of the deadlock phe-
nomenon and its capture by the notion of a wait-for graph, the chapter presented
three deadlock detection algorithms. The first one is suited to the one-at-a-time
model, which is the simplest instance of both the AND and OR models, namely,
a process waits for a single message from a given sender (or a single resource) at a
time. The second algorithm allows for the detection of deadlocks in the AND com-
munication model, while the third one allows for the detection of deadlocks in the
OR communication model.

15.6 Bibliographic Notes


• The deadlock problem originated in resource allocation when the very first op-
erating systems were designed and implemented in the 1960s and early 1970s
422 15 Distributed Deadlock Detection

(e.g., [97, 160, 165, 166, 188]). The deadlock problem has also been intensively
studied in database systems (e.g., [218, 259, 287, 331]).
• The advent of distributed systems, distributed computing, and graph-based algo-
rithms has given rise to a new impetus for deadlock detection, and many new
algorithms have been designed (e.g., [61, 169, 216, 278] to cite a few).
• Algorithms to detect cycles and knots in static graphs are presented in [248, 264].
• The deadlock detection algorithm for the one-at-a-time model presented in
Sect. 15.2 is due D.L. Mitchell and M. Merritt [266].
• The deadlock detection algorithm for the AND communication model presented
in Sect. 15.3 is new. Another detection algorithm suited to this model is presented
in [260], and a deadlock avoidance algorithm for the AND model is described
in [389].
• The deadlock detection algorithm for the OR communication model presented in
Sect. 15.4 is due to K.M. Chandy, J. Misra, and L.M. Haas [81]. This algorithm is
based on the notion of a diffusing computation introduced by E.W. Dijkstra and
C.S. Scholten in [115]. Another algorithm for the OR model is described in [379].
• A deadlock detection algorithm for a very general communication model includ-
ing the AND model, the OR model, the k-out-of-n model, and their combination
is described and proved correct in [65].
• Proof techniques for deadlock absence in networks of processes are addressed
in [76] and an invariant-based verification method for a deadlock detection algo-
rithm is investigated in [215].
• Deadlock detection algorithms for synchronous systems are described in
[308, 391].
• Deadlock detection in transaction systems is addressed in [163].
• Introductory surveys on distributed deadlock detection can be found in [206, 346].

15.7 Exercises and Problems

1. Considering the deadlock detection algorithm presented in Sect. 15.2, let us re-
place the detection predicate privi = pubj by pubi = pubj , in order to eliminate
the local control variable privi .
Is this predicate correct to ensure that each deadlock is detected, and is it de-
tected by a single process, which is a process belonging to a cycle? To show the
answer is “yes”, a proof has to be designed. To show the answer is “no”, a coun-
terexample has to be produced. (To answer this question, one can investigate the
particular case of the wait-for graph described in Fig. 15.12.)
2. Adapt the deadlock detection algorithm suited to the AND communication model
presented in Sect. 15.3 to obtain an algorithm that works for the AND resource
model.
Solution in [81].
3. Let us consider the general OR/AND receive statement defined in the previ-
ous chapter devoted to termination detection (Sect. 14.5). Using the predicate
15.7 Exercises and Problems 423

Fig. 15.12 Another example


of a wait-for graph

fulfilled(), design an algorithm which detects communication deadlocks in this


very general communication model.
(Hint: the solution consists in an appropriate generalization of the termination
detection algorithm presented in Fig. 14.16.)
Solution in [65].
4. Considering the deadlock detection algorithm for the OR communication model
described in Fig. 15.10, let us assume that, when a process pi attains its “end”
statement, obsi (a) sets dep_seti to ∅, and (b) for any pair (k, sn), sends system-
atically by return ANSWER (k, sn) each time it receives a message QUERY (k, sn).
In that way, a locally terminated process never stops a network traversal with
feedback.
Does this extension leave the detection algorithm correct? When a deadlock
is detected by an observer obsk , is it possible for obsk to know which are the
locally terminated processes involved in the deadlock?
Part VI
Distributed Shared Memory

A distributed shared memory is an abstraction that hides the details of communicat-


ing by sending and receiving messages through a network. The processes cooperate
to a common goal by using shared objects (also called concurrent objects). The most
famous of these objects is the read/write register, which gives the illusion that the
processes access a classical shared memory. Other concurrent objects are the usual
objects such as queues, stacks, files, etc.
This part of the book is devoted to the implementation of a shared memory on
top of a message-passing system. To that end it investigates two consistency con-
ditions which can be associated with shared objects, namely, atomicity (also called
linearizability), and sequential consistency. For a given object, or a set of objects,
a consistency condition states which of its executions are the correct ones. As an
example, for a read/write shared register, it states which are the values that must be
returned by the invocations of the read operation.

This part of the book is made up of two chapters. After having presented the
general problem of building a shared memory on top of a message-passing system,
Chap. 16 addresses the atomicity (linearizability) consistency condition. It defines
it, presents its main composability property, and describes distributed algorithms
that implement it. Then, Chap. 17 considers the sequential consistency condition,
explains its fundamental difference with respect to atomicity, and presents several
implementations of it.
Chapter 16
Atomic Consistency (Linearizability)

This chapter is on the strongest consistency condition for concurrent objects. This
condition is called atomicity when considering shared registers, and linearizability
when considering more sophisticated types of objects. In the following, these two
terms are considered as synonyms.
The chapter first introduces the notion of a distributed shared memory. It then de-
fines formally the atomicity concept, and presents its main composability property,
and several implementations on top of a message-passing system.

Keywords Atomicity · Composability · Concurrent object ·


Consistency condition · Distributed shared memory · Invalidation vs. update ·
Linearizability · Linearization point · Local property · Manager process ·
Object operation · Partial order on operations · Read/write register · Real time ·
Sequential specification · Server process · Shared memory abstraction ·
Total order broadcast abstraction

16.1 The Concept of a Distributed Shared Memory


Concurrent Objects and Sequential Specification An object is defined by a set
of operations and a specification that defines the meaning of these operations. A con-
current object is an object which can be accessed (possibly concurrently) by several
processes.
As an example, an unbounded stack S is defined by two operations denoted
S.push() and S.pop(), and the following specification. An invocation of S.push(v)
adds the value v to the stack object, and an invocation of S.pop() withdraws from
the stack and returns the last value that has been added to the stack. If the stack is
empty, S.pop() returns a predefined default value ⊥ (this default value is a value
that cannot be added to the stack).
As we can see, this specification uses the term “last” and consequently assumes
that the invocations of S.push() and S.pop() are totally ordered. Hence, it implicitly
refers to some notion of time. More precisely, it is a sequential specification, i.e., a
specification which defines the correct behaviors of an object by describing all the
sequences of operation executions which are allowed.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 427


DOI 10.1007/978-3-642-38123-2_16, © Springer-Verlag Berlin Heidelberg 2013
428 16 Atomic Consistency (Linearizability)

Fig. 16.1 Structure of a distributed shared memory

All the concurrent objects considered in the following are assumed to be defined
by a sequential specification.

Operations of a Register A register R is an object which can be accessed by two


operations denoted R.read() and R.write(). Intuitively, a register is atomic if (a)
each operation appears as if it has been executed instantaneously between its start
event and its end event, (b) no two writes appear as being executed simultaneously,
and (c) each read returns the value written by the last write which precedes it.
A formal specification of an atomic register is given below. A formal specifica-
tion of a sequentially consistent register will be given in the next chapter. These
definitions differ mainly in the underlying notion of time they use. Atomicity in-
volves real time, while sequential consistency involves a logical time.

Shared Memory A shared memory is a set of concurrent objects. At the basic


level of a centralized system these objects are the primitive read/write registers pro-
vided by the hardware. At a higher abstraction level, a shared memory can be made
up of more sophisticated objects such as shared queues, stacks, sets, etc.

Distributed Shared Memory System A distributed shared memory system is


a distributed algorithm that implements a shared memory abstraction on top of a
message-passing system. To that end, the algorithm uses the local memories of the
processes and the underlying message-passing system. The local memories are used
to store the physical representation of the objects, and messages are used by the
processes to cooperate in order to implement the operations on the objects so that
their specification is never violated.
The structure of a distributed shared memory is represented in Fig. 16.1. As
already indicated, according to the abstraction level which is realized, the shared
memory can be composed of read/write registers, or objects of n higher abstraction
level.
16.2 The Atomicity Consistency Condition 429

Fig. 16.2 Register:


What values can be returned
by read operations?

16.2 The Atomicity Consistency Condition

16.2.1 What Is the Issue?

Let us consider Fig. 16.2, which represents a computation involving two processes
accessing a shared register R. The process pw issues write operations, while the
process pr issues read operations (the notation R.read() ← v means that the value
returned by the corresponding read is v).
The question is: Which values v and v  can be returned for this register execution
to be correct? As an example do v = 0 and v  = 2 define a correct execution? Or do
v = 2 and v  = 1 define a correct execution? Are several correct executions possible
or is a single one possible?
The aim of a consistency condition is to answer this question. Whatever the ob-
ject (register, queue, etc.), there are several meaningful answers to this question, and
atomicity is one of them.

16.2.2 An Execution Is a Partial Order on Operations

By a slight abuse of language, we use the term “operation” also for “execution of
an operation on an object”. Let OP be the set of all the operations issued by the
processes.
A computation of a set of processes accessing a set of concurrent objects is a
partial order on the set of operations issued by the processes. This partial order,
op
 = (OP, −→),
denoted OP is defined as follows. Let op1 be any operation issued by
a process pi , and op2 be any operation issued by a process pj ; op1 is on object X,
op
while op2 is on object Y (possibly i = j or X = Y ). op1 −→ op2 if op1 terminated
before op2 started.
op
The projection of −→ on the operations issued by a process is called process
order relation. As each process pi is sequential, the process order relation defines n
total orders (one per process). When op1 and op2 are operations on the same object
op
X, the projection of −→ on the operations on X is called the object order relation.
op
Two operations which are not ordered by −→ are said to be concurrent or over-
lapping. Otherwise they are non-overlapping.
op
The relation −→ associated with the computation described in Fig. 16.2 is de-
picted in Fig. 16.3. The first read operation issued by pr is concurrent with the two
430 16 Atomic Consistency (Linearizability)

op
Fig. 16.3 The relation −→
of the computation described
in Fig. 16.2

last write operations issued by pw , while its last read operation is concurrent only
with the last write operation.
op
The relation −→ associated with the computation described in Fig. 16.2 is de-
picted in Fig. 16.3.

Remark Let us notice that this definition generalizes the definition of a message-
passing execution given in Sect. 6.1.2. A message-passing system is a system where
any directed pair of processes (pi , pj ) communicate by pi depositing (sending)
values (messages) in an object (the channel from pi to pj ), and pj withdrawing
(receiving) values from this object. Moreover, the inescapable transit time of each
message is captured by the fact that a value is always withdrawn after it has been
deposited.

Sequential Computation, Equivalent Computations A computation OP  is se-


op
quential if “−→” is a total order.
 (OP at α) denotes the projection
Let α be any object X or any process pi . OP|α

of OP on α (i.e., the partial order involving only the operations accessing the object
α if α is an object, or issued by α if α is a process). As each process pi is sequential,
 i is a total order (the trace of the operations issued by pi ).
let us observe that OP|p
Two computations OP1 and OP2 are equivalent if for any process pi , OP1 |pi
is the same as OP2 |pi . This means that, when they are equivalent, no process can
distinguish OP1 from OP2 .

Legality A sequential history is legal if it meets the sequential specification of all


its objects. This means that for any object X, OP|X (i.e., the sequence of all the
operations accessing X) is a sequence that belongs to the specification of X.
If X is a register, this means that no read returns an overwritten value. If X is a
stack this means that each invocation of pop() returns the last value that has been

added to the stack (“last” with respect to the order defined by the sequence OP|X),
etc.

16.2.3 Atomicity: Formal Definition

Atomic Computation A computation OP is atomic (or linearizable) if there is a


sequential computation 
S such that
•  and 
OP S are equivalent (i.e., no process can distinguish OP and  S ),
• 
S is legal (i.e., the specification of each object is respected), and
16.2 The Atomicity Consistency Condition 431

• The total order defined by   (i.e., what-


S respects the partial order defined by OP
ever the processes that invoke them and the objects they access, any two oper-
 are ordered the same way in 
ations ordered in OP S; let us recall the order of
 captures their real-time order).
non-overlapping operations in OP
 to be atomic, everything has to appear as if the
This definition means that, for OP
computation was (a) sequential and legal with respect to each object X (X behaves
as described by its sequential specification), (b) and in agreement with “real-time”
(if an operation op1 terminates before another operation op2 starts, then op1 has to
appear before op2 in S).
Such an S is a computation that could have been obtained by executing all the
operations, one after the other, while respecting the occurrence order of all the non-
overlapping operations.

Atomic Register A register X is atomic (or behaves atomically) in a computation


 if the computation OP|X
OP  is atomic.
It follows from the definition of an atomic computation OP, and the definition
of legality, that the sequential execution S is such that, for any object X, 
 S|X is
legal. Hence, if a computation OP is atomic, so are all the objects involved in this
computation.

Linearization and Linearization Point If computation OP  is linearizable, a se-


quential execution such as the previous sequence  S, is called a linearization of the
computation OP.
The very existence of a linearization  S means that, from an external observer
point of view, each operation could have been executed at a point of the time line
that lies between the time this operation starts and the time it ends. Such a point is
called the linearization point of the corresponding operation.
Let us notice that proving that an algorithm implements atomic consistency
amounts to identifying a linearization point for each of its operations, i.e., time
instants that respect the occurrence order of non-overlapping operations and are in
agreement with the sequential specification of each object.

An Example An execution of a read/write register accessed by three processes is


described in Fig. 16.4. As we can see, the executions of the second read by pi , the
second write by pj , and the write by pk are overlapping. The line at the bottom of
the figure represents the time line of an omniscient external observer. Two dotted
arrows are associated with each operation. They meet at a point of the omniscient
external observer’s time line at which the corresponding operation could have been
instantaneously executed.
The linearization points represented by bullets on the observer’s time line de-
scribes a sequence  S that (a) respects the occurrence order of non-overlapping op-
erations, and (b) belongs to the sequential specification of a register. It follows that,
in this execution, the register behaves as an atomic register.
Another computation is described in Fig. 16.5. It differs from the previous
one in the fact that the concurrent (overlapping) operations R.write(2) by pj and
432 16 Atomic Consistency (Linearizability)

Fig. 16.4 An execution of an atomic register

Fig. 16.5 Another execution of an atomic register

R.write(3) by pk are ordered in the reverse order. If the read operations issued by
pi and pj return the value 3, the register is atomic. If one of them returns another
value, it is not.
Let us finally notice that, as the second read by pi , the operation R.write(2) by
pj , and the operation R.write(3) by pk are all concurrent, it is possible that the
execution be such that the second read by pi appears as being linearized between
R.write(2) by pj and R.write(3) by pk . In that case, for R to be atomic, the second
read by pi has to return the value 2.

16.3 Atomic Objects Compose for Free

The Notion of a Local Property Let P be any property defined on a set of ob-
jects. The property P is said to be local if the set of objects as a whole satisfies P
whenever each object taken separately satisfies P .
16.3 Atomic Objects Compose for Free 433

Locality is an important concept that promotes modularity. Let us consider some


local property P . To prove that an entire set of objects satisfy P , we have only to en-
sure that each object, independently from the others, satisfies P . As a consequence,
the property P can be implemented for each object independently of the way it im-
plemented for the other objects. At one extreme, it is even possible to design an
implementation where each object has its own algorithm implementing P . At an-
other extreme, all the objects (whatever their type) might use the same algorithm to
implement P (each object using its own instance of the algorithm).

Atomicity Is a Local Property The following theorem, which is due to M. Her-


lihy and J. Wing (1990), shows that atomicity is a local property. Intuitively, the fact
that atomicity is local comes from the fact that it involves the real-time occurrence
order on non-overlapping operations whatever the objects and the processes con-
cerned by these operations. This point appears clearly in the proof of the theorem.

Theorem 27 A computation OP  is atomic (linearizable) if and only if each object


 is atomic (i.e., OP|X
X involved in OP  is atomic/linearizable).

Proof The “⇒” direction (only if) is an immediate consequence of the definition
 is linearizable then, for each object X involved in OP,
of atomicity: If OP  OP|X
 is
linearizable. So, the rest of the proof is restricted to the “⇐” direction.
Given an object X, let S 
X be a linearization of OP|X. It follows from the defi-
nition of atomicity that SX defines a total order on the operations involving X. Let
→X denote this total order. We construct an order relation → defined on the whole
 as follows:
set of operations of OP
1. For each object X: →X ⊆→,
op
2. −→⊆→.
Basically, “→” totally orders all operations on the same object X, according to
op
→X (first item), while preserving −→, i.e., the real-time occurrence order on the
operations (second item).

Claim “→ is acyclic”. This claim means that → defines a partial order on the set

of all the operations of OP.

Assuming this claim (see its proof below), it is thus possible to construct a se-
quential history   and respecting →. We trivially
S including all operations of OP
have →⊆→S , where →S is the total order (on the operations) defined from  S. We
 and 
have the three following conditions: (1) OP S are equivalent (they contain the
same operations, and the operations of each process are ordered the same way OP 
 
and S), (2) S is sequential (by construction) and legal (due to the first item stated
op
above), and (3) −→⊆→S (due to the second item stated above and the relation
 is linearizable.
inclusion →⊆→S ). It follows that OP
434 16 Atomic Consistency (Linearizability)

Proof of the Claim We show (by contradiction) that → is acyclic. Assume first that
→ induces a cycle involving the operations on a single object X. Indeed, as →X is
a total order, in particular transitive, there must be two operations opi and opj on X
op
such that opi →X opj and opj −→ opi .
• As opi →X opj and X is linearizable (respects object order, i.e., respects real-
time order), it follows that opi started before opj terminated (otherwise, we will
have opj →X opi ). Let us denote this as start[opi ] < term[opj ].
op op
• Similarly, it follows from the definition of −→ that opj −→ opi ⇒ term[opj ] <
start[opi ].
These two items contradict each other, from which we conclude that, if there is a
cycle in →, it cannot come from two operations opi and opj , and an object X such
op
that opi →X opj and opj −→ opi .
It follows that any cycle must involve at least two objects. To obtain a contra-
diction we show that, in this case, a cycle in → implies a cycle in →H (which is
acyclic). Let us examine the way the cycle could be obtained. If two consecutive
op
edges of the cycle are due to just some →X or just −→, then the cycle can be short-
ened, as any of these relations is transitive. Moreover, opi →X opj →Y opk is not
possible for X = Y , as each operation is on only one object (opi →X opj →Y opk
would imply that opj is on both X and Y ). So let us consider any sequence of edges
op op
of the cycle such that: op1 −→ op2 →X op3 −→ op4 . We have:
op op
• op1 −→ op2 ⇒ term[op1 ] < start[op2 ] (definition of −→).
• op2 →X op3 ⇒ start[op2 ] < term[op3 ] (as X is linearizable).
op op
• op3 −→ op4 ⇒ term[op3 ] < start[op4 ] (definition of −→).
Combining these statements, we obtain term[op1 ] < start[op4 ], from which we can
op
conclude that op1 −→ op4 . It follows that any cycle in → can be reduced to a cycle
op op
in −→, which is a contradiction as −→ is an irreflexive partial order. (End of the
proof of the claim.) 

The Benefit of Locality Considering an execution of a set of processes that ac-


cess concurrently a set of objects, atomicity allows the programmer to reason as if
all the operations issued by the processes on the objects were executed one after the
other. The previous theorem is fundamental. It states that, to reason about sequen-
tial processes that access concurrent atomic objects, one can reason on each object
independently, without losing the atomicity property of the whole computation.

An Example Locality means that atomic objects compose for free. As an exam-
ple, let us consider two atomic queue objects Q1 and Q2, each with its own im-
plementation I 1 and I 2, respectively (hence, the implementations can use different
algorithms).
Let us define the object Q as the composition of Q1 and Q2 defined as follows
(Fig. 16.6). Q provides processes with the four following operations Q.enq1(),
16.4 Message-Passing Implementations of Atomicity 435

Fig. 16.6 Atomicity allows objects to compose for free

Q.deq1(), Q.enq2(), and Q.deq2(), whose effect is the same as Q1.enq(),


Q1.deq(), Q2.enq() and Q2.deq(), respectively.
Thanks to locality, an implementation of Q consists simply in piecing together
I 1 and I 2 without any modification to their code. As we will see in the next chapter,
sequential consistency is not a local property. Hence, the previous object composi-
tion property is no longer true for sequential consistency.

16.4 Message-Passing Implementations of Atomicity


As atomicity requires that non-overlapping operations be executed in their “real-
time” occurrence order, its implementations require a strong cooperation among
processes for the execution of each operation. This cooperation can be realized with
an appropriate underlying communication abstraction (such total order broadcast),
or the association of a “server” process with each atomic object. Such implementa-
tions are described in this section.

16.4.1 Atomicity Based on a Total Order Broadcast Abstraction

Principle The total order broadcast abstraction has been introduced in Sect. 7.1.4
(where we also presented an implementation of it based on scalar clocks), and in
Sect. 12.4 (where coordinator-based and token-based implementations of it have
been described and proved correct).
This abstraction provides the processes with two operations, denoted to_
broadcast() and to_deliver(), which allow them to broadcast messages and deliver
these messages in the very same order. Let us recall that we then say that a process
to-broadcasts and to-delivers a message.
An algorithm based on this communication abstraction, that implements an
atomic object X, can be easily designed. Each process pi maintains a copy xi of
the object X, and each time pi invokes an operation X.oper(), it to-broadcasts a
message describing this operation and waits until it to-delivers this message.
436 16 Atomic Consistency (Linearizability)

operation X.oper (param) is


(1) resulti ← ⊥;
(2) to_broadcast OPERATION (i, oper , param);
(3) wait (resulti = ⊥);
(4) return(resulti ).

when a message OPERATION (j, oper , param) is to-delivered do


(5) r ← xi .oper (param);
(6) if (i = j ) then resulti ← r end if.

Fig. 16.7 From total order broadcast to atomicity

The Algorithm The corresponding algorithm is described in Fig. 16.7. Let


oper1 (), . . . , operm () be the operations associated with the object X. If X is a
read/write register, those are read() and write(). If X is a stack, they are push()
and pop(), etc. It is assumed that each operation returns a result (which can be a
default value for operations which have no data result such as write() or push()).
Moreover, ⊥ is a default control value which cannot be returned by an operation.
When a process pi invokes X.oper (param), it to-broadcasts a message OPERA -
TION () carrying the name of the operation, its input parameter, and the identity of
the invoking process (line 2). Then, pi is blocked until its operation has been applied
to its local copy xi of the object X (line 3). When this occurs, we have resulti = ⊥,
and pi returns this value (line 4).
When pi receives a message OPERATION (), it first applies the corresponding
operation to its local copy xi (line 5). Moreover, if it is the process that invoked this
operation, it saves its result in resulti .

Theorem 28 The object X implemented by the algorithm described in Fig. 16.7 is


atomic.

 be an execution involving the object X. Let us first observe that,


Proof Let OP
due the to-order broadcast abstraction defines a total order on all the operations
invoked by the processes. This total order defines a sequence 
S which is trivially
 (b) respects its partial order on operation, and (c) is legal.
(a) equivalent to OP,
As each copy xi of X is applied this sequence  S of operations, each is a correct
implementation of X. 

Remark on the Locality Property of Atomicity While the previous implemen-


tation has considered a single object X, it works for any number of objects. This is
due to the fact that the atomicity consistency condition is a local property. Hence,
if atomicity is implemented
• with the to-broadcast abstraction for some objects (possibly, each object being
implemented with a specific implementation of to-broadcast),
• different approaches (such as those presented below) for the other objects,
then each object behaves atomically, and consequently the whole execution is
atomic (Theorem 27).
16.4 Message-Passing Implementations of Atomicity 437

Fig. 16.8 Why read operations have to be to-broadcast

The Case of Operations Which Do Not Modify the Objects Let us consider the
particular case where the object is a read/write register R. One could wonder why,
as the read operations do not modify the object and each process pi has a local copy
xi of R, it is necessary to to-broadcast the invocations of read().
To answer this question, let us consider Fig. 16.8. The register R is initialized
to the value 0. Process p1 invokes R.write(1), and consequently issues an under-
lying to-broadcast of OPERATION (1, write , 1). When a process pi to-delivers this
message, it assigns the new value 1 to xi . After it has to-delivered the message OP -
ERATION (1, write , 1), p3 invokes R.read() and, as xi = 1, it returns the value 1.
Differently, the message OPERATION (1, write , 1) sent by p1 to p2 is slow, and p2
invokes R.read() before this message has been to-delivered. This read consequently
returns the current value of x2 , i.e., the initial value 0.
As the invocation of R.read() by p3 terminates before the invocation of R.read()
op
by p2 starts we have (R.read() by p3 ) −→ (R.read() by p2 ), and consequently, the
read by p3 has to be ordered (in  S) before the read by p2. But then, the sequence

S cannot be legal. This is because the read by p3 obtains the new value, while the
read by p2 (which occurs later with respect to real time as formally captured by
op
−→) obtains the initial value, which has been overwritten (as witnessed by the read
of p3 ).
Preventing such incorrect executions requires all operations to be totally ordered
with the to-broadcast abstraction. Hence, when implementing atomicity with to-
broadcast, even the operations which do not modify the object have to participate in
the to-broadcast.

16.4.2 Atomicity of Read/Write Objects Based on Server Processes

A way to implement atomic read/write registers without using an underlying total


order broadcast consists in associating a server process (also called manager) with
a set of registers. At one extreme, it is possible that a single server manages all the
registers, and at the other extreme, it is possible to have an independent server per
438 16 Atomic Consistency (Linearizability)

Fig. 16.9 Invalidation-based implementation of atomicity: message flow

register. This is a consequence of the fact that atomicity is a local property: If X and
Y are atomic, their implementations can be independent (i.e., the manager of X and
the manager of Y never need to cooperate).
Hence, without loss of generality, we consider in the following that there is a
single register X managed by a single server, denoted pX . In addition to pX , each
process pi has a local copy xi of X. The local copy at pX is sometimes called
primary copy. To simplify the presentation (and again without loss of generality)
we consider that the role of the server pX is only to manage X (i.e., it does not
involve operations on X).
The role of pX is to ensure atomicity by managing the local copies so that (a)
each read operation returns a correct value, and (b)—when possible—read opera-
tions are local, i.e., do not need to send or receive messages. To attain this goal, two
approaches are possible.
• Invalidation. In this case, at each write of X, the manager pX invalidates all the
local copies of X.
• Update. In this case, at each write of X, the manager pX updates all the local
copies of X.
The next two sections develop each of these approaches.

16.4.3 Atomicity Based on a Server Process and Copy Invalidation

Principle and Flow of Messages The principle of an invalidation-based imple-


mentation of atomicity is described in Fig. 16.9.
When a process pi invokes X.write(v), it sends a request message (WRITE _
REQ (v)) to the manager of X, which updates its local copy and forwards an invali-
dation message to the processes which have a copy of X. When a process receives
this message, it invalidates its copy and sends by return to pX an acknowledgment
(message INV _ ACK ()). When it has received all the acknowledgments, pX informs
16.4 Message-Passing Implementations of Atomicity 439

pi that there are no more copies of the previous value and consequently the write
operation can terminate (message WRITE _ ACK ()).
Moreover, during the period starting at the reception of the message WRITE _
REQ () and finishing at the sending of the corresponding WRITE _ ACK (), pX be-
comes locked with respect to X, which means that it delays the processing of all the
message WRITE _ REQ () and READ _ REQ (). These messages can be processed only
when pX is not locked.
Finally, when the manager pX receives a message READ _ REQ () from a process
pj , it sends by return the current copy of X to pj . When pj receives it, it writes it
in its copy xj and, if it reads again X, it uses this value until it is invalidated. Hence,
some read operations needs two messages, while others are purely local.
It is easy to see that the resulting register object X is atomic. All the write op-
erations are totally ordered by pX , and each read operation obtains the last written
value. The fact that the meaning of “last” is with respect to real time (as captured by
op
−→) follows from the following observation: When pX stops being locked (time τ
in Fig. 16.9), only the current writer and itself have the last value of X.

The Algorithm The invalidation-based algorithm is described in Fig. 16.10. This


code is the direct operational translation of the previous principle. The manager pX
is such that xX is initialized to the initial value of X.
The local array hlvX [1..n] (for hold last value) is such that hlvX [i] is true if and
only if pi has a copy of the last value that has been written. The initialization of
this Boolean array is such that hlvX [i] is true if xi is initially equal to xX . If hlvX [i]
is initially false, xi = ⊥, which locally means that pi does not have the last value
of X.
The messages WRITE _ REQ () and READ _ REQ () remain in their input buffer until
they can be processed. It is assumed that every message is eventually processed (as
pX is the only process which receives these messages, this is easy to ensure).

16.4.4 Introducing the Notion of an Owner Process

The Notion of an Owner Process It is possible to enrich the previous algorithm


by introducing the notion of a process owner. A process starts being the owner of a
register when it terminates a write on this register and stops being the owner when
this object is written or read by another process. This means that, when a process pi
owns a register X, it can issue several writes on X without contacting the manager
of X. Only its last write is “meaningful” in the sense that this write overwrites the
previous writes it has issued since it started being the current owner.
As we can see, as it is related to the pattern of read and write operations issued
by the processes, this ownership notion is essentially dynamic. As soon as a process
reads the object, there is no more owner until the next write operation. It is assumed
that, initially, there is no owner process and only the manager pX has the initial
value of X.
440 16 Atomic Consistency (Linearizability)

========= on the side of process pi , 1 ≤ i ≤ n =========


operation X.write(v) is
(1) xi ← v;
(2) send WRITE _ REQ (v) to pX ;
(3) wait WRITE _ ACK () from pX .

operation X.read() is
(4) if (xi = ⊥)
(5) then send READ _ REQ () to pX ;
(6) wait READ _ ACK (v) from pX ;
(7) xi ← v
(8) end if;
(9) return(xi ).

when a message INV (X) is received from pX do


(10) xi ← ⊥;
(11) send ACK _ INV () to pX .

========= on the side of the server pX ===============


when a message READ _ REQ () is received from pi do
(12) wait (¬lockedX );
(13) send READ _ ACK (xX ) to pi ;
(14) hlvX [i] ← true.

when a message WRITE _ REQ (v) is received from pi do


(15) wait (¬lockedX );
(16) lockedX ← true; xX ← v;
(17) for each j such that hlvX [j ] do send INV () to pj end for;
(18) wait (ACK _ INV () received from each j such that hlvX [j ]);
(19) hlvX [1..n] ← [false, . . . , false]; hlvX [i] ← true;
(20) send WRITE _ ACK () to pi ;
(21) lockedX ← false.

Fig. 16.10 Invalidation-based implementation of atomicity: algorithm

The Basic Pattern The basic pattern associated with the ownership notion is the
following one:
• A process pi writes X: It becomes the owner, which entails the invalidation of
all the copies of the previous value of X. Moreover, pi can continue to update its
local copy of X, without informing pX , until another process pj invokes a read
or a write operation.
• If the operation issued by pj is a write, it becomes the new owner, and the situa-
tion is as in the previous item.
• If the operation issued by pj is a read, the current owner pi is demoted, and it
is no longer the owner of X and is downgraded from the writing/reading mode
to the reading mode only. It can continue reading its copy xi (without passing
through the manager pX ), but its next write operation will have to be managed
by pX . Moreover, pi has to send the current value of xi to pX , and from now on
pX knows the last value of X.
16.4 Message-Passing Implementations of Atomicity 441

operation X.write(v) is
(1) if (¬ owneri )
(2) then send WRITE _ REQ (v) to pX ;
(3) wait WRITE _ ACK () from pX ;
(4) owneri ← true
(5) end if;
(6) xi ← v.

operation X.read() is
(7) if (xi = ⊥)
(8) then send READ _ REQ () to pX ;
(9) wait READ _ ACK (v) from pX ;
(10) xi ← v
(11) end if;
(12) return(xi ).

when a message DOWNGRADE _ REQ (type) is received from pX do


(13) owneri ← false;
(14) if (type = w) then xi ← ⊥ end if;
(15) send DOWNGRADE _ ACK (xi ) to pX .

Fig. 16.11 Invalidation and owner-based implementation of atomicity (code of pi )

If another process pk wants to read X, it can then obtain from pX the last value
of X, and keeps reading its new copy until a new write operation invalidates all
copies.

The Read and Write Operations These operations are described in Fig. 16.11.
Each process pi manages an additional control variable, denoted owneri , which is
true if and only if pi is the current owner of X.
If, when pi invokes X.write(v), pi is the current owner of X, it has only to up-
date its local copy xi (lines 1 and 6). Otherwise it becomes the new owner (line 4)
and sends a message WRITE _ REQ (v) to the manager pX so that it downgrades the
previous owner, if any (line 2). When this downgrading has been done (line 3), pi
writes v in its local copy xi , and the write terminates.
The algorithm implementing the read operation is similar to the one implement-
ing the write operation. If there is a local copy of X (xi = ⊥), pi returns it. Other-
wise, it sends a message READ _ REQ () to pX in order to obtain the last value written
into X. This behavior is described by lines 7–12.
The lines 13–15 are related to the management of the ownership of X. When
the manager pX discovers that pi is no longer the owner of X, pX sent to pi a
message DOWNGRADE _ REQ (type), where type = w if the downgrading is due to a
write operation, and type = r if it is due to a read operation. Hence, when it receives
DOWNGRADE _ REQ (type), pi first sets owneri to false. Then it sends by return to
pX an acknowledgment carrying its value of xi (which is the last value that has been
written if type = r).
442 16 Atomic Consistency (Linearizability)

when a message WRITE _ REQ (v) is received from pi do


(16) wait (¬lockedX );
(17) lockedX ← true;
(18) let send_to = {j such that hlwX [j ]} \ {i};
(19) for each j ∈ send_to do send DOWNGRADE _ REQ (w) to pj end for;
(20) wait (DOWNGRADE _ ACK () received from each pj such that j ∈ send_to );
(21) xX ← v; ownerX ← i;
(22) hlvX [1..n] ← [false, . . . , false]; hlvX [i] ← true;
(23) send WRITE _ ACK () to pi ;
(24) lockedX ← false.

when a message READ _ REQ () is received from pi do


(25) wait (¬lockedX );
(26) lockedX ← true;
(27) if (ownerX = ⊥)
(28) let k = ownerX ;
(29) send DOWNGRADE _ REQ (r) to pk ;
(30) wait (DOWNGRADE _ ACK (v) received from pk );
(31) xX ← v; ownerX ← ⊥;
(32) end if;
(33) hlvX [i] ← true;
(34) send READ _ ACK (xX ) to pi ;
(35) lockedX ← false.

Fig. 16.12 Invalidation and owner-based implementation of atomicity (code of the manager pX )

Behavior of the Manager pX The process pX manages an additional variable


ownerX , which contains the identity of the current owner of X. If there is no current
owner, then ownerX = ⊥. The behavior of pX is described in Fig. 16.12. As in the
previous algorithm, the Boolean lockedX is used to ensure that pX processes one
write or read operation at a time.
When pX receives a message WRITE _ REQ (v) from a process pi , it first sends a
message DOWNGRADE _ REQ (w) to all the processes that have a copy of X, so that
they invalidate their copies of X (lines 19–20). Moreover, as the invalidation is due
to a write operation, the parameter type is set to w (because, in this case, no value
has to be returned). Then, pX updates its local context, namely, xX , ownerx , and
hlwX [1..n] (lines 21–22), before returning an acknowledgment to pi , indicating the
write has terminated (line 23).
Let us notice that, when ownerX = ⊥, the local variable xX of the manager pX
has the last value of X. Hence, when pX receives a message READ _ REQ () from
a process pi , it sends by return to pi the value of xX if ownerX = ⊥, and updates
hlvX [i] (lines 27 and 33–34). Differently, if ownerX = ⊥, pX has first to downgrade
the current owner from the writing mode to the reading mode and obtains from it
the last value of X (lines 27–32).
It is easy to see that the number of control messages (involved in read and
write operations), which are sent consecutively, varies from 0 (the last value is
in the local variable xi of the invoking process) up to 4 (namely, the sequence
16.4 Message-Passing Implementations of Atomicity 443

Fig. 16.13 Update-based implementation of atomicity

WRITE _ REQ () – or READ _ REQ () –, DOWNGRADE _ REQ (), DOWNGRADE _ ACK (),
and WRITE _ ACK () – or READ _ ACK () –).

16.4.5 Atomicity Based on a Server Process and Copy Update

Principle The idea is very similar to the one used in the invalidation approach. It
differs in the fact that, when the manager pX learns a new value, instead of invali-
dating copies, it forwards the new value to the other processes.
To illustrate this principle, let us consider Fig. 16.13. When pX receives a mes-
sage WRITE _ REQ (v), it forwards the value v to all the other processes, and (as
previously) becomes locked until it has received all the corresponding acknowledg-
ments, which means that it processes sequentially all write requests. When a process
pi receives a message UPDATE & LOCK (v), it updates xi to v, sends an acknowledg-
ment to pX , and becomes locked. When pX has received all the acknowledgments,
it knows that all processes have the new value v. This time is denoted τ on the
figure. When this occurs, pX sends a message to all the processes to inform them
that the write has terminated. When a process pi receives this message, it becomes
unlocked, which means that it can again issue read or write operations.
It follows that, differently from the invalidation approach, all reads are purely
local in the update approach (they send and receive messages).

The Algorithm The code of the algorithm, described in Fig. 16.14, is similar to
the previous one. This algorithm considers that there are several atomic objects X,
Y , etc., each managed by its own server pX , pY , etc. The local Boolean variable
lockedi [X] is used by pi to control the accesses of pi to X.
For each object X, the cooperation between the server pX and the application
processes pi , pj , etc., is locally managed by the Boolean variables lockedi [X],
lockedj [X], etc. Thanks to this cooperation, the server pX of each object X guar-
antees that X behaves atomically. As proved in Theorem 27, the set of servers pX ,
pY , etc., do not have to coordinate to ensure that the whole execution is atomic.
444 16 Atomic Consistency (Linearizability)

========= on the side of process pi , 1 ≤ i ≤ n =========


operation X.write(v) is
(1) wait (¬lockedi [X]);
(2) xi ← v;
(3) send WRITE _ REQ (v) to pX ;
(4) wait WRITE _ END () from pX .

operation X.read() is
(5) wait (¬lockedi [X]);
(6) return(xi ).

when a message UPDATE & LOCK (v) is received from pX do


(7) xi ← v;
(8) lockedi [X] ← true;
(9) send U & L _ ACK () to pX .

when a message WRITE _ END () is received from pX do


(10) lockedi [X] ← false.

========= on the side of the server pX ===============


when a message WRITE _ REQ (v) is received from pi do
(11) wait (¬lockedX );
(12) lockedX ← true;
(13) for each j ∈ {1, . . . , n} \ {i} do send UPDATE & LOCK (v) to pj end for;
(14) wait (U & L _ ACK () received from each j ∈ {1, . . . , n} \ {i});
(15) for each j ∈ {1, . . . , n} do send WRITE _ END () to pi end for;
(16) lockedX ← false.

Fig. 16.14 Update-based algorithm implementing atomicity

It is easy to see that, as in the invalidation-based algorithm, pX serializes all write


operations, and that no read operation obtains an overwritten value.

16.5 Summary
This chapter first introduced the concept of a distributed shared memory. It has then
presented, both from intuitive and formal points of view, the atomicity consistency
condition (also called linearizability). It then showed that atomicity is a consistency
condition that allows objects to be composed for free (a set of objects is atomic
if and only if each object is atomic). Then the chapter presented three message-
passing algorithms that implement atomic objects: a first one based on a total order
broadcast abstraction, a second one based on an invalidation technique, and a third
one based on an update technique.

16.6 Bibliographic Notes


• Atomic consistency is the implicit consistency condition used for von Neumann
machines. A formalization and an associated theory of atomicity are given by L.
16.7 Exercises and Problems 445

Lamport in [228, 229]. An axiomatic presentation of atomicity for asynchronous


hardware is given by J. Misra in [262].
• The notion of linearizability, which generalizes atomicity to any type of object
defined by a sequential specification, is due to M. Herlihy and J. Wing [183]. The
theorem stating that atomic objects compose for free (i.e., atomicity is a local
property, Theorem 27) is due to them.
• The invalidation-based algorithm that implements atomic consistency presented
in Sect. 16.4.3 is a simplified version of an algorithm managing a distributed
shared memory, which is due to K. Li and P. Hudak [234]. An early introductory
survey on virtual memory systems can be found in [106].
• The update-based algorithm presented in Sect. 16.4.5, which implements atomic
consistency, is from [32] (where it is used to implement data consistency in a
programming language dedicated to distributed systems).
• Consistency conditions for objects have given rise to many proposals. They are
connected to cache consistency protocols [5, 72, 119]. The interested reader will
find introductory surveys on consistency conditions in [3, 4, 286, 301, 323].
• Numerous consistency conditions weaker than atomicity have been investigated:
serializability [289], causal consistency [10], hybrid consistency [136], tuples
space [69], normality [151], non-sequential consistency conditions [21], slow
memory [194], lazy release consistency [204], timed consistency [371, 372],
pipelined RAM (PRAM) consistency [237], to cite a few.
• Generic algorithms which implement several consistency conditions are de-
scribed in [200, 212].
• Sequential consistency, which was introduced by L. Lamport [227], is (with atom-
icity) one of the most important consistency conditions. It is addressed in the next
chapter. Its connection with linearizability is investigated in [23, 313].
• Analysis of strong consistency conditions and their implementations can be found
in [157, 256].
• An object operation is polyadic if it is on several objects at the same time. As
an example, the queue operation (Q, Q ).add(), which adds Q at the end of Q,
is polyadic. Investigation of consistency conditions for objects whose operations
are polyadic can be found in [267, 326].

16.7 Exercises and Problems

1. Design an algorithm that implements read/write atomic objects, and uses update
(instead of invalidation) and the process ownership notion.
2. Extend the algorithm described in Figs. 16.11 and 16.12, which implements
atomicity with invalidation and the ownership notion, to obtain an algorithm that
implements a distributed shared memory in which each atomic object is a page
of a classical shared virtual memory.
Solution in [234].
446 16 Atomic Consistency (Linearizability)

3. When considering the update-based implementation of atomicity described in


Fig. 16.14, describe a scenario where a process pi , which has issued a write
operation, becomes locked because of a write operation issued by another pro-
cess pj . Which write operation is ordered first?
op
4. Let us consider a shared memory computation OP  = (OP, −→). Does the prob-
 is atomically consistent belong to the complexity class P
lem of checking if OP
or NP?
Chapter 17
Sequential Consistency

This chapter is on sequential consistency, a consistency condition for distributed


shared memory, which is weaker than atomicity (linearizability). After having de-
fined sequential consistency, this chapter shows that it is not a local property. Then,
it presents two theorems which are of fundamental importance when one has to
implement sequential consistency on top of asynchronous message-passing sys-
tems. Finally, the chapter presents and proves correct several distributed algorithms
that implement sequential consistency. Sequential consistency was introduced by
L. Lamport (1979).

Keywords Causal consistency · Concurrent object · Consistency condition ·


Distributed shared memory · Invalidation · Logical time · Manager process ·
OO constraint · Partial order on operations · Read/write register ·
Sequential consistency · Server processes · Shared memory abstraction ·
Total order broadcast abstraction · WW constraint

17.1 Sequential Consistency

17.1.1 Definition

Intuitive Definition Let us consider an execution of a set of processes accessing


 is sequentially consistent
a set of concurrent objects. The resulting computation OP
if it could have been produced (with the help of a scheduler) by executing it on
a monoprocessor system. This means that, in a sequentially consistent execution,
the operations of all the processes appear as if they have been executed in some
sequential order, and the operations of each process appear in this total ordering in
the order specified by its program.

Formal Definition: Sequentially Consistent Computation The formalism used


below is the one that was introduced in Sect. 16.2.2, where it was shown that an exe-
cution of processes accessing concurrent objects is a partial order on the operations
issued by these processes. The definition of sequential consistency is based on the
“process order” relation that provides us with a formal statement that, each process
being sequential, the operations it issued are totally ordered.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 447


DOI 10.1007/978-3-642-38123-2_17, © Springer-Verlag Berlin Heidelberg 2013
448 17 Sequential Consistency

Fig. 17.1 A sequentially consistent computation (which is not atomic)

A computation OP is sequentially consistent if there exists a sequential compu-


tation 
S such that:
 and 
• OP  and 
S are equivalent (i.e., no process can distinguish OP S), and
• 
S is legal (the specification of each object is respected).
These two items are the same as for atomicity (which includes a third one related
to real-time order). Atomicity is sequential consistency plus the fact that the oper-
ations that do not overlap (whatever the processes that issued them and the objects
they access) are ordered in 
S according to their real-time occurrence order. Trivially,
any computation that is atomic is also sequentially consistent, while the reverse is
not true.
This is illustrated in the example depicted in Fig. 17.1. The computation de-
scribed in this figure involves two processes and two registers, R and R  . It is se-
quentially consistent because there is a sequence S that respects (a) process order for
each process, (b) the sequential specification of each register, and (c) is composed
of the operations (parameter and result) as the “real” computation. The sequence  S
consists of the execution of pj followed by that of pi , namely

R.writej (1), R  .writej (5), R.readj () → 1, R.writei (2), R  .readi () → 5, R.readi () → 2.

As we can see, this “witness” execution  S does not respect the real time occurrence
order of the operations from different processes. Hence, it is not atomic.
The dotted arrow depicts what is called the read-from relation when the shared
objects are read/write registers. It indicates, for each read operation, the write
operation that wrote the value read. The read-from relation is the analog of the
send/receive relation in message-passing systems, with two main differences: (1) not
all values written are read, and (2) a value written can be read by several processes
and several times by the same process.
An example of a computation which is not sequentially consistent is described
in Fig. 17.2. Despite the fact that each value that is returned by a read operation has
been previously written, it is not possible to build there a sequence S which respects
both process order and the sequential specification of each register.
Actually, determining whether a computation made up of processes accessing
concurrent read/write registers is sequentially consistent is an NP-complete prob-
lem (Taylor, 1983). This result rules out the design of efficient algorithms that would
17.1 Sequential Consistency 449

Fig. 17.2 A computation which is not sequentially consistent

Fig. 17.3 A sequentially


consistent queue

implement sequential consistency and no more. As we will see, efficient algorithms


for sequential consistency implement more than this condition. (As shown by the
simple algorithms implementing atomicity, which is stronger than sequential con-
sistency.)

Formal Definition: Sequentially Consistent Object Given a computation OP 



and an object X, this object is sequentially consistent in OP if the computation

OP|X 
is sequentially consistent. (Let us recall that OP|X  from which have
is OP
been suppressed all the operations which are not on X.)

17.1.2 Sequential Consistency Is Not a Local Property

While atomic consistency is a local property, and consequently atomic objects com-
pose for free (see Sect. 16.3), this is no longer the case for sequentially consistent
objects. The following counter-example proves this claim.
Let us consider a computation with two processes accessing queues. Each queue
is accessed by the usual operations denoted enqueue(), which adds an item at the
head of the queue, and dequeue(), which withdraws from the queue the item at its
tail and returns it (⊥ is returned if the queue is empty).
The computation described in Fig. 17.3, which involves one queue denoted Q, is
sequentially consistent. This follows from the fact that the sequence

S = Q.enqueuej (b), Q.dequeuej () → b, Q.enqueuei (b)
is legal and respects the order of the operations in each process. As this computation
involves a single object, trivially Q is sequentially consistent.
Let us now consider the computation described in Fig. 17.4, which involves two
queues, Q and Q . It is easy to see that each queue is sequentially consistent. The
existence of the previous sequence S proves it for Q, and the existence of the fol-
lowing sequence proves it for Q ,
   
S = Q .enqueuei a  , Q .dequeuei () → a  , Q .enqueuej b .
450 17 Sequential Consistency

Fig. 17.4 Sequential


consistency is not
a local property

But the computation as a whole is not sequentially consistent: it is not possible


to order all the operations in such a way that the corresponding sequence SEQ
would be such that (a) the operations appear in their process order, and (b) both
Q and Q satisfy their sequential specification. This is because, to produce such
a sequence, we have to order first either Q.enqueuei (a) or Q .enqueuei (a  ), e.g.,
Q.enqueuei (a). Then, whatever the second operation placed in the sequence, we
need to have Q.dequeuej (a), instead of Q.dequeuej (b), for Q to behave correctly.
It follows that, even if each object is sequentially consistent, the whole compu-
tation is not, i.e., sequential consistency is not a local property. From an implemen-
tation point of view, this will clearly appear in the implementation where a specific
manager is associated with each object (see Sect. 17.4). The managers are required
to cooperate to ensure that the computation is sequentially consistent. (As we have
seen in the previous chapter, atomic consistency does not need this cooperation.)

17.1.3 Partial Order for Sequential Consistency

Partial Order on Read and Write Operations As we have seen, differently


from atomicity (linearizability), sequential consistency does not refer to the real-
time order on the operations issued by processes.
Let us consider the case where the objects are read/write registers. Moreover, to
simplify the presentation, and without loss of generality, it is assumed that there is
an initial write on each register, and all the values written into a register are different.
Let wj (X, a) denotes the write of a into X by pj , and ri (X)a denotes a read of
X by pi which returns the value a. When X and a are not relevant, these operations
are denoted wi () and ri (). As already indicated, this means that ri (X)a reads from
wi (X, a). The set of all the “read from” pairs is captured by the relation denoted
rf rf
−→, i.e., if ri (X)a reads from wi (X, a), we have wi (X, a) −→ ri (X)a.
po
In such a context, the relation −→ defining a read/write computation is defined
po
as follows: op1 and op2 being two operations on registers, op1 −→ op2 if
• op1 and op2 have been issued by the same process with op1 first (process order
relation), or
• op1 = wi (X, a) and op2 = ri (X)a (read from relation), or
po po
• ∃op such that op1 −→ op and op −→ op2 (transitivity).
po
Let us recall that two operations op1 and op2 such that ¬(op1 −→ op2 ) ∧
po
¬(op2 −→ op1 ) are said to be concurrent or independent.
17.1 Sequential Consistency 451

A Definition of Legality Customized for Read/Write Registers The notion


of legality was introduced in the previous chapter to capture the fact a sequential
computation satisfies the sequential specification of the objects it accesses. As we
consider here read/write objects (registers), the notion of legality is extended to
any computation (not only sequential computations) and reformulated in terms of
read/write registers.
A read/write-based computation OP  is legal if, for each read operation ri (X)a,
we have:
• ∃wj (X, a), and
op op
• wk (X, b) such that [wj (X, a) −→ wk (X, b)] ∧ [wk (X, b) −→ ri (X)a].
The first item states that any value returned by a read operation has previously been
written, while the second item states that no read operation returns an overwritten
value.

17.1.4 Two Theorems


for Sequentially Consistent Read/Write Registers

Considering processes accessing concurrent objects which are read/write registers,


this section states two theorems which are important when designing algorithms
which implement sequential consistency. These theorems are due to M. Mizuno, M.
Raynal, and J.Z. Zhou (1994).

Two Additional Constraints Since checking sequential consistency is an NP-


complete problem, the idea is to find additional constraints which can be (relatively
easily) implemented and simplify the design of algorithms ensuring sequential con-
sistency. Two such constraints have been identified.
• Write/write (WW) constraint. All write operations are totally ordered.
• Object-ordered (OO) constraint. Conflicting operations on each object are totally
ordered. (A write operation conflicts with any read operation and any other write
operation, while a read operation conflicts with any write operation.)
It is important to see that the constraint WW imposes an order on all write operations
whatever the objects they are on, while the constraint OO on each object taken
individually.
When considering an execution that satisfies the WW constraint, the total order
ww
relation created by WW is denoted −→. Similarly, the order relation created by the
oo(X)
constraint OO on an object X is denoted −→ .

Theorem 29 Let OP  be a computation that satisfies the WW constraint. If OP


 is
legal, it is sequentially consistent.
452 17 Sequential Consistency

Fig. 17.5 Part of the graph G


used in the proof of Theorem 29

Proof Assuming that OP is legal and satisfies the constraint WW, let us consider
the directed graph denoted G and defined as follows. Its vertices are the operations,
ww rf
and there is an edge from op1 to op2 if (a) op1 −→ op2 , or (b) op1 −→ op2 , or
(c) op1 and op2 have been issued by the same process with op1 first (process order
 is acyclic, so is G. The proof consists of two steps.
at each pi ). As OP
First step. For each object X, we add edges to G so that all (conflicting) opera-
tions on X are ordered, and these additions preserve the acyclicity and legality of
the modified graph G.
Let us consider a register X. As all write operations on X are totally ordered,
we have only to order the read operations on X with respect to the write operations
on X. Let ri (X)a be a read operation. As ri (X)a is legal (assumption), there exists
rf
wj (X)a such that wj (X)a −→ ri (X)a. (See Fig. 17.5, where the label associated
with an edge explains it.)
ww
Let wk (X) be any write operation such that wj (X)a −→ wk (X). For any such
wk (X), let us add an edge from ri (X)a to wk (X) (dotted edge in the figure). This
addition cannot create a cycle because, due to the legality of G, there is no path from
wk (X) to ri (X)a. Let us now show that this addition preserves the legality of G.
Legality can be violated if adding an edge creates a path from some write op-
eration wx to some read operation ry . Hence, let us assume that the addition of the
edge from ri (X)a to wk (X) creates a new path from wx to ry . It follows that, before
adding the edge from ri (X)a to wk (X), there was a path from wx to ri (X)a and a
ww
path from wk (X) to ry . Due to the relation −→, two cases have to be analyzed.
ww
• wx = wk (X) or wx −→ wk (X). In this case, there is a path from wx to ry which
does not use the edge from ri (X)a to wk (X). This path goes from wx to wk (X)
and then from wk (X) to ry , and (by induction on the previous edges added to G)
none of these paths violate legality.
ww
• wk (X) −→ wx . This case implies that there is path from wk (X) to ri (X)a (this
path goes from wk (X) to wx and then from wx ) to ri (X)a). But this path violates
the assumption stating the graph G built before adding the edge from ri (X)a to
wk (X) is legal. Hence, this case cannot occur.
17.2 Sequential Consistency from Total Order Broadcast 453

It follows that the addition of the edge from ri (X)a to wk (X) does not violate the
legality of the updated graph G.
Second step. The previous step is repeated until, for each object X, all read op-
erations on X are ordered with respect to all write operations. When this is done,
let us consider any topological sort  S of the resulting acyclic graph. It is easy to
see that S preserves process order and legality (i.e., no process reads an overwritten
value). (Reminder: a topological sort of a directed acyclic graph is a sequence of all
its vertices, which preserves their partial ordering.) 

Theorem 30 Let OP  be a computation that satisfies the OO constraint. If OP


 is
legal, it is sequentially consistent.

The proof of this theorem is similar to the proof of the previous theorem. It is left
to the reader.

17.1.5 From Theorems to Algorithms

Theorems 29 and 30 provide us with a simple methodology and two approaches to


design algorithms implementing sequential consistency.
On the approach side, an algorithm is based on the WW constraint or the OO con-
straint. On the methodology side, an algorithm has first to ensure the WW constraint
or the OO constraint, and then to only guarantee the legality of the read operations.
This modularity favors the understanding of what we are doing, and both the design
and the proof of algorithms.
The algorithms implementing sequential consistency, which are described in the
rest of this chapter, follow this modular approach.

17.2 Sequential Consistency from Total Order Broadcast


The common principle the algorithms described in this section rely on, which is
due to H. Attiya and J.L. Welch (1994), consists in using the total order broadcast
abstraction to implement the WW constraint. Due to Theorem 29, these algorithms
have then to ensure only that the computation is legal.

17.2.1 A Fast Read Algorithm for Read/Write Objects

Total Order on Write Operations This algorithm uses the total order broadcast
abstraction to order all write operations. Moreover, as we are about to see, the legal-
ity of the read operations is obtained for free with a reading of the local copy of the
register X at pi , hence the name fast read algorithm.
454 17 Sequential Consistency

Fig. 17.6 Fast read algorithm


operation X.write(v) is
implementing sequential
(1) to_broadcast SEQ _ CONS (i, X, v);
consistency (code for pi )
(2) receivedi ← false; wait (receivedi );
(3) return().

operation X.read() is
(4) return(xi ).

when SEQ _ CONS (j, Y, v) is to-delivered do


(5) yi ← v;
(6) if (j = i) then receivedi ← true end if.

The Fast Read Algorithm Each process pi maintains a copy xi of each


read/write register X. When process pi invokes X.write(v) it uses an underlying to-
broadcast algorithm to send the value v to all the processes (including itself). It then
waits until it is to-delivered its own message. When process pi invokes X.read(),
it returns the current value of its local copy xi of X. The text of the algorithm is
described in Fig. 17.6.

Theorem 31 The fast read algorithm implements sequential consistency.

Proof Let us first observe that, thanks to the to-broadcast of all the values which
are written, the algorithm satisfies the WW constraint. Hence, due to Theorem 29,
it only remains to show that no read operation obtains an overwritten value.
To that end, let us construct a sequence 
S by enriching the total order on write op-
ww
erations (−→) as follows. Let SEQ _ CONS (j, X, v) and SEQ _ CONS (k, Y, v  ) be the
ww
messages associated with any two write operations which are consecutive in −→.
Due to the to-broadcast abstraction, any process to-delivers first SEQ _ CONS (j, X, v)
and then SEQ _ CONS (k, Y, v  ). For any process pi let us add (while respecting the
process order defined by pi ) all the read operations issued by pi between the time
it has been to-delivered SEQ _ CONS (j, X, v) and the time it has been to-delivered
SEQ _ CONS (k, Y, v  ). It follows from the algorithms that all these read operations
obtain the last value written in each register X, Y , etc., where the meaning of last is
ww
with respect to the total order −→. It follows that, with respect to this total order,
no read operation obtains an overwritten value, which concludes the proof of the
theorem. 

Remark This implementation of sequential consistency shows an important dif-


ference between this consistency condition and atomicity. Both consistency condi-
tions rely on a time notion to order the read and write operations. This time notion
has to be in agreement with both process order and object order for atomicity, while
it has to be in agreement only with each process order (taken individually) for se-
quential consistency.
The previous algorithm, which is based on a total order on all the write opera-
tions, does more than required. it would be sufficient to respect only process order
17.2 Sequential Consistency from Total Order Broadcast 455

and ensure legality. Ordering all write operations allows for a simpler algorithm,
which ensures more than sequential consistency but less than atomicity.

17.2.2 A Fast Write Algorithm for Read/Write Objects

Non-blocking Write Operations It appears that, instead of forcing a write op-


eration issued by a process pi to locally terminate only when pi has to-delivered
the corresponding SEQ _ CONS () message, it is possible to have a fast write imple-
mentation in which write operations are never blocked. The synchronization price
to obtain sequential consistency has then to be paid by the read operations.
The corresponding fast write algorithm and the previous fast read algorithm
are dual algorithms. This duality offers a choice when one has to implement se-
quentially consistent applications. The fast write algorithm is more appropriate for
write-intensive applications, while the fast read algorithm is more appropriate for
read-intensive applications.

The Fast Write Algorithm As previously, each process pi maintains a copy


xi of each read/write register X. Moreover, each process pi maintains a count of
the number of messages SEQ _ CONS () it has to-broadcast and that are not yet to-
delivered. This value is kept in the local variable nb_writei (which is initialized
to 0). A read invoked by pi is allowed to terminate only when pi has to-delivered
all the messages it has sent. When this occurs, the values written by pi are in its
past, and consequently (as in the fast read algorithm) pi sees all its writes.

Theorem 32 The fast write algorithm implements sequential consistency.

Proof As before, due to Theorem 29, we have only to show that no read opera-
tion obtains an overwritten value. To that end, let ri (X)a be a read operation and
wj (X, a) be the corresponding write operation. We have to show that
 op   op 
wk (X, b) such that wj (X, a) −→ wk (X, b) ∧ wk (X, b) −→ ri (X)a .
Let us assume by contradiction that such an operation wk (X, b) exists. There are
two cases.
• k = i, i.e., wk (X, b) and ri (X)a have been issued by the same process pi . It
follows from the read algorithm that nb_writei = 0 when ri (X)a is executed.
op
As wk (X, b) −→ ri (X)a, this means that xi has been updated to the value b and
due to total order broadcast, this occurs after xi has been assigned the value a. It
follows that pi cannot return the value a, a contradiction.
• k = i, i.e., wk (X, b) and ri (X)a have been issued by different processes.
op
Since wk (X, b) −→ ri (X)a and a = b, there is an operation operi () (issued
op op
by pi ) such that wk (X, b) −→ operi () −→ ri (X)a (otherwise, pi would have
read b). There are two subcases according to the fact that operi () is a read or a
write operation.
456 17 Sequential Consistency

Fig. 17.7 Fast write


operation X.write(v) is
algorithm implementing
(1) nb_writei ← nb_writei + 1;
sequential consistency
(2) to_broadcast SEQ _ CONS (i, X, v);
(code for pi )
(3) return().

operation X.read() is
(4) wait (nb_writei = 0);
(5) return(xi ).

when SEQ _ CONS (j, Y, v) is to-delivered do


(6) yi ← v;
(7) if (j = i) then nb_writei ← nb_writei − 1 end if.

– operi () is a write operation. In this case, due to the variable nb_writei , pi


has to wait for this write to locally terminate before executing ri (X)a. As
op
wk (X, b) −→ operi , it follows from the total order broadcast that xi has
op
been assigned the value b before operi terminates, and due to wj (X, a) −→
wk (X, b), this value overwrites a. Hence, ri (X) cannot return a, which is a
contradiction.
op
– operi () is a read operation. As wk (X, b) −→ operi , if operi is a read of X, it
cannot return a, a contradiction. Otherwise operi is a read of another object Y .
But, this implies again that xi has been updated to b before being read by pi , a
contradiction as pi returns a. 

17.2.3 A Fast Enqueue Algorithm for Queue Objects

An interesting property of the previous total order-based fast read and fast write
algorithms lies in the fact that their skeleton (namely, total order broadcast and a
fast operation) can be used to design an algorithm implementing a fast enqueue
sequentially consistent queue. Such an implementation is presented in Fig. 17.8.
The algorithm implementing the operation Q.enqueue(v) is similar to that of the
fast write algorithm of Fig. 17.7, while the algorithm implementing the operation
Q.dequeue() is similar to that of the corresponding read algorithm. The algorithm
assumes that the default value ⊥ can neither be enqueued, nor represent the empty
stack.

17.3 Sequential Consistency from a Single Server

17.3.1 The Single Server Is a Process

Principle Another way to ensure the WW constraint consists in using a dedicated


server process which plays the role of the whole shared memory. Let psm be this
17.3 Sequential Consistency from a Single Server 457

Fig. 17.8
operation Q.enqueue(v) is
Fast enqueue algorithm
(1) to_broadcast SEQ _ CONS (i, Q, enq, v);
implementing
(2) return().
a sequentially consistent queue
(code for pi )
operation Q.dequeue() is
(3) resulti ← ⊥;
(4) to_broadcast SEQ _ CONS (i, Q, deq, −);
(5) wait (resulti = ⊥);
(6) return(resulti ).

when SEQ _ CONS (j, Y, op, v) is to-delivered do


(7) if (op = enq)
(8) then enqueue v at the head of qi
(9) else r ← value dequeued from the tail qi
(10) if (i = j ) then resulti ← r end if
(11) end if.

process. It manages all the objects. Moreover, each process pi manages a copy of
each object, and its object copies behave as a cache memory. As previously, for any
object Y (capital letter), yi (lowercase letter) denotes its local copy at pi .
When it invokes X.write(a), a process contact psm , which thereby defines a to-
tal order on all the write operations. It also contacts psm when it invokes a read
operation Y.read() and its local copy of Y is not up to date (i.e., yi = ⊥).
The object manager process psm keeps track of which processes have the last
value of each object. In that way, when psm answers a request from a process pi ,
psm can inform pi on which of its local copies are no longer up to date.

Local Variables Managed by psm In addition of a copy xsm of each read/write


register object X, the manager psm maintains a Boolean array hlvsm [1..n, [X, Y, . . .]]
whose meaning is the following:
hlvsm [i, X] ≡ (pi has the last value of X).

The Algorithm The algorithm is described in Fig. 17.9. When pi invokes


X.write(v) it sends a request message to psm and waits for an acknowledgment
(lines 1–2). The acknowledgment message WRITE _ ACK () informs pi on which of
its object copies are obsolete. Consequently, when it receives this acknowledgment,
pi invalidates them (line 3), updates xi (line 4), and terminates the write operation.
When pi invokes X.read(), it returns the value of its current copy (line 11) if this
copy is up to date, which is captured by the predicate xi = ⊥ (line 15). If this copy
is not up to date (xi = ⊥), pi behaves as in the write operation (lines 6–9 are nearly
the same as lines 1–4). It contacts the manager psm to obtain the last value of X and,
as already indicated, psm benefits from the acknowledgment message to inform it
on its object copies which do not contain the last written value.
The behavior of psm is simple. When it receives a message WRITE _ REQ (X, v)
from a process pi , it saves the value v in xi (line 12), and updates accordingly the
entries of the array hlvsm (lines 13–14). It then computes the set (inval) of objects for
458 17 Sequential Consistency

Fig. 17.9 Read/write


=============== code of pi =============
sequentially consistent
operation X.write(v) is
registers
(1) send WRITE _ REQ (X, v) to psm ;
from a central manager
(2) wait (WRITE _ ACK (inval) from psm );
(3) for each Y ∈ inval do yi ← ⊥ end for;
(4) xi ← v.

operation X.read() is
(5) if (xi = ⊥)
(6) then send READ _ REQ (X) to psm ;
(7) wait (READ _ ACK (inval, v) from psm );
(8) for each Y ∈ inval do yi ← ⊥ end for;
(9) xi ← v
(10) end if;
(11) return(xi ).

============== code of psm =============


when WRITE _ REQ (X, v) is received from pi do
(12) xsm ← v;
(13) hlwsm [1..n, X] ← [false, . . . , false];
(14) hlwsm [i, X] ← true;
(15) inval ← {Y | ¬ hlwsm [i, Y ]};
(16) send WRITE _ ACK (inval) to pi .

when READ _ REQ (X) is received from pi do


(17) hlwsm [i, X] ← true;
(18) inval ← {Y | ¬ hlwsm [i, Y ]};
(19) send READ _ ACK (inval, xsm ) to pi .

which pi has not the last value, and sends this set to pi (lines 15–16). The behavior
of psm when it receives READ _ REQ (X) is similar.

Theorem 33 The algorithm described in Fig. 17.9 implements a sequentially con-


sistent shared memory made up of read/write registers.

Proof Let us consider a computation of a set of processes cooperating through


read/write registers whose accesses are all under the control of the algorithm of
Fig. 17.9. We have to show that there is an equivalent sequential execution which is
legal.
As the manager process psm trivially orders all the write operations, the con-
straint WW is satisfied. It follows from Theorem 29 that we have only to show that
the computation is legal, i.e., no read operation of a process obtains an overwritten
value.
To that end, considering a process pi , let op1 be one of its (read or write) op-
erations which entails communication with psm , and op2 its next (read or write)
operation entailing again communication with psm . It follows from the algorithm
that, between op1 and op2 , pi has executed only read operations, and these opera-
tions were local (they entailed no communication with psm ). Moreover, the value of
a register X read by pi between op1 and op2 is the value of xi when op1 terminates,
17.3 Sequential Consistency from a Single Server 459

Fig. 17.10 Pattern of read/write accesses used in the proof of Theorem 33

which is the value of xsm when psm sent an acknowledgment to pi to terminate op1 .
This is illustrated in Fig. 17.10.
It follows that all the read operations of pi that occurred between op1 and op2
appear as having occurred after op1 and before the first operation processed by psm
after op1 (these read operations are inside an ellipsis and an arrow shows the point
at which—from the “read from” relation point of view—these operations seem to
have logically occurred). Hence, no read operation returns an overwritten value, and
the computation is legal, which concludes the proof of the theorem. 

17.3.2 The Single Server Is a Navigating Token

Ensuring the WW Constraint and the Legality of Read Operations The pro-
cess psm used in the previous algorithm is static and each pi sends (receives) mes-
sages to (from) it. The key point is that psm orders all write operations. Another
way to order all write operations consists in using a dynamic approach, namely a
navigating token (see Chap. 5). To write a value in a register, a process has first to
acquire the token. As there is a single token, this generates a total order on all the
write operations.
The legality of read operations can be ensured by requiring the token to carry
the same Boolean array hlv[1..n, [X, Y, . . .]] as the one used in the algorithm of
Fig. 17.9. Moreover, as the current owner of the token is the only process that can
read and write this array, the current owner can use it to provide the process pj to
which it sends the token with up to date values.

The Token-Based Algorithm The resulting algorithm is described in Fig. 17.11.


As the fast read algorithm (Fig. 17.6) and the fast write algorithm (Fig. 17.7) based
on the total order broadcast abstraction, it never invalidates object copies at a pro-
cess. As only the write operations require the token, it is consequently a fast read
algorithm.
To execute a write operation on a register, the invoking process pi needs the
token (line 1). When pi has obtained the token, opi first updates its object copies
which have been modified since the last visit of the token. The corresponding pairs
(Y, w) are recorded in the set new_values carried by the token (lines 2–3). Then, pi
460 17 Sequential Consistency

operation X.write(v) is
(1) acquire_token();
(2) let (hlv, new_values) be the pair carried by the token;
(3) for each (Y, w) ∈ new_values) do yi ← w; hlv[i, Y ] ← true end for;
(4) hlw[1..n, X] ← [false, . . . , false];
(5) xi ← v; hlv[i, X] ← true;
(6) let pj be the process to which the token is sent;
(7) let new_values = {(Y, yi ) | ¬hlv[j, Y ]};
(8) add the pair (hlv, new_values) to the token;
(9) realese_token() % for pj %.

operation X.read() is
(10) return(xi ).

Fig. 17.11 Token-based sequentially consistent shared memory (code for pi )

writes the new values of X into xi and updates accordingly hlv[i, X] and the other
entries hlv[j, X] such that j = i (lines 4–5).
On the token side, pi then computes, with the help of the vector hlv[i, [X, Y, . . .]],
the set of pairs (Y, yi ) for which the next process pj that will have the token does
not have the last values (lines 6–7). After it has added this set of pairs to the token
(line 8), pi releases the token,which is sent to pj (line 9). Finally the read operation
are purely local (line 10).
Let us notice that, when it has the token, a process can issue several write oper-
ations on the same or several registers. The only requirement is that a process that
wants to write eventually must acquire the token.

On the Moves of the Token The algorithm assumes that when a process releases
the token, it knows the next user of the token. This can be easily implemented by
having the token move on a ring. In this case, the token acts as an object used to
update the local memories of the processes with the last values written. Between
two consecutive visits of the token, a process reads its local copies. The consistency
argument is the same as the one depicted in Fig. 17.10.

17.4 Sequential Consistency with a Server per Object


While the previous section presented algorithms based on the WW constraint, this
section presents an algorithm based on the OO constraint.

17.4.1 Structural View

The structural view is described in Fig. 17.12. There is a manager process pX per
object X, and the set of managers {pA , . . . , pZ } implements, in a distributed way,
the whole shared memory.
17.4 Sequential Consistency with a Server per Object 461

Fig. 17.12 Architectural view associated with the OO constraint

Fig. 17.13 Why the object managers must cooperate

There is a bidirectional channel for every pair made up of an application process


and a manager process. This is because we are interested in a genuine OO-based
implementation of sequential consistency: at the application level, there is no notion
of a channel and the application processes are assumed to communicate only by
reading and writing shared registers.
Moreover, there is a channel connecting any pair of manager processes. As we
will see, this is required because, differently from atomicity, sequential consistency
is not a local property. The object managers have to cooperate to ensure that there
is a legal sequential sequence of operations that is equivalent to the real compu-
tation. Such a cooperation exists but does not appear explicitly in the algorithm
based on a single process psm managing all the objects (such as the one presented
in Sect. 17.3.1).

17.4.2 The Object Managers Must Cooperate

Let us consider the computation described in Fig. 17.13, where X and Y are initial-
ized to 0, in which the last read by p2 is missing. As it is equivalent to the following
legal sequence S,

Y.write2 (1), X.write2 (2), X.read1 () → 2, Y.write1 (3), X.read1 () → 2, X.write2 (4),

this computation is sequentially consistent.


462 17 Sequential Consistency

Fig. 17.14
operation X.write(v) is
Sequential consistency
(1) valid ← {Y | yi = ⊥};
with a manager per object:
(2) send WRITE _ REQ (X, v, valid) to pX ;
process side
(3) wait (WRITE _ ACK (inval) from pX );
(4) for each Y ∈ inval do yi ← ⊥ end for;
(5) xi ← v.

operation X.read() is
(6) if (xi = ⊥)
(7) then valid ← {Y | yi = ⊥};
(8) send READ _ REQ (X, valid) to pX ;
(9) wait (READ _ ACK (inval, v) from pX );
(10) for each Y ∈ inval do yi ← ⊥ end for;
(11) xi ← v
(12) end if;
(13) return(xi ).

The full computation (including Y.read2 () → 1), is not sequentially consistent:


There is no way to order all the operations while respecting both the process order
relation and the sequential specification of both X and Y . This is due to the fact that,
as the second read of p1 returns 2, the operation X.write2 (4) has to appear after it
in a sequence (otherwise p1 could not have obtained the value 2). But, then this
write has to appear after the operation X.write2 (4) issued by p1 , and consequently
the read by p2 should return 3 for the computation to be sequentially consistent.
This means that when p2 executes X.write2 (4), its local copy of Y , namely
y2 = 0, no longer ensures the legality of its next read of this object Y . The manager
of X has to cooperate with the manager of X to discover it, so that p4 invalidate its
local copy of Y .

17.4.3 An Algorithm Based on the OO Constraint

An algorithm based on the OO constraint is described in Fig. 17.14 (application


process part) and Fig. 17.15 (object manager part).

On an Application Process Side The algorithms implementing the operations


X.write(v) and X.read() are nearly the same as those of Fig. 17.9 which are for a
single object (or a system where all the objects are managed by the same process
psm ).
The only difference lies in the fact that a process pi adds to its request message
(WRITE _ REQ () sent at line 2, and READ _ REQ () sent at line 8), the set of objects for
which its local copies are valid (i.e., reading them before the current read or write
operation would not violate sequential consistency). This information will help the
manager of X (the object currently read or written by pi ) to help it invalidate its
local copies whose future read would not be legal with their previous values.
17.4 Sequential Consistency with a Server per Object 463

when WRITE _ REQ (X, v, valid) is received from pi do


(14) xX ← v;
(15) hlwX [1..n] ← [false, . . . , false];
(16) hlwX [i] ← true;
(17) inval ← check_legality(X, i, valid);
(18) send WRITE _ ACK (inval) to pi .

when READ _ REQ (X, valid) is received from pi do


(19) hlwX [i] ← true;
(20) inval ← check_legality(X, i, valid);
(21) send READ _ ACK (inval, xX ) to pi .

procedure check_legality(X, i, valid) is


(22) inval_set ← ∅;
(23) for each Y ∈ valid \ {X} do send VALID _ REQ (i) to pY end for;
(24) for each Y ∈ valid \ {X} do
(25) receive VALID _ ACK (rep) from pY ;
(26) if rep = no then inval_set ← inval_set ∪ {Y } end if
(27) end for;
(28) return(inval_set).

when VALID _ REQ (j ) is received from pY do


(29) if ((pX is processing a write request) ∨ ¬ hlvX [j ])
(30) then r ← no else r ← yes
(31) end if;
(32) send VALID _ ACK (r) to pY .

Fig. 17.15 Sequential consistency with a manager per object: manager side

On an Object Manager Side The behavior of the process pX managing the


object X is described in Fig. 17.15. This manager maintains a Boolean vector
hlvX [1..n] such that hlvX [i] is true if and only if pi has the last value of X.
The code of the algorithms associated with the reception of a message WRITE _
REQ (X, v, valid) or the reception of a message READ _ REQ (X, valid) is similar to its
counterpart of Fig. 17.9 (where all objects are managed by the same process psm ).
The difference lies in the way the legality of the objects is ensured. The matrix
hlvsm [1..n, [X, Y, . . .]] used by the centralized manager psm is now distributed on
all the object managers as follows: The vector hlvsm [1..n, X] is now managed only
by pX , which records it in its local vector hlvX [1..n].
The read or write operation currently invoked by pi may entail the invalidation of
its local copy of some objects, e.g., the object Y . This means that the computation,
by pX , of the set inval associated with this read or write operation requires that
pX cooperates with other object managers. This cooperation is realized by the local
procedure denoted check_legality(X, i, valid).
As only the objects belonging to the current set valid transmitted by pi can be
invalidated, pX sends a message VALID _ REQ (i) to each manager pY such that Y ∈
valid (line 23). It then waits for an answer for each of these messages (line 24), each
carrying the answer no or yes. The answer yes means that pY claims that the current
464 17 Sequential Consistency

Fig. 17.16 Cooperation between managers is required by the OO constraint

copy of Y at pi is still valid (its future reads by pi will still be legal). The answer
no means that the copy of Y at pi must be invalidated.
The behavior of a manager pY , when it receives a message VALID _ REQ (i) from
a manager pX , is described in Fig. 17.16. When this occurs, pY (conservatively)
asks pX to invalidate the copy of Y at pi if it knows that this copy is no longer up
to date. This is the case if hlwY [i], or if pY is currently processing a write (which,
by construction is a write of Y ). When this occurs, pY answers no. Otherwise, the
copy of Y at pi is still valid, and pY answers yes to pX .

Sequential Consistency vs. Atomicity/Linearizability The previous OO-based


implementation shows clearly the price that has to be paid to implement sequential
consistency, namely, when each object is managed by a single process, the object
managers must cooperate to guarantee a correct implementation.
As seen in Sect. 16.4.3, this was not the case in the algorithm implementing atom-
icity (see the algorithm in Fig. 16.10). This has a very simple explanation: atomicity
(linearizability) is a local property (Theorem 27), while sequential consistency is
not (see Sect. 17.1.2).

17.5 A Weaker Consistency Condition: Causal Consistency

17.5.1 Definition

Underlying Intuition Causal consistency is a consistency condition for read/write


objects which is strictly weaker than sequential consistency. Its essence is to capture
only the causality relation defined by the read-from relation. To that end, it does not
require that all the processes agree on the very same legal sequential history  S.
Two processes are not required to agree on the write operations which are concur-
rent; they can see them in different orders. Read operations are, however, required
to be legal.
17.5 A Weaker Consistency Condition: Causal Consistency 465

Fig. 17.17 An example of a causally consistent computation

Remark The notion of a sequential observation of a distributed message-passing


computation was introduced in Sect. 6.3.3. A sequential observation of a distributed
execution is a sequence including all its events, respecting their partial ordering.
Hence, distinct observations differ in the way they order concurrent events.
The notion of causal consistency is a similar notion applied to read/write objects.

Definition The set of operations that may affect a process pi are its read and
write operations, plus the write operations issued by the other processes. Given a
 let OPi be the computation from which all the read operations not
computation OP,
issued by pi have been removed.
A computation OP is causally consistent if, for any process pi , there is a legal
sequential computation Hi that is equivalent to OPi .
While this means that, for each process pi , OPi is sequentially consistent, it does
 is sequentially consistent. (This type of consistency condition is
not mean that OP
oriented toward cooperative work.)

Example 1 Let us first consider the computation described in Fig. 17.2. The fol-
lowing sequential computation Hi and H j can be associated with pi and pj , respec-
tively:
i ≡ R  .writej (0), R  .readi (0), R  .writej (2), R  .readi (2),
H
j ≡ R.writei (0), R.readi (0), R.writej (1), R.readi (1).
H
i and H
As both H j are legal, the computation is causally consistent.

Example 2 A second example is described in Fig. 17.17, where it is assumed


that the write operations are independent. It follows that p1 and p2 can order these
write operations in any order, and consequently as the sequences
1 ≡ R.write4 (1),
H R.write3 (2), R.read1 () → 2, and
2 ≡ R.write4 (2),
H R.write3 (1), R.read2 () → 1
are legal, the computation is causally consistent.
466 17 Sequential Consistency

Fig. 17.18 Another example of a causally consistent computation

op
Let us now assume that R.write4 (1) −→ R.write3 (2). In this case, both H 1
and H2 must order R.write4 (1) before R.write3 (2), and for the computation to be
causally consistent, both read operations have to return the value 2. Let us observe
that, in this case, this computation is also atomic.

Example 3 A second example is described in Fig. 17.18, where the computation


is made up of three processes accessing the registers R and R  . Moreover, as before,
the operations R.write1 (1) and R.write2 (2) are concurrent.
If a = 2 (value returned from the read of R by p3 ), the computation is causally
consistent. This is due to the two following sequences which (a) respect the relation
op
−→ and (b) are legal:

1 ≡ R.write1 (1), R.write2 (2), R.read1 () → 2, R  .write1 (3),


H
2 ≡ R.write2 (2),
H and
3 ≡ R.write1 (1), R.write2 (2), R  .write1 (3), R  .read3 () → 3, R.read3 () → 2.
H

Let us notice that, in this case, the computation is also sequentially consistent.
If a = 1 the computation is still causally consistent, but is no longer sequentially
consistent. In this case p3 orders the concurrent write operations of R in the reverse
order, i.e., we have

3 ≡ R.write2 (2), R.write1 (1), R  .write1 (3), R  .read3 () → 3, R.read3 () → 2.


H

17.5.2 A Simple Algorithm

Causal Consistency and Causal Message Delivery It is easy to see that, for
read/write registers, causal consistency is analogous to causal broadcast on message
deliveries. The notion of causal message delivery was introduced in Sect. 12.1, and
its broadcast instance was developed in Sect. 12.3. It states that no message m can
be delivered to a process before the messages broadcast in the causal past of m.
17.5 A Weaker Consistency Condition: Causal Consistency 467

Fig. 17.19
operation X.write(v) is
A simple algorithm
(1) co_broadcast CAUSAL _ CONS (X, xi );
implementing causal consistency
(2) xi ← v.

operation X.read() is
(3) return(xi );

when CAUSAL _ CONS (Y, v) is co-delivered do


(4) yi ← v.

op
For causal consistency, the causal past is defined with respect to the relation −→,
a write operation corresponds to the broadcast of a message, while a read operation
corresponds to a message reception (with the difference that a value written can
never be read, and the same value can be read several times).

A Simple Algorithm It follows from the previous discussion that a simple way
to implement causal consistency lies in using an underlying causal broadcast algo-
rithm. Several algorithms were presented in Sect. 12.3; they provide the processes
with the operations co_broadcast() and co_deliver(). We consider here the causal
broadcast algorithm presented in Fig. 12.10.
The algorithm is described in Fig. 17.19. Each process manages a copy xi of
every register object X. It is easy to see that both the read operation and the write
operation are fast. This is due to the fact that, as causality involves only the causal
past, no process coordination is required.

17.5.3 The Case of a Single Object

A Simple Algorithm This section presents a very simple algorithm that imple-
ments causal consistency when there is a single object X. This algorithm is based
on scalar clocks (these logical clocks were introduced in Sect. 7.1.1).
The algorithm is described in Fig. 17.20. The scalar clock of pi is denoted clocki
and is initialized to 0.

operation X.write(v) is
(1) clocki ← clocki + 1;
(2) xi ← v;
(3) for each j ∈ {1, . . . , n} \ {i} do send CAUSAL _ CONS (v, clocki ) to pj end for.

operation X.read() is
(4) return(xi );

when CAUSAL _ CONS (v, lw_date) is delivered do


(5) if (lw_date > clocki ) then clocki ← lw_date; xi ← v end if.

Fig. 17.20 Causal consistency for a single object


468 17 Sequential Consistency

As before, the read and write operations are fast. When a process invokes a write
operation it increases its local clock, associates the corresponding date with its write,
and sends the message CAUSAL _ CONS (v, clocki ) to all the other processes. The
scalar clocks establish a total order on the write operations which are causally de-
pendent.
Write operations with the same date w_date are concurrent. Only the first of
them that is received by a process pi is taken into account by pi . The algorithm
considers that, from the receiver pi ’s point of view, the other ones are overwritten
by the first one. Hence, two processes pi and pj can order differently concurrent
write operations.

From Causal Consistency to Sequential Consistency This algorithm can be


easily enriched if one wants to obtain sequential consistency instead of causal
consistency. To that end, logical clocks have to be replaced by timestamps (see
Sect. 7.1.2). Moreover, each process pi manages two additional local variables,
denoted last_writeri and lw_datei . To pi ’s knowledge, last_writeri contains the
identity of the last process that has written into X, and lw_datei contains the date
associated with this write. The pair lw_datei , last_writeri  constitutes the times-
tamp of the last write of X known by pi . The modifications of the algorithm are as
follows:
• Instead of sending the message CAUSAL _ CONS (v, clocki ) (line 3), a process pi
sends now the message CAUSAL _ CONS (v, clocki , i).
• When it receives a message CAUSAL _ CONS (v, lw_date, j ), a process pi takes
it into account only if lw_date, j  > lw_datei , last_writeri . It then updates xi
to v and lw_datei , last_writeri  to lw_date, j .
In that way, their timestamps define a total order on all the write operations, and
no two processes order two write operations differently. The writes discarded by a
process correspond to values that it sees as overwritten values.

17.6 A Hierarchy of Consistency Conditions


Atomicity (linearizability), sequential consistency, causal consistency, and FIFO
consistency (presented in Problem 5), define a strict hierarchy of consistency condi-
tions. This hierarchy is described in Fig. 17.21.

17.7 Summary
This chapter introduced sequential consistency and causal consistency, and pre-
sented several algorithms which implement them. As far as sequential consistency
is concerned, it presented two properties (denoted WW and OO) which simplify the
design of algorithms implementing this consistency condition.
17.8 Bibliographic Notes 469

Fig. 17.21 Hierarchy


of consistency conditions

17.8 Bibliographic Notes


• Sequential consistency was introduced by L. Lamport [227].
• Analysis of sequential consistency and its relation with atomicity can be found
in [23, 313]. It is shown in [364] that checking if a computation is sequentially
consistent is an NP-complete problem.
• Detection of violation of sequential consistency is addressed in [156].
• A lower bound on the time cost to ensure sequential consistency of read/write
registers is proved in [23]. Considering that the transit time of each message is
upper bounded by δ, it is shown that, whatever the algorithm, the sum of the
delays for a read operation and a write operation is at most δ. The fast read (resp.,
write) algorithm corresponds to the case where the read (resp., write) operation
cost 0 time units, while the write (resp., read) operation can cost up to δ time
units.
• Both the WW constraint and the OO constraint, Theorems 29 and 30, are due to
M. Mizuno, M. Raynal, and J.Z. Zhou [269]. The constraint-based approach is
due to T. Ibaraki, T. Kameda, and T. Minoura [196].
• The three algorithms based on the total order broadcast abstraction (fast read
algorithm, Sect. 17.2.1, fast write algorithm, Sect. 17.2.2, and fast enqueue algo-
rithm, Sect. 17.2.3) are due to H. Attiya and J.L. Welch [23]. Formal proofs of
these algorithms can be found in that paper.
• The algorithm described in Sect. 17.3.1 (sequential consistency with a single
server for all the objects) and the algorithm described in Sect. 17.4 (sequential
consistency with a server per object) are due to M. Mizuno, M. Raynal, and J.Z.
Zhou [269].
• The token-based algorithm implementing sequential consistency is due to M.
Raynal [314].
• An optimistic algorithm for sequential consistency is described in [268]. Other
algorithms implementing sequential consistency can be found in [320, 322].
• The implementation of sequential consistency when objects are accessed by mul-
tiobject operations is addressed in [326].
• Causal consistency was introduced by M. Ahamad, G. Neiger, J.E. Burns, P.W.
Hutto, and P. Kohli [10].
• Algorithms implementing causal consistency can be found in [10, 11, 92, 318].
An algorithm which allows processes to dynamically switch from sequential con-
sistency to causal consistency, and vice-versa, is described in [369].
470 17 Sequential Consistency

• Other consistency conditions weaker than sequential consistency have been pro-
posed, e.g., slow memory [194], lazy release consistency [204], and PRAM con-
sistency [237] (to cite a few). See [323] for a short introductory survey.
• Normality is a consistency condition which considers process order and object
order [151]. When operations are on a single object, it is equivalent to sequential
consistency.
• Very general algorithms, which can be instantiated with one of many consistency
conditions, are described in [200, 212].

17.9 Exercises and Problems


1. Considering a concurrent stack defined by the classical operations push() and
pop(), describe an execution where the stack behaves atomically. Design then an
execution where the stack is sequentially consistent but not atomic.
2. Give a proof of Theorem 30.
3. Prove that the fast enqueue algorithm described in Fig. 17.8 implements a se-
quentially consistent queue.
4. Considering the fast write algorithm described in Sect. 17.2.2, let us replace
the variable nb_writei by an array nb_writei [X, Y, Z, . . .], where nb_writei [X]
counts the number write into X issued by pi and not yet to-delivered to pi .
When pi invokes X.read(), is it possible to replace the predicate
(nb_writei = 0) used at line 4 by the predicate nb_writei [X] = 0? If the an-
swer is “yes”, prove the corresponding algorithm. If the answer is “no”, give a
counterexample.
5. FIFO consistency requires that the write operations issued by each process pi are
seen by each process in the order in which they have been issued by pi . There is
no constraint on the write operations issued by different processes.
Describe an algorithm implementing FIFO consistency. Show that this con-
sistency condition is weaker than causal consistency.
6. Show that causal consistency plus the WW constraint provide sequential consis-
tency.
Solution in [322].
7. Extend the algorithm described in Sect. 17.3.1 so that it benefits from the notion
of an owner process introduced in Sect. 16.4.4. (A process owns an object X from
the time it writes it until the next read or write operation that will be applied to
X by another process. As during this period no other process accesses X, pi can
write it several times without having to inform the manager process.)
8. Extend the algorithms described Fig. 17.14 (application process algorithm) and
Fig. 17.15 (manager algorithm) so that they benefit from the notion of an owner
process.
Solution in [313].
9. Design an algorithm that implements causal consistency, where each process
manages copies of only a subset of the objects.
Solution in [318].
Afterword

The Aim of This Book

The practice of sequential computing has greatly benefited from the results of the
theory of sequential computing that were captured in the study of formal languages
and automata theory. Everyone knows what can be computed (computability) and
what can be computed efficiently (complexity). All these results constitute the foun-
dations of sequential computing, which, thanks to them, has become a science.
These theoretical results and algorithmic principles have been described in many
books from which students can learn basic results, algorithms, and principles of se-
quential computing (e.g., [99, 107, 148, 189, 205, 219, 258, 270, 351] to cite a few).
Since Lamport’s seminal paper “Time, clocks, and the ordering of events in a dis-
tributed system”, which appeared in 1978 [226], distributed computing is no longer
a set of tricks or recipes, but a domain of computing science with its own concepts,
methods, and applications. The world is distributed, and today the major part of ap-
plications are distributed. This means that message-passing algorithms are now an
important part of any computing science or computing engineering curriculum.
Thanks to appropriate curricula—and good associated books—students have a
good background in the theory and practice of sequential computing. In the same
spirit, an aim of this book is to try to provide them with an appropriate background
when they have to solve distributed computing problems.
Technology is what makes everyday life easier. Science is what allows us to
transcend it, and capture the deep nature of the objects we are manipulating. To that
end, it provides us with the right concepts to master and understand what we are
doing. Considering failure-free asynchronous distributed computing, an ambition of
this book is to be a step in this direction.

M. Raynal, Distributed Algorithms for Message-Passing Systems, 471


DOI 10.1007/978-3-642-38123-2, © Springer-Verlag Berlin Heidelberg 2013
472 Afterword

Most Important Concepts, Notions,


and Mechanisms Presented in This Book

Chapter 1: Asynchronous/synchronous system, breadth-first traversal, broadcast,


convergecast, depth-first traversal, distributed algorithm, forward/discard princi-
ple, initial knowledge, local algorithm, parallel traversal, spanning tree, unidirec-
tional logical ring.
Chapter 2: Distributed graph algorithm, cycle detection, graph coloring, knot de-
tection, maximal independent set, problem reduction, shortest path computation.
Chapter 3: Cut vertex, de Bruijn’s graph, determination of cut vertices, global func-
tion, message filtering, regular communication graph, round-based framework.
Chapter 4: Anonymous network, election, message complexity, process identity,
ring network, time complexity, unidirectional versus bidirectional ring.
Chapter 5: Adaptive algorithm, distributed queuing, edge/link reversal, mobile ob-
ject, mutual exclusion, network navigation, object consistency, routing, scalability,
spanning tree, starvation-freedom, token.
Chapter 6: Event, causal dependence relation, causal future, causal path, causal
past, concurrent (independent) events, causal precedence relation, consistent global
state, cut, global state, happened before relation, lattice of global states, observa-
tion, marker message, nondeterminism, partial order on events, partial order on
local states, process history, process local state, sequential observation.
Chapter 7: Adaptive communication layer, approximate causality relation, causal
precedence, causality tracking, conjunction of stable local predicates, detection of
a global state property, discarding old data, Hasse diagram, immediate predecessor,
linear (scalar) time (clock), logical time, matrix time (clock), message stability,
partial (total) order, relevant event, k-restricted vector clock, sequential observa-
tion, size of a vector clock, timestamp, time propagation, total order broadcast,
vector time (clock).
Chapter 8: Causal path, causal precedence, communication-induced checkpointing,
interval (of events), local checkpoint, forced local checkpoint, global checkpoint,
hidden dependency, recovery, rollback-dependency trackability, scalar clock, spon-
taneous local checkpoint, uncoordinated checkpoint, useless checkpoint, vector
clock, Z-dependence, zigzag cycle, zigzag pattern, zigzag path, zigzag prevention.
Chapter 9: Asynchronous system, bounded delay network, complexity, graph cov-
ering structure, physical clock drift, pulse-based programming, synchronizer, syn-
chronous algorithm.
Chapter 10: Adaptive algorithm, arbiter permission, bounded algorithm, deadlock-
freedom, directed acyclic graph, extended mutex, adaptive algorithm, grid quorum,
individual permission, liveness property, mutual exclusion (mutex), preemption,
quorum, readers/writers problem, safety property, starvation-freedom, timestamp,
vote.
How to Use This Book 473

Chapter 11: Conflict graph, deadlock prevention, graph coloring, incremental re-
quests, k-out-of-M problem, permission, resource allocation, resource graph, re-
source type, resource instance, simultaneous requests, static/dynamic (resource)
session, timestamp, total order, waiting chain, wait-for graph.
Chapter 12: Asynchronous system, bounded lifetime message, causal barrier,
causal broadcast, causal message delivery order, circulating token, client/server
broadcast, coordinator process, delivery condition, first in first out (FIFO) channel,
order properties on a channel, size of control information, synchronous system.
Chapter 13: Asynchronous system, client-server hierarchy, communication initia-
tive, communicating sequential processes, crown, deadline-constrained interac-
tion, deterministic vs. nondeterministic context, logically instantaneous commu-
nication, planned vs. forced interaction, rendezvous, multiparty interaction, syn-
chronous communication, synchronous system, token.
Chapter 14: AND receive, asynchronous system, atomic model, counting, diffusing
computation, distributed iteration, global state, k-out-of-n receive statement, loop
invariant, message arrival vs. message reception, network traversal, nondetermin-
istic statement, OR receive statement, reasoned construction, receive statement,
ring, spanning tree, stable property, termination detection, wave.
Chapter 15: AND communication model, cycle, deadlock, deadlock detection,
knot, one-at-a-time model, OR communication model, probe-based algorithm, re-
source vs. message, stable property, wait-for graph.
Chapter 16: Atomicity, composability, concurrent object, consistency condition,
distributed shared memory, invalidation vs. update, linearizability, linearization
point, local property, manager process, object operation, partial order on opera-
tions, read/write register, real time, sequential specification, server process, shared
memory abstraction, total order broadcast abstraction.
Chapter 17: Causal consistency, concurrent object, consistency condition, dis-
tributed shared memory, invalidation, logical time, manager process, OO con-
straint, partial order on operations, read/write register, sequential consistency,
server processes, shared memory abstraction, total order broadcast abstraction,
WW constraint.

How to Use This Book


This section presents two courses on distributed computing which can benefit from
the concepts, algorithms and principles presented in this book. Each course is a
one-semester course, and they are designed to be sequential (a full year at the un-
dergraduate level, or split, with the first course at the undergraduate level and the
second at the beginning of the graduate level).
• A first one-semester course on distributed computing could first focus on Part I,
which is devoted to graph algorithms. Then, the course could address (a) dis-
tributed mutual exclusion (Chap. 10), (b) causal message delivery and total order
474 Afterword

broadcast (Chap. 12), and (c) distributed termination detection (Chap. 14), if time
permits.
The spirit of this course is to be an introductory course, giving students a cor-
rect intuition of what distributed algorithms are (they are not simple “extensions”
of sequential algorithms), and show them that there are problems which are spe-
cific to distributed computing.
• A second one-semester course on distributed computing could first address the
concept of a global state (Chap. 6). The aim is here to give the student a precise
view of what a distributed execution is and introduce the notion of a global state.
Then, the course could develop and illustrate the different notions of logical times
(Chap. 7).
Distributed checkpointing (Chap. 8), synchronizers (Chap. 9), resource alloca-
tion (Chap. 11), rendezvous communication (Chap. 13), and deadlock detection
(Chap. 15), can be used to illustrate the previous notions.
Finally, the meaning and the implementation of a distributed shared memory
(Part VI) could be presented to introduce the notion of a consistency condition,
which is a fundamental notion of distributed computing.
Of course, this book can also be used by engineers and researchers who work
on distributed applications to better understand the concepts and mechanisms that
underlie their work.

From Failure-Free Systems to Failure-Prone Systems


This book was devoted to algorithms for failure-free asynchronous distributed ap-
plications and systems. Once the fundamental notions, concepts, and algorithms
of failure-free distributed computing are mastered, one can focus on more spe-
cific topics of failure-prone distributed systems. In such a context, the com-
bined effect of asynchrony and failures create uncertainty that algorithms have
to cope with. The reader interested in the net effect of asynchrony and failure
on the design of distributed algorithms is invited to consult the following books:
[24, 67, 150, 155, 219, 242, 315, 316] (to cite a few).

A Series of Books
This book completes a series of four books, written by the author, devoted to concur-
rent and distributed computing [315–317]. More precisely, we have the following.
• As has been seen, this book is on elementary distributed computing for failure-
free asynchronous systems.
• The book [317] is on algorithms in asynchronous shared memory systems where
processes can commit crash failures. It focuses on the construction of reliable
concurrent objects in the presence of process crashes.
A Series of Books 475

• The book [316] is on asynchronous message-passing systems where processes are


prone to crash failures. It presents communication and agreement abstractions
for fault-tolerant asynchronous distributed systems. Failure detectors are used to
circumvent impossibility results encountered in pure asynchronous systems.
• The book [315] is on synchronous message-passing systems, where the processes
are prone to crash failures, omission failures, or Byzantine failures. It focuses on
the following distributed agreement problems: consensus, interactive consistency,
and non-blocking atomic commit.

Enseigner, c’est réfléchir à voix haute devant les étudiants.


Henri-Léon Lebesgue (1875–1941)
Make everything as simple as possible, but not simpler.
Albert Einstein (1879–1955)
References

1. A. Acharya, B.R. Badrinath, Recording distributed snapshot based on causal order of mes-
sage delivery. Inf. Process. Lett. 44, 317–321 (1992)
2. A. Acharya, B.R. Badrinath, Checkpointing distributed application on mobile computers,
in 3rd Int’l Conference on Parallel and Distributed Information Systems (IEEE Press, New
York, 1994), pp. 73–80
3. S. Adve, K. Gharachorloo, Shared memory consistency models. IEEE Comput. 29(12), 66–
76 (1996)
4. S. Adve, M. Mill, A unified formalization of four shared memory models. IEEE Trans. Par-
allel Distrib. Syst. 4(6), 613–624 (1993)
5. Y. Afek, G.M. Brown, M. Merritt, Lazy caching. ACM Trans. Program. Lang. Syst. 15(1),
182–205 (1993)
6. A. Agarwal, V.K. Garg, Efficient dependency tracking for relevant events in concurrent sys-
tems. Distrib. Comput. 19(3), 163–183 (2007)
7. D. Agrawal, A. El Abbadi, An efficient and fault-tolerant solution for distributed mutual
exclusion. ACM Trans. Comput. Syst. 9(1), 1–20 (1991)
8. D. Agrawal, A. Malpini, Efficient dissemination of information in computer networks. Com-
put. J. 34(6), 534–541 (1991)
9. M. Ahamad, M.H. Ammar, S.Y. Cheung, Multidimensional voting. ACM Trans. Comput.
Syst. 9(4), 399–431 (1991)
10. M. Ahamad, G. Neiger, J.E. Burns, P.W. Hutto, P. Kohli, Causal memory: definitions, imple-
mentation and programming. Distrib. Comput. 9, 37–49 (1995)
11. M. Ahamad, M. Raynal, G. Thia-Kime, An adaptive protocol for implementing causally con-
sistent distributed services, in Proc. 18th Int’l Conference on Distributed Computing Systems
(ICDCS’98) (IEEE Press, New York, 1998), pp. 86–93
12. M. Ahuja, Flush primitives for asynchronous distributed systems. Inf. Process. Lett. 34, 5–12
(1990)
13. M. Ahuja, M. Raynal, An implementation of global flush primitives using counters. Parallel
Process. Lett. 5(2), 171–178 (1995)
14. S. Alagar, S. Venkatesan, An optimal algorithm for distributed snapshots with message causal
ordering. Inf. Process. Lett. 50, 310–316 (1994)
15. A. Alvarez, S. Arévalo, V. Cholvi, A. Fernández, E. Jiménez, On the interconnection of
message passing systems. Inf. Process. Lett. 105(6), 249–254 (2008)
16. L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic, and causal. IEEE Trans.
Softw. Eng. 24(2), 149–159 (1998)
17. E. Anceaume, J.-M. Hélary, M. Raynal, Tracking immediate predecessors in distributed com-
putations, in Proc. 14th Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA’002) (ACM Press, New York, 2002), pp. 210–219

M. Raynal, Distributed Algorithms for Message-Passing Systems, 477


DOI 10.1007/978-3-642-38123-2, © Springer-Verlag Berlin Heidelberg 2013
478 References

18. E. Anceaume, J.-M. Hélary, M. Raynal, A note on the determination of the immediate pre-
decessors in a distributed computation. Int. J. Found. Comput. Sci. 13(6), 865–872 (2002)
19. D. Angluin, Local and global properties in networks of processors, in Proc. 12th ACM Sym-
posium on Theory of Computation (STOC’81) (ACM Press, New York, 1981), pp. 82–93
20. I. Arrieta, F. Fariña, J.-R. Mendívil, M. Raynal, Leader election: from Higham-Przytycka’s
algorithm to a gracefully degrading algorithm, in Proc. 6th Int’l Conference on Complex,
Intelligent, and Software Intensive Systems (CISIS’12) (IEEE Press, New York, 2012), pp.
225–232
21. H. Attiya, S. Chaudhuri, R. Friedman, J.L. Welch, Non-sequential consistency conditions
for shared memory, in Proc. 5th ACM Symposium on Parallel Algorithms and Architectures
(SPAA’93) (ACM Press, New York, 1993), pp. 241–250
22. H. Attiya, M. Snir, M. Warmuth, Computing on an anonymous ring. J. ACM 35(4), 845–876
(1988)
23. H. Attiya, J.L. Welch, Sequential consistency versus linearizability. ACM Trans. Comput.
Syst. 12(2), 91–122 (1994)
24. H. Attiya, J.L. Welch, Distributed Computing: Fundamentals, Simulations and Advanced
Topics, 2nd edn. (Wiley-Interscience, New York, 2004). 414 pages. ISBN 0-471-45324-2
25. B. Awerbuch, A new distributed depth-first search algorithm. Inf. Process. Lett. 20(3), 147–
150 (1985)
26. B. Awerbuch, Reducing complexities of the distributed max-flow and breadth-first algorithms
by means of network synchronization. Networks 15, 425–437 (1985)
27. B. Awerbuch, Complexity of network synchronization. J. ACM 4, 804–823 (1985)
28. O. Babaoğlu, E. Fromentin, M. Raynal, A unified framework for the specification and the
run-time detection of dynamic properties in distributed executions. J. Syst. Softw. 33, 287–
298 (1996)
29. O. Babaoğlu, K. Marzullo, Consistent global states of distributed systems: fundamental con-
cepts and mechanisms, in Distributed Systems (ACM/Addison-Wesley Press, New York,
1993), pp. 55–93. Chap. 4
30. R. Bagrodia, Process synchronization: design and performance evaluation for distributed al-
gorithms. IEEE Trans. Softw. Eng. SE15(9), 1053–1065 (1989)
31. R. Bagrodia, Synchronization of asynchronous processes in CSP. ACM Trans. Program.
Lang. Syst. 11(4), 585–597 (1989)
32. H.E. Bal, F. Kaashoek, A. Tanenbaum, Orca: a language for parallel programming of dis-
tributed systems. IEEE Trans. Softw. Eng. 18(3), 180–205 (1992)
33. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, Impossibility of scalar clock-based
communication-induced checkpointing protocols ensuring the RDT property. Inf. Process.
Lett. 80(2), 105–111 (2001)
34. R. Baldoni, J.M. Hélary, A. Mostéfaoui, M. Raynal, A communication-induced checkpoint-
ing protocol that ensures rollback-dependency trackability, in Proc. 27th IEEE Symposium
on Fault-Tolerant Computing (FTCS-27) (IEEE Press, New York, 1997), pp. 68–77
35. R. Baldoni, J.-M. Hélary, M. Raynal, Consistent records in asynchronous computations. Acta
Inform. 35(6), 441–455 (1998)
36. R. Baldoni, J.M. Hélary, M. Raynal, Rollback-dependency trackability: a minimal character-
ization and its protocol. Inf. Comput. 165(2), 144–173 (2001)
37. R. Baldoni, G. Melideo, k-dependency vectors: a scalable causality-tracking protocol, in
Proc. 11th Euromicro Workshop on Parallel, Distributed and Network-Based Processing
(PDP’03) (2003), pp. 219–226
38. R. Baldoni, A. Mostéfaoui, M. Raynal, Causal delivery of messages with real-time data in
unreliable networks. Real-Time Syst. 10(3), 245–262 (1996)
39. R. Baldoni, R. Prakash, M. Raynal, M. Singhal, Efficient delta-causal broadcasting. Comput.
Syst. Sci. Eng. 13(5), 263–270 (1998)
40. R. Baldoni, M. Raynal, Fundamentals of distributed computing: a practical tour of vector
clock systems. IEEE Distrib. Syst. Online 3(2), 1–18 (2002)
References 479

41. D. Barbara, H. Garcia Molina, Mutual exclusion in partitioned distributed systems. Distrib.
Comput. 1(2), 119–132 (1986)
42. D. Barbara, H. Garcia Molina, A. Spauster, Increasing availability under mutual exclu-
sion constraints with dynamic vote assignments. ACM Trans. Comput. Syst. 7(7), 394–426
(1989)
43. L. Barenboim, M. Elkin, Deterministic distributed vertex coloring in polylogarithmic time.
J. ACM 58(5), 23 (2011), 25 pages
44. R. Bellman, Dynamic Programming (Princeton University Press, Princeton, 1957)
45. J.-C. Bermond, C. Delorme, J.-J. Quisquater, Strategies for interconnection networks: some
methods from graph theory. J. Parallel Distrib. Comput. 3(4), 433–449 (1986)
46. J.-C. Bermond, J.-C. König, General and efficient decentralized consensus protocols II, in
Proc. Int’l Workshop on Parallel and Distributed Algorithms, ed. by M. Cosnard, P. Quinton,
M. Raynal, Y. Robert (North-Holland, Amsterdam, 1989), pp. 199–210
47. J.-C. Bermond, J.-C. König, Un protocole distribué pour la 2-connexité. TSI. Tech. Sci. In-
form. 10(4), 269–274 (1991)
48. J.-C. Bermond, J.-C. König, M. Raynal, General and efficient decentralized consensus pro-
tocols, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 41–56
49. J.-C. Bermond, C. Peyrat, de Bruijn and Kautz networks: a competitor for the hypercube? in
Proc. Int’l Conference on Hypercube and Distributed Computers (North-Holland, Amster-
dam, 1989), pp. 279–284
50. J.M. Bernabéu-Aubán, M. Ahamad, Applying a path-compression technique to obtain an
effective distributed mutual exclusion algorithm, in Proc. 3rd Int’l Workshop on Distributed
Algorithms (WDAG’89). LNCS, vol. 392 (Springer, Berlin, 1989), pp. 33–44
51. A.J. Bernstein, Output guards and non-determinism in “communicating sequential pro-
cesses”. ACM Trans. Program. Lang. Syst. 2(2), 234–238 (1980)
52. B.K. Bhargava, S.-R. Lian, Independent checkpointing and concurrent rollback for recovery
in distributed systems: an optimistic approach, in Proc. 7th IEEE Symposium on Reliable
Distributed Systems (SRDS’88) (IEEE Press, New York, 1988), pp. 3–12
53. K. Birman, T. Joseph, Reliable communication in the presence of failures. ACM Trans. Com-
put. Syst. 5(1), 47–76 (1987)
54. A.D. Birell, B.J. Nelson, Implementing remote procedure calls. ACM Trans. Comput. Syst.
3, 39–59 (1984)
55. K.P. Birman, A. Schiper, P. Stephenson, Lightweight causal and atomic group multicast.
ACM Trans. Comput. Syst. 9(3), 272–314 (1991)
56. H.L. Bodlaender, Some lower bound results for decentralized extrema finding in ring of
processors. J. Comput. Syst. Sci. 42, 97–118 (1991)
57. L. Bougé, Repeated snapshots in distributed systems with synchronous communications and
their implementation in CSP. Theor. Comput. Sci. 49, 145–169 (1987)
58. L. Bougé, N. Francez, A compositional approach to super-imposition, in Proc. 15th Annual
ACM Symposium on Principles of Programming Languages (POPL’88) (ACM Press, New
York, 1988), pp. 240–249
59. A. Boukerche, C. Tropper, A distributed graph algorithm for the detection of local cycles and
knots. IEEE Trans. Parallel Distrib. Syst. 9(8), 748–757 (1998)
60. Ch. Boulinier, F. Petit, V. Villain, Synchronous vs asynchronous unison. Algorithmica 51(1),
61–80 (2008)
61. G. Bracha, S. Toueg, Distributed deadlock detection. Distrib. Comput. 2(3), 127–138 (1987)
62. D. Briatico, A. Ciuffoletti, L.A. Simoncini, Distributed domino-effect free recovery algo-
rithm, in 4th IEEE Symposium on Reliability in Distributed Software and Database Systems
(IEEE Press, New York, 1984), pp. 207–215
63. P. Brinch Hansen, Distributed processes: a concurrent programming concept. Commun.
ACM 21(11), 934–941 (1978)
480 References

64. J. Brzezinski, J.-M. Hélary, M. Raynal, Termination detection in a very general distributed
computing model, in Proc. 13th IEEE Int’l Conference on Distributed Computing Systems
(ICDCS’93) (IEEE Press, New York, 1993), pp. 374–381
65. J. Brzezinski, J.-M. Hélary, M. Raynal, M. Singhal, Deadlock models and a general algo-
rithm for distributed deadlock detection. J. Parallel Distrib. Comput. 31(2), 112–125 (1995)
(Erratum printed in Journal of Parallel and Distributed Computing, 32(2), 232 (1996))
66. G.N. Buckley, A. Silberschatz, An effective implementation for the generalized input-output
construct of CSP. ACM Trans. Program. Lang. Syst. 5(2), 223–235 (1983)
67. C. Cachin, R. Guerraoui, L. Rodrigues, Introduction to Reliable and Secure Distributed Pro-
gramming, 2nd edn. (Springer, Berlin, 2012), 367 pages. ISBN 978-3-642-15259-7
68. G. Cao, M. Singhal, A delay-optimal quorum-based mutual exclusion algorithm for dis-
tributed systems. IEEE Trans. Parallel Distrib. Syst. 12(12), 1256–1268 (1991)
69. N. Carriero, D. Gelernter, T.G. Mattson, A.H. Sherman, The Linda alternative to message-
passing systems. Parallel Comput. 20(4), 633–655 (1994)
70. O. Carvalho, G. Roucairol, On the distribution of an assertion, in Proc. First ACM Symposium
on Principles of Distributed Computing (PODC’1982) (ACM Press, New York, 1982), pp.
18–20
71. O. Carvalho, G. Roucairol, On mutual exclusion in computer networks. Commun. ACM
26(2), 146–147 (1983)
72. L.M. Censier, P. Feautrier, A new solution to coherence problems in multicache systems.
IEEE Trans. Comput. C-27(12), 112–118 (1978)
73. P. Chandra, A.K. Kshemkalyani, Causality-based predicate detection across space and time.
IEEE Trans. Comput. 54(11), 1438–1453 (2005)
74. S. Chandrasekaran, S. Venkatesan, A message-optimal algorithm for distributed termination
detection. J. Parallel Distrib. Comput. 8(3), 245–252 (1990)
75. K.M. Chandy, L. Lamport, Distributed snapshots: determining global states of distributed
systems. ACM Trans. Comput. Syst. 3(1), 63–75 (1985)
76. K.M. Chandy, J. Misra, Deadlock absence proof for networks of communicating processes.
Inf. Process. Lett. 9(4), 185–189 (1979)
77. K.M. Chandy, J. Misra, Distributed computation on graphs: shortest path algorithms. Com-
mun. ACM 25(11), 833–837 (1982)
78. K.M. Chandy, J. Misra, The drinking philosophers problem. ACM Trans. Program. Lang.
Syst. 6(4), 632–646 (1984)
79. K.M. Chandy, J. Misra, An example of stepwise refinement of distributed programs: quies-
cence detection. ACM Trans. Program. Lang. Syst. 8(3), 326–343 (1986)
80. K.M. Chandy, J. Misra, Parallel Program Design (Addison-Wesley, Reading, 1988), 516
pages
81. K.M. Chandy, J. Misra, L.M. Haas, Distributed deadlock detection. ACM Trans. Comput.
Syst. 1(2), 144–156 (1983)
82. E.J.H. Chang, Echo algorithms: depth-first algorithms on graphs. IEEE Trans. Softw. Eng.
SE-8(4), 391–402 (1982)
83. E.J.H. Chang, R. Roberts, An improved algorithm for decentralized extrema finding in cir-
cular configurations of processes. Commun. ACM 22(5), 281–283 (1979)
84. J.-M. Chang, N.F. Maxemchuck, Reliable broadcast protocols. ACM Trans. Comput. Syst.
2(3), 251–273 (1984)
85. A. Charlesworth, The multiway rendezvous. ACM Trans. Program. Lang. Syst. 9, 350–366
(1987)
86. B. Charron-Bost, Concerning the size of logical clocks in distributed systems. Inf. Process.
Lett. 39, 11–16 (1991)
87. B. Charron-Bost, G. Tel, Calcul approché de la borne inférieure de valeurs réparties. Inform.
Théor. Appl. 31(4), 305–330 (1997)
88. B. Charron-Bost, G. Tel, F. Mattern, Synchronous, asynchronous, and causally ordered com-
munications. Distrib. Comput. 9(4), 173–191 (1996)
References 481

89. C. Chase, V.K. Garg, Detection of global predicates: techniques and their limitations. Distrib.
Comput. 11(4), 191–201 (1998)
90. D.R. Cheriton, D. Skeen, Understanding the limitations of causally and totally ordered com-
munication, in Proc. 14th ACM Symposium on Operating System Principles (SOSP’93)
(ACM Press, New York, 1993), pp. 44–57
91. T.-Y. Cheung, Graph traversal techniques and the maximum flow problem in distributed com-
putation. IEEE Trans. Softw. Eng. SE-9(4), 504–512 (1983)
92. V. Cholvi, A. Fernández, E. Jiménez, P. Manzano, M. Raynal, A methodological construction
of an efficient sequentially consistent distributed shared memory. Comput. J. 53(9), 1523–
1534 (2010)
93. C.T. Chou, I. Cidon, I. Gopal, S. Zaks, Synchronizing asynchronous bounded delays net-
works, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 212–218
94. M. Choy, A.K. Singh, Efficient implementation of synchronous communication over asyn-
chronous networks. J. Parallel Distrib. Comput. 26, 166–180 (1995)
95. I. Cidon, Yet another distributed depth-first search algorithm. Inf. Process. Lett. 26(6), 301–
305 (1988)
96. I. Cidon, An efficient knot detection algorithm. IEEE Trans. Softw. Eng. 15(5), 644–649
(1989)
97. E.G. Coffman Jr., M.J. Elphick, A. Shoshani, System deadlocks. ACM Comput. Surv. 3(2),
67–78 (1971)
98. R. Cooper, K. Marzullo, Consistent detection of global predicates, in Proc. ACM/ONR Work-
shop on Parallel and Distributed Debugging (ACM Press, New York, 1991), pp. 163–173
99. Th.H. Cormen, Ch.E. Leiserson, R.L. Rivest, Introduction to Algorithms (The MIT Press,
Cambridge, 1998), 1028 pages
100. J.-M. Couvreur, N. Francez, M. Gouda, Asynchronous unison, in Proc. 12th IEEE Int’l Con-
ference on Distributed Computing Systems (ICDCS’92) (IEEE Press, New York, 1992), pp.
486–493
101. F. Cristian, Probabilistic clock synchronization. Distrib. Comput. 3(3), 146–158 (1989)
102. F. Cristian, H. Aghili, R. Strong, D. Dolev, Atomic broadcast: from simple message diffusion
to Byzantine agreement. Inf. Comput. 118(1), 158–179 (1995)
103. F. Cristian, F. Jahanian, A timestamping-based checkpointing protocol for long-lived dis-
tributed computations, in Proc. 10th IEEE Symposium on Reliable Distributed Systems
(SRDS’91) (IEEE Press, New York, 1991), pp. 12–20
104. O.P. Damani, Y.-M. Wang, V.K. Garg, Distributed recovery with k-optimistic logging. J. Par-
allel Distrib. Comput. 63(12), 1193–1218 (2003)
105. M.J. Demmer, M. Herlihy, The arrow distributed directory protocol, in Proc. 12th Int’l Sym-
posium on Distributed Computing (DISC’98). LNCS, vol. 1499 (Springer, Berlin, 1998), pp.
119–133
106. P.J. Denning, Virtual memory. ACM Comput. Surv. 2(3), 153–189 (1970)
107. P.J. Denning, J.B. Dennis, J.E. Qualitz, Machines, Languages and Computation (Prentice
Hall, New York, 1978), 612 pages
108. Cl. Diehl, Cl. Jard, Interval approximations of message causality in distributed executions,
in Proc. 9th Annual Symposium on Theoretical Aspects of Computer Science (STACS’92).
LNCS, vol. 577 (Springer, Berlin, 1992), pp. 363–374
109. E.W. Dijkstra, Solution of a problem in concurrent programming control. Commun. ACM
8(9), 569 (1965)
110. E.W. Dijkstra, The structure of “THE” multiprogramming system. Commun. ACM 11(5),
341–346 (1968)
111. E.W. Dijkstra, Hierarchical ordering of sequential processes. Acta Inform. 1, 115–138
(1971)
112. E.W. Dijkstra, Self stabilizing systems in spite of distributed control. Commun. ACM 17,
643–644 (1974)
482 References

113. E.W. Dijkstra, Guarded commands, nondeterminacy, and formal derivation of programs.
Commun. ACM 18(8), 453–457 (1979)
114. E.W. Dijkstra, W.H.J. Feijen, A.J.M. van Gasteren, Derivation of a termination detection
algorithm for distributed computations. Inf. Process. Lett. 16(5), 217–219 (1983)
115. E.W.D. Dijkstra, C.S. Scholten, Termination detection for diffusing computations. Inf. Pro-
cess. Lett. 11(1), 1–4 (1980)
116. D. Dolev, J.Y. Halpern, H.R. Strong, On the possibility and impossibility of achieving clock
synchronization. J. Comput. Syst. Sci. 33(2), 230–250 (1986)
117. D. Dolev, M. Klawe, M. Rodeh, An O(n log n) unidirectional distributed algorithm for ex-
trema finding in a circle. J. Algorithms 3, 245–260 (1982)
118. S. Dolev, Self-Stabilization (The MIT Press, Cambridge, 2000), 197 pages
119. M. Dubois, C. Scheurich, Memory access dependencies in shared memory multiprocessors.
IEEE Trans. Softw. Eng. 16(6), 660–673 (1990)
120. E.N. Elnozahy, L. Alvisi, Y.-M. Wang, D.B. Johnson, A survey of rollback-recovery proto-
cols in message-passing systems. ACM Comput. Surv. 34(3), 375–408 (2002)
121. E. Evangelist, N. Francez, S. Katz, Multiparty interactions for interprocess communication
and synchronization. IEEE Trans. Softw. Eng. 15(11), 1417–1426 (1989)
122. S. Even, Graph Algorithms, 2nd edn. (Cambridge University Press, Cambridge, 2011), 202
pages (edited by G. Even)
123. A. Fekete, N.A. Lynch, L. Shrira, A modular proof of correctness for a network synchro-
nizer, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 312
(Springer, Berlin, 1987), pp. 219–256
124. C.J. Fidge, Timestamp in message passing systems that preserves partial ordering, in Proc.
11th Australian Computing Conference (1988), pp. 56–66
125. C.J. Fidge, Logical time in distributed computing systems. IEEE Comput. 24(8), 28–33
(1991)
126. C.J. Fidge, Limitation of vector timestamps for reconstructing distributed computations. Inf.
Process. Lett. 68, 87–91 (1998)
127. M.J. Fischer, A. Michael, Sacrificing serializability to attain high availability of data, in Proc.
First ACM Symposium on Principles of Database Systems (PODS’82) (ACM Press, New
York, 1882), pp. 70–75
128. R.W. Floyd, Algorithm 97: shortest path. Commun. ACM 5(6), 345 (1962)
129. J. Fowler, W. Zwaenepoel, Causal distributed breakpoints, in Proc. 10th Int’l IEEE Con-
ference on Distributed Computing Systems (ICDCS’90) (IEEE Press, New York, 1990), pp.
134–141
130. N. Francez, Distributed termination. ACM Trans. Program. Lang. Syst. 2(1), 42–55 (1980)
131. N. Francez, B. Halpern, G. Taubenfeld, Script: a communication abstraction mechanism. Sci.
Comput. Program. 6(1), 35–88 (1986)
132. N. Francez, M. Rodeh, Achieving distributed termination without freezing. IEEE Trans.
Softw. Eng. 8(3), 287–292 (1982)
133. N. Francez, S. Yemini, Symmetric intertask communication. ACM Trans. Program. Lang.
Syst. 7(4), 622–636 (1985)
134. W.R. Franklin, On an improved algorithm for decentralized extrema-finding in circular con-
figurations of processors. Commun. ACM 25(5), 336–337 (1982)
135. U. Fridzke, P. Ingels, A. Mostéfaoui, M. Raynal, Fault-tolerant consensus-based total order
multicast. IEEE Trans. Parallel Distrib. Syst. 13(2), 147–157 (2001)
136. R. Friedman, Implementing hybrid consistency with high-level synchronization operations,
in Proc. 12th Annual ACM Symposium on Principles of Distributed Computing (PODC’93)
(ACM Press, New York, 1993), pp. 229–240
137. E. Fromentin, Cl. Jard, G.-V. Jourdan, M. Raynal, On-the-fly analysis of distributed compu-
tations. Inf. Process. Lett. 54(5), 267–274 (1995)
138. E. Fromentin, M. Raynal, Shared global states in distributed computations. J. Comput. Syst.
Sci. 55(3), 522–528 (1997)
References 483

139. E. Fromentin, M. Raynal, V.K. Garg, A.I. Tomlinson, On the fly testing of regular patterns in
distributed computations, in Proc. Int’l Conference on Parallel Processing (ICPP’94) (1994),
pp. 73–76
140. R. Fujimoto, Parallel discrete event simulation. Commun. ACM 33(10), 31–53 (1990)
141. E. Gafni, D. Bertsekas, Distributed algorithms for generating loop-free routes in networks
with frequently changing topologies. IEEE Trans. Commun. C-29(1), 11–18 (1981)
142. R.G. Gallager, Distributed minimum hop algorithms. Tech Report LIDS 1175, MIT, 1982
143. R.G. Gallager, P.A. Humblet, P.M. Spira, A distributed algorithm for minimum-weight span-
ning trees. ACM Trans. Program. Lang. Syst. 5(1), 66–77 (1983)
144. I.C. Garcia, E. Buzato, Progressive construction of consistent global checkpoints, in Proc.
19th Int’l Conference on Distributed Computing Systems (ICDCS’99) (IEEE Press, New
York, 1999), pp. 55–62
145. I.C. Garcia, L.E. Buzato, On the minimal characterization of the rollback-dependency tracka-
bility property, in Proc. 21st Int’l Conference on Distributed Computing Systems (ICDCS’01)
(IEEE Press, New York, 2001), pp. 342–349
146. I.C. Garcia, L.E. Buzato, An efficient checkpointing protocol for the minimal characteri-
zation of operational rollback-dependency trackability, in Proc. 23rd Int’l Symposium on
Reliable Distributed Systems (SRDS’04) (IEEE Press, New York, 2004), pp. 126–135
147. H. Garcia Molina, D. Barbara, How to assign votes in a distributed system. J. ACM 32(4),
841–860 (1985)
148. M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-
Completeness (Freeman, New York, 1979), 340 pages
149. V.K. Garg, Principles of Distributed Systems (Kluwer Academic, Dordrecht, 1996), 274
pages
150. V.K. Garg, Elements of Distributed Computing (Wiley-Interscience, New York, 2002), 423
pages
151. V.K. Garg, M. Raynal, Normality: a consistency condition for concurrent objects. Parallel
Process. Lett. 9(1), 123–134 (1999)
152. V.K. Garg, S. Skawratananond, N. Mittal, Timestamping messages and events in a distributed
system using synchronous communication. Distrib. Comput. 19(5–6), 387–402 (2007)
153. V.K. Garg, B. Waldecker, Detection of weak unstable predicates in distributed programs.
IEEE Trans. Parallel Distrib. Syst. 5(3), 299–307 (1994)
154. V.K. Garg, B. Waldecker, Detection of strong unstable predicates in distributed programs.
IEEE Trans. Parallel Distrib. Syst. 7(12), 1323–1333 (1996)
155. Ch. Georgiou, A. Shvartsman, Do-All Computing in Distributed Systems: Cooperation in the
Presence of Adversity (Springer, Berlin, 2008), 219 pages. ISBN 978-0-387-69045-2
156. K. Gharachorloo, P. Gibbons, Detecting violations of sequential consistency, in Proc. 3rd
ACM Symposium on Parallel Algorithms and Architectures (SPAA’91) (ACM Press, New
York, 1991), pp. 316–326
157. K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, J.L. Hennessy, Memory
consistency and event ordering in scalable shared memory multiprocessors, in Proc. 17th
ACM Int’l Symposium on Computer Architecture (ISCA’90) (1990), pp. 15–26
158. A. Gibbons, Algorithmic Graph Theory (Cambridge University Press, Cambridge, 1985),
260 pages
159. D.K. Gifford, Weighted voting for replicated data, in Proc. 7th ACM Symposium on Operat-
ing System Principles (SOSP’79) (ACM Press, New York, 1979), pp. 150–172
160. V.D. Gligor, S.H. Shattuck, Deadlock detection in distributed systems. IEEE Trans. Softw.
Eng. SE-6(5), 435–440 (1980)
161. A.P. Goldberg, A. Gopal, A. Lowry, R. Strom, Restoring consistent global states of dis-
tributed computations, in Proc. ACM/ONR Workshop on Parallel and Distributed Debugging
(ACM Press, New York, 1991), pp. 144–156
162. M. Gouda, T. Herman, Stabilizing unison. Inf. Process. Lett. 35(4), 171–175 (1990)
163. D. Gray, A. Reuter, Transaction Processing: Concepts and Techniques (Morgan Kaufmann,
San Mateo, 1993), 1060 pages. ISBN 1-55860-190-2
484 References

164. J.L. Gross, J. Yellen (eds.), Graph Theory (CRC Press, Boca Raton, 2004), 1167 pages
165. H.N. Haberman, Prevention of system deadlocks. Commun. ACM 12(7), 373–377 (1969)
166. J.W. Havender, Avoiding deadlocks in multitasking systems. IBM Syst. J. 13(3), 168–192
(1971)
167. J.-M. Hélary, Observing global states of asynchronous distributed applications, in Proc. 3rd
Int’l Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 392 (Springer, Berlin,
1987), pp. 124–135
168. J.-M. Hélary, M. Hurfin, A. Mostéfaoui, M. Raynal, F. Tronel, Computing global functions in
asynchronous distributed systems with perfect failure detectors. IEEE Trans. Parallel Distrib.
Syst. 11(9), 897–909 (2000)
169. J.-M. Hélary, C. Jard, N. Plouzeau, M. Raynal, Detection of stable properties in distributed
systems, in Proc. 6th ACM Symposium on Principles of Distributed Computing (PODC’87)
(ACM Press, New York, 1987), pp. 125–136
170. J.-M. Hélary, A. Maddi, M. Raynal, Controlling information transfers in distributed applica-
tions, application to deadlock detection, in Proc. Int’l IFIP WG 10.3 Conference on Parallel
Processing (North-Holland, Amsterdam, 1987), pp. 85–92
171. J.-M. Hélary, A. Mostéfaoui, R.H.B. Netzer, M. Raynal, Communication-based prevention
of useless checkpoints in distributed computations. Distrib. Comput. 13(1), 29–43 (2000)
172. J.-M. Hélary, A. Mostéfaoui, M. Raynal, A general scheme for token and tree-based dis-
tributed mutual exclusion algorithms. IEEE Trans. Parallel Distrib. Syst. 5(11), 1185–1196
(1994)
173. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Communication-induced determination of consis-
tent snapshots. IEEE Trans. Parallel Distrib. Syst. 10(9), 865–877 (1999)
174. J.-M. Hélary, A. Mostéfaoui, M. Raynal, Interval consistency of asynchronous distributed
computations. J. Comput. Syst. Sci. 64(2), 329–349 (2002)
175. J.-M. Hélary, R.H.B. Netzer, M. Raynal, Consistency criteria for distributed checkpoints.
IEEE Trans. Softw. Eng. 2(2), 274–281 (1999)
176. J.-M. Hélary, N. Plouzeau, M. Raynal, A distributed algorithm for mutual exclusion in arbi-
trary networks. Comput. J. 31(4), 289–295 (1988)
177. J.-M. Hélary, M. Raynal, Depth-first traversal and virtual ring construction in distributed
systems, in Proc. IFIP WG 10.3 Conference on Parallel Processing (North-Holland, Ams-
terdam, 1988), pp. 333–346
178. J.-M. Hélary, M. Raynal, Vers la construction raisonnée d’algorithmes répartis: le cas de la
terminaison. TSI. Tech. Sci. Inform. 10(3), 203–209 (1991)
179. J.-M. Hélary, M. Raynal, Synchronization and Control of Distributed Systems and Programs
(Wiley, New York, 1991), 160 pages
180. J.-M. Hélary, M. Raynal, Towards the construction of distributed detection programs with an
application to distributed termination. Distrib. Comput. 7(3), 137–147 (1994)
181. J.-M. Hélary, M. Raynal, G. Melideo, R. Baldoni, Efficient causality-tracking timestamping.
IEEE Trans. Knowl. Data Eng. 15(5), 1239–1250 (2003)
182. M. Herlihy, F. Kuhn, S. Tirthapura, R. Wattenhofer, Dynamic analysis of the arrow distributed
protocol. Theory Comput. Syst. 39(6), 875–901 (2006)
183. M. Herlihy, J. Wing, Linearizability: a correctness condition for concurrent objects. ACM
Trans. Program. Lang. Syst. 12(3), 463–492 (1990)
184. L. Higham, T. Przytycka, A simple efficient algorithm for maximum finding on rings. Inf.
Process. Lett. 58(6), 319–324 (1996)
185. D.S. Hirschberg, J.B. Sinclair, Decentralized extrema finding in circular configuration of
processors. Commun. ACM 23, 627–628 (1980)
186. C.A.R. Hoare, Communicating sequential processes. Commun. ACM 21(8), 666–677 (1978)
187. W. Hohberg, How to find biconnected components in distributed networks. J. Parallel Distrib.
Comput. 9(4), 374–386 (1990)
188. R.C. Holt, Comments on prevention of system deadlocks. Commun. ACM 14(1), 36–38
(1871)
References 485

189. J.E. Hopcroft, R. Motwani, J.D. Ullman, Introduction to Automata Theory, Languages and
Computation, 2nd edn. (Addison-Wesley, Reading, 2001), 521 pages
190. S.-T. Huang, Termination detection by using distributed snapshots. Inf. Process. Lett. 32(3),
113–119 (1989)
191. S.-T. Huang, Detecting termination of distributed computations by external agents, in Proc.
9th IEEE Int’l Conference on Distributed Computing Systems (ICDCS’89) (IEEE Press, New
York, 1989), pp. 79–84
192. M. Hurfin, N. Plouzeau, M. Raynal, Detecting atomic sequences of predicates in distributed
computations. SIGPLAN Not. 28(12), 32–42 (1993). Proc. ACM/ONR Workshop on Parallel
and Distributed Debugging
193. M. Hurfin, M. Mizuno, M. Raynal, S. Singhal, Efficient distributed detection of conjunctions
of local predicates. IEEE Trans. Softw. Eng. 24(8), 664–677 (1998)
194. P. Hutto, M. Ahamad, Slow memory: weakening consistency to enhance concurrency in dis-
tributed shared memories, in Proc. 10th IEEE Int’l Conference on Distributed Computing
Systems (ICDCS’90) (IEEE Press, New York, 1990), pp. 302–311
195. T. Ibaraki, T. Kameda, A theory of coteries: mutual exclusion in distributed systems. J. Par-
allel Distrib. Comput. 4(7), 779–794 (1993)
196. T. Ibaraki, T. Kameda, T. Minoura, Serializability with constraints. ACM Trans. Database
Syst. 12(3), 429–452 (1987)
197. R. Ingram, P. Shields, J.E. Walter, J.L. Welch, An asynchronous leader election algorithm for
dynamic networks, in Proc. 23rd Int’l IEEE Parallel and Distributed Processing Symposium
(IPDPS’09) (IEEE Press, New York, 2009), pp. 1–12
198. Cl. Jard, G.-V. Jourdan, Incremental transitive dependency tracking in distributed computa-
tions. Parallel Process. Lett. 6(3), 427–435 (1996)
199. J. Jefferson, Virtual time. ACM Trans. Program. Lang. Syst. 7(3), 404–425 (1985)
200. E. Jiménez, A. Fernández, V. Cholvi, A parameterized algorithm that implements sequential,
causal, and cache memory consistencies. J. Syst. Softw. 81(1), 120–131 (2008)
201. Ö. Johansson, Simple distributed ( + 1)-coloring of graphs. Inf. Process. Lett. 70(5), 229–
232 (1999)
202. D.B. Johnson, W. Zwaenepoel, Recovery in distributed systems using optimistic message
logging and checkpointing. J. Algorithms 11(3), 462–491 (1990)
203. S. Kanchi, D. Vineyard, An optimal distributed algorithm for all-pairs shortest-path. Int. J.
Inf. Theories Appl. 11(2), 141–146 (2004)
204. P. Keleher, A.L. Cox, W. Zwaenepoel, Lazy release consistency for software distributed
shared memory, in Proc. 19th ACM Int’l Symposium on Computer Architecture (ISCA’92),
(1992), pp. 13–21
205. J. Kleinberg, E. Tardos, Algorithm Design (Addison-Wesley, Reading, 2005), 838 pages
206. P. Knapp, Deadlock detection in distributed databases. ACM Comput. Surv. 19(4), 303–328
(1987)
207. R. Koo, S. Toueg, Checkpointing and rollback-recovery for distributed systems. IEEE Trans.
Softw. Eng. 13(1), 23–31 (1987)
208. E. Korach, S. Moran, S. Zaks, Tight lower and upper bounds for some distributed algorithms
for a complete network of processors, in Proc. 4th ACM Symposium on Principles of Dis-
tributed Computing (PODC’84) (ACM Press, New York, 1984), pp. 199–207
209. E. Korach, S. Moran, S. Zaks, The optimality of distributive constructions of minimum
weight and degree restricted spanning tree in complete networks of processes. SIAM J. Com-
put. 16(2), 231–236 (1987)
210. E. Korach, D. Rotem, N. Santoro, Distributed algorithms for finding centers and medians in
networks. ACM Trans. Program. Lang. Syst. 6(3), 380–401 (1984)
211. E. Korach, G. Tel, S. Zaks, Optimal synchronization of ABD networks, in Proc. Int’l Con-
ference on Concurrency. LNCS, vol. 335 (Springer, Berlin, 1988), pp. 353–367
212. R. Kordale, M. Ahamad, A scalable technique for implementing multiple consistency lev-
els for distributed objects, in Proc. 16th IEEE Int’l Conference on Distributed Computing
Systems (ICDCS’96) (IEEE Press, New York, 1996), pp. 369–376
486 References

213. A.D. Kshemkalyani, Fast and message-efficient global snapshot algorithms for large-scale
distributed systems. IEEE Trans. Parallel Distrib. Syst. 21(9), 1281–1289 (2010)
214. A.D. Kshemkalyani, M. Raynal, M. Singhal, Global snapshots of a distributed systems. Dis-
trib. Syst. Eng. 2(4), 224–233 (1995)
215. A.D. Kshemkalyani, M. Singhal, Invariant-based verification of a distributed deadlock de-
tection algorithm. IEEE Trans. Softw. Eng. 17(8), 789–799 (1991)
216. A.D. Kshemkalyani, M. Singhal, Efficient detection and resolution of generalized distributed
deadlocks. IEEE Trans. Softw. Eng. 20(1), 43–54 (1994)
217. A.D. Kshemkalyani, M. Singhal, Necessary and sufficient conditions on information for
causal message ordering and their optimal implementation. Distrib. Comput. 11(2), 91–111
(1998)
218. A.D. Kshemkalyani, M. Singhal, A one-phase algorithm to detect distributed deadlocks in
replicated databases. IEEE Trans. Knowl. Data Eng. 11(6), 880–895 (1999)
219. A.D. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Sys-
tems (Cambridge University Press, Cambridge, 2008), 736 pages
220. A.D. Kshemkalyani, M. Singhal, Efficient distributed snapshots in an anonymous asyn-
chronous message-passing system. J. Parallel Distrib. Comput. 73, 621–629 (2013)
221. T.-H. Lai, Termination detection for dynamically distributed systems with non-first-in-first-
out communication. J. Parallel Distrib. Comput. 3(4), 577–599 (1986)
222. T.H. Lai, T.H. Yang, On distributed snapshots. Inf. Process. Lett. 25, 153–158 (1987)
223. T.V. Lakshman, A.K. Agrawala, Efficient decentralized consensus protocols. IEEE Trans.
Softw. Eng. SE-12(5), 600–607 (1986)
224. K.B. Lakshmanan, N. Meenakshi, K. Thulisaraman, A time-optimal message-efficient dis-
tributed algorithm for depth-first search. Inf. Process. Lett. 25, 103–109 (1987)
225. K.B. Lakshmanan, K. Thulisaraman, On the use of synchronizers for asynchronous com-
munication networks, in Proc. 2nd Int’l Workshop on Distributed Algorithms (WDAG’87).
LNCS, vol. 312 (Springer, Berlin, 1987), pp. 257–267
226. L. Lamport, Time, clocks, and the ordering of events in a distributed system. Commun. ACM
21(7), 558–565 (1978)
227. L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess
programs. IEEE Trans. Comput. C-28(9), 690–691 (1979)
228. L. Lamport, On inter-process communications, part I: basic formalism. Distrib. Comput.
1(2), 77–85 (1986)
229. L. Lamport, On inter-process communications, part II: algorithms. Distrib. Comput. 1(2),
86–101 (1986)
230. L. Lamport, P.M. Melliar-Smith, Synchronizing clocks in the presence of faults. J. ACM
32(1), 52–78 (1985)
231. Y. Lavallée, G. Roucairol, A fully distributed minimal spanning tree algorithm. Inf. Process.
Lett. 23(2), 55–62 (1986)
232. G. Le Lann, Distributed systems: towards a formal approach, in IFIP World Congress,
(1977), pp. 155–160
233. I. Lee, S.B. Davidson, Adding time to synchronous processes. IEEE Trans. Comput. C-36(8),
941–948 (1987)
234. K. Li, K.P. Huda, Memory coherence in shared virtual memory systems. ACM Trans. Com-
put. Syst. 7(4), 321–359 (1989)
235. T.F. Li, Th. Radhakrishnan, K. Venkatesh, Global state detection in non-FIFO networks, in
Proc. 7th Int’l Conference on Distributed Computing Systems (ICDCS’87) (IEEE Press, New
York, 1987), pp. 364–370
236. N. Linial, Locality in distributed graph algorithms. SIAM J. Comput. 21(1), 193–201 (1992)
237. R.J. Lipton, J.S. Sandberg, PRAM: a scalable shared memory. Tech Report CS-TR-180-88,
Princeton University, 1988
238. B. Liskov, R. Ladin, Highly available distributed services and fault-tolerant distributed
garbage collection, in Proc. 5th ACM Symposium on Principles of Distributed Computing
(PODC’86) (ACM Press, New York, 1986), pp. 29–39
References 487

239. S. Lodha, A.D. Ksemkalyani, A fair distributed mutual exclusion algorithm. IEEE Trans.
Parallel Distrib. Syst. 11(6), 537–549 (2000)
240. M. Luby, A simple parallel algorithm for the maximal independent set problem. SIAM J.
Comput. 15(4), 1036–1053 (1987)
241. N.A. Lynch, Upper bounds for static resource allocation in a distributed system. J. Comput.
Syst. Sci. 23(2), 254–278 (1981)
242. N.A. Lynch, Distributed
√ Algorithms (Morgan Kaufmann, San Francisco, 1996), 872 pages
243. M. Maekawa, A n algorithm for mutual exclusion in decentralized systems. ACM Trans.
Comput. Syst. 3(2), 145–159 (1985)
244. N. Malpani, J.L. Welch, N. Vaidya, Leader election algorithms for mobile ad hoc networks,
in Proc. 4th Int’l ACM Workshop on Discrete Algorithms and Methods for Mobile Computing
and Communications (DIAL-M’00) (ACM Press, New York, 2000), pp. 96–103
245. Y. Manabe, R. Baldoni, M. Raynal, S. Aoyagi, k-arbiter: a safe and general scheme for h-out
of-k mutual exclusion. Theor. Comput. Sci. 193(1–2), 97–112 (1998)
246. D. Manivannan, R.H.B. Netzer, M. Singhal, Finding consistent global checkpoints in a dis-
tributed computation. IEEE Trans. Parallel Distrib. Syst. 8(6), 623–627 (1997)
247. D. Manivannan, M. Singhal, A low overhead recovery technique using quasi-synchronous
checkpointing, in Proc. 16th IEEE Int’l Conference on Distributed Computing Systems
(ICDCS’96) (IEEE Press, New York, 1996), pp. 100–107
248. D. Manivannan, M. Singhal, An efficient distributed algorithm for detection of knots and
cycles in a distributed graph. IEEE Trans. Parallel Distrib. Syst. 14(10), 961–972 (2003)
249. F. Mattern, Algorithms for distributed termination detection. Distrib. Comput. 2(3), 161–175
(1987)
250. F. Mattern, Virtual time and global states of distributed systems, in Proc. Parallel and Dis-
tributed Algorithms Conference, ed. by M. Cosnard, P. Quinton, M. Raynal, Y. Robert (North-
Holland, Amsterdam, 1988), pp. 215–226
251. F. Mattern, Global quiescence detection based on credit distribution and recovery. Inf. Pro-
cess. Lett. 30(4), 195–200 (1989)
252. F. Mattern, An efficient distributed termination test. Inf. Process. Lett. 31(4), 203–208 (1989)
253. F. Mattern, Efficient algorithms for distributed snapshots and global virtual time approxima-
tion. J. Parallel Distrib. Comput. 18, 423–434 (1993)
254. F. Mattern, Distributed algorithms and causally consistent observations, in Proc. 16th Int’l
Conference on Application and Theory of Petri Nets, (Invited Paper). LNCS, vol. 935
(Springer, Berlin, 1995), pp. 21–22
255. F. Mattern, S. Fünfrocken, A non-blocking lightweight implementation of causal order mes-
sage delivery, in Proc. Int’l Dagstuhl Workshop on Theory and Practice in Distributed Sys-
tems. LNCS, vol. 938 (Springer, Berlin, 1995), pp. 197–213
256. M. Mavronicolas, D. Roth, Efficient, strong consistent implementations of shared memory, in
Proc. 6th Int’l Workshop on Distributed Algorithms (WDAG’92). LNCS, vol. 647 (Springer,
Berlin, 1992), pp. 346–361
257. J. Mayo, Ph. Kearns, Efficient distributed termination detection with roughly synchronized
clocks. Inf. Process. Lett. 52(2), 105–108 (1994)
258. K. Mehlhorn, P. Sanders, Algorithms and Data Structures (Springer, Berlin, 2008), 300
pages
259. D. Menasce, R. Muntz, Locking and deadlock detection in distributed database. IEEE Trans.
Softw. Eng. SE-5(3), 195–202 (1979)
260. J.R. Mendívil, F. Fariña, C.F. Garitagoitia, C.F. Alastruey, J.M. Barnabeu-Auban, A dis-
tributed deadlock resolution algorithm for the AND model. IEEE Trans. Parallel Distrib.
Syst. 10(5), 433–447 (1999)
261. J. Misra, Detecting termination of distributed computations using markers, in Proc. 2nd ACM
Symposium on Principles of Distributed Computing (PODC’83) (ACM Press, New York,
1983), pp. 290–294
262. J. Misra, Axioms for memory access in asynchronous hardware systems. ACM Trans. Pro-
gram. Lang. Syst. 8(1), 142–153 (1986)
488 References

263. J. Misra, Distributed discrete event simulation. ACM Comput. Surv. 18(1), 39–65 (1986)
264. J. Misra, K.M. Chandy, A distributed graph algorithm: knot detection. ACM Trans. Program.
Lang. Syst. 4(4), 678–686 (1982)
265. J. Misra, K.M. Chandy, Termination detection of diffusing computations in communicating
sequential processes. ACM Trans. Program. Lang. Syst. 4(1), 37–43 (1982)
266. D.P. Mitchell, M. Merritt, A distributed algorithm for deadlock detection and resolution, in
Proc. 3rd ACM Symposium on Principles of Distributed Computing (PODC’84) (ACM Press,
New York, 1984), pp. 282–284
267. N. Mittal, V.K. Garg, Consistency conditions for multi-objects operations, in Proc. 18th IEEE
Int’l Conference on Distributed Computing Systems (ICDCS’98) (IEEE Press, New York,
1998), pp. 582–589
268. M. Mizuno, M.L. Nielsen, M. Raynal, An optimistic protocol for a linearizable distributed
shared memory service. Parallel Process. Lett. 6(2), 265–278 (1996)
269. M. Mizuno, M. Raynal, J.Z. Zhou, Sequential consistency in distributed systems, in Int’l
Dagstuhl Workshop on the Theory and Practice in Distributed Systems. LNCS, vol. 938
(Springer, Berlin, 1994), pp. 224–241
270. B. Moret, The Theory of Computation (Addison-Wesley, Reading, 1998), 453 pages
271. A. Mostéfaoui, M. Raynal, Efficient message logging for uncoordinated checkpointing pro-
tocols, in Proc. 2nd European Dependable Computing Conference (EDCC’96). LNCS,
vol. 1150 (Springer, Berlin, 1996), pp. 353–364
272. A. Mostéfaoui, M. Raynal, P. Veríssimo, Logically instantaneous communication on top of
distributed memory parallel machines, in Proc. 5th Int’l Conference on Parallel Computing
Technologies (PACT’99). LNCS, vol. 1662 (Springer, Berlin, 1999), pp. 258–270
273. V.V. Murty, V.K. Garg, An algorithm to guarantee synchronous ordering of messages, in
Proc. 2nd Int’l IEEE Symposium on Autonomous Decentralized Systems (IEEE Press, New
York, 1995), pp. 208–214
274. V.V. Murty, V.K. Garg, Characterization of message ordering specifications and protocols, in
Proc. 7th Int’l Conference on Distributed Computer Systems (ICDCS’97) (IEEE Press, New
York, 1997), pp. 492–499
275. M. Naimi, M. Trehel, An improvement of the log n distributed algorithm for mutual exclu-
sion, in Proc. 7th Int’l IEEE Conference on Distributed Computing Systems (ICDCS’87)
(IEEE Press, New York, 1987), pp. 371–375
276. M. Naimi, M. Trehel, A. Arnold, A log(n) distributed mutual exclusion algorithm based on
path reversal. J. Parallel Distrib. Comput. 34(1), 1–13 (1996)
277. M. Naor, A. Wool, The load, capacity and availability of quorums systems. SIAM J. Comput.
27(2), 423–447 (2008)
278. N. Nararajan, A distributed scheme for detecting communication deadlocks. IEEE Trans.
Softw. Eng. 12(4), 531–537 (1986)
279. M.L. Neilsen, M. Mizuno, A DAG-based algorithm for distributed mutual exclusion, in Proc.
11th IEEE Int’l IEEE Conference on Distributed Computing Systems (ICDCS’91) (IEEE
Press, New York, 1991), pp. 354–360
280. M.L. Neilsen, M. Masaaki, Nondominated k-coteries for multiple mutual exclusion. Inf. Pro-
cess. Lett. 50(5), 247–252 (1994)
281. M.L. Neilsen, M. Masaaki, M. Raynal, A general method to define quorums, in Proc. 12th
Int’l IEEE Conference on Distributed Computing Systems (ICDCS’92) (IEEE Press, New
York, 1992), pp. 657–664
282. M. Nesterenko, M. Mizuno, A quorum-based self-stabilizing distributed mutual exclusion
algorithm. J. Parallel Distrib. Comput. 62(2), 284–305 (2002)
283. R.H.B. Netzer, J. Xu, Necessary and sufficient conditions for consistent global snapshots.
IEEE Trans. Parallel Distrib. Syst. 6(2), 165–169 (1995)
284. N. Neves, W.K. Fuchs, Adaptive recovery for mobile environments. Commun. ACM 40(1),
68–74 (1997)
285. S. Nishio, K.F. Li, F.G. Manning, A resilient distributed mutual exclusion algorithm for com-
puter networks. IEEE Trans. Parallel Distrib. Syst. 1(3), 344–356 (1990)
References 489

286. B. Nitzberg, V. Lo, Distributed shared memory: a survey of issues and algorithms. IEEE
Comput. 24(8), 52–60 (1991)
287. R. Obermarck, Distributed deadlock detection algorithm. ACM Trans. Database Syst. 7(2),
197–208 (1982)
288. J.K. Pachl, E. Korach, D. Rotem, Lower bounds for distributed maximum-finding algorithms.
J. ACM 31(4), 905–918 (1984)
289. Ch.H. Papadimitriou, The serializability of concurrent database updates. J. ACM 26(4), 631–
653 (1979)
290. D.S. Parker, G.L. Popek, G. Rudisin, L. Stoughton, B.J. Walker, E. Walton, J.M. Chow, D.A.
Edwards, S. Kiser, C.S. Kline, Detection of mutual inconsistency in distributed systems.
IEEE Trans. Softw. Eng. SE9(3), 240–246 (1983)
291. B. Patt-Shamir, S. Rajsbaum, A theory of clock synchronization, in Proc. 26th Annual ACM
Symposium on Theory of Computing (STOC’94) (ACM Press, New York, 1994), pp. 810–
819
292. D. Peleg, Distributed Computing: A Locally-Sensitive Approach. SIAM Monographs on Dis-
crete Mathematics and Applications (2000), 343 pages
293. D. Peleg, J.D. Ullman, An optimal synchronizer for the hypercube. SIAM J. Comput. 18,
740–747 (1989)
294. D. Peleg, A. Wool, Crumbling walls: a class of practical and efficient quorum systems. Dis-
trib. Comput. 10(2), 87–97 (1997)
295. G.L. Peterson, An O(n log n) unidirectional algorithm for the circular extrema problem.
ACM Trans. Program. Lang. Syst. 4(4), 758–762 (1982)
296. L.L. Peterson, N.C. Bucholz, R.D. Schlichting, Preserving and using context information in
interprocess communication. ACM Trans. Comput. Syst. 7(3), 217–246 (1989)
297. S.E. Pomares Hernadez, J.R. Perez Cruz, M. Raynal, From the happened before relation to
the causal ordered set abstraction. J. Parallel Distrib. Comput. 72, 791–795 (2012)
298. R. Prakash, M. Raynal, M. Singhal, An adaptive causal ordering algorithm suited to mobile
computing environments. J. Parallel Distrib. Comput. 41(1), 190–204 (1997)
299. R. Prakash, M. Singhal, Low-cost checkpointing and failure recovery in mobile computing
systems. IEEE Trans. Parallel Distrib. Syst. 7(10), 1035–1048 (1996)
300. R. Prakash, M. Singhal, Dependency sequences and hierarchical clocks: efficient alternatives
to vector clocks for mobile computing systems. Wirel. Netw. 3(5), 349–360 (1997)
301. J. Protic, M. Tomasevic, Distributed shared memory: concepts and systems. IEEE Concurr.
4(2), 63–79 (1996)
302. S.P. Rana, A distributed solution of the distributed termination problem. Inf. Process. Lett.
17(1), 43–46 (1983)
303. B. Randell, System structure for software fault-tolerance. IEEE Trans. Softw. Eng. SE1(2),
220–232 (1975)
304. K. Raymond, A tree-based algorithm for distributed mutual exclusion. ACM Trans. Comput.
Syst. 7(1), 61–77 (1989)
305. K. Raymond, A distributed algorithm for multiple entries to a critical section. Inf. Process.
Lett. 30(4), 189–193 (1989)
306. M. Raynal, Algorithms for Mutual Exclusion (The MIT Press, Cambridge, 1986), 107 pages.
ISBN 0-262-18119-3
307. M. Raynal, A distributed algorithm to prevent mutual drift between n logical clocks. Inf.
Process. Lett. 24, 199–202 (1987)
308. M. Raynal, Networks and Distributed Computation: Concepts, Tools and Algorithms (The
MIT Press, Cambridge, 1987), 168 pages. ISBN 0-262-18130-4
309. M. Raynal, Prime numbers as a tool to design distributed algorithms. Inf. Process. Lett. 33,
53–58 (1989)
310. M. Raynal, A simple taxonomy of distributed mutual exclusion algorithms. Oper. Syst. Rev.
25(2), 47–50 (1991)
490 References

311. M. Raynal, A distributed solution to the k-out of-M resource allocation problem, in Proc.
Int’l Conference on Computing and Information. LNCS, vol. 497 (Springer, Berlin, 1991),
pp. 509–518
312. M. Raynal, Illustrating the use of vector clocks in property detection: an example and a
counter-example, in Proc. 5th European Conference on Parallelism (EUROPAR’99). LNCS,
vol. 1685 (Springer, Berlin, 1999), pp. 806–814
313. M. Raynal, Sequential consistency as lazy linearizability, in Proc. 14th ACM Symposium on
Parallel Algorithms and Architectures (SPAA’02) (ACM Press, New York, 2002), pp. 151–
152
314. M. Raynal, Token-based sequential consistency. Comput. Syst. Sci. Eng. 17(6), 359–365
(2002)
315. M. Raynal, Fault-Tolerant Agreement in Synchronous Distributed Systems (Morgan & Clay-
pool, San Francisco, 2010), 167 pages. ISBN 9781608455256
316. M. Raynal, Communication and Agreement Abstractions for Fault-Tolerant Asynchronous
Distributed Systems (Morgan & Claypool, San Francisco, 2010), 251 pages. ISBN
9781608452934
317. M. Raynal, Concurrent Programming: Algorithms, Principles, and Foundations (Springer,
Berlin, 2012), 500 pages. ISBN 978-3-642-32026-2
318. M. Raynal, M. Ahamad, Exploiting write semantics in implementing partially replicated
causal objects, in Proc. 6th EUROMICRO Conference on Parallel and Distributed Processing
(PDP’98) (IEEE Press, New York, 1998), pp. 157–163
319. M. Raynal, J.-M. Hélary, Synchronization and Control of Distributed Systems and Programs.
Wiley Series in Parallel Computing (1991), 126 pages. ISBN 0-471-92453-9
320. M. Raynal, M. Roy, C. Tutu, A simple protocol offering both atomic consistent read op-
erations and sequentially consistent read operations, in Proc. 19th Int’l Conference on Ad-
vanced Information Networking and Applications (AINA’05) (IEEE Press, New York, 2005),
pp. 961–966
321. M. Raynal, G. Rubino, An algorithm to detect token loss on a logical ring and to regenerate
lost tokens, in Int’l Conference on Parallel Processing and Applications (North-Holland,
Amsterdam, 1987), pp. 457–467
322. M. Raynal, A. Schiper, From causal consistency to sequential consistency in shared memory
systems, in Proc. 15th Int’l Conference on Foundations of Software Technology and The-
oretical Computer Science (FST&TCS’95). LNCS, vol. 1026 (Springer, Berlin, 1995), pp.
180–194
323. M. Raynal, A. Schiper, A suite of formal definitions for consistency criteria in distributed
shared memories, in Proc. 9th Int’l IEEE Conference on Parallel and Distributed Computing
Systems (PDCS’96) (IEEE Press, New York, 1996), pp. 125–131
324. M. Raynal, A. Schiper, S. Toueg, The causal ordering abstraction and a simple way to im-
plement. Inf. Process. Lett. 39(6), 343–350 (1991)
325. M. Raynal, S. Singhal, Logical time: capturing causality in distributed systems. IEEE Com-
put. 29(2), 49–57 (1996)
326. M. Raynal, K. Vidyasankar, A distributed implementation of sequential consistency with
multi-object operations, in Proc. 24th IEEE Int’l Conference on Distributed Computing Sys-
tems (ICDCS’04) (IEEE Press, New York, 2004), pp. 544–551
327. G. Ricart, A.K. Agrawala, An optimal algorithm for mutual exclusion in computer networks.
Commun. ACM 24(1), 9–17 (1981)
328. G. Ricart, A.K. Agrawala, Author response to “on mutual exclusion in computer networks”
by Carvalho and Roucairol. Commun. ACM 26(2), 147–148 (1983)
329. R. Righter, J.C. Walrand, Distributed simulation of discrete event systems. Proc. IEEE 77(1),
99–113 (1989)
330. S. Ronn, H. Saikkonen, Distributed termination detection with counters. Inf. Process. Lett.
34(5), 223–227 (1990)
331. D.J. Rosenkrantz, R.E. Stearns, P.M. Lewis, System level concurrency control in distributed
databases. ACM Trans. Database Syst. 3(2), 178–198 (1978)
References 491

332. D.L. Russell, State restoration in systems of communicating processes. IEEE Trans. Softw.
Eng. SE6(2), 183–194 (1980)
333. B. Sanders, The information structure of distributed mutual exclusion algorithms. ACM
Trans. Comput. Syst. 5(3), 284–299 (1987)
334. S.K. Sarin, N.A. Lynch, Discarding obsolete information in a replicated database system.
IEEE Trans. Softw. Eng. 13(1), 39–46 (1987)
335. N. Santoro, Design and Analysis of Distributed Algorithms (Wiley, New York, 2007), 589
pages
336. A. Schiper, J. Eggli, A. Sandoz, A new algorithm to implement causal ordering, in Proc. 3rd
Int’l Workshop on Distributed Algorithms (WDAG’89). LNCS, vol. 392 (Springer, Berlin,
1989), pp. 219–232
337. R. Schmid, I.C. Garcia, F. Pedone, L.E. Buzato, Optimal asynchronous garbage collection
for RDT checkpointing protocols, in Proc. 25th Int’l Conference on Distributed Computing
Systems (ICDCS’01) (IEEE Press, New York, 2005), pp. 167–176
338. F. Schmuck, The use of efficient broadcast in asynchronous distributed systems. Doctoral
Dissertation, Tech. Report TR88-928, Dept of Computer Science, Cornell University, 124
pages, 1988
339. F.B. Schneider, Implementing fault-tolerant services using the state machine approach. ACM
Comput. Surv. 22(4), 299–319 (1990)
340. R. Schwarz, F. Mattern, Detecting causal relationships in distributed computations: in search
of the Holy Grail. Distrib. Comput. 7, 149–174 (1994)
341. A. Segall, Distributed network protocols. IEEE Trans. Inf. Theory 29(1), 23–35 (1983)
342. N. Shavit, N. Francez, A new approach to detection of locally indicative stability, in 13th
Int’l Colloquium on Automata, Languages and Programming (ICALP’86). LNCS, vol. 226
(Springer, Berlin, 1986), pp. 344–358
343. A. Silberschatz, Synchronization and communication in distributed systems. IEEE Trans.
Softw. Eng. SE5(6), 542–546 (1979)
344. L.M. Silva, J.G. Silva, Global checkpoints for distributed programs, in Proc. 11th Symposium
on Reliable Distributed Systems (SRDS’92) (IEEE Press, New York, 1992), pp. 155–162
345. M. Singhal, A heuristically-aided algorithm for mutual exclusion in distributed systems.
IEEE Trans. Comput. 38(5), 651–662 (1989)
346. M. Singhal, Deadlock detection in distributed systems. IEEE Comput. 22(11), 37–48 (1989)
347. M. Singhal, A class of deadlock-free Maekawa-type algorithms for mutual exclusion in dis-
tributed systems. Distrib. Comput. 4(3), 131–138 (1991)
348. M. Singhal, A dynamic information-structure mutual exclusion algorithm for distributed sys-
tems. IEEE Trans. Parallel Distrib. Syst. 3(1), 121–125 (1992)
349. M. Singhal, A taxonomy of distributed mutual exclusion. J. Parallel Distrib. Comput. 18(1),
94–101 (1993)
350. M. Singhal, A.D. Kshemkalyani, An efficient implementation of vector clocks. Inf. Process.
Lett. 43, 47–52 (1992)
351. M. Sipser, Introduction to the Theory of Computation (PWS, Boston, 1996), 396 pages
352. A.P. Sistla, J.L. Welch, Efficient distributed recovery using message logging, in Proc. 8th
ACM Symposium on Principles of Distributed Computing (PODC’89) (ACM Press, New
York, 1989), pp. 223–238
353. D. Skeen, A quorum-based commit protocol, in Proc. 6th Berkeley Workshop on Distributed
Data Management and Computer Networks (1982), pp. 69–80
354. J.L.A. van de Snepscheut, Synchronous communication between asynchronous components.
Inf. Process. Lett. 13(3), 127–130 (1981)
355. J.L.A. van de Snepscheut, Fair mutual exclusion on a graph of processes. Distrib. Comput.
2(2), 113–115 (1987)
356. T. Soneoka, T. Ibaraki, Logically instantaneous message passing in asynchronous distributed
systems. IEEE Trans. Comput. 43(5), 513–527 (1994)
357. M. Spezialetti, Ph. Kearns, Efficient distributed snapshots, in Proc. 6th Int’l Conference on
Distributed Computing Systems (ICDCS’86) (IEEE Press, New York, 1986), pp. 382–388
492 References

358. T.K. Srikanth, S. Toueg, Optimal clock synchronization. J. ACM 34(3), 626–645 (1987)
359. M. van Steen, Graph Theory and Complex Networks: An Introduction (2011), 285 pages.
ISBN 978-90-815406-1-2
360. R.E. Strom, S. Yemini, Optimistic recovery in distributed systems. ACM Trans. Comput.
Syst. 3(3), 204–226 (1985)
361. I. Suzuki, T. Kasami, A distributed mutual exclusion algorithm. ACM Trans. Comput. Syst.
3(4), 344–349 (1985)
362. G. Taubenfeld, Synchronization Algorithms and Concurrent Programming (Pearson Prentice-
Hall, Upper Saddle River, 2006), 423 pages. ISBN 0-131-97259-6
363. K. Taylor, The role of inhibition in asynchronous consistent-cut protocols, in Proc. 3rd Int’l
Workshop on Distributed Algorithms (WDAG’87). LNCS, vol. 392 (Springer, Berlin, 1987),
pp. 280–291
364. R.N. Taylor, Complexity of analyzing the synchronization structure of concurrent programs.
Acta Inform. 19(1), 57–84 (1983)
365. G. Tel, Introduction to Distributed Algorithms, 2nd edn. (Cambridge University Press, Cam-
bridge, 2000), 596 pages. ISBN 0-521-79483-8
366. G. Tel, F. Mattern, The derivation of distributed termination detection algorithms from
garbage collection schemes. ACM Trans. Program. Lang. Syst. 15(1), 1–35 (1993)
367. G. Tel, R.B. Tan, J. van Leeuwen, The derivation of graph-marking algorithms from dis-
tributed termination detection protocols. Sci. Comput. Program. 10(1), 107–137 (1988)
368. R.H. Thomas, A majority consensus approach to concurrency control for multiple copy
databases. ACM Trans. Database Syst. 4(2), 180–209 (1979)
369. O. Theel, M. Raynal, Static and dynamic adaptation of transactional consistency, in Proc.
30th Hawaï, Int’l Conference on Systems Sciences (HICSS-30), vol. I (1997), pp. 533–542
370. F.J. Torres-Rojas, M. Ahamad, Plausible clocks: constant size logical clocks for distributed
systems. Distrib. Comput. 12(4), 179–195 (1999)
371. F. Torres-Rojas, M. Ahamad, M. Raynal, Lifetime-based consistency protocols for dis-
tributed objects, in Proc. 12th Int’l Symposium on Distributed Computing (DISC’98). LNCS,
vol. 1499 (Springer, Berlin, 1998), pp. 378–392
372. F. Torres-Rojas, M. Ahamad, M. Raynal, Timed consistency for shared distributed objects,
in Proc. 18th Annual ACM Symposium on Principles of Distributed Computing (PODC’99)
(ACM Press, New York, 1999), pp. 163–172
373. S. Toueg, An all-pairs shortest paths distributed algorithm. IBM Technical Report RC 8327,
1980
374. M. Trehel, M. Naimi, Un algorithme distribué d’exclusion mutuelle en log n. TSI. Tech. Sci.
Inform. 6(2), 141–150 (1987)
375. J. Tsai, S.-Y. Kuo, Y.-M. Wang, Theoretical analysis for communication-induced check-
pointing protocols with rollback-dependency trackability. IEEE Trans. Parallel Distrib. Syst.
9(10), 963–971 (1998)
376. J. Tsai, Y.-M. Wang, S.-Y. Kuo, Evaluations of domino-free communication-induced check-
pointing protocols. Inf. Process. Lett. 69(1), 31–37 (1999)
377. S. Venkatesan, Message optimal incremental snapshots, in Proc. 9th Int’l Conference on
Distributed Computing Systems (ICDCS’89) (IEEE Press, New York, 1989), pp. 53–60
378. S. Venkatesan, B. Dathan, Testing and debugging distributed programs using global predi-
cates. IEEE Trans. Softw. Eng. 21(2), 163–177 (1995)
379. J. Villadangos, F. Fariña, J.R. Mendívil, C.F. Garitagoitia, A. Cordoba, A safe algorithm for
resolving OR deadlocks. IEEE Trans. Softw. Eng. 29(7), 608–622 (2003)
380. M. Vukolić, Quorum Systems with Applications to Storage and Consensus (Morgan & Clay-
pool, San Francisco, 2012), 130 pages. ISBN 798-1-60845-683-3
381. Y.-M. Wang, Consistent global checkpoints that contain a given set of local checkpoints.
IEEE Trans. Comput. 46(4), 456–468 (1997)
382. Y.-M. Wang, P.Y. Chung, I.J. Lin, W.K. Fuchs, Checkpointing space reclamation for uncoor-
dinated checkpointing in message-passing systems. IEEE Trans. Parallel Distrib. Syst. 6(5),
546–554 (1995)
References 493

383. Y.-M. Wang, W.K. Fuchs, Optimistic message logging for independent checkpointing
in message-passing systems, in Proc. 11th Symposium on Reliable Distributed Systems
(SRDS’92) (IEEE Press, New York, 1992), pp. 147–154
384. S. Warshall, A theorem on Boolean matrices. J. ACM 9(1), 11–12 (1962)
385. J.L. Welch, Simulating synchronous processors. Inf. Comput. 74, 159–171 (1987)
386. J.L. Welch, N.A. Lynch, A new fault-tolerance algorithm for clock synchronization. Inf.
Comput. 77(1), 1–36 (1988)
387. J.L. Welch, N.A. Lynch, A modular drinking philosophers algorithm. Distrib. Comput. 6(4),
233–244 (1993)
388. J.L. Welch, J.E. Walter, Link Reversal Algorithms (Morgan & Claypool, San Francisco,
2011), 93 pages. ISBN 9781608450411
389. H. Wu, W.-N. Chin, J. Jaffar, An efficient distributed deadlock avoidance algorithm for the
AND model. IEEE Trans. Softw. Eng. 28(1), 18–29 (2002)
390. G.T.J. Wuu, A.J. Bernstein, Efficient solutions to the replicated log and dictionary problems,
in Proc. 3rd Annual ACM Symposium on Principles of Distributed Computing (PODC’84)
(ACM Press, New York, 1984), pp. 233–242
391. G.T.J. Wuu, A.J. Bernstein, False deadlock detection in distributed systems. IEEE Trans.
Softw. Eng. SE-11(8), 820–821 (1985)
392. M. Yamashita, T. Kameda, Computing on anonymous networks, part I: characterizing the
solvable cases. IEEE Trans. Parallel Distrib. Syst. 7(1), 69–89 (1996)
393. M. Yamashita, T. Kameda, Computing on anonymous networks, part II: decision and mem-
bership problems. IEEE Trans. Parallel Distrib. Syst. 7(1), 90–96 (1996)
394. L.-Y. Yen, T.-L. Huang, Resetting vector clocks in distributed systems. J. Parallel Distrib.
Comput. 43, 15–20 (1997)
395. Y. Zhu, C.-T. Cheung, A new distributed breadth-first search algorithm. Inf. Process. Lett.
25(5), 329–333 (1987)
Index

A Causal order
Abstraction Bounded lifetime message, 317
Checkpointing, 196 Broadcast, 313
Adaptive algorithm, 108 Broadcast causal barrier, 315
Bounded mutual exclusion, 259 Causality-based characterization, 305
Causal order, 312 Definition, 304
Mutual exclusion, 256 Point-to-point, 306
AND model Point-to-point delivery condition, 307
Deadlock detection, 403 Reduce the size of control information, 310
Receive statement, 386 Causal past, 124
Anonymous systems, 78 Causal path, 123
Antiquorum, 267 Causal precedence, 123
Arbiter process, 264 Center of graph, 60
Asynchronous atomic model, 370 Channel, 4
Asynchronous system, 5 FIFO, 4
Atomicity FIFO in leader election, 89
Definition, 430 Four delivery properties, 328
From copy invalidation, 438 State, 132, 368
From copy update, 443 Checkpoint and communication pattern, 189
From server processes, 437 Checkpointing
From total order broadcast, 435 Classes of algorithms, 198
Is a local property, 433 Consistent abstractions, 196
Linearization point, 431 Domino effect, 196
Recovery algorithm, 214
B Rollback-dependency trackability, 197
Bounded delay network Stable storage, 211
Definition, 236 Uncoordinated, 211
Local clock drift, 240 Z-cycle-freedom, 196
Breadth-first spanning-tree Communication
Built with centralized control, 20 Deterministic context, 341
Built without centralized control, 17 Nondeterministic context, 342
Broadcast Communication graph, 6
Definition, 9 Concurrency set, 124
On a rooted spanning tree, 10 Conflict graph, 286
Consistency condition, 425
C Atomicity, 430
Causal future, 124 Causal consistency, 464

M. Raynal, Distributed Algorithms for Message-Passing Systems, 495


DOI 10.1007/978-3-642-38123-2, © Springer-Verlag Berlin Heidelberg 2013
496 Index

Consistency condition (cont.) Distributed algorithm, 4


FIFO consistency, 470 Asynchronous, 5
Hierarchy, 468 Synchronous, 4
Sequential consistency, 447 Distributed computation: definition, 123
What is the issue, 429 Distributed cycle detection, 50
Continuously passive process Distributed knot detection, 50
In deadlock detection, 416 Distributed shared memory, 427
In termination detection, 382 Distributed shortest paths
Convergecast Bellman–Ford, 35
Definition, 9 Floyd–Warshall, 38
On a rooted spanning tree, 10 Domino effect, 196
Crown structure, 339 Drinking philosophers, 295
Crumbling wall, 268 Dynamic programming principle, 36
Cut Dynamic termination
Consistent, 125 Detection algorithm, 394
Definition, 125 Locally evaluable predicate, 393
Cut vertex of graph
Definition, 60, 66 E
Determination, 67 Eccentricity of a vertex, 60
Event
D Definition, 122
De Bruijn graph Immediate predecessor, 170
Computing, 74 Partial order, 123
Definition, 73 Relevant event, 170
Deadlock detection
AND model, 403 F
Definition, 404 Fast read algorithm (sequential consistency),
Dependency set, 402 453, 456, 459
One-at-a-time model, 403 Fast write algorithm (sequential consistency),
OR model, 404 455
Resource vs. message, 402 Finite projective planes, 266
Structure of algorithms, 405 Flooding algorithm, 10
Wait-for graph, 402 Forward/discard principle, 6
What is the difficulty, 414 Fully connected network, 4
Deadlock detection in the AND model, 408
Deadlock detection in the one-at-a-time model, G
405 Global checkpoint, 189, see Global state
Deadlock detection in the OR model, 413 Global function, 59
Degree of a vertex, 42 Global state
Delivery condition Consistency, 129, 134
Associated with a channel, 331 Consistency wrt. channel states, 133
Causal barrier, 315 Definition, 129
For bounded lifetime messages, 319 Detection of a conjunction of local
For message causal order, 307 predicates, 166
Dependency set In termination detection, 375
Deadlock detection, 402 Including channel states, 132
Termination detection, 385 Lattice structure, 129
Depth-first network traversal, 24 On-the-fly determination, 135
Diameter of a graph, 60 Reachability, 129
Diffusing computation vs. cut, 134
Definition, 376 Global state computation
Termination detection, 377 Definition, 136
Dining philosophers, 290 Meaning of the result, 136
Index 497

Graph algorithms Local clock drift, 240


Distributed cycle detection, 50 Local property, 432
Distributed knot detection, 50 Sequential consistency is not a, 449
Distributed shortest paths, 35 Local state
Distributed vertex coloring, 42 Definition, 127
Maximal independent set, 46 Logical instantaneity, see Rendezvous
Graph topology vs. round numbers, 72 Logical ring construction, 27
Grid quorum, 267 Loop invariant and progress condition in
termination detection, 383
H
Happened before relation, 123 M
Matrix clock
I Basic algorithm, 183
Immediate predecessor Definition, 182
Event, 170 Properties, 184
Tracking problem, 170 Matrix time, 182
Incremental requests, 287 Maximal independent set, 46
Interaction, see Rendezvous communication Message, 4
Invariance wrt. real time, 125 Arrival vs. reception, 385
As a point, 336
K As an interval, 336
k-out-of-M problem Bounded lifetime, 317
Case k = 1, 278 Causal order, 304
Definition, 277 Crown, 339
General case, 280 Delivery condition, 307
k-out-of-M receive statement, 387 Filtering, 69
In transit, 133
L Internal vs. external, 7
Lattice of global states, 129 Logging, 211
Leader election Logical instantaneity, 335
Impossibility in anonymous rings, 78 Marker, 138
In bidirectional rings, 83 Orphan, 133
In unidirectional rings, 79 Relation, 123
Optimality in unidirectional rings, 86 Rendezvous, 336
Problem definition, 77 Sense of transfer, 336
Linear time Stability, 155
Basic algorithm, 150 Unspecified reception, 387
Definition, 150 Message complexity
Properties, 151 Leader election, 81, 85, 89
wrt. sequential observation, 152 Message delivery: hierarchy, 306
Linearizability, see Atomicity Message logging, 211
Linearization point, 431 Mobile object, 93
Liveness property Monotonous computation
Deadlock detection, 405 Definition, 398
Mutual exclusion, 248 Observation, 399
Navigating object, 94 Multicast, 9
Observation of a monotonous computation, Mutex, 247
399 Mutual exclusion
Rendezvous communication, 337 Adaptive algorithm, 256
Termination detection, 369 Based on a token, 94, 249
Total order broadcast, 321 Based on arbiter permissions, 264
Local checkpoint, 189, see Local state Based on individual permissions, 249
Forced vs. spontaneous, 198 Bounded adaptive algorithm, 259
Useless, 195 Definition, 247
498 Index

Mutual exclusion (cont.) Port, 9


Process states, 248 Preemption (permission), 270
To readers/writers, 255 Process, 3
vs. election, 248 Arbiter, 264
vs. total order broadcast, 323 Continuously passive, 382, 416
With multiples entries, 278 History, 122
wrt. neighbors, 293 Initial knowledge, 5
Notion of a proxy, 101
N Owner, 439
Navigating token for sequential consistency, Safe, 224
459 State, 368
Navigation algorithm Proxy process, 101
Adaptive, 108 Pulse model, 220
Based on a complete network, 96
Based on a spanning tree, 100 Q
Network Quorum
Bounded delay, 236 Antiquorum, 267
Fully connected, 4 Construction, 266
Object navigation, 93 Crumbling wall, 268
Ring, 4 Definition, 265
Tree, 4 Grid, 267
Network traversal Vote-based, 268
Breadth-first, 16, 17, 20
Depth-first, 24 R
In deadlock detection, 413 Radius of a graph, 60
Synchronous breadth-first, 220 Receive statement
Nondeterminism, 136 AND model, 386
Communication, 342 Disjunctive k-out-of-m model, 387
Communication choice, 358 k-out-of-m model, 387
Nondeterministic receive statement, 385 OR model, 386
Solving, 345 Regular graph, 72
Computing on a De Bruijn graph, 74
O De Bruijn graph, 73
Object computation Relevant event, 170
Equivalent computations, 430 Rendezvous communication
Legal computation, 430 Client-server algorithm, 342
Partial order, 429 Crown-based characterization, 339
OO constraint, 451 Definition, 336
OR model n-way with deadlines, 360
Deadlock detection, 403 Nondeterministic forced interactions, 350
Receive statement, 386 Nondeterministic planned, 341
Owner process, 439 Token-based algorithm, 345
With deadline, 354
P wrt. causal order, 338
Partial order Resource allocation, 277
On events, 123 Conflict graph, 286
On local states, 127 Dynamic session, 293
On object operations, 429 Graph coloring, 291
Peripheral vertex of graph, 60 Incremental requests, 287
Permission-based algorithms, 249 Reduce waiting chains, 290
Arbiter permission, 249 Resources with several instances, 295
Individual permission, 249 Static session, 292
Preemption, 270 Resources with a single instance, 286
Quorum-based permission, 268 Restricted vector clock, 181
Index 499

Ring network, 4 Snapshot, 121


Rollback-dependency trackability Space/time diagram
Algorithms, 203 Equivalent executions, 125
BHMR predicate, 208 Synchronous vs. asynchronous, 5
Definition, 197 Spanner, 234
FDAS strategy, 206 Stable property
Rooted spanning tree Computation, 137
Breadth-first construction, 16 Definition, 137
Construction, 12 Static termination
Definition, 11 Detection algorithm, 390
For broadcast, 10 Locally evaluable predicate, 390
For convergecast, 10 Superimposition, 141
Round numbers vs. graph topology, 72 Synchronizer
Round-based algorithm Basic principle, 223
As a distributed iteration, 65 Definition, 222
Global function computation, 61 Notion of a safe process, 224
Maximal independent set, 46 Synchronizer α, 224
Shortest paths, 35 Synchronizer β (tree-based), 227
Vertex coloring, 43 Synchronizer δ (spanner-based), 234
Routing tables Synchronizer γ (overlay-based), 230
From a global function, 62 Synchronizer μ, 239
From Bellman–Ford, 35 Synchronizer λ, 238
From Floyd–Warshall, 38 Synchronous breadth-first traversal, 220
Rubber band transformation, 143 Synchronous communication, see Rendezvous
communication
S
Synchronous system, 4, 219
Safety property
For rendezvous with deadlines, 354
Deadlock detection, 404
Pulse model, 220
Mutual exclusion, 248
Navigating object, 94
T
Observation of a monotonous computation,
399 Termination
Rendezvous communication, 337 Graph computation algorithm, 8
Termination detection, 369 Local vs. global, 23
Total order broadcast, 320 Predicate, 368
Scalar time, see Linear time Shortest path algorithm, 37
Sequential consistency Termination detection
Based on the OO-constraint, 462 Atomic model, 370
Definition, 447 Dynamic termination, 389
Fast enqueue, 456 Four-counter algorithm, 371
Fast read algorithm, 453 General model, 378
Fast write algorithm, 455 General predicate fulfilled(), 387
From a navigating token, 459 Global states and cuts, 375
From a server per object, 460 In diffusing computation, 376
From a single server, 456 Loop invariant and progress condition, 383
From total order broadcast, 453 Problem, 369
Is not a local property, 449 Reasoned detection in a general model, 381
Object managers must cooperate, 461 Static termination, 389
OO constraint, 451 Static vs. dynamic termination, 388
Partial order for read/write objects, 450 Static/dynamic vs. classical termination,
Two theorems, 451 389
WW constraint, 451 Type of algorithms, 369
Sequential observation Vector counting algorithm, 373
Definition, 131 Very general model, 385
500 Index

Timestamp Vector time


Definition, 152 Definition, 159
wrt. sequential observation, 152 Detection of a conjunction of local
Token, 94 predicates, 166
Total order broadcast Development, 163
Based on inquiries, 324 wrt. global states, 165
Circulating token, 322 Vertex coloring, 42
Coordinator-based, 322 Vote, 268
Definition, 154, 320
For sequential consistency, 453 W
In a synchronous system, 326 Wait-for graph, 402
Informal definition, 153 Waiting
Strong, 320 Due to messages, 401
Timestamp-based implementation, 156 Due to resources, 401
To implement atomicity, 435 Wave and sequence of waves
vs. mutual exclusion, 323 Definition, 379
Weak, 321 Ring-based implementation, 380
Tree invariant, 101
Tree-based implementation, 380
Tree network, 4
Wave-based algorithm
Spanning tree construction, 17
U
Uncoordinated checkpointing, 211 Termination detection, 381, 390, 394
Unspecified message reception, 387 WW constraint, 451

V Z
Vector clock Z-cycle-freedom
Adaptive communication layer, 180 Algorithms, 201
Approximation, 181 Dating system, 199
Basic algorithm, 160 Definition, 196
Definition, 160 Notion of an optimal algorithm, 203
Efficient implementation, 176 Operational characterization, 199
k-restricted, 181 Z-dependency relation, 190
Lower bound on the size, 174 Zigzag path, 190
Properties, 162 Zigzag pattern, 191

You might also like