DC - Unit 4 Latest

UNIT IV – CONSENSUS AND RECOVERY
Consensus and Agreement Algorithms: Problem Definition –

Overview of Results – Agreement in a Failure-Free
System(Synchronous and Asynchronous) – Agreement in Synchronous
Systems with Failures; Checkpointing and Rollback Recovery:
Introduction – Background and Definitions – Issues in Failure
Recovery – Checkpoint-based Recovery – Coordinated Checkpointing
Algorithm - - Algorithm for Asynchronous Checkpointing and
Recovery
Consensus and Agreement Algorithms
Introduction
 Consensus
Forms of coordination required by the processes to exchange information to
negotiate with one another and eventually reach a common understanding or
agreement, before taking application-specific actions is called consensus.
Example:
commit decision in database systems, wherein the processes collectively decide

whether to commit or abort a transaction that they participate in.
Problem Definition
• Assumptions underlying our study of agreement algorithms:
1. Failure Models
• Among the n processes in the system, at most f processes can be faulty
• Behavior of faulty process depends upon the failure model assumed
• Fail-stop model: a process may crash in the middle of a step, which could be the
execution of a local operation or processing of a message for a send or receive
event.
• Byzantine failure model: a process may behave arbitrarily
• Choice of failure model depends upon the feasibility and complexity of solving
consensus.
Problem Definition
2. Synchronous/Asynchronous communication:
• If a failure-prone process chooses to send a message to process Pi but fails, then Pi

cannot detect the non-arrival of the message in an asynchronous system
• In a synchronous system, message which has not been sent can be recognized by
the intended recipient, at the end of the round.
3. Network connectivity:
• The system has full logical connectivity, i.e., each process can communicate with
any other by direct message passing.
4. Sender identification:
• A process that receives a message always knows the identity of the sender
process.
Problem Definition
• When multiple messages are expected from the same sender in a single round,
we implicitly assume a scheduling algorithm that sends these messages in sub-
rounds, so that each message sent within the round can be uniquely identified.
5. Channel reliability:
• The channels are reliable, and only the processes may fail
6. Authenticated vs. non-authenticated messages:
• With unauthenticated messages, when a faulty process relays a message to

other processes
(i) it can forge the message and claim that it was received from another process
(ii) it can also tamper with the contents of a received message before relaying it.
Problem Definition
• An unauthenticated message is also called an oral message or an unsigned
message.
• When a process receives a message, it has no way to verify its authenticity.
Solution: Using techniques such as digital signatures, it is easier to solve the
agreement problem because if some process forges a message or
tampers with the contents of a received message before relaying it,
the recipient can detect the forgery or tampering.
6. Agreement variable:
• The agreement variable may be boolean or multi-valued, and need not be an
integer.
Problem Definition
Case study: Difficulty of reaching agreement
• Inspired by the long wars fought by the Byzantium Empire in the Middle Ages
• Four camps of the attacking army, each commanded by a general, are camped
around the fort of Byzantium.
• They can succeed in attacking only if they attack simultaneously. Hence, they
need to reach agreement on the time of attack.
• The only way they can communicate is to send messengers among themselves.
The messengers model the messages.
• An asynchronous system is modeled by messengers taking an unbounded time

to travel between two camps.
Problem Definition
Case study: Difficulty of reaching agreement
• A lost message is modeled by a messenger being captured by the enemy.
• A Byzantine process is modeled by a general being a traitor.
• The traitor will attempt to subvert the agreement-reaching mechanism, by giving

misleading information to the other generals.
• A traitor may inform one general to attack at 10am, and inform the other
generals to attack at noon. Or he may not send a message at all to some general.
Likewise, he may tamper with messages he gets from other generals, before
relaying those messages.
Problem Definition
• Four generals are shown
• a consensus decision is to be reached about a boolean value
• The various generals are conveying potentially misleading values of the

decision variable to the other generals, which results in confusion.
Problem Definition
• In such Byzantine behavior, the challenge is to determine whether it is possible

to reach agreement, and if so under what conditions.
• If agreement is reachable, then protocols to reach it need to be devised.
(A) The Byzantine Agreement and Other Problems
(i) The Byzantine Agreement Problem
• The Byzantine agreement problem requires a designated process, called the

source process that has an initial value, to reach agreement with the other
processes about its initial value, subject to the following conditions.
Problem Definition
• Agreement: All non-faulty processes must agree on the same value.
• Validity: If the source process is non-faulty, then the agreed upon value by all
the non-faulty processes must be the same as the initial value of the source.
• Termination: Each non-faulty process must eventually decide on a value.
(ii) The Consensus Problem

• The Consensus problem differs from the Byzantine Agreement problem in that
each process has an initial value and all the correct processes must agree on a
single value.
Problem Definition
• Agreement: All non-faulty processes must agree on the same (single) value.
• Validity: If all the non-faulty processes have the same initial value, then the
agreed upon value by all the non-faulty processes must be that same value.
• Termination: Each non-faulty process must eventually decide on a value.
(iii) The Interactive Consistency Problem

• The Interactive Consistency problem differs from the Byzantine agreement
problem in that each process has an initial value, and all the correct processes
must agree upon a set of values, one value for each process.
Problem Definition
• Agreement: All non-faulty processes must agree on the same array of values
A[v1 . . . vn].
• Validity: If process i is non-faulty and its initial value is vi, then all non-faulty
processes agree on vi as the ith element of the array A. If process j is faulty, then
the non-faulty processes can agree on any value for A[j].
• Termination: Each non-faulty process must eventually decide on the array A.

Overview of results
• For a no failure case Consensus is attainable
• In synchronous system common knowledge of consensus value is attainable
• In asynchronous system concurrent common knowledge of consensus value is
attainable
Overview of results
Agreement in a Failure-Free System(Synchronous or Asynchronous)
• In a failure-free system, consensus can be reached by
1. collecting information from the different processes
2. arriving at a decision
3. distributing this decision in the system
• A distributed mechanism would have each process broadcast its values to

others, and each process computes the same function on the values received.
• The decision can be reached by using an application-specific function - some

simple examples being the majority, max and min functions
Agreement in a Failure-Free System(Synchronous or Asynchronous)
• Algorithms to collect the initial values and then distribute the decision may be
based on the token circulation on a logical ring, or the three-phase tree-based
broadcast-convergecast-broadcast, or direct communication with all nodes.
1. In a synchronous system, this can be done simply in a constant number of
rounds
2. In an asynchronous system, consensus can similarly be reached in a constant
number of message hops.
• Reaching agreement is straightforward in a failure-free system. Hence, we focus

on failure-prone systems.
Agreement in (Message-Passing) Synchronous Systems with Failures
1. Consensus Algorithm for Crash Failures (Synchronous System)
2. Consensus Algorithms for Byzantine Failures (Synchronous System)
(A) Upper Bound on Byzantine Processes
(B) Byzantine Agreement Tree Algorithm: Exponential (Synchronous System)

(i) Recursive formulation
(ii) Iterative formulation
3. Phase-King Algorithm for Consensus: Polynomial (Synchronous System)

1. Consensus Algorithm for Crash Failures (Synchronous System)
• The agreement condition is satisfied because in the f + 1 rounds, there must be at

least one round in which no process failed.
• In this round, say round r, all the processes that have not failed so far succeed in
broadcasting their values, and all these processes take the minimum of the values
broadcast and received in that round.
• Thus, the local values at the end of the round are the same, say xir for all non-failed
processes. In further rounds, only this value may be sent by each process at most
once, and no process i will update its value xir
• The validity condition is satisfied because processes do not send fictitious values
in this failure model. (Thus, a process that crashes has sent only correct values
until the crash). For all i, if the initial value is identical, then the only value sent
by any process is that identical value which is the value agreed upon as per the
agreement condition
• The termination condition is seen to be self-evidently satisfied.
Complexity:
• The number of messages is at most O(n2) in each round
• the total number of messages is O((f +1) · n2).
2. Consensus Algorithms for Byzantine Failures (Synchronous System)
(A) Upper Bound on Byzantine Processes
Proof:
With n processes and f ≥ n/3 processes, the Byzantine agreement problem cannot
be solved.
• Let Z(3, 1) denote the Byzantine agreement problem for parameters n = 3 and f =
1.
• Let Z(n ≤ 3f, f) denote the Byzantine agreement problem for parameters n(≤ 3f)
and f.
• A reduction from Z(3, 1) to Z(n ≤ 3f, f) needs to be shown, i.e., if Z(n ≤ 3f, f) is
solvable, then Z(3, 1) is also solvable. After showing this reduction, we can argue
that as Z(3, 1) is not solvable, Z(n ≤ 3f, f) is also not solvable.
The main idea of the reduction argument
• In Z(n ≤ 3f, f), partition the n processes into three sets S1, S2, S3, each of size ≤
n/3.
• In Z(3, 1), each of the three processes P1, P2, P3 simulates the actions of the
corresponding set S1, S2, S3 in Z(n ≤ 3f, f).
• If one process is faulty in Z(3, 1), then at most f, where f ≤ n/3, processes are
faulty in Z(n, f).
• In the simulation, a correct process in Z(3, 1) simulates a group of up to n/3

correct processes in Z(n, f).
(i) Recursive formulation (Oral message algorithm or Lamport shostak Pease alg)
(i) Recursive formulation (Oral message algorithm or Lamport shostak Pease alg)
Number of messages for agreement on one value is = (n-1)+(n-1)(n-2)

= (4-1)+(4-1)(4-2) = 3+6=9 msgs
(i) Iterative formulation: Iterative version of the high-level recursive algorithm
• Lines (2a)-(2e) correspond to the
unfolding action of the recursive
pseudo-code
• lines (2f)-(2h) correspond to the

folding of the recursive pseudo-code
• Two operations are defined in the list

L:
- head(L) is the first member of
the list L
- tail(L) is the list L after removing
its first member
• Each process maintains a tree of boolean variables
• The tree data structure is used as follows:
• Once the entire tree is filled from root to leaves, the actions in the folding of
the recursion are simulated in lines (2f)-(2h) of the iterative version,
proceeding from the leaves up to the root of the tree.
• These actions are crucial – they entail taking the majority of the values at
each level of the tree.
• The final value of the root is the agreement value, which will be the same at
all processes.
(i) Iterative formulation
(i) Iterative formulation – Algorithm
(i) Iterative formulation – Algorithm
(i) Correctness of Byzantine Agreement Algorithm
• Loyal commander: Given f and x, if the commander process is loyal, then
Oral_Msg(x) is correct if there are at least 2f + x processes.
• No assumption about commander: Given f, Oral_Msg(x) is correct if x ≥ f and
there are a total of 3x + 1 or more processes.
Complexity
The algorithm requires f + 1 rounds, an exponential amount of local memory, and

(n − 1) + (n − 1)(n − 2) + . . . + [(n − 1)(n − 2) . . . (n − f − 1)] messages
(C) Phase-King Algorithm for Consensus: Polynomial (Synchronous System)
• Proposed by Berman and Garay
• The phase-king algorithm solves the consensus problem using f + 1 phases, and a
polynomial number of messages
• It can tolerate only f < ⌈n/4⌉ malicious processes.
• The algorithm is so called because it operates in f + 1 phases, each with two

rounds, and a unique process plays an asymmetrical role as a leader in each round.
• Each phase has 2 rounds:
• First round: Each process sends its estimate to all other processes
• Second round: Phase king process arrives at an estimate based on the value it
received in first round and broadcasts its new estimate to all others
The message pattern
• The phase king algorithm
• Correctness
• Complexity
The algorithm requires f + 1 phases and two sub-rounds each, and (f + 1) [(n −1)(n
+ 1)] messages.
Checkpointing and Rollback Recovery
Introduction
• Distributed systems are not fault-tolerant and the vast computing potential of
these systems is often hampered by their susceptibility to failures.
• Many techniques like transactions, group communication, and rollback recovery

have been developed to add reliability and high availability to distributed
systems.
Rollback recovery protocols
• restores the system back to a consistent state after a failure.
• It achieves fault tolerance by periodically saving the state of a process during the
failure-free execution, and restarting from a saved state upon a failure to reduce
Introduction
• The saved state is called a checkpoint, and the procedure of restarting from
previously check pointed state is called rollback recovery
• A checkpoint can be saved on either the stable storage or the volatile storage
depending on the failure scenarios to be tolerated
• In distributed systems, rollback recovery is complicated because messages

induce inter-process dependencies during failure-free operation.
• rollback propagation: Upon a failure of one or more processes in a system,

these dependencies may force some of the processes that did not fail to roll
back, creating what is commonly called a rollback propagation.
Introduction
• Domino effect :
consider the situation where the sender of a message m rolls back to a state that
precedes the sending of m.The receiver of m must also roll back to a state that
precedes m’s receipt; otherwise, the states of the two processes would be
inconsistent because they would show that message m was received without
being sent, which is impossible in any correct failure-free execution. This
phenomenon of cascaded rollback is called the domino effect.
• Independent or uncoordinated checkpointing:

If each participating process takes its checkpoints independently, then the system
is susceptible to the domino effect. This approach is called independent or
uncoordinated checkpointing.
Introduction
• Techniques that avoid Domino effect :
(a) Coordinated checkpointing rollback recovery

processes coordinate their checkpoints to form a system-wide consistent state
(b) Communication induced checkpointing rollback recovery

Forces each process to take checkpoints based on information piggybacked on the
application messages it receives from other processes.
(c) Log-based rollback recovery

• Combines checkpointing with logging of nondeterministic events.
Background and Definitions
1. System model
2. Local checkpoint
3. Consistent system states
4. Interactions with the outside world
5. Different types of messages
(a) In-Transit messages
(b) Lost messages
(c) Delayed messages
(d) Orphan messages
(e) Duplicate messages
1. System model
• A distributed system consists of a fixed number of processes P1, P2,..., PN that

communicate only through messages.
• Processes cooperate to execute a distributed application and interact with the
outside world by receiving and sending input and output messages, respectively
• Some protocols assume that the communication subsystem delivers messages
reliably, in First-In-First-Out (FIFO) order, while other protocols assume that the
communication subsystem can lose, duplicate, or reorder messages.
• A system recovers correctly if its internal state is consistent with the observable
behavior of the system before the failure
2. Local checkpoint
• A local checkpoint is a snapshot of the state of the process at a given instance
and the event of recording the state of a process is called local checkpointing.
• The contents of a checkpoint depend upon the application context and the
checkpointing method being used.
• Depending upon the checkpointing method used, a process may keep several
local checkpoints or just a single checkpoint at any time
• We assume that a process stores all local checkpoints on the stable storage so
that they are available even if the process crashes
• We also assume that a process is able to roll back to any of its existing local
checkpoints and thus restore to and restart from the corresponding state.
3. Consistent and Inconsistent system states
• consistent system state is one in which a process’s state reflects a message
receipt, then the state of the corresponding sender must reflect the sending of
that message.
• shows message m1 to have been sent but not

yet received, but that is alright.
• The state is consistent because it represents a

situation in which every message that has been
received, there is a corresponding message
send event.
Inconsistent system state
• The state is inconsistent because process P2 is shown to have received m2 but

the state of process P1 does not reflect to have sent it.
• Inconsistent states occur because of failures

4. Interactions with the outside world (OWP)
• A distributed application often interacts with the outside world to receive input
data or deliver the outcome of a computation
• If a failure occurs, the outside world cannot be expected to a roll back.
• For example, a printer cannot roll back the effects of printing a character, and an
automatic teller machine cannot recover the money that it dispensed to a
customer
• Output Commit
Before sending output to the OWP, the system must ensure that the state from
which the output is sent will be recovered despite any future failure. This is
commonly called the output commit problem.
• Input messages
• input messages that a system receives from the OWP may not be reproducible
during recovery, because it may not be possible for the outside world to
regenerate them.
• recovery protocols must arrange to save these input messages so that they can
be retrieved when needed for execution replay after a failure.
• A common approach is to save each input message on the stable storage before
allowing the application program to process it.
a. In-transit message
messages that have been sent but not yet received
b. Lost messages
messages whose “send‟ is done but “receive‟ is undone due to rollback
c. Delayed messages
messages whose “receive‟ is not recorded because the receiving process was
either down or the message arrived after rollback
d. Orphan messages
• messages with “receive‟ recorded but message “send‟ not recorded
• Do not arise if processes roll back to a consistent global state
e. Duplicate messages
Due to message logging and replaying during process recovery some messages
are sent repeatedly.
Issues in Failure Recovery
Issues in Failure Recovery
Overlapping failure
• A process Pj that begins rollback/ recovery in response to the failure of a process

Pi can itself fail and develop amnesia with respect to process Pi’s failure; that is,
process Pj can act in a fashion that exhibits ignorance of process Pi’s failure.
• If overlapping failures are to be tolerated, a mechanism must be introduced to

deal with amnesia and the resulting inconsistencies.
Checkpoint Based Recovery
1. Uncoordinated checkpointing
2. Coordinated checkpointing
(a) Blocking coordinated checkpointing
(b) Non blocking checkpointing coordination
3. Impossibility of Min Process Non-blocking Checkpointing
4. Communication-Induced Checkpointing
(a) Model-based Checkpointing
(b) Index based checkpointing
1. Uncoordinated checkpointing
• Each process has autonomy in deciding when to take checkpoints.
• Eliminates synchronization overhead as there is no need for coordination

between processes.
• Autonomy in taking checkpoints also allows each process to select appropriate

checkpoints positions.
Drawbacks:
1. There is the possibility of the domino effect during a recovery, which may cause
the loss of a large amount of useful work

2. Recovery from a failure is slow because processes need to iterate to find a

consistent set of checkpoints.
3. Useless Checkpoints
(a) Since no coordination is done at the time of checkpoint is taken, checkpoints

taken by a process may be useless checkpoints.
(b) Useless checkpoints are undesirable because they incur overhead and do not
contribute to advancing the recovery line.
4. Forces each process to maintain multiple checkpoints, and to periodically

invoke a garbage collection algorithm to reclaim the checkpoints that are no
longer required.
5. It is not suitable for applications with frequent output commits because these
require global coordination to compute the recovery line.
How to determine consistent global checkpoint?
• As each process takes checkpoints independently, we need to determine a

consistent global checkpoint to rollback to, when a failure occurs
• In order to determine a consistent global checkpoint during recovery, the

processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation.
• Steps followed for consistent global checkpoint recovery are as follows:

1. When a failure occurs, the recovering process initiates rollback by broadcasting a
dependency request message to collect all the dependency information
maintained by each process.
2. When a process receives this message, it stops its execution and replies with
the dependency information saved on the stable storage
3. The initiator then calculates the recovery line based on the global dependency
information and broadcasts a rollback request message containing the
recovery line.
4. Upon receiving this message, a process whose current state belongs to the
recovery line simply resumes execution; otherwise, it rolls back to an earlier
checkpoint as indicated by the recovery line.
The direct dependency tracking technique shown in the below diagram is
commonly used in uncoordinated checkpointing.
• Let ci,x be the xth checkpoint of process Pi , where i is the process id and x is
the checkpoint index
• Let Ii,x denote the checkpoint interval or simply interval between checkpoints
ci,x−1 and ci,x.
• When process Pi at interval Ii,x sends a message m to Pj , it piggybacks the pair

(i,x) on m.
• When Pj receives m during interval Ij,y , it records the dependency from Ii,x to Ij,y,
which is later saved onto stable storage when Pj takes checkpoint cj,y.
2. Coordinated checkpointing
• processes coordinate in checkpointing activities so that all local checkpoints
form a consistent global state
• Benefits are
 simplifies recovery
 avoids domino effect
 each process restarts from its recent checkpoint
• Requires each process to maintain only one checkpoint on the stable storage,
reducing the storage overhead and eliminating the need for garbage
collection.
• The main disadvantage of this method is

 Large latency is involved in committing output, as a global checkpoint is
needed before a message is sent to OWP.
 Delays and overhead occurs with each new global checkpoint.
• For the checkpoints to synchronize, the clock time has to be synchronized so

that all processes agree at what instants of time they will take checkpoints.
But clock synchronization could not be achieved in distributed system.
• Solution:
checkpoint consistency could be achieved without synchronizing clock by
(a) Blocking the message sending for the running duration of the protocol
(b) Piggybacking checkpoint indices on messages to avoid blocking.
(i) Blocking Coordinated Checkpointing
Approach :
coordinated checkpointing involves blocking communications while the
checkpointing protocol executes
Procedure :
• After a process takes a local checkpoint, to prevent orphan messages, it remains
blocked until the entire checkpointing activity is complete.
• The coordinator takes a checkpoint and broadcasts a request message to all
processes, asking them to take a checkpoint.
• When a process receives this message

 it stops its execution
 flushes all the communication channels
 takes a tentative checkpoint and
 sends an acknowledgment message back to the coordinator.
• After the coordinator receives acknowledgments from all processes, it

broadcasts a commit message that completes the two-phase checkpointing
protocol.
• After receiving the commit message

process removes the old permanent checkpoint
 atomically makes the tentative checkpoint permanent and
resumes its execution and exchange of messages with other processes
• A problem with this approach is that the computation is blocked during the
checkpointing and therefore, nonblocking checkpointing schemes are preferable.
(ii) Non Blocking Coordinated Checkpointing
• In this approach processes need not stop their execution while taking
checkpoints.
• A fundamental problem is that communication messages are not blocked in this

coordinated checkpointing due to which the checkpoint might become
inconsistent when messages are sent before setting the checkpoints.
• message m is sent by P0 after receiving a

checkpoint request from the checkpoint
coordinator.
• Assume m reaches P1 before the checkpoint
request.
• This situation results in an inconsistent
checkpoint since checkpoint c1,x shows the
receipt of message m from P0, while
checkpoint c0,x does not show m being
sent from P0.
Solution 1: with FIFO Channel
problem can be avoided by preceding the first post-checkpoint message on each
channel by a checkpoint request, forcing each process to take a checkpoint before
receiving the first post-checkpoint message
Solution 2: with Non-FIFO Channel
• Two approaches are used
Approach 1:
• uses the idea of snapshot algorithm of Chandy and Lamport in which
markers play the role of the checkpoint-request messages.
• In this algorithm, the initiator takes a checkpoint and sends a marker (a

checkpoint request) on all outgoing channels.
• The marker can be piggybacked on every post-checkpoint message.

• Each process takes a checkpoint upon receiving the first marker and sends the
marker on all outgoing channels before sending any application message
Approach 2:
checkpoint indices are sent , where a checkpoint creation is triggered when the
receiver’s local checkpoint index is lower than the piggybacked checkpoint
index.
3. Impossibility of Min Process Non-blocking Checkpointing
• A min-process, non-blocking checkpointing algorithm is one that
 forces only a minimum number of processes to take a new checkpoint
 Does not force any process to suspend its computation
• min-process checkpointing algorithm consists of two phases

Phase 1:
• the checkpoint initiator identifies all processes with which it has communicated
since the last checkpoint and sends them a request
• Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoints and sends them a request, and so
on, until no more processes can be identified.
Phase 2:
• all processes identified in the first phase take a checkpoint
• The result is a consistent checkpoint that involves only the participating
processes.
• In this protocol, after a process takes a checkpoint, it cannot send any message
until the second phase terminates successfully, although receiving a message
after the checkpoint has been taken is allowable.
• Based on the concept called ’Z-dependency’, Cao and Singhal proved that there
does not exist a non-blocking algorithm that allows a minimum number of
processes to take their checkpoints.
Phase 2:
• all processes identified in the first phase take a checkpoint
• The result is a consistent checkpoint that involves only the participating
processes.
• In this protocol, after a process takes a checkpoint, it cannot send any message
until the second phase terminates successfully, although receiving a message
after the checkpoint has been taken is allowable.
• Based on the concept called ’Z-dependency’, Cao and Singhal proved that there
does not exist a non-blocking algorithm that allows a minimum number of
processes to take their checkpoints.
4. Communication-Induced Checkpointing
• It is another way to avoid domino effect while allowing processes to take some of
their checkpoints independently.
• Processes may be forced to take additional checkpoints

• Advantage:
 Minimizes or eliminates useless checkpoints
 Processes can act independently
• Two types
1. Autonomous checkpoints: The checkpoints that a process takes independently
2. Forced checkpoints : The checkpoints which processes are forced to take

• Communication-induced check pointing piggybacks protocol- related

information on each application message
• The receiver of each application message uses the piggybacked information to

determine if it has to take a forced checkpoint to advance the global recovery
line
• The forced checkpoint must be taken before the application may process the
contents of the message
• Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing
Model-based checkpointing
• Model-based checkpointing prevents checkpoints that could result in inconsistent
states among the existing checkpoints and the communication pattern related to
it.
• Communication pattern causing inconsistency is prevented by forced checkpoints.
• No control messages are exchanged among the processes during normal

operation. All information necessary to execute the protocol is piggybacked on
• There are several domino-effect-free checkpoint and communication models.
1. MRS (mark, send, and receive) model

• Proposed by Russell
• avoids the domino effect by ensuring that within every checkpoint interval all
message receiving events precede all message-sending events.
2. Rollback prevention:
• proposed by Wu and Fuchs
• In this approach domino effect is avoided by taking checkpoint immediately
after every message sending event.
• Avoids rollback propagation
Index Based Checkpointing
• Index-based communication-induced checkpointing assigns monotonically

increasing indexes to checkpoints, such that the checkpoints having the same
index at different processes form a consistent state.
• Inconsistency between checkpoints of the same index can be avoided in a lazy

fashion if indexes are piggybacked on application messages to help receivers
decide when they should take a forced a checkpoint.
Koo-Toueg Coordinated Checkpointing Algorithm
• Koo and Toueg coordinated checkpointing and recovery technique takes
o a consistent set of checkpoints
o avoids the domino effect and livelock problems during the recovery
• Includes 2 parts:
o the check pointing algorithm
o the recovery algorithm
• Algorithm makes the following assumptions about the distributed system:

o Processes communicate by exchanging messages through communication
channels.
o Communication channels are FIFO
o It is assumed that end-to-end protocols exist to cope with message loss due to
rollback recovery and communication failure.

o Communication failures do not partition the network.
o The checkpoint algorithm takes two kinds of checkpoints on the stable storage:
 A permanent checkpoint is a local checkpoint at a process and is a part of a

consistent global checkpoint.
 A tentative checkpoint is a temporary checkpoint that is made a permanent

checkpoint on the successful termination of the checkpoint algorithm.
The algorithm has two phases
Correctness: A set of permanent checkpoints taken by this algorithm is
consistent for two reasons
i. Either all or none of the processes take permanent checkpoint
ii. No process sends message after taking tentative checkpoint
• Thus a situation will not arise where there is a record of a message being
received but there is no record of sending it.
An Optimization
• The above protocol may cause a process to take a checkpoint even when it is not
necessary for consistency
• Since taking a checkpoint is an expensive operation, we must avoid taking
checkpoints if it is not necessary
The Rollback Recovery Algorithm
• The rollback recovery algorithm restores the system state to a consistent state
after a failure.
• The rollback recovery algorithm assumes that

 a single process invokes the algorithm.
 the checkpoint and the rollback recovery algorithms are not invoked
concurrently.
• The rollback recovery algorithm has two phases.

The Rollback Recovery Algorithm
Correctness
All processes restart from an appropriate state because if processes decide to
restart, then they resume execution from a consistent state. (the checkpointing
algorithm takes a consistent set of checkpoints).
An Optimization
The above recovery protocol causes all processes to roll back irrespective of
whether a process needs to roll back or not.
Juang and Venkatesan Algorithm for Asynchronous Checkpointing and Recovery
The algorithm makes the following assumptions about the underlying system
• The communication channels are reliable
• Messages are delivered in FIFO order
• Has infinite buffers
• The message transmission delay is arbitrary, but finite
• The underlying computation is assumed to be event-driven

• Two types of log storage are maintained namely volatile log and stable log
Volatile – fast but lost
Stable – slow but permanent
Contents of volatile log will be periodically flushed to stable log.
• When message arrives at a process
1. Processes reads the message
2. Modifies its current state S to new state S’
3. Sends messages to other processes
Checkpointing Algorithm
• After executing an event, a processor records a triplet {s, m, msgs_sent} in its
volatile storage
o s is the state of the processor before the event
o m is the message (including the identity of the sender of m, denoted as
m.sender) received
o msgs_sent is the set of messages sent by the processor during the event.
• A local checkpoint at a process consists of the record of an event occurring at

the process
• Periodically, a process independently saves the contents of the volatile log in

the stable storage and clears the volatile log.
Recovery Algorithm
• Asynchronous checkpointing, therefore the main issue in the recovery is to find a

consistent set of checkpoints to which the system can be restored
• The recovery algorithm achieves this by making each processor keep track of both
the number of messages it has sent to other processors as well as the number of
messages it has received from other processors
• Recovery may involve several iterations of roll backs by processes.
• Whenever a process rolls back, it is necessary for all other processes to find out if
any message sent by the rolled back process has become an orphan message
• Orphan messages are discovered by comparing the number of messages sent to

and received from neighboring processors
• If RCV Di←j(CkPti) > SENTj→i(CkPtj) then one or more messages at processor pj are
orphan messages.
• In this case, processor pj must roll back to a state where the number of
messages received agrees with the number of messages sent.
• When a process fails:
1. Roll’sback to latest checkpoint
2. Computes send value and transmits to other processes
3. Receives send value from other processes
4. At one particular checkpoint where send = receive will be considered for
rollback otherwise it will go for the recent checkpoint of itself.
• Suppose processor Y crashes at the point indicated and rolls back to a state
corresponding to checkpoint ey1.
• According to this state, Y has sent only one message to X; however, according
to X’s current state (ex2), X has received two messages from Y. Therefore, X must
roll back to a state preceding ex2 to be consistent with Y’s state.
We note that if X rolls back to checkpoint ex1

ex1, ex2 then it will be consistent with Y’s state ey1.
Likewise, processor Z must roll back to ez2 to
be consistent with Y’s state, ey1.
Algorithm
Example for the Algorithm

DC - Unit 4 Latest

Uploaded by

Copyright:

Available Formats

DC - Unit 4 Latest

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DC - Unit 4 Latest

Uploaded by

Copyright:

Available Formats

UNIT IV – CONSENSUS AND RECOVERY

Consensus and Agreement Algorithms: Problem Definition –

commit decision in database systems, wherein the processes collectively decide

• Behavior of faulty process depends upon the failure model assumed

• If a failure-prone process chooses to send a message to process Pi but fails, then Pi

6. Authenticated vs. non-authenticated messages:

• With unauthenticated messages, when a faulty process relays a message to

• An asynchronous system is modeled by messengers taking an unbounded time

• A lost message is modeled by a messenger being captured by the enemy.

• A Byzantine process is modeled by a general being a traitor.

• The traitor will attempt to subvert the agreement-reaching mechanism, by giving

• Four generals are shown

• a consensus decision is to be reached about a boolean value

• The various generals are conveying potentially misleading values of the

• In such Byzantine behavior, the challenge is to determine whether it is possible

(A) The Byzantine Agreement and Other Problems

(i) The Byzantine Agreement Problem

• The Byzantine agreement problem requires a designated process, called the

• Termination: Each non-faulty process must eventually decide on a value.

(ii) The Consensus Problem

• Termination: Each non-faulty process must eventually decide on a value.

(iii) The Interactive Consistency Problem

• Termination: Each non-faulty process must eventually decide on the array A.

• A distributed mechanism would have each process broadcast its values to

• The decision can be reached by using an application-specific function - some

• Reaching agreement is straightforward in a failure-free system. Hence, we focus

2. Consensus Algorithms for Byzantine Failures (Synchronous System)

(A) Upper Bound on Byzantine Processes

(B) Byzantine Agreement Tree Algorithm: Exponential (Synchronous System)

3. Phase-King Algorithm for Consensus: Polynomial (Synchronous System)

• The agreement condition is satisfied because in the f + 1 rounds, there must be at

• The termination condition is seen to be self-evidently satisfied.

• In the simulation, a correct process in Z(3, 1) simulates a group of up to n/3

Number of messages for agreement on one value is = (n-1)+(n-1)(n-2)

• lines (2f)-(2h) correspond to the

• Two operations are defined in the list

The algorithm requires f + 1 rounds, an exponential amount of local memory, and

• Proposed by Berman and Garay

• It can tolerate only f < ⌈n/4⌉ malicious processes.

• The algorithm is so called because it operates in f + 1 phases, each with two

• Many techniques like transactions, group communication, and rollback recovery

Rollback recovery protocols

• restores the system back to a consistent state after a failure.

• In distributed systems, rollback recovery is complicated because messages

• rollback propagation: Upon a failure of one or more processes in a system,

• Independent or uncoordinated checkpointing:

(a) Coordinated checkpointing rollback recovery

(b) Communication induced checkpointing rollback recovery

(c) Log-based rollback recovery

• A distributed system consists of a fixed number of processes P1, P2,..., PN that

• shows message m1 to have been sent but not

• The state is consistent because it represents a

• The state is inconsistent because process P2 is shown to have received m2 but

• Inconsistent states occur because of failures

• If a failure occurs, the outside world cannot be expected to a roll back.

• A process Pj that begins rollback/ recovery in response to the failure of a process

• If overlapping failures are to be tolerated, a mechanism must be introduced to

3. Impossibility of Min Process Non-blocking Checkpointing

• Eliminates synchronization overhead as there is no need for coordination