Lecture 1. Message passing & Snapshots
A distributed system is an interconnected collection of autonomous processes. Motivations
of using distributed systems are recourse sharing, parallelization, multicore programming,
etc.
There are three major differences between uniprocessor systems (with 1 processor) and
distributed systems:
1. Lack of knowledge of the global state: A process has no up-to-date knowledge on the
local states of other processes. Things like termination- and deadlock detection
become issues.
2. Lack of a global timeframe: Multiple processes may execute operations in a different
order.
3. Nondeterminism: Multiple executions of the same program may give different results.
In distributed computing, there are two main communication paradigms: message passing
and shared memory. Communication can happen synchronous or asynchronous.
Asynchronous communication means that sending and receiving of a message are
independent events. In case of synchronous communication, sending and receiving of a
message are coordinated to form a single event.
Most of the time, we will assume that there are communication protocols in place that
detect and correct flaws during the transmission of information.
Unless stated otherwise, we assume:
- A strongly connected network: every node can reach any other node in the network.
- Each process knows only its neighbors (local knowledge).
- Asynchronous message passing communication.
- Channels never lose, duplicate, or garble any messages.
- The delay of messages in channels is arbitrary but finite.
- Channels can be non-FIFO, meaning that messages can overtake each other.
- A stable network, with no nodes leaving or crashing.
- Each process has a unique (hardware) ID.
Channels can be unidirectional (messages can only flow in one direction) or bidirectional
(messages can flow in two directions, back and forth). Algorithms for directed networks are
more general than for undirected ones. Undirected networks can do anything that directed
ones can, but not the other way around.
The resource consumption of an execution of a distributed algorithm can be considered in
several ways:
1. Message complexity: Total number of messages exchanged.
2. Bit complexity: Total number of bits exchanged (only interesting if messages can be
very long).
3. Time complexity: Amount of time consumed. However, the computation varies from
the uniprocessor one:
o Event processing takes no time (making uniprocessor executions instant)
o A message is received at most 1 time-unit after it is sent.
, 4. Space complexity: Amount of memory needed for the process. Note that here, the
definition is equal to that of uniprocessor systems.
We consider worst- or average-case complexity.
Big 𝑶 notation
𝑓 = 𝑂(𝑔) if, for some 𝐶 > 0, 𝑓(𝑛) ≤ 𝐶 ⋅ 𝑔(𝑛) for all 𝑛 ∈ ℕ. 𝑔 is a rough upper bound for 𝑓.
𝑓 = Θ(𝑔) if 𝑓 = 𝑂(𝑔) and 𝑔 = 𝑂(𝑓). 𝑔 is a precise bound for 𝑓.
For example, 𝑛2 ∈ 𝑂(𝑛3 ) but 𝑛2 ∉ Θ(𝑛3 ).
Formal framework
Let’s look at a formal framework for describing distributed systems, and introduce some
terminology:
- The global state of a distributed system is called a configuration. The global state of a
program consists of the local states of all the nodes (called state), and the messages
travelling between nodes.
- The configuration evolves in discrete steps, called transitions. A transition occurs
when, for example, the local state in any of the nodes change. Hence, a transition is
always associated to an event at one of the processes.
- A transition system consists of all the possible executions that a distributed algorithm
can take:
- A configuration is terminal if 𝛾 → 𝛿 for no 𝛿 ∈ 𝐶. In other words, there are no
configurations to transition to.
- An execution is a sequence 𝛾0 𝛾1 … of configurations that is either infinite or ends in a
terminal configuration.
- A configuration 𝛿 is reachable if there is a sequence of transitions to reach it.
- A process can perform internal, send, and receive events.
- A process is an initiator if its first event is an internal or send event.
- An algorithm is centralized if there is exactly one initiator. A decentralized algorithm
can have multiple initiators.
An assertion is a predicate on the configurations of an algorithm. An assertion is a:
- Safety property if it’s true in each configuration of each execution of the algorithm.
E.g., something bad will never happen. Note that these properties only hold on the
reachable configurations.
- Liveness property if it’s true in some configuration of the algorithm. E.g., something
good will eventually happen.
Each invariant is a safety property. An assertion 𝑃 is an invariant if:
,Note that invariants must hold in any configuration, not only the reachable ones. It’s possible
that a safety property 𝑃 holds in any reachable configurations, but there exist unreachable
states where 𝑃 doesn’t hold. In that case, 𝑃 is a safety property but not an invariant.
Causal relations are relations between two events that must happen in a certain other.
This relation is transitive. If 𝑎 happens before 𝑏, and 𝑏 happens before 𝑐, then 𝑎 happens
before 𝑐.
If neither 𝑎 ≼ 𝑏 nor 𝑏 ≼ 𝑎, then 𝑎 and 𝑏 are concurrent. A permutation of concurrent events
in an execution does not affect the result of the execution. These permutations together
form a computation.
All executions of a computation start in the same initial configuration. And if they are finite,
they all end in the same terminal configuration.
Consider the finite execution 𝑎𝑏𝑐. Let 𝑎 ≺ 𝑏 be the only causal relationship. Then,
𝑎𝑏𝑐, 𝑐𝑎𝑏, 𝑎𝑐𝑏 are in the same computation.
A logical clock 𝑪 maps occurrences of events in a computation to a partially ordered set such
that 𝑎 ≺ 𝑏 ⇒ 𝐶(𝑎) < 𝐶(𝑏). Lamport's clock 𝐿𝐶 assigns to each event 𝑎 the length 𝑘 of a
longest causality chain 𝑎1 ≺ ⋯ ≺ 𝑎𝑘 = 𝑎.
A vector clock 𝑽𝑪 maps each event in a computation to a length 𝑛 tuple ℕ𝑁 such that 𝑎 ≺
𝑏 ⇔ 𝑉𝐶(𝑎) < 𝑉𝐶(𝑏). We can order ℕ𝑁 by:
𝑉𝐶(𝑎) = (𝑘0 , … , 𝐾𝑁−1 ) where each 𝑘𝑖 is the length of a longest causality chain 𝑎1𝑖 ≺ ⋯ ≺
𝑎𝑘𝑖 𝑖 of events at process 𝑝𝑖 with 𝑎𝑘𝑖 𝑖 ≼ 𝑎.
VC is the LC, but for processes.
, Snapshots
A snapshot of an execution of a distributed algorithm aims to return a configuration of this
execution. They can be used for e.g., restarting after failure, debugging, or determining
stable properties like deadlocks or garbage.
There are many ways to take snapshots, but the challenge is to not freeze the execution.
We distinguish basic messages of the underlying distributed algorithm and control messages
of the snapshot algorithm.
Note that a snapshot is only meaningful if it is a configuration of an execution that could
have occurred in the same computation as the actual execution. For each basis message 𝑚,
sender 𝑝 and receiver 𝑞 must agree on whether the sending of 𝑚 is pre- or post-snapshot.
We need to avoid the following situations:
The Chandy-Lamport algorithm can take snapshots. It assumes a directed network with
FIFO-channels. Initiators and noninitiators that receive a control message (marker) for the
time:
- Take a local snapshot of their sate, and
- Send the marker message to all their outgoing channels.
Process 𝑞 computes as channel state 𝑝𝑞 the basic messages that it receives via 𝑝𝑞 after
taking its local snapshot and before receiving the marker message from 𝑝. This produces a
meaningful snapshot because the channels are FIFO.
This algorithm could give a snapshot that never actually occurred in the actual execution.
However, if some event 𝑎 is not causally before some other event 𝑏, we could have
rearranged the timing of the events such that they happened in different orders, related to
the marker messages.
If we ditch the assumption that channels are FIFO, marker messages can be overtaken,
possibly reaching inconsistent snapshots. This can be arranged by adding an extra bit, which
carries information about if the sending node is pre- or post-snapshot. This is called
piggybacking and is part of the Lai-Yang algorithm.