Topic: reliability of distributed systems

topics > computer science > Group: distributed systems

digital communication
exception handling

communication protocols
database consistency and reliability
database transactions
defensive programming
distributed system security
error checking in robot programming
error safe systems
implementing distributed systems and applications
key distribution
log-structured rollback-recovery
optimistic update for concurrency control
proving concurrent programs
reliable broadcast
reliable communication
replicated data
resourceful, redundant systems for reliability
security leaks and weaknesses
security of remotely executed code
software maintenance and testing of distributed systems
specification and design of distributed systems
synchronized processing


A distributed system must allow for failure. Components, sites, and communication links may fail. Failure may be intermittent. A network may be temporarily partitioned into subnets.

End-to-end reliability allows for individual links and components to fail. Service availability should be independent of local failures. Checkpoints allow for restart. A system may be self-stabilizing. Cascading failures are particularly worrisome.

Services need to initialize and terminate cleanly. Out-of-service components need to be removed from the system. Timestamps help distinguish short-term from long-term failure. (cbb 2/07)

Subtopic: end-to-end reliability up

Quote: the file transfer program must ensure end-to-end reliability; the communication channel is one link in a long chain [»saltJH11_1984]
Quote: while end-to-end error recovery is sufficient, it requires a cheap test for success and may cause severe performance problems under heavy load [»lampBW10_1983]
Quote: fate-sharing--can loss state about an entity only if lose the entity itself; e.g., store synchronization information with the hosts and not the gateways, trust the host instead of the network

Subtopic: site failures up

Quote: distributed systems must cope with site and communication failures [»filmRE_1982]
Quote: in Xerox's internet, sites can be down for hours or even days [»demeA8_1987]
Quote: open systems must be designed so that operating components take over from failed components during repair [»hewiC7_1986]
Quote: Amoeba duplicates the directory server; can bring one down, install new software, and the update the other server [»taneAS12_1990]
Quote: Grapevine is invariant to failures of single servers [»schrMD2_1984]
Quote: Grapevine made a mistake in not informing users of site failures; experienced users will want to notify someone if it lasts too long [»schrMD2_1984]

Subtopic: denial of service up

Quote: survey of denial of service attacks, Internet vulnerabilities, defense mechanisms, and countermeasures [»pengT4_2007]

Subtopic: component failure up

Quote: Google's scalable distributed file system was designed for frequent component failure and huge, append-only files [»gherS10_2003]

Subtopic: software failures up

Quote: hardware only brings down a single site; but software can bring down the system [»birrAD4_1982]
Quote: interconnected, computer-based systems are not independent; failures in one system can cause problems for other systems [»parnDL8_2007]

Subtopic: termination up

Quote: detecting termination of distributed computations is equivalent to a garbage collection problem; can use to derive new termination algorithms [»telG1_1993]
Quote: both termination and garbage collection satisfy safety (if marked, it is terminated/garbage) and liveness (will mark if terminated/garbage) [»telG1_1993]
Quote: any distributed termination detection algorithm can solve the distributed, isolated train detection problem for garbage collection [»lowrMC12_2002]

Subtopic: tight coupling up

Quote: system accidents naturally occur in systems with interactive complexity and tight coupling; e.g., nuclear power plants [»perrC_1984]

Subtopic: automatic messages up

Quote: automatic message generation can lead to chain reactions; prevent by aging, time-outs, or absorption; e.g., cyclic forwarding [»manbU10_1990]
Quote: use absorption to prevent chain reactions of automatic messages; use the originator's id and a table of all previously sent ids; no duplicates [»manbU10_1990]

Subtopic: checkpoint, rollback-recovery up

Quote: survey of automatic, rollback-recovery from checkpoints of message-passing systems [»elnoEN9_2002]
Quote: message-passing systems may propagate rollback recovery because each message creates a dependency between sender and receiver; can domino to starting point [»elnoEN9_2002]
Quote: signals and interrupts do not work with many log-based, rollback-recovery protocols; they require piecewise determinism between messages [»slyeJH10_1998]
Quote: distributed systems with backward error recovery use atomic actions or conversations with checkpoints; these are equivalent [»shriSK7_1993]
Quote: coordinate checkpoints by embedded conversations; defines consistent global states for recovery [»shriSK7_1993]

Subtopic: recovery w/o checkpoint up

Quote: optimistic recovery for distributed system via repeatable, message-driven recovery units; reconstruct consistent state after failure [»stroRE8_1985]
Quote: roll-back initiates a state interval of a recovery unit; messages between recovery units define a partial order of state intervals [»stroRE8_1985]

Subtopic: self-stabilizing up

Quote: self-stabilizing algorithm for leader election under dynamic topologies [»schnM3_1993]
Quote: self-stabilizing, token-based system with no race conditions; if multiple tokens, guaranteed to remove all but one from system [»browGM6_1989]

Subtopic: global registry up

Quote: maintain a distributed system with a global registry of software modules; unique identifier for name, version, security interface, and performance parameters; modules exchange ids [»fedaA3_1997]
Quote: authentication and key distribution must be extensible to large internetworks of many domains [»jansP4_1997]

Subtopic: stateless up

Quote: REST interactions are stateless; they transfer representations of identified resources

Subtopic: message queue up

Quote: input messages stored in a queue, only removed when completely processed [»giffDK12_1985]

Subtopic: timestamps up

Quote: reinitialize connection table for at-most-once messages with the timestamp of the latest message that could have existed before the crash [»liskB9_1989]
Quote: use exponential speedup for heartbeat protocol; terminates remaining processes on process failure [»goudMG5_1998]

Subtopic: temporally stable IDs up

Quote: Thoth depended on T-stability (ids not reused in <T seconds); difficult to implement in a distributed environment

Related Topics up

Group: digital communication   (11 topics, 296 quotes)
Group: debugging   (10 topics, 333 quotes)
Group: exception handling   (12 topics, 314 quotes)
Group: security   (23 topics, 874 quotes)

Topic: backtracking (30 items)
Topic: communication protocols (62 items)
Topic: database consistency and reliability (15 items)
Topic: database transactions (27 items)
Topic: defensive programming (22 items)
Topic: distributed system security (17 items)
Topic: error checking in robot programming (6 items)
Topic: error safe systems (76 items)
Topic: implementing distributed systems and applications (41 items)
Topic: key distribution (35 items)
Topic: log-structured rollback-recovery (13 items)
Topic: optimistic update for concurrency control (35 items)
Topic: proving concurrent programs (37 items)
Topic: reliable broadcast (29 items)
Topic: reliable communication (29 items)
Topic: replicated data (51 items)
Topic: resourceful, redundant systems for reliability (38 items)
Topic: security leaks and weaknesses (67 items)
Topic: security of remotely executed code (24 items)
Topic: software maintenance and testing of distributed systems (16 items)
Topic: specification and design of distributed systems (14 items)
Topic: synchronized processing
(35 items)

Updated barberCB 6/05
Copyright © 2002-2008 by C. Bradford Barber. All rights reserved.
Thesa is a trademark of C. Bradford Barber.