Th Topic: reliability of distributed systems

Topic: reliability of distributed systems

topics > computer science > Group: distributed systems

Group:
digital communication
Group:
debugging
Group:
exception handling
Group:
security

Topic:
backtracking
Topic:
communication protocols
Topic:
database consistency and reliability
Topic:
database transactions
Topic:
defensive programming
Topic:
distributed system security
Topic:
error checking in robot programming
Topic:
error safe systems
Topic:
implementing distributed systems and applications
Topic:
key distribution
Topic:
log-structured rollback-recovery
Topic:
optimistic update for concurrency control
Topic:
proving concurrent programs
Topic:
reliable broadcast
Topic:
reliable communication
Topic:
replicated data
Topic:
resourceful, redundant systems for reliability
Topic:
security leaks and weaknesses
Topic:
security of remotely executed code
Topic:
software maintenance and testing of distributed systems
Topic:
specification and design of distributed systems
Topic:
synchronized processing

Summary

A distributed system must allow for failure. Components, sites, and communication links may fail. Failure may be intermittent. A network may be temporarily partitioned into subnets.
End-to-end reliability allows for individual links and components to fail. Service availability should be independent of local failures. Checkpoints allow for restart. A system may be self-stabilizing. Cascading failures are particularly worrisome.
Services need to initialize and terminate cleanly. Out-of-service components need to be removed from the system. Timestamps help distinguish short-term from long-term failure. (cbb 2/07)

Subtopic: end-to-end reliability

Quote: the file transfer program must ensure end-to-end reliability; the communication channel is one link in a long chain [»saltJH11_1984]
Quote: while end-to-end error recovery is sufficient, it requires a cheap test for success and may cause severe performance problems under heavy load [»lampBW10_1983]
Quote: fate-sharing--can loss state about an entity only if lose the entity itself; e.g., store synchronization information with the hosts and not the gateways, trust the host instead of the network

Subtopic: site failures

Quote: distributed systems must cope with site and communication failures [»filmRE_1982]
Quote: in Xerox's internet, sites can be down for hours or even days [»demeA8_1987]
Quote: open systems must be designed so that operating components take over from failed components during repair [»hewiC7_1986]
Quote: Amoeba duplicates the directory server; can bring one down, install new software, and the update the other server [»taneAS12_1990]
Quote: Grapevine is invariant to failures of single servers [»schrMD2_1984]
Quote: Grapevine made a mistake in not informing users of site failures; experienced users will want to notify someone if it lasts too long [»schrMD2_1984]

Subtopic: denial of service

Quote: survey of denial of service attacks, Internet vulnerabilities, defense mechanisms, and countermeasures [»pengT4_2007]

Subtopic: component failure

Quote: Google's scalable distributed file system was designed for frequent component failure and huge, append-only files [»gherS10_2003]

Subtopic: software failures

Quote: hardware only brings down a single site; but software can bring down the system [»birrAD4_1982]
Quote: interconnected, computer-based systems are not independent; failures in one system can cause problems for other systems [»parnDL8_2007]

Subtopic: termination

Quote: detecting termination of distributed computations is equivalent to a garbage collection problem; can use to derive new termination algorithms [»telG1_1993]
Quote: both termination and garbage collection satisfy safety (if marked, it is terminated/garbage) and liveness (will mark if terminated/garbage) [»telG1_1993]
Quote: any distributed termination detection algorithm can solve the distributed, isolated train detection problem for garbage collection [»lowrMC12_2002]

Subtopic: tight coupling

Quote: system accidents naturally occur in systems with interactive complexity and tight coupling; e.g., nuclear power plants [»perrC_1984]

Subtopic: automatic messages

Quote: automatic message generation can lead to chain reactions; prevent by aging, time-outs, or absorption; e.g., cyclic forwarding [»manbU10_1990]
Quote: use absorption to prevent chain reactions of automatic messages; use the originator's id and a table of all previously sent ids; no duplicates [»manbU10_1990]

Subtopic: checkpoint, rollback-recovery

Quote: survey of automatic, rollback-recovery from checkpoints of message-passing systems [»elnoEN9_2002]
Quote: message-passing systems may propagate rollback recovery because each message creates a dependency between sender and receiver; can domino to starting point [»elnoEN9_2002]
Quote: signals and interrupts do not work with many log-based, rollback-recovery protocols; they require piecewise determinism between messages [»slyeJH10_1998]
Quote: distributed systems with backward error recovery use atomic actions or conversations with checkpoints; these are equivalent [»shriSK7_1993]
Quote: coordinate checkpoints by embedded conversations; defines consistent global states for recovery [»shriSK7_1993]

Subtopic: recovery w/o checkpoint

Quote: optimistic recovery for distributed system via repeatable, message-driven recovery units; reconstruct consistent state after failure [»stroRE8_1985]
Quote: roll-back initiates a state interval of a recovery unit; messages between recovery units define a partial order of state intervals [»stroRE8_1985]

Subtopic: self-stabilizing

Quote: self-stabilizing algorithm for leader election under dynamic topologies [»schnM3_1993]
Quote: self-stabilizing, token-based system with no race conditions; if multiple tokens, guaranteed to remove all but one from system [»browGM6_1989]

Subtopic: global registry

Quote: maintain a distributed system with a global registry of software modules; unique identifier for name, version, security interface, and performance parameters; modules exchange ids [»fedaA3_1997]
Quote: authentication and key distribution must be extensible to large internetworks of many domains [»jansP4_1997]

Subtopic: stateless

Quote: REST interactions are stateless; they transfer representations of identified resources

Subtopic: message queue

Quote: input messages stored in a queue, only removed when completely processed [»giffDK12_1985]

Subtopic: timestamps

Quote: reinitialize connection table for at-most-once messages with the timestamp of the latest message that could have existed before the crash [»liskB9_1989]
Quote: use exponential speedup for heartbeat protocol; terminates remaining processes on process failure [»goudMG5_1998]

Subtopic: temporally stable IDs

Quote: Thoth depended on T-stability (ids not reused in <T seconds); difficult to implement in a distributed environment
[»cherDR3_1988]

Related Topics

Group: digital communication (11 topics, 296 quotes)
Group: debugging (10 topics, 333 quotes)
Group: exception handling (12 topics, 314 quotes)
Group: security (23 topics, 874 quotes)
Topic: backtracking (30 items)
Topic: communication protocols (62 items)
Topic: database consistency and reliability (15 items)
Topic: database transactions (27 items)
Topic: defensive programming (22 items)
Topic: distributed system security (17 items)
Topic: error checking in robot programming (6 items)
Topic: error safe systems (76 items)
Topic: implementing distributed systems and applications (41 items)
Topic: key distribution (35 items)
Topic: log-structured rollback-recovery (13 items)
Topic: optimistic update for concurrency control (35 items)
Topic: proving concurrent programs (37 items)
Topic: reliable broadcast (29 items)
Topic: reliable communication (29 items)
Topic: replicated data (51 items)
Topic: resourceful, redundant systems for reliability (38 items)
Topic: security leaks and weaknesses (67 items)
Topic: security of remotely executed code (24 items)
Topic: software maintenance and testing of distributed systems (16 items)
Topic: specification and design of distributed systems (14 items)
Topic: synchronized processing
(35 items)