Group: digital communication
Group: debugging
Group: exception handling
Group: security
Topic: backtracking
Topic: communication protocols
Topic: database consistency and reliability
Topic: database transactions
Topic: defensive programming
Topic: distributed system security
Topic: error checking in robot programming
Topic: error safe systems
Topic: implementing distributed systems and applications
Topic: key distribution
Topic: log-structured rollback-recovery
Topic: optimistic update for concurrency control
Topic: proving concurrent programs
Topic: reliable broadcast
Topic: reliable communication
Topic: replicated data
Topic: resourceful, redundant systems for reliability
Topic: security leaks and weaknesses
Topic: security of remotely executed code
Topic: software maintenance and testing of distributed systems
Topic: specification and design of distributed systems
Topic: synchronized processing
| |
Summary
A distributed system must allow for failure. Components, sites, and communication links may fail. Failure may be intermittent. A network may be temporarily partitioned into subnets.
End-to-end reliability allows for individual links and components to fail. Service availability should be independent of local failures. Checkpoints allow for restart. A system may be self-stabilizing. Cascading failures are particularly worrisome.
Services need to initialize and terminate cleanly. Out-of-service components need to be removed from the system. Timestamps help distinguish short-term from long-term failure. (cbb 2/07)
Subtopic: end-to-end reliability
Quote: the file transfer program must ensure end-to-end reliability; the communication channel is one link in a long chain [»saltJH11_1984]
| Quote: while end-to-end error recovery is sufficient, it requires a cheap test for success and may cause severe performance problems under heavy load [»lampBW10_1983]
| Quote: fate-sharing--can loss state about an entity only if lose the entity itself; e.g., store synchronization information with the hosts and not the gateways, trust the host instead of the network
| Subtopic: site failures
Quote: distributed systems must cope with site and communication failures [»filmRE_1982]
| Quote: in Xerox's internet, sites can be down for hours or even days [»demeA8_1987]
| Quote: open systems must be designed so that operating components take over from failed components during repair [»hewiC7_1986]
| Quote: Amoeba duplicates the directory server; can bring one down, install new software, and the update the other server [»taneAS12_1990]
| Quote: Grapevine is invariant to failures of single servers [»schrMD2_1984]
| Quote: Grapevine made a mistake in not informing users of site failures; experienced users will want to notify someone if it lasts too long [»schrMD2_1984]
| Subtopic: denial of service
Quote: survey of denial of service attacks, Internet vulnerabilities, defense mechanisms, and countermeasures [»pengT4_2007]
| Subtopic: component failure
Quote: Google's scalable distributed file system was designed for frequent component failure and huge, append-only files [»gherS10_2003]
| Subtopic: software failures
Quote: hardware only brings down a single site; but software can bring down the system [»birrAD4_1982]
| Quote: interconnected, computer-based systems are not independent; failures in one system can cause problems for other systems [»parnDL8_2007]
| Subtopic: termination
Quote: detecting termination of distributed computations is equivalent to a garbage collection problem; can use to derive new termination algorithms [»telG1_1993]
| Quote: both termination and garbage collection satisfy safety (if marked, it is terminated/garbage) and liveness (will mark if terminated/garbage) [»telG1_1993]
| Quote: any distributed termination detection algorithm can solve the distributed, isolated train detection problem for garbage collection [»lowrMC12_2002]
| Subtopic: tight coupling
Quote: system accidents naturally occur in systems with interactive complexity and tight coupling; e.g., nuclear power plants [»perrC_1984]
| Subtopic: automatic messages
Quote: automatic message generation can lead to chain reactions; prevent by aging, time-outs, or absorption; e.g., cyclic forwarding [»manbU10_1990]
| Quote: use absorption to prevent chain reactions of automatic messages; use the originator's id and a table of all previously sent ids; no duplicates [»manbU10_1990]
| Subtopic: checkpoint, rollback-recovery
Quote: survey of automatic, rollback-recovery from checkpoints of message-passing systems [»elnoEN9_2002]
| Quote: message-passing systems may propagate rollback recovery because each message creates a dependency between sender and receiver; can domino to starting point [»elnoEN9_2002]
| Quote: signals and interrupts do not work with many log-based, rollback-recovery protocols; they require piecewise determinism between messages [»slyeJH10_1998]
| Quote: distributed systems with backward error recovery use atomic actions or conversations with checkpoints; these are equivalent [»shriSK7_1993]
| Quote: coordinate checkpoints by embedded conversations; defines consistent global states for recovery [»shriSK7_1993]
| Subtopic: recovery w/o checkpoint
Quote: optimistic recovery for distributed system via repeatable, message-driven recovery units; reconstruct consistent state after failure [»stroRE8_1985]
| Quote: roll-back initiates a state interval of a recovery unit; messages between recovery units define a partial order of state intervals [»stroRE8_1985]
| Subtopic: self-stabilizing
Quote: self-stabilizing algorithm for leader election under dynamic topologies [»schnM3_1993]
| Quote: self-stabilizing, token-based system with no race conditions; if multiple tokens, guaranteed to remove all but one from system [»browGM6_1989]
| Subtopic: global registry
Quote: maintain a distributed system with a global registry of software modules; unique identifier for name, version, security interface, and performance parameters; modules exchange ids [»fedaA3_1997]
| Quote: authentication and key distribution must be extensible to large internetworks of many domains [»jansP4_1997]
| Subtopic: stateless
Quote: REST interactions are stateless; they transfer representations of identified resources
| Subtopic: message queue
Quote: input messages stored in a queue, only removed when completely processed [»giffDK12_1985]
| Subtopic: timestamps
Quote: reinitialize connection table for at-most-once messages with the timestamp of the latest message that could have existed before the crash [»liskB9_1989]
| Quote: use exponential speedup for heartbeat protocol; terminates remaining processes on process failure [»goudMG5_1998]
| Subtopic: temporally stable IDs
Quote: Thoth depended on T-stability (ids not reused in <T seconds); difficult to implement in a distributed environment [»cherDR3_1988]
|
Related Topics
Group: digital communication (11 topics, 296 quotes)
Group: debugging (10 topics, 333 quotes)
Group: exception handling (12 topics, 314 quotes)
Group: security (23 topics, 874 quotes)
Topic: backtracking (30 items)
Topic: communication protocols (62 items)
Topic: database consistency and reliability (15 items)
Topic: database transactions (27 items)
Topic: defensive programming (22 items)
Topic: distributed system security (17 items)
Topic: error checking in robot programming (6 items)
Topic: error safe systems (76 items)
Topic: implementing distributed systems and applications (41 items)
Topic: key distribution (35 items)
Topic: log-structured rollback-recovery (13 items)
Topic: optimistic update for concurrency control (35 items)
Topic: proving concurrent programs (37 items)
Topic: reliable broadcast (29 items)
Topic: reliable communication (29 items)
Topic: replicated data (51 items)
Topic: resourceful, redundant systems for reliability (38 items)
Topic: security leaks and weaknesses (67 items)
Topic: security of remotely executed code (24 items)
Topic: software maintenance and testing of distributed systems (16 items)
Topic: specification and design of distributed systems (14 items)
Topic: synchronized processing (35 items)
|