Group:  digital communication
 Group:  debugging
 Group:  exception handling
 Group:  security
Topic:  backtracking
 Topic:  communication protocols
 Topic:  database consistency and reliability
 Topic:  database transactions
 Topic:  defensive programming
 Topic:  distributed system security
 Topic:  error checking in robot programming
 Topic:  error safe systems
 Topic:  implementing distributed systems and applications
 Topic:  key distribution
 Topic:  log-structured rollback-recovery
 Topic:  optimistic update for concurrency control
 Topic:  proving concurrent programs
 Topic:  reliable broadcast
 Topic:  reliable communication
 Topic:  replicated data
 Topic:  resourceful, redundant systems for reliability
 Topic:  security leaks and weaknesses
 Topic:  security of remotely executed code
 Topic:  software maintenance and testing of distributed systems
 Topic:  specification and design of distributed systems
 Topic:  synchronized processing 
 
 |  | 
Summary
   A distributed system must allow for failure.  Components, sites, and communication links may fail.  Failure may be intermittent.  A network may be temporarily partitioned into subnets. 
End-to-end reliability allows for individual links and components to fail.  Service availability should be independent of local failures.  Checkpoints allow for restart.  A system may be self-stabilizing.  Cascading failures are particularly worrisome.
 Services need to initialize and terminate cleanly.  Out-of-service components need to be removed from the system. Timestamps help distinguish short-term from long-term failure. (cbb 2/07) 
  
Subtopic: end-to-end reliability  
| Quote: the file transfer program must ensure end-to-end reliability; the communication channel is one link in a long chain  [»saltJH11_1984]
 |  | Quote: while end-to-end error recovery is sufficient, it requires a cheap test for success and may cause severe performance problems under heavy load [»lampBW10_1983]
 |  | Quote: fate-sharing--can loss state about an entity only if lose the entity itself; e.g., store synchronization information with the hosts and not the gateways, trust the host instead of the network
 |    Subtopic: site failures  
| Quote: distributed systems must cope with site and communication failures [»filmRE_1982]
 |  | Quote: in Xerox's internet, sites can be down for hours or even days [»demeA8_1987]
 |  | Quote: open systems must be designed so that operating components take over from failed components during repair [»hewiC7_1986]
 |  | Quote: Amoeba duplicates the directory server; can bring one down, install new software, and the update the other server [»taneAS12_1990]
 |  | Quote: Grapevine is invariant to failures of single servers [»schrMD2_1984]
 |  | Quote: Grapevine made a mistake in not informing users of site failures; experienced users will want to notify someone if it lasts too long [»schrMD2_1984]
 |    Subtopic: denial of service  
| Quote: survey of denial of service attacks, Internet vulnerabilities, defense mechanisms, and countermeasures [»pengT4_2007]
 |    Subtopic: component failure  
| Quote: Google's scalable distributed file system was designed for frequent component failure and huge, append-only files [»gherS10_2003]
 |    Subtopic: software failures  
| Quote: hardware only brings down a single site; but software can bring down the system [»birrAD4_1982]
 |  | Quote: interconnected, computer-based systems are not independent; failures in one system can cause problems for other systems [»parnDL8_2007]
 |    Subtopic: termination  
| Quote: detecting termination of distributed computations is equivalent to a garbage collection problem; can use to derive new termination algorithms [»telG1_1993]
 |  | Quote: both termination and garbage collection satisfy safety (if marked, it is terminated/garbage) and liveness (will mark if terminated/garbage) [»telG1_1993]
 |  | Quote: any distributed termination detection algorithm can solve the distributed, isolated train detection problem for garbage collection [»lowrMC12_2002]
 |    Subtopic: tight coupling  
| Quote: system accidents naturally occur in systems with interactive complexity and tight coupling; e.g., nuclear power plants [»perrC_1984]
 |    Subtopic: automatic messages  
| Quote: automatic message generation can lead to chain reactions; prevent by aging, time-outs, or absorption; e.g., cyclic forwarding [»manbU10_1990]
 |  | Quote: use absorption to prevent chain reactions of automatic messages; use the originator's id and a table of all previously sent ids; no duplicates [»manbU10_1990]
 |    Subtopic: checkpoint, rollback-recovery  
| Quote: survey of automatic, rollback-recovery from checkpoints of message-passing systems [»elnoEN9_2002]
 |  | Quote: message-passing systems may propagate rollback recovery because each message creates a dependency between sender and receiver; can domino to starting point [»elnoEN9_2002]
 |  | Quote: signals and interrupts do not work with many log-based, rollback-recovery protocols; they require piecewise determinism between messages [»slyeJH10_1998]
 |  | Quote: distributed systems with backward error recovery use atomic actions or conversations with checkpoints; these are equivalent [»shriSK7_1993]
 |  | Quote: coordinate checkpoints by embedded conversations; defines consistent global states for recovery [»shriSK7_1993]
 |    Subtopic: recovery w/o checkpoint  
| Quote: optimistic recovery for distributed system via repeatable, message-driven recovery units; reconstruct consistent state after failure [»stroRE8_1985]
 |  | Quote: roll-back initiates a state interval of a recovery unit; messages between recovery units define a partial order of state intervals [»stroRE8_1985]
 |    Subtopic: self-stabilizing  
| Quote: self-stabilizing algorithm for leader election under dynamic topologies [»schnM3_1993]
 |  | Quote: self-stabilizing, token-based system with no race conditions; if multiple tokens, guaranteed to remove all but one from system [»browGM6_1989]
 |    Subtopic: global registry  
| Quote: maintain a distributed system with a global registry of software modules; unique identifier for name, version, security interface, and performance parameters; modules exchange ids [»fedaA3_1997]
 |  | Quote: authentication and key distribution must be extensible to large internetworks of many domains [»jansP4_1997]
 |    Subtopic: stateless  
| Quote: REST interactions are stateless; they transfer representations of identified resources
 |    Subtopic: message queue  
| Quote: input messages stored in a queue, only removed when completely processed [»giffDK12_1985]
 |    Subtopic: timestamps  
| Quote: reinitialize connection table for at-most-once messages with the timestamp of the latest message that could have existed before the crash [»liskB9_1989]
 |  | Quote: use exponential speedup for heartbeat protocol; terminates remaining processes on process failure [»goudMG5_1998]
 |    Subtopic: temporally stable IDs  
Quote: Thoth depended on T-stability (ids not reused in <T seconds); difficult to implement in a distributed environment  [»cherDR3_1988]
 |    
 Related Topics  
 Group: digital communication   (11 topics, 296 quotes)
 Group: debugging   (10 topics, 333 quotes)
 Group: exception handling   (12 topics, 314 quotes)
 Group: security   (23 topics, 874 quotes)
Topic: backtracking (30 items)
 Topic: communication protocols (62 items)
 Topic: database consistency and reliability (15 items)
 Topic: database transactions (27 items)
 Topic: defensive programming (22 items)
 Topic: distributed system security (17 items)
 Topic: error checking in robot programming (6 items)
 Topic: error safe systems (76 items)
 Topic: implementing distributed systems and applications (41 items)
 Topic: key distribution (35 items)
 Topic: log-structured rollback-recovery (13 items)
 Topic: optimistic update for concurrency control (35 items)
 Topic: proving concurrent programs (37 items)
 Topic: reliable broadcast (29 items)
 Topic: reliable communication (29 items)
 Topic: replicated data (51 items)
 Topic: resourceful, redundant systems for reliability (38 items)
 Topic: security leaks and weaknesses (67 items)
 Topic: security of remotely executed code (24 items)
 Topic: software maintenance and testing of distributed systems (16 items)
 Topic: specification and design of distributed systems (14 items)
 Topic: synchronized processing  (35 items)
 
 
 |