The goal of adaptive fault tolerance(AFT) is to expand the envelope of dependable system operation in distributed, real-time systems. Such systems often experience substantial run-time changes in the types and distributions of faults, in the availability of resources, in data distribution, and in users' requirements for dependability and performance. Preliminary examples, such as Adaptable Distributed Recovery Blocks (Kim) and distributed crash recovery, illustrate how adaptive fault tolerance can provide useful tradeoffs among service properties such as error-recovery latency, throughput, and precision, over a wide range of operating conditions. A general methodology for AFT system design must address issues of (1) rapid, incremental diagnosis/estimation of environmental and internal state, (2) safe and effective control, and (3) efficient, parametric or multimode fault-tolerant implementations. A major challenge is to achieve the additional flexibility without excessive complexity, both for performance and reliability concerns. Reflective architecture, a form of meta-design, is an attractive framework for AFT system design and for adaptive systems in general. It provides for the monitoring and redefinition of system behavior in a hierarchical manner that may be integrated with conventional "uses"-based hierarchical design.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.