Ricky W. Butler scite author profile

This paper affirms that the quantification of lifecritical software reliability is infeasible using statistical methods, whether these methods are applied to standard software or faulttolerant software. The classical methods of estimating reliability are shown to lead to exorbitant amounts of testing when applied to life-critical software. Reliability growth models are examined and also shown to be incapable of overcoming the need for excessive amounts of testing. The key assumption of software fault tolerance-separately programmed versions fail independently-is shown to be problematic. This assumption cannot be justified by experimentation in the ultrareliability region, and subjective arguments in its favor are not sufficiently strong to justify it as an axiom. Also, the implications of the recent multiversion software experiments support this affirmation.

show abstract

Fault-tolerant clock synchronization in distributed systems

Ramanathan

1990

View full text Add to dashboard Cite

igital computers have become essential to critical real-time applications such as aerospace systems, life support systems, nuclear power plants, drive-by-wire systems, and computer-integrated manufacturing systems. Common to all these applications is the demand for maximum reliability and high performance from computer controllers. This requirement is necessarily stringent because a single controller failure in these applications can lead to disaster. For example, the allowable probability of failure for a commercial aircraft is specified to be less than per 10-hour mission because a controller failure during flight could result in a crash.Because of such stringent requirements, traditional methods for design and validation of computer controllers are often inadequate. Ad hoc techniques that appear sound under a careful failure-modes-andeffects analysis are often susceptible to certain subtle failure modes. The clock synchronization problem shown in Figure 1 is a classic example. The figure shows a three-node system in which each node has its own clock. The clocks are synchronized by adjusting each to the median of the three Figure 1 b the faulty clock B reports incorrectly to clocks A and C. As a result, clocks A and C do not make any corrections because both behave as if they are the median clock. Lamport and Melliar-Smith were the first to study the three-clock synchronization problem in the presence of arbitrary fault behavior.' They coined the term Byzantine fault to refer to the fault model in which a faulty clock can exhibit arbitrary behavior including, but not limited to, misrepresenting its value to other clocks in the system. They showed that in the presence of Byzantine faults, no algorithm can guarantee synchronization of the nonfaulty clocks in a three-node system. They also showed that 3m + 1 clocks are sufficient to ensure synchronization of the nonfaulty clocks in the presence of m Byzantine faults. This condition later proved not only sufficient but also necessary for ensuring synchronization in the presence of Byzantine faults.Since the initial study by Lamport and Melliar-Smith, the problem of clock synchronization in the presence of Byzantine faults has been studied extensively by several other researchers. All this attention is mainly beOctober 1990 @@18~91h?i9Oll0@@-0033$@1.00 G 1990 IEEE

show abstract

A formal methods approach to the analysis of mode confusion

Butler

Miller

Potts

et al.

View full text Add to dashboard Cite

The infeasibility of experimental quantification of life-critical software reliability

Butler

Finelli

1991

View full text Add to dashboard Cite

This paper a rms that quanti cation of life-critical software reliability is infeasible using statistical methods whether applied to standard software or faulttolerant software. The key assumption of software fault tolerance|separately programmed versions fail independently|is shown to be problematic. This assumption cannot be justi ed by experimentation in the ultrareliability region and subjective arguments in its favor are not su ciently strong to justify it as an axiom. Also, the implications of the recent m ultiversion software experiments support this a rmation.

show abstract

The SURE approach to reliability analysis

Butler

1992

IEEE Trans. Rel.

View full text Add to dashboard Cite

Advance state of the art Special math needed for explanation: Markov concepts Special math needed to use results: Elementary probability Results useful to: Reliability analysts and theoreticians Summary & Conclusions-The SURE computer program is a reliability-analysis tool for ultrareliable computer-system architectures. SURE is based on computational methods developed at the NASA Langley Research Center. These methods provide an e%cient means for computing reasonably accurate upper and lower bounds for the death state probabilities of a large class of semi-Markov models. Once a semi-Markov model is described using a simple input language, SURE automatically computes the upper and lower bounds on the probability of system failure. A parameter of the model can be specified as a variable over a range of values, thus directing SURE to perform a sensitivity analysis automatically. This feature, along with the speed of the program, makes it an especially useful design tool. SURE is a flexible, user-friendly reliability-analysis tool. The program provides a rapid computational capability for semi-Markov models useful for describing the fault-handling behavior of fault-tolerant computer systems. The only modeling restriction imposed by the program is that the non-exponential recovery transitions must be fast in comparison to the mission timea desirable attribute of all fault-tolerant systems. The SURE reliability-analysis method uses a fast bounding theorem based on means and variances; the method yields upper and lower bounds on the probability of system failure. The upper and lower bounds are typieally within 5 percent of each other. Techniques have been developed to enable SURE to solve models with loops and calculate the operational-state probabilities. The computation is extremely fast, and large statespaces can be directly solved; a pruning technique enables SURE to process extremely large models.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ricky W. Butler

The infeasibility of quantifying the reliability of life-critical real-time software

Fault-tolerant clock synchronization in distributed systems

A formal methods approach to the analysis of mode confusion

The infeasibility of experimental quantification of life-critical software reliability

The SURE approach to reliability analysis

Contact Info

Product

Resources

About