The PSTR/SNS scheme for real-time fault tolerance via active object replication and network surveillance

Kim, K.H.; Subbaraman, C.

doi:10.1109/69.842258

Cited by 16 publications

(6 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Real-time fault-tolerant systems: IFLOW [18] and MEAD [19] use fault-prediction techniques to reduce fault detection and client failover time to change the frequency of backup replica state synchronization to minimize state synchronization during failure recovery, and by determining the possibility of a primary replica failure and redirecting clients to alternate servers before failures occur, respectively. The Time-triggered Message-triggered Objects (TMO) project [20] considers replication schemes such as the primary-shadow TMO replication (PSTR) scheme, for which recovery time bounds can be quantitatively established, and real-time fault tolerance guarantees can be provided to applications. FC-ORB [21] is a real-time Object Request Broker (ORB) middleware that employs end-to-end utilization control to handle fluctuations in application workload and system resources by enforcing desired CPU utilization bounds on multiple processors by adapting the rates of end-to-end tasks within user-specified ranges.…”

Section: B Evaluating Safemat-induced Failover Overhead Timesmentioning

confidence: 99%

Reliable Distributed Real-Time and Embedded Systems through Safe Middleware Adaptation

Dabholkar

Dubey

Gokhale

et al. 2012

2012 IEEE 31st Symposium on Reliable Distributed Systems

View full text Add to dashboard Cite

Abstract-Distributed real-time and embedded (DRE) systems are a class of real-time systems formed through a composition of predominantly legacy, closed and statically scheduled realtime subsystems, which comprise over-provisioned resources to deal with worst-case failure scenarios. The formation of the systems of systems leads to a new range of faults that manifest at different granularities for which no statically defined fault tolerance scheme applies. Thus, dynamic and adaptive fault tolerance mechanisms are needed which must execute within the available resources without compromising the safety and timeliness of existing real-time tasks in the individual subsystems. To address these requirements, this paper describes a middleware solution called Safe Middleware Adaptation for Real-Time Fault Tolerance (SafeMAT), which opportunistically leverages the available slack in the over-provisioned resources of individual subsystems. SafeMAT comprises three primary artifacts: (1) a flexible and configurable distributed, runtime resource monitoring framework that can pinpoint in real-time the available slack in the system that is used in making dynamic and adaptive fault tolerance decisions; (2) a safe and resourceaware dynamic failure adaptation algorithm that enables efficient recovery from different granularities of failures within the available slack in the execution schedule while ensuring real-time constraints are not violated and resources are not overloaded; and (3) a framework that empirically validates the correctness of the dynamic mechanisms and the safety of the DRE system. Experimental results evaluating SafeMAT on an avionics application indicates that SafeMAT incurs only 9-15% runtime failover and 2-6% processor utilization overheads at runtime thereby providing safe and predictable failure adaptability in real-time.

show abstract

Section: B Evaluating Safemat-induced Failover Overhead Timesmentioning

confidence: 99%

Reliable Distributed Real-Time and Embedded Systems through Safe Middleware Adaptation

Dabholkar

Dubey

Gokhale

et al. 2012

2012 IEEE 31st Symposium on Reliable Distributed Systems

View full text Add to dashboard Cite

show abstract

“…MEAD [17] and its proactive recovery strategy for distributed CORBA applications can minimize the recovery time for DRE systems. The Time-triggered Message-triggered Objects (TMO) project [9] considers replication schemes such as the primary-shadow TMO replication (PSTR) scheme, for which recovery time bounds can be quantitatively established, and real-time fault tolerance guarantees can be provided to applications. DARX [11] provides adaptive fault-tolerance for multi-agent software platforms by dynamically changing replication styles in response to changing resource availabilities and application performance.…”

Section: Related Workmentioning

confidence: 99%

Towards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems

Balasubramanian

Gokhale

Schmidt

et al. 2008

Distributed Applications and Interoperable Systems

View full text Add to dashboard Cite

Abstract. Distributed real-time and embedded (DRE) systems often require support for multiple simultaneous quality of service (QoS) properties, such as real-timeliness and fault tolerance, that operate within resource constrained environments. These resource constraints motivate the need for a lightweight middleware infrastructure, while the need for simultaneous QoS properties require the middleware to provide fault tolerance capabilities that respect time-critical needs of DRE systems. Conventional middleware solutions, such as Fault-tolerant CORBA (FT-CORBA) and Continuous Availability API for J2EE, have limited utility for DRE systems because they are heavyweight (e.g., the complexity of their feature-rich fault tolerance capabilities consumes excessive runtime resources), yet incomplete (e.g., they lack mechanisms that enable fault tolerance while maintaining real-time predictability). This paper provides three contributions to the development and standardization of lightweight real-time and fault-tolerant middleware for DRE systems. First, we discuss the challenges in realizing real-time faulttolerant solutions for DRE systems using contemporary middleware. Second, we describe recent progress towards standardizing a CORBA lightweight fault-tolerance specification for DRE systems. Third, we present the architecture of FLARe, which is a prototype based on the OMG real-time fault-tolerant CORBA middleware standardization efforts that is lightweight (e.g., leverages only those server-and client-side mechanisms required for real-time systems) and predictable (e.g., provides fault-tolerant mechanisms that respect time-critical performance needs of DRE systems).

show abstract

“…Time testing, referred to as heartbeating [3,9,10], can be used to check if a component or system is anomalous, but it may fail to locate where an anomaly is in a component or system. The detection mechanism depending on exceptions [7] may not handle unanticipated, state-dependent anomalies.…”

Section: Introductionmentioning

confidence: 99%

“…Several approaches to anomaly detection for dependable systems have been suggested in [3,[8][9][10]7,4], which may provide partial solutions from the perspective of quality factors of the mechanisms for anomaly detection-speed and accuracy. Time testing, referred to as heartbeating [3,9,10], can be used to check if a component or system is anomalous, but it may fail to locate where an anomaly is in a component or system.…”

Section: Introductionmentioning

confidence: 99%

Detection of anomalies in software architecture with connectors

Shin

Paniagua

et al. 2006

Science of Computer Programming

View full text Add to dashboard Cite

This paper describes an approach to detecting anomalies in a software architectural style that is structured with components and connectors between the components. Each component is designed with tasks (concurrent or active objects), connectors between tasks, and passive objects accessed by tasks. Anomalies in the software architecture are detected twofold by each Component Monitor, which supervises objects in a component, and by a System Monitor, which monitors message communications between components. The monitors encapsulate the specifications of objects being monitored, which are represented using statecharts. The execution of statecharts in the monitors depends on notification messages from connectors between tasks, passive objects accessed by tasks in a component, and connectors between components.

show abstract

The PSTR/SNS scheme for real-time fault tolerance via active object replication and network surveillance

Cited by 16 publications

References 14 publications

Reliable Distributed Real-Time and Embedded Systems through Safe Middleware Adaptation

Reliable Distributed Real-Time and Embedded Systems through Safe Middleware Adaptation

Towards Middleware for Fault-Tolerance in Distributed Real-Time and Embedded Systems

Detection of anomalies in software architecture with connectors

Contact Info

Product

Resources

About