Software-Based Fault Recovery via Adaptive Diversity for COTS Multi-Core Processors

Höller, Andrea; Rauter, Tobias; Iber, Johannes; Macher, Georg; Kreiner, Christian

doi:10.48550/arxiv.1511.03528

Cited by 3 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As prior thread-level FT implementations [19], [20], [28] are based upon fundamentally different concepts, only address transient faults within a very limited scope, and are deeply embedded into proprietary OS, their fault-coverage and performance can not be directly compared. However, the measured performance overhead does fall within the same range as measured in [19], and we also observe comparable averagecase performance.…”

Section: Discussion and Outlookmentioning

confidence: 99%

“…Most implement checkpoint & rollback or restart, which makes them unsuitable for spacecraft command & control applications [24], others ignore fault-detection [25], [26], or require external, infallible fault detection entities with deep knowledge about application-intrinsics [27] but no concept of how this could be obtained. Often, faults are assumed to be isolated, side-effect free and local to an application [28] and/or transient [19], [20], [25], which voids their effectiveness for space applications. Many prior concepts entail high performance- [29], resource-overhead [30], [31], or impose severe design constraints on applications and the OS [18], [19].…”

Section: Related Workmentioning

confidence: 99%

“…[19], [20], [28] implement voting through OS invasive measures, can not handle multi-threaded applications and consider the OS and stored program code to be fault free. [21] requires no modifications to the application software whatsoever, but can only assure availability in a networked application architecture.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Fuchs

Stefanov

Murillo

et al. 2017

2017 IEEE 26th Asian Test Symposium (ATS)

View full text Add to dashboard Cite

Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, larger satellites continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. We present the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. Early benchmark results indicate a drastic performance increase over state-of-the-art radiation-hard OBC designs and considerably lower software-and hardware development costs. This approach was developed for a 4-year European Space Agency (ESA) project, and we are implementing a tiled MPSoC prototype jointly with two industrial partners.

show abstract

Section: Discussion and Outlookmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Fuchs

Stefanov

Murillo

et al. 2017

2017 IEEE 26th Asian Test Symposium (ATS)

View full text Add to dashboard Cite

show abstract

“…Thread-level coarse-grain lockstep of weakly coupled cores instead supports general purpose computing, and in the past, has already been used for high availability, non-stop service, and error resilience concepts. However, in prior research, faults are usually assumed to be isolated, side effect free, and local to an individual application thread [19] or transient [20], [21], entailing high performance [22] or resource overhead [23], [24]. More advanced proof-of-concepts [20], [25], however, attempt to address these limitations, and even show a modest performance overhead between 3% and 25%, but utilize checkpoint & rollback or restart mechanics [20], which make them unsuitable for spacecraft command & control applications.…”

Section: Background and Related Workmentioning

confidence: 99%

Fault-Tolerant Nanosatellite Computing on a Budget

Fuchs¹,

Murillo²,

Plaat³

et al. 2019

Preprint

View full text Add to dashboard Cite

We present an on-board computer architecture designed for small satellites (<50kg), which exploits software-fault-tolerance to achieve strong fault coverage with commodity hardware. Micro-and nanosatellites have become popular platforms for a variety of commercial and scientific applications, but today are considered suitable mainly for short and low-priority space missions due to their low reliability. In part, this can be attributed to their reliance upon cheap, low-feature size, COTS components originally designed for embedded and mobilemarket applications, for which traditional hardware-voting concepts are ineffective. Software-fault-tolerance has been shown to be effective for such systems, but have largely been ignored by the space industry due to low maturity, as most have only been researched in theory. In practice, designers of payload instruments and miniaturized satellites are usually forced to sacrifice reliability in favor of delivering the level of performance necessary for cutting-edge science and innovative commercial applications. Thus, we developed a set of software measures facilitating fault tolerance based upon thread-level coarse-grain lockstep, which we validated through fault-injection. To offer strong long-term fault coverage, our architecture is implemented as tiled MPSoC on an FPGA, utilizing partial reconfiguration, as well as mixed criticality. This architecture can satisfy the high performance requirements of current and future scientific and commercial space missions at very low cost, while offering the strong fault-coverage guarantees necessary for platform control even for missions with a long duration. This architecture was developed for a 4-year ESA project. Together with two industrial partners, we are developing a prototype to then undergo radiation testing.

show abstract

“…Coarse-grain lockstep of weakly coupled cores can do just that, and in the past has already been used for high availability, non-stop service, and error resilience concepts. However, in prior research, faults are usually assumed to be isolated, side effect free and local to an individual application thread [19] or transient [20], [21], and entail high performance [22] or resource overhead [23], [24]. More advanced proof-of-concepts [20], [25], however, attempt to address these limitations, and even show a modest performance overhead between 3% and 25%, but utilize checkpoint & rollback or restart mechanics [20], which make them unsuitable for spacecraft command & control applications.…”

Section: Related Workmentioning

confidence: 99%

Dynamic Fault Tolerance Through Resource Pooling

Fuchs

Murillo

Plaat

et al. 2018

2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)

View full text Add to dashboard Cite

Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within a multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proofof-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological system's ability to tolerate faults, adapt to permanent failure, and enables graceful aging of an MPSoC.

show abstract

Software-Based Fault Recovery via Adaptive Diversity for COTS Multi-Core Processors

Cited by 3 publications

References 0 publications

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Fault-Tolerant Nanosatellite Computing on a Budget

Dynamic Fault Tolerance Through Resource Pooling

Contact Info

Product

Resources

About