Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors

Sartor, Anderson L.; Lorenzon, Arthur F.; Carro, Luigi; Kastensmidt, Fernanda Lima; Wong, Stephan; Beck, Antonio Carlos Schneider

doi:10.1145/3001935

Cited by 13 publications

(9 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Hardware-based approaches duplicate the instructions at runtime using specific hardware using the compiler's result. To do so, coupling of the VLIW pipelines is applied [4], [13]. When the duplicated instructions do not fit in the current bundle, an additional time slot is added.…”

Section: Related Workmentioning

confidence: 99%

Run-time Instruction Replication for permanent and soft error mitigation in VLIW processors

Psiakis

Kritikakou

Sentieys

2017

2017 15th IEEE International New Circuits and Systems Conference (NEWCAS)

View full text Add to dashboard Cite

Error occurrence in embedded systems has significantly increased. Although inherent resource redundancy exist in processors, such as in Very Long Instruction Word (VLIW) processors, it is not always used due to low application's Instruction Level Parallelism (ILP). Approaches benefit the additional resources to provide fault tolerance. When permanent and soft errors coexist, spare units have to be used or the executed program has to be modified through self-repair or by using several stored versions. However, these solutions introduce high area overhead for the additional resources, time overhead for the execution of the repair algorithm and storage overhead of the multiversioning. To address these limitations, a hardware mechanism is proposed which at run-time replicates the instructions and schedules them at the idle slots considering the resource constraints. If a resource becomes faulty, the proposed approach efficiently rebinds both the original and replicated instructions during execution. In this way, the area overhead is reduced, as no spare resources are used, whereas time and storage overhead are not required. Results show up to 49% performance gain over existing techniques.

show abstract

Section: Related Workmentioning

confidence: 99%

Run-time Instruction Replication for permanent and soft error mitigation in VLIW processors

Psiakis

Kritikakou

Sentieys

2017

2017 15th IEEE International New Circuits and Systems Conference (NEWCAS)

View full text Add to dashboard Cite

show abstract

“…First concepts involving coarse-grained lockstepping are promising [18]- [20], but do not address the specific challenges to FT in space [21]. FT using thread-level very-long-instruction word architectures [22], [23] has also been explored, though the approach still requires pipelinelevel voters in hardware. Most implement checkpoint & rollback or restart, which makes them unsuitable for spacecraft command & control applications [24], others ignore fault-detection [25], [26], or require external, infallible fault detection entities with deep knowledge about application-intrinsics [27] but no concept of how this could be obtained.…”

Section: Related Workmentioning

confidence: 99%

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

Fuchs

Stefanov

Murillo

et al. 2017

2017 IEEE 26th Asian Test Symposium (ATS)

View full text Add to dashboard Cite

Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, larger satellites continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. We present the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. Early benchmark results indicate a drastic performance increase over state-of-the-art radiation-hard OBC designs and considerably lower software-and hardware development costs. This approach was developed for a 4-year European Space Agency (ESA) project, and we are implementing a tiled MPSoC prototype jointly with two industrial partners.

show abstract

“…These solutions are noninvasive, flexible, and have been proven in GPGPUs [8], but can be very costly in terms of performance [9]. In [10], the authors developed fault-tolerance solutions for parallel processors by adjusting the instruction-level parallelism, increasing the reliability at the cost of workload performance. On the other hand, authors in [11] propose a reduced precision Duplication with Comparison (DWC) approach to increase the reliability in GPUs by replicating instructions and operating them in execution units at different precision, so obtaining redundancy at zero cost, but degrading performance and output precision.…”

Section: Introductionmentioning

confidence: 99%

DYRE: a DYnamic REconfigurable solution to increase GPGPU’s reliability

et al. 2021

View full text Add to dashboard Cite

General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.

show abstract

Exploiting Idle Hardware to Provide Low Overhead Fault Tolerance for VLIW Processors

Cited by 13 publications

References 41 publications

Run-time Instruction Replication for permanent and soft error mitigation in VLIW processors

Run-time Instruction Replication for permanent and soft error mitigation in VLIW processors

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

DYRE: a DYnamic REconfigurable solution to increase GPGPU’s reliability

Contact Info

Product

Resources

About