Abstract:Because of technology scaling, the soft error rate has been increasing in digital circuits, which affects system reliability. Therefore, modern processors, including VLIW architectures, must have means to mitigate such effects to guarantee reliable computing. In this scenario, our work proposes three low overhead fault tolerance approaches based on instruction duplication with zero latency detection, which uses a rollback mechanism to correct soft errors in the pipelanes of a configurable VLIW processor. The f… Show more
“…Hardware-based approaches duplicate the instructions at runtime using specific hardware using the compiler's result. To do so, coupling of the VLIW pipelines is applied [4], [13]. When the duplicated instructions do not fit in the current bundle, an additional time slot is added.…”
Error occurrence in embedded systems has significantly increased. Although inherent resource redundancy exist in processors, such as in Very Long Instruction Word (VLIW) processors, it is not always used due to low application's Instruction Level Parallelism (ILP). Approaches benefit the additional resources to provide fault tolerance. When permanent and soft errors coexist, spare units have to be used or the executed program has to be modified through self-repair or by using several stored versions. However, these solutions introduce high area overhead for the additional resources, time overhead for the execution of the repair algorithm and storage overhead of the multiversioning. To address these limitations, a hardware mechanism is proposed which at run-time replicates the instructions and schedules them at the idle slots considering the resource constraints. If a resource becomes faulty, the proposed approach efficiently rebinds both the original and replicated instructions during execution. In this way, the area overhead is reduced, as no spare resources are used, whereas time and storage overhead are not required. Results show up to 49% performance gain over existing techniques.
“…Hardware-based approaches duplicate the instructions at runtime using specific hardware using the compiler's result. To do so, coupling of the VLIW pipelines is applied [4], [13]. When the duplicated instructions do not fit in the current bundle, an additional time slot is added.…”
Error occurrence in embedded systems has significantly increased. Although inherent resource redundancy exist in processors, such as in Very Long Instruction Word (VLIW) processors, it is not always used due to low application's Instruction Level Parallelism (ILP). Approaches benefit the additional resources to provide fault tolerance. When permanent and soft errors coexist, spare units have to be used or the executed program has to be modified through self-repair or by using several stored versions. However, these solutions introduce high area overhead for the additional resources, time overhead for the execution of the repair algorithm and storage overhead of the multiversioning. To address these limitations, a hardware mechanism is proposed which at run-time replicates the instructions and schedules them at the idle slots considering the resource constraints. If a resource becomes faulty, the proposed approach efficiently rebinds both the original and replicated instructions during execution. In this way, the area overhead is reduced, as no spare resources are used, whereas time and storage overhead are not required. Results show up to 49% performance gain over existing techniques.
“…First concepts involving coarse-grained lockstepping are promising [18]- [20], but do not address the specific challenges to FT in space [21]. FT using thread-level very-long-instruction word architectures [22], [23] has also been explored, though the approach still requires pipelinelevel voters in hardware. Most implement checkpoint & rollback or restart, which makes them unsuitable for spacecraft command & control applications [24], others ignore fault-detection [25], [26], or require external, infallible fault detection entities with deep knowledge about application-intrinsics [27] but no concept of how this could be obtained.…”
Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, larger satellites continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. We present the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. Early benchmark results indicate a drastic performance increase over state-of-the-art radiation-hard OBC designs and considerably lower software-and hardware development costs. This approach was developed for a 4-year European Space Agency (ESA) project, and we are implementing a tiled MPSoC prototype jointly with two industrial partners.
“…These solutions are noninvasive, flexible, and have been proven in GPGPUs [8], but can be very costly in terms of performance [9]. In [10], the authors developed fault-tolerance solutions for parallel processors by adjusting the instruction-level parallelism, increasing the reliability at the cost of workload performance. On the other hand, authors in [11] propose a reduced precision Duplication with Comparison (DWC) approach to increase the reliability in GPUs by replicating instructions and operating them in execution units at different precision, so obtaining redundancy at zero cost, but degrading performance and output precision.…”
General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.