Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy

Goncalves, Marcio M.; Lamb, Ivan Peter; Rech, Paolo; Brum, Raphael Martins; Azambuja, Jose Rodrigo

doi:10.1109/tns.2020.2982162

Cited by 10 publications

(4 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…-Detection with redundancy without diversity [163,6,191,178,199,57,73,138,99,164,34,170,120,140,77,127,139] and with diversity [13,14,8,12] -Detection and/or correction with coding (e.g., ECC) and checkers [57,120,169,139,108,124,125,140,132] -Recovery with re-execution or checkpoints [71,175,180,135,115,124] -Mitigation with shielding and reconfiguration [153,127,91,193] • Application-dependent:…”

Section: Random Hw Failuresmentioning

confidence: 99%

“…Processing Units Due to the limited public documentation, controllability and observability of SM architectures, few works target the reliability of SMs explicitly. In general, works attempt to improve SMs (and other components simultaneously) employing different software-based redundancy techniques for the whole SM, its cores only, or parts of those cores (e.g., pipeline registers [163]) using available underutilized resources in order to reduce the computing overhead [6,191,178,199,57,73]. Some authors combine software redundancy with diversity to mitigate common cause failures by, for instance, making redundant threads execute with some staggering in different cores [13,14,8].…”

Section: Componentsmentioning

confidence: 99%

See 1 more Smart Citation

GPU Devices for Safety-Critical Systems: A Survey

et al. 2022

View full text Add to dashboard Cite

Graphics Processing Unit (GPU) devices and their associated software programming languages and frameworks can deliver the computing performance required to facilitate the development of next-generation high-performance safety-critical systems such as autonomous driving systems. However, the integration of complex, parallel and computationally demanding software functions with different safety-criticality levels on GPU devices with shared hardware resources contributes to several safety certification challenges. This survey categorizes and provides an overview of research contributions that address GPU devices’ random hardware failures, systematic failures and independence of execution.

show abstract

Section: Random Hw Failuresmentioning

confidence: 99%

Section: Componentsmentioning

confidence: 99%

GPU Devices for Safety-Critical Systems: A Survey

et al. 2022

View full text Add to dashboard Cite

show abstract

“…Firstly, the overhead assessment determines the cost in terms of hardware, power, and performance of the DYRE architecture. For this purpose, the DYRE architecture is compared against the original design, DDWC, which is based only on fault-detection [7], and with BISR, which is based only on fault mitigation [9]. The original GPGPU and the three fault-tolerance mechanisms were synthesized using the Design Compiler tool using the 15 nm Nand gate Open-cell library and one clock of 500 MHz.…”

Section: Experimental Evaluationmentioning

confidence: 99%

“…The software solutions rely on modified versions of the application code to harden and mitigate fault effects [7]. These solutions are noninvasive, flexible, and have been proven in GPGPUs [8], but can be very costly in terms of performance [9].…”

Section: Introductionmentioning

confidence: 99%

DYRE: a DYnamic REconfigurable solution to increase GPGPU’s reliability

et al. 2021

View full text Add to dashboard Cite

General-purpose graphics processing units (GPGPUs) are extensively used in high-performance computing. However, it is well known that these devices’ reliability may be limited by the rising of faults at the hardware level. This work introduces a flexible solution to detect and mitigate permanent faults affecting the execution units in these parallel devices. The proposed solution is based on adding some spare modules to perform two in-field operations: detecting and mitigating faults. The solution takes advantage of the regularity of the execution units in the device to avoid significant design changes and reduce the overhead. The proposed solution was evaluated in terms of reliability improvement and area, performance, and power overhead costs. For this purpose, we resorted to a micro-architectural open-source GPGPU model (FlexGripPlus). Experimental results show that the proposed solution can extend the reliability by up to 57%, with overhead costs lower than 2% and 8% in area and power, respectively.

show abstract