2020
DOI: 10.1109/tns.2020.2982162
|View full text |Cite
|
Sign up to set email alerts
|

Improving Selective Fault Tolerance in GPU Register Files by Relaxing Application Accuracy

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(4 citation statements)
references
References 21 publications
0
2
0
Order By: Relevance
“…-Detection with redundancy without diversity [163,6,191,178,199,57,73,138,99,164,34,170,120,140,77,127,139] and with diversity [13,14,8,12] -Detection and/or correction with coding (e.g., ECC) and checkers [57,120,169,139,108,124,125,140,132] -Recovery with re-execution or checkpoints [71,175,180,135,115,124] -Mitigation with shielding and reconfiguration [153,127,91,193] • Application-dependent:…”
Section: Random Hw Failuresmentioning
confidence: 99%
See 1 more Smart Citation
“…-Detection with redundancy without diversity [163,6,191,178,199,57,73,138,99,164,34,170,120,140,77,127,139] and with diversity [13,14,8,12] -Detection and/or correction with coding (e.g., ECC) and checkers [57,120,169,139,108,124,125,140,132] -Recovery with re-execution or checkpoints [71,175,180,135,115,124] -Mitigation with shielding and reconfiguration [153,127,91,193] • Application-dependent:…”
Section: Random Hw Failuresmentioning
confidence: 99%
“…Processing Units Due to the limited public documentation, controllability and observability of SM architectures, few works target the reliability of SMs explicitly. In general, works attempt to improve SMs (and other components simultaneously) employing different software-based redundancy techniques for the whole SM, its cores only, or parts of those cores (e.g., pipeline registers [163]) using available underutilized resources in order to reduce the computing overhead [6,191,178,199,57,73]. Some authors combine software redundancy with diversity to mitigate common cause failures by, for instance, making redundant threads execute with some staggering in different cores [13,14,8].…”
Section: Componentsmentioning
confidence: 99%
“…Firstly, the overhead assessment determines the cost in terms of hardware, power, and performance of the DYRE architecture. For this purpose, the DYRE architecture is compared against the original design, DDWC, which is based only on fault-detection [7], and with BISR, which is based only on fault mitigation [9]. The original GPGPU and the three fault-tolerance mechanisms were synthesized using the Design Compiler tool using the 15 nm Nand gate Open-cell library and one clock of 500 MHz.…”
Section: Experimental Evaluationmentioning
confidence: 99%
“…The software solutions rely on modified versions of the application code to harden and mitigate fault effects [7]. These solutions are noninvasive, flexible, and have been proven in GPGPUs [8], but can be very costly in terms of performance [9].…”
Section: Introductionmentioning
confidence: 99%