2013
DOI: 10.1016/j.jocs.2013.05.002
|View full text |Cite
|
Sign up to set email alerts
|

On-line soft error correction in matrix–matrix multiplication

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 19 publications
(18 citation statements)
references
References 9 publications
0
18
0
Order By: Relevance
“…Application-level fault tolerance mechanisms, such as algorithm-based fault tolerance [27], [28], [29], are extensively studied as a means to increase application resilience to transient faults on data objects. However, those mechanisms can come with big performance and energy overheads (e.g., 35% performance loss in [3]).…”
Section: Case Studymentioning
confidence: 99%
“…Application-level fault tolerance mechanisms, such as algorithm-based fault tolerance [27], [28], [29], are extensively studied as a means to increase application resilience to transient faults on data objects. However, those mechanisms can come with big performance and energy overheads (e.g., 35% performance loss in [3]).…”
Section: Case Studymentioning
confidence: 99%
“…Existing techniques that can ensure reliability to SDCs comprise two categories: (i) algorithm-based fault tolerance 1 (ABFT)-i.e., methods using checksums specifically tailored to the algorithm under consideration-that can reliably detect (and possibly correct) up to a limited number of SDCs [13], [17], [19], [25], [39], [46], [47], [60]; (ii) systems with dual modular redundancy (DMR), where all non-coinciding SDCs can be detected if the same operation is duplicated in two separate processors (or threads) that cross-validate their results [21], but SDCs cannot be corrected without using triple modular redundancy (TMR) [23].…”
Section: A Summary Of Prior Workmentioning
confidence: 99%
“…All ABFT methods specifically tailored for GEMM computations [13], [17], [25], [60], [61] append the input subblocks with (redundant) checksum vectors (rows or columns), denoted by a c , b r in Figure 1(b) and highlighted in color.…”
Section: A Algorithm-based Fault Tolerancementioning
confidence: 99%
“…Some researchers went in another direction in order to tolerate more faults effectively. Realizing that the offline approach taken by the traditional ABFT techniques have to face catastrophic error propagation at the end, researchers attempted to adapt checksum schemes for online error detection and correction [5,25,24]. The idea is that online ABFT catches errors early on when they are not propagated far away, therefore making it easier to correct.…”
Section: Related Workmentioning
confidence: 99%
“…The fault model is a deciding factor in the design of ABFT codes and adaption to the associated algorithm. However the fault models used in existing ABFT research are either too abstract [8,14,10] or too simplistic [5,24,25] limiting their use where the architectural fault models do not fit. In this work we rethink the fault model and explore the challenges if we use a comprehensive architectural fault model that allows both logic/arithmetic faults and storage faults in main memory, on-chip memory, and other datapaths.…”
Section: Introductionmentioning
confidence: 99%