2009
DOI: 10.1016/j.jpdc.2008.12.002
|View full text |Cite
|
Sign up to set email alerts
|

Algorithm-based fault tolerance applied to high performance computing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
146
0
1

Year Published

2013
2013
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 186 publications
(147 citation statements)
references
References 23 publications
0
146
0
1
Order By: Relevance
“…With checkpointing, several application-specific detectors can be used to avoid replication and increase performance in failure-free executions. Two well-known examples are AlgorithmBased Fault Tolerance (ABFT) [38,39,40] and silent error detectors based on domain-specific data analytics [41,42,43].…”
Section: Soft and Silent Errorsmentioning
confidence: 99%
“…With checkpointing, several application-specific detectors can be used to avoid replication and increase performance in failure-free executions. Two well-known examples are AlgorithmBased Fault Tolerance (ABFT) [38,39,40] and silent error detectors based on domain-specific data analytics [41,42,43].…”
Section: Soft and Silent Errorsmentioning
confidence: 99%
“…Strong scalability involves doubling the number of nodes but maintaining the constant configuration size for the chosen scientific code, while weak scalability occurs when the number of nodes is doubled and the problem size is also concurrently increased (Bosilca, Delmas, Dongarra, & Langou, 2009;Varma, Wang, Mueller, Engelmann, & Scott, i2006). Using DL_POLY_2.18, the strong scalability of the model was studied in two different ways, namely, by using a small and large system (Tang, 2007).…”
Section: Design Of Computational Experimentsmentioning
confidence: 99%
“…In the small system, DL_POLY_2 was employed for a simulation comprising 8640 atoms (that is Sodium (Na) = 960, Potassium (K) = 960, Silicon (Si) = 1920, Oxygen (O) = 4800) (Bosilca et al, 2009). The chemical system of atoms were simulated in a cubic box of size 48.358 angstroms for each of its X, Y and Z axes.…”
Section: Design Of Computational Experimentsmentioning
confidence: 99%
“…Indeed, application-specific information enables adhoc solutions, which dramatically decrease the cost of error detection. Algorithmbased fault tolerance (ABFT) [29,30,31] is a well-known technique, which uses checksums to detect up to a certain number of errors in linear algebra kernels. Unfortunately, ABFT can only protect datasets in linear algebra kernels, and it must be implemented for each different kernel, which incurs a large amount of work for large HPC applications.…”
Section: Related Workmentioning
confidence: 99%