2016
DOI: 10.1177/1094342015628056
|View full text |Cite
|
Sign up to set email alerts
|

Complex scientific applications made fault-tolerant with the sparse grid combination technique

Abstract: Ultra-large-scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
15
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 21 publications
(15 citation statements)
references
References 36 publications
0
15
0
Order By: Relevance
“…There already exist some fault-tolerant scientific applications. For example Ali et al (2016) implemented a faulttolerant numeric linear equation and partial equation solver. Obersteiner et al (2017) extended a plasma simulation, Laguna et al (2016) a molecular dynamics simulation, and Engelmann and Geist (2003) a Fast Fourier Transformation that gracefully handle hardware faults.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…There already exist some fault-tolerant scientific applications. For example Ali et al (2016) implemented a faulttolerant numeric linear equation and partial equation solver. Obersteiner et al (2017) extended a plasma simulation, Laguna et al (2016) a molecular dynamics simulation, and Engelmann and Geist (2003) a Fast Fourier Transformation that gracefully handle hardware faults.…”
Section: Related Workmentioning
confidence: 99%
“…Researchers have already used ULFM for other scientific software (Ali et al, 2016;Engelmann and Geist, 2003;Kohl et al, 2017;Laguna et al, 2016;Obersteiner et al, 2017). ULFM reports failures by returning an error on at least one rank which participated in the failed communication.…”
Section: The New Mpi Standard and User Levelmentioning
confidence: 99%
“…The operation MPIX Comm failure ack enables users to acknowledge all locally notified failures in the communication context. 1 When using unnamed communications, this routine pro-vides the application a way to resume any-source operations, as long as the list of failed processes does not change.…”
Section: Failure Notificationmentioning
confidence: 99%
“…Ali et al [1] focus on non-shrinking recovery of PDE-based applications, re-spawning replacement processes on the same node, when they are still available, or otherwise in pre-allocated spare-nodes. As in their shrinking proposal [47], they use the SGCT Algorithm-Based Fault Tolerance (ABFT) strategy to approximate recovery of multiple failures, rather than the exact recovery through checkpointing.…”
Section: Non-shrinking Solutionsmentioning
confidence: 99%
See 1 more Smart Citation