2014 IEEE International Parallel &Amp; Distributed Processing Symposium Workshops 2014
DOI: 10.1109/ipdpsw.2014.132
|View full text |Cite
|
Sign up to set email alerts
|

Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 24 publications
(22 citation statements)
references
References 9 publications
0
22
0
Order By: Relevance
“…It is tolerated as follows. Before the SGCT algorithm is applied, the loss of some processes in P i is detected using ULFM MPI (see [16] for details). Replacement processes are then created (with the same process grid size as P i ) on the same node when node failure is not happened.…”
Section: A Sgct Algorithm Overview and Process Organizationmentioning
confidence: 99%
See 2 more Smart Citations
“…It is tolerated as follows. Before the SGCT algorithm is applied, the loss of some processes in P i is detected using ULFM MPI (see [16] for details). Replacement processes are then created (with the same process grid size as P i ) on the same node when node failure is not happened.…”
Section: A Sgct Algorithm Overview and Process Organizationmentioning
confidence: 99%
“…In order to tolerate these failures, the faulty communicator is reconstructed (line 15, details are in [16]). Then the fault-tolerant SGCT is applied using the communicator W , with the combined solution u c I being used to re-initialize the sub-grid solutions {u i } (lines [16][17].…”
Section: B Integration Of the Sgct Algorithm Into Higherdimensional mentioning
confidence: 99%
See 1 more Smart Citation
“…Although generally exhibiting excellent performance and resiliency, ABFT requires that the algorithm is innately able to incorporate fault tolerance and therefore might be a less generalist approach. In abft applications that require the restoration of a full set of processes [1], the recovery procedure for the MPI layer actually has strikingly similar requirement to the deployment of coordinated checkpoint with in-place restart. Unlike checkpoint/restart, this is not only an optimization, but a hard requirement.…”
Section: Forward Recovery With Complex Patternsmentioning
confidence: 99%
“…High Performance Computing, as observed by the Top 500 ranking, 1 has exhibited a constant progression of the computing power by a factor of two every 18 months for the last 15 years and the pace of progress has been only slightly disturbed by the financial turmoil in 2008. Following the long-term trend, the Exaflops milestone should be reached as soon as 2022.…”
Section: Introductionmentioning
confidence: 99%