Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

Benoît, Anne; Cavelan, Aurélien; Ciorba, Florina M.; Fèvre, Valentin Le; Robert, Yves

doi:10.15803/ijnc.9.1_2

Cited by 7 publications

(3 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fault tolerant protocols for other parallel programming models, such as PGAS [20] have been also explored. The combination of checkpointing the output of tasks and replicating for application-specific detection is explored in [2] for a linear workflow context, in the presence of both fail-stop and silent faults. Finally, in a recent study, the authors of [21] explore the combination of replication with checkpointing for fail-stop errors, and compute the optimal checkpoint interval for this approach.…”

Section: Background and Related Workmentioning

confidence: 99%

“…SEDAR can detect and recover from all transient faults that cause SDC and TOE (Time Out Errors). Three different ways are provided by SEDAR so it can achieve full silent error coverage: (1) only detection with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery utilizing a single safe application-level checkpoint. Each of these alternatives has particular features and provides a different cost-performance trade-off.…”

Section: Introductionmentioning

confidence: 99%

“…In the area of High-Performance Computing (HPC), parallel systems continue increasing the number of components to improve their performance and, as a consequence, ensuring their reliability has become a critical issue. Nowadays, fault rates involve just a few hours on modern platforms [1] but it is forecasted that large parallel applications will have to manage fault rates of barely some minutes in exascale supercomputers [2]. In that sense, these applications require some help to progress efficiently.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Soft errors detection and automatic recovery based on replication combined with different levels of checkpointing

Montezanti

Rucci

Giusti

et al. 2020

Future Generation Computer Systems

View full text Add to dashboard Cite

Handling faults is a growing concern in HPC. In future exascale systems, it is projected that silent undetected errors will occur several times a day, increasing the occurrence of corrupted results. In this article, we propose SEDAR, which is a methodology that improves system reliability against transient faults when running parallel message-passing applications. Our approach, based on process replication for detection, combined with different levels of checkpointing for automatic recovery, has the goal of helping users of scientific applications to obtain executions with correct results. SEDAR is structured in three levels: (1) only detection and safestop with notification; (2) recovery based on multiple system-level checkpoints; and (3) recovery based on a single valid user-level checkpoint. As each of these variants supplies a particular coverage but involves limitations and implementation costs, SEDAR can be adapted to the needs of the system. In this work, a description of the methodology is presented and the temporal behavior of employing each SEDAR strategy is mathematically described, both in the absence and presence of faults. A model that considers all the fault scenarios on a test application is introduced to show the validity of the detection and recovery mechanisms. An overhead evaluation of each variant is performed with applications involving different communication patterns; this is also used to extract guidelines about when it is beneficial to employ each SEDAR protection level. As a result, we show its efficacy and viability to tolerate transient faults in target HPC environments.

show abstract

Section: Background and Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%