Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks

Posner, Jonas; Reitz, Lukas; Fohry, Claudia

doi:10.1109/ipdpsw52791.2021.00089

Cited by 3 publications

(1 citation statement)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent study has compared the algorithms from Kestor et al [7] and Posner et al [8] in DIT, to which the NFJ algorithm was transferred [21]. The study reported overheads below 1% for both algorithms, with those of the NFJ-specific algorithm [7] being lower in failure-free cases, and those of the checkpointing algorithm [8] being lower during recovery.…”

Section: Introductionmentioning

confidence: 99%

Checkpointing and Localized Recovery for Nested Fork-Join Programs

Fohry

2021

Preprint

Self Cite

View full text Add to dashboard Cite

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned.This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

show abstract