Checkpointing Process Groups in a Grid Environment

Mehnert-Spahn, John; Schöttner, Michael; Morin, Christine

doi:10.1109/pdcat.2008.14

Cited by 2 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since no separate UNIX session or UNIX process group is initiated at job submission, BLCR cannot use these two process group semantics. More information can be found under [18].…”

Section: Process Groupsmentioning

confidence: 99%

See 1 more Smart Citation

The Architecture of the XtreemOS Grid Checkpointing Service

Mehnert-Spahn

Ropars

Schoettner

et al. 2009

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

“…Since no separate UNIX session or UNIX process group is initiated at job submission, BLCR cannot use these two process group semantics. More information can be found under [18].…”

Section: Process Groupsmentioning

confidence: 99%

“…The kernel is still able to distinguish equally named identifiers used across multiple applications, by isolating resource groups and mapping each identifier to a unique one at kernel level. Integration of state-of-the-art lightweight virtualization mechanisms provided by mainline Linux into XtreemOS is in progress, see also [18].…”

Section: Resource Conflictsmentioning

confidence: 99%

The Architecture of the XtreemOS Grid Checkpointing Service

Mehnert-Spahn

Ropars

Schoettner

et al. 2009

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

“…Thus it provides a more flexible solution with advantages of both reduced runtime overhead and localized recovery effect. Recent reports have shown feasibility as well as efficiency of applying group checkpoints to certain MPI programs [13] and in specific grid computing environments [14].…”

Section: Introductionmentioning

confidence: 99%

A Synchronization-Induced Checkpoint Protocol for Group-Synchronous Parallel Programs

Wei

Goswami

2012

2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies

View full text Add to dashboard Cite

Group checkpointing is a fix between global checkpointing and log-based recovery. It features both reduced runtime overhead and localized recovery effect for improving the fault-tolerance performance of large-scale distributed systems. However, parallel programs cannot efficiently benefit from this strategy, as they often involve synchronous or semisynchronous interactions that incur extra idling delays between processes as well as between process groups. This paper presents an analytical study on such delays and the corresponding delay optimization strategies. Observing that certain parallel programs exhibit patterns of "synchronization groups", we develop a Synchronization-Induced Checkpoint protocol that manages checkpoints around such groups. The protocol keeps advantages of ordinary group checkpointing, and meanwhile minimizes the costs of synchronization-induced delays. We also broadly categorize the known synchronization patterns and establish their relations with suitable checkpoint strategies for parallel programs. (Abstract)

show abstract

Checkpointing Process Groups in a Grid Environment

Cited by 2 publications

References 16 publications

The Architecture of the XtreemOS Grid Checkpointing Service

The Architecture of the XtreemOS Grid Checkpointing Service

A Synchronization-Induced Checkpoint Protocol for Group-Synchronous Parallel Programs

Contact Info

Product

Resources

About