Daniel Marques scite author profile

Because of increasing hardware and software complexity, the running time of many computational science applications is now more than the mean-time-to-failure of highpeformance computing platforms. Therefore, computational science applications need to tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing faulttolerance protocols in the literature are not suitable for implementing this approach.In this paper, we present a suitable protocol, and show how it can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

show abstract

Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Schulz

Bronevetsky

Fernandes

et al.

View full text Add to dashboard Cite

The running times of many computational science applications are much longer than the mean-time-to-failure of current high-performance computing platforms. Therefore, to run to completion, these applications must tolerate hardware failures.Checkpoint-and-restart (CPR) is the most commonly used scheme for accomplishing this -the state of computation is saved periodically on stable storage, and when a hardware failure is detected, the computation is restarted from the most recently saved state. Most automatic CPR schemes in the literature can be classified as blocking, system-level checkpointing schemes because they take core-dump style snapshots of the computational state when all the processes are blocked at global barriers in the program. Unfortunately, a system that implements this style of checkpointing is tied to a particular platform; in addition, it cannot be used if there are no global barriers in the program.In our research project, we are exploring an alternative called non-blocking application-level checkpointing. In our approach, programs are transformed by a pre-processor so that they become selfcheckpointing and self-restartable on any platform; there is also no assumption about the existence of global barriers in the code.In this paper, we describe our implementation of non-blocking application-level checkpointing. We present experimental results on both a Windows cluster and the Lemieux system at the Pittsburgh Supercomputer Center, and argue that these results demonstrate both the platform-independence and the scalability of our approach.

show abstract

Optimizing Checkpoint Sizes in the C3 System

Marques

Bronevetsky

Fernandes

et al.

View full text Add to dashboard Cite

Compiler-Enhanced Incremental Checkpointing

Bronevetsky

Marques

Pingali

et al. 2008

View full text Add to dashboard Cite

Compiler-enhanced incremental checkpointing for OpenMP applications

Bronevetsky

Marques

Pingali

et al. 2008

View full text Add to dashboard Cite

As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerating such failures, enabling applications to periodically save their state and restart computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains more popular due to its superior performance. This paper improves performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis, which works with both sequential and OpenMP applications, reduces checkpoint sizes by as much as 80% and enables asynchronous checkpointing.

show abstract

Collective operations in application-level fault-tolerant MPI

Bronevetsky

Marques

Pingali

et al. 2003

View full text Add to dashboard Cite

Fault-tolerance is becoming a critical issue on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs without global barriers.In an earlier paper, we presented a distributed checkpoint coordination protocol which handles MPI's point-to-point constructs, while dealing with the unique challenges of application-level checkpointing. The protocol is implemented by a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. However, it did not handle collective communication, which is a very important part of MPI. In this paper, we extend the protocol to handle MPI's collective communication constructs. We also present experimental results that show that the overhead introduced by the protocol for collective operations is small.

show abstract

Collective operations in application-level fault-tolerant MPI

Bronevetsky

Marques

Pingali

et al. 2003

View full text Add to dashboard Cite

C 3: A System for Automating Application-Level Checkpointing of MPI Programs

Bronevetsky

Marques

Pingali

et al. 2004

View full text Add to dashboard Cite

Abstract. Fault-tolerance is becoming necessary on high-performance platforms. Checkpointing techniques make programs fault-tolerant by saving their state periodically and restoring this state after failure. System-level checkpointing saves the state of the entire machine on stable storage, but this usually has too much overhead. In practice, programmers do manual application-level checkpointing by writing code to (i) save the values of key program variables at critical points in the program, and (ii) restore the entire computational state from these values during recovery. However, this can be difficult to do in general MPI programs.In ([2],[3]) we have presented a distributed checkpoint coordination protocol which handles MPI's point-to-point and collective constructs, while dealing with the unique challenges of application-level checkpointing. We have implemented our protocols as part of a thin software layer that sits between the application program and the MPI library, so it does not require any modifications to the MPI library. This thin layer is used by the C 3 (Cornell Checkpoint (pre-)Compiler), a tool that automatically converts an MPI application in an equivalent fault-tolerant version. In this paper, we summarize our work on this system to date. We also present experimental results that show that the overhead introduced by the protocols are small. We also discuss a number of future areas of research.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

334 Leonard St

Brooklyn, NY 11211

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Daniel Marques

Automated application-level checkpointing of MPI programs

Implementation and Evaluation of a Scalable Application-Level Checkpoint-Recovery Scheme for MPI Programs

Optimizing Checkpoint Sizes in the C3 System

Compiler-Enhanced Incremental Checkpointing

Compiler-enhanced incremental checkpointing for OpenMP applications

Collective operations in application-level fault-tolerant MPI

Collective operations in application-level fault-tolerant MPI

C 3: A System for Automating Application-Level Checkpointing of MPI Programs

Contact Info

Product

Resources

About