Keita Teranishi scite author profile

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

show abstract

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Hukerikar

Teranishi

Diniz

et al. 2017

Int J Parallel Prog

View full text Add to dashboard Cite

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for highperformance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance.In this paper we present RedThreads, an interface that provides applicationlevel fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables

show abstract

Practical scalable consensus for pseudo-synchronous distributed systems

Hérault

Bouteiller

Bosilca

et al. 2015

View full text Add to dashboard Cite

The ability to consistently handle faults in a distributed environment requires, among a small set of basic routines, an agreement algorithm allowing surviving entities to reach a consensual decision between a bounded set of volatile resources. This paper presents an algorithm that implements an Early Returning Agreement (ERA) in pseudo-synchronous systems, which optimistically allows a process to resume its activity while guaranteeing strong progress. We prove the correctness of our ERA algorithm, and expose its logarithmic behavior, which is an extremely desirable property for any algorithm which targets future exascale platforms. We detail a practical implementation of this consensus algorithm in the context of an MPI library, and evaluate both its efficiency and scalability through a set of benchmarks and two fault tolerant scientific applications.

show abstract

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

Chien¹,

Balaji²,

Beckman³

et al. 2015

Procedia Computer Science

View full text Add to dashboard Cite

A latency tolerant hybrid sparse solver using incomplete Cholesky factorization

Raghavan

Teranishi

2003

Numerical Linear Algebra App

View full text Add to dashboard Cite

SUMMARYConsider the solution of large sparse symmetric positive deÿnite linear systems using the preconditioned conjugate gradient method. On sequential architectures, incomplete Cholesky factorizations provide effective preconditioning for systems from a variety of application domains, some of which may have widely di ering preconditioning requirements. However, incomplete factorization based preconditioners are not considered suitable for multiprocessors. This is primarily because the triangular solution step required to apply the preconditioner (at each iteration) does not scale well due to the large latency of inter-processor communication. We propose a new approach to overcome this performance bottleneck by coupling incomplete factorization with a selective inversion scheme to replace triangular solutions by scalable matrix-vector multiplications. We discuss our algorithm, analyze its communication latency for model sparse linear systems, and provide empirical results on its performance and scalability.

show abstract

Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales

Gamell

Teranishi

Mayo

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading

Hukerikar

Teranishi

Diniz

et al. 2014

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Keita Teranishi

Toward Local Failure Local Recovery Resilience Model using MPI-ULFM

Local recovery and failure masking for stencil-based applications at extreme scales

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

Practical scalable consensus for pseudo-synchronous distributed systems

Versioned Distributed Arrays for Resilience in Scientific Applications: Global View Resilience

A latency tolerant hybrid sparse solver using incomplete Cholesky factorization

Modeling and Simulating Multiple Failure Masking Enabled by Local Recovery for Stencil-Based Applications at Extreme Scales

An evaluation of lazy fault detection based on Adaptive Redundant Multithreading

Contact Info

Product

Resources

About