Rinku Gupta scite author profile

We present here a report produced by a workshop on 'Addressing failures in exascale computing' held in Park City, Utah, 4-11 August 2012. The charter of this workshop was to establish a common taxonomy about resilience across all the levels in a computing system, discuss existing knowledge on resilience across the various hardware and software layers of an exascale system, and build on those results, examining potential solutions from both a hardware and software perspective and focusing on a combined approach.The workshop brought together participants with expertise in applications, system software, and hardware; they came from industry, government, and academia, and their interests ranged from theory to implementation. The combination allowed broad and comprehensive discussions and led to this document, which summarizes and builds on those discussions.

show abstract

A practical failure prediction with location and lead time for Blue Gene/P

Zheng

Lan

Gupta

et al. 2010

View full text Add to dashboard Cite

Co-analysis of RAS Log and Job Log on Blue Gene/P

Zheng

Tang

et al. 2011

View full text Add to dashboard Cite

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Gupta

Beckman

Park

et al. 2009

View full text Add to dashboard Cite

Considerable work has been done on providing fault tolerance capabilities for different software components on large-scale high-end computing systems. Thus far, however, these fault-tolerant components have worked insularly and independently and information about faults is rarely shared. Such lack of system-wide fault tolerance is emerging as one of the biggest problems on leadership-class systems. In this paper, we propose a coordinated infrastructure, named CIFTS, that enables system software components to share fault information with each other and adapt to faults in a holistic manner. Central to the CIFTS infrastructure is a Fault Tolerance Backplane (FTB) that enables fault notification and awareness throughout the software stack, including fault-aware libraries, middleware, and applications. We present details of the CIFTS infrastructure and the interface specification that has allowed various software programs, including MPICH2, MVAPICH, Open MPI, and PVFS, to plug into the CIFTS infrastructure. Further, through a detailed evaluation we demonstrate the nonintrusive low-overhead capability of CIFTS that lets applications run with minimal performance degradation.

show abstract

Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

et al. 2011

View full text Add to dashboard Cite

For parallel applications running on high-end computing systems, which processes of an application get launched on which processing cores is typically determined at application launch time without any information about the application characteristics. As high-end computing systems continue to grow in scale, however, this approach is becoming increasingly infeasible for achieving the best performance. For example, for systems such as IBM Blue Gene and Cray XT that rely on flat 3D torus networks, process communication often involves network sharing, even for highly scalable applications. This causes the overall application performance to depend heavily on how processes are mapped on the network. In this paper, we first analyze the impact of different process mappings on application performance on a massive Blue Gene/P system. Then, we match this analysis with application communication patterns that we allow applications to describe prior to being launched. The underlying process management system can use this combined information in conjunction with the hardware characteristics of the system to determine the best mapping for the application. Our experiments study the performance of different communication patterns, including 2D and 3D nearest-neighbor communication and structured Cartesian grid communication. Our studies, that scale up to 131,072 cores of the largest BG/P system in the United States (using 80% of the total system size), demonstrate that different process mappings can show significant difference in overall performance, especially on scale. For example, we show that this difference can be as much as 30% for P3DFFT and up to twofold for HALO. Through our proposed model, however, such differences in performance can be avoided so that the best possible performance is always achieved.

show abstract

LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events

Gupta

Snir

et al. 2017

View full text Add to dashboard Cite

Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q

Yoshii

Iskra

Gupta

et al. 2012

View full text Add to dashboard Cite

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Guo

Gupta

et al. 2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

12 3

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rinku Gupta

Addressing failures in exascale computing

A practical failure prediction with location and lead time for Blue Gene/P

Co-analysis of RAS Log and Job Log on Blue Gene/P

CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems

Mapping communication layouts to network hardware characteristics on massive-scale blue gene systems

LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events

Evaluating Power-Monitoring Capabilities on IBM Blue Gene/P and Blue Gene/Q

Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System

Contact Info

Product

Resources

About