Ana Gainaru scite author profile

International audienceA significant percentage of the computing capacity of large-scale platforms is wasted because of interferences incurred by multiple applications that access a shared parallel file system concurrently. One solution to handling I/O bursts in large-scale HPC systems is to absorb them at an intermediate storage layer consisting of burst buffers. However, our analysis of the Argonne's Mira system shows that burst buffers cannot prevent congestion at all times. Consequently, I/O performance is dramatically degraded, showing in some cases a decrease in I/O throughput of 67%. In this paper, we analyze the effects of interference on application I/O bandwidth and propose several scheduling techniques to mitigate congestion. We show through extensive experiments that our global I/O scheduler is able to reduce the effects of congestion, even on systems where burst buffers are used, and can increase the overall system throughput up to 56%. We also show that it outperforms current Mira I/O schedulers

show abstract

Fault prediction under the microscope: A closer look into HPC systems

Gainaru¹,

Cappello²,

Snir³

et al. 2012

View full text Add to dashboard Cite

Modeling and tolerating heterogeneous failures in large parallel systems

Heien¹,

Kondo²,

Gainaru³

et al. 2011

View full text Add to dashboard Cite

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

Bouguerra

Gainaru

Bautista-Gomez

et al. 2013

View full text Add to dashboard Cite

Reducing Waste in Extreme Scale Systems through Introspective Analysis

Bautista-Gomez

Gainaru

Perarnau

et al. 2016

View full text Add to dashboard Cite

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

Gainaru¹,

Cappello

Kramer³

2012

View full text Add to dashboard Cite

Scheduling Parallel Tasks under Multiple Resources: List Scheduling vs. Pack Scheduling

Sun

Elghazi

Gainaru

et al. 2018

View full text Add to dashboard Cite

Scheduling in High-Performance Computing (HPC) has been traditionally centered around computing resources (e.g., processors/cores). The ever-growing amount of data produced by modern scientific applications start to drive novel architectures and new computing frameworks to support more efficient data processing, transfer and storage for future HPC systems. This trend towards data-driven computing demands the scheduling solutions to also consider other resources (e.g., I/O, memory, cache) that can be shared amongst competing applications. In this paper, we study the problem of scheduling HPC applications while exploring the availability of multiple types of resources that could impact their performance. The goal is to minimize the overall execution time, or makespan, for a set of moldable tasks under multiple-resource constraints. Two scheduling paradigms, namely, list scheduling and pack scheduling, are compared through both theoretical analyses and experimental evaluations. Theoretically, we prove, for several algorithms falling in the two scheduling paradigms, tight approximation ratios that increase linearly with the number of resource types. As the complexity of direct solutions grows exponentially with the number of resource types, we also design a strategy to indirectly solve the problem via a transformation to a single-resource-type problem, which can significantly reduce the algorithms' running times without compromising their approximation ratios. Experiments conducted on Intel Knights Landing with two resource types (processor cores and high-bandwidth memory) and simulations designed on more resource types confirm the benefit of the transformation strategy and show that pack-based scheduling, despite having a worse theoretical bound, offers a practically promising and easy-toimplement solution, especially when more resource types need to be managed.

show abstract

Reservation Strategies for Stochastic Jobs

Aupy¹,

Gainaru²,

Honoré³

et al. 2019

View full text Add to dashboard Cite

12 3 4 5

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Ana Gainaru

Scheduling the I/O of HPC Applications Under Congestion

Fault prediction under the microscope: A closer look into HPC systems

Modeling and tolerating heterogeneous failures in large parallel systems

Improving the Computing Efficiency of HPC Systems Using a Combination of Proactive and Preventive Checkpointing

Reducing Waste in Extreme Scale Systems through Introspective Analysis

Taming of the Shrew: Modeling the Normal and Faulty Behaviour of Large-scale HPC Systems

Scheduling Parallel Tasks under Multiple Resources: List Scheduling vs. Pack Scheduling

Reservation Strategies for Stochastic Jobs

Contact Info

Product

Resources

About