A 'cool' way of improving the reliability of HPC machines

Sarood, Osman; Meneses, Esteban; Kalé, Laxmikant V.

doi:10.1145/2503210.2503228

Cited by 25 publications

(19 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We also plan to provide rich support for user priorities in PARM. Thermal behavior of CPUs can significantly affect the reliability of a machine [44] as well as the cooling costs of the data center [45]. We also plan to investigate the possibility of incorporating thermal constraints along with a strict power constraint in our scheduling scheme.…”

Section: Discussionmentioning

confidence: 99%

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Sarood

Langer

Gupta

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

101

View full text Add to dashboard Cite

Abstract-Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (10 18 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates.We also propose a performance modeling scheme that estimates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different powerresponse characteristics. We show that with a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1.7X improvement.

show abstract

Section: Discussionmentioning

confidence: 99%

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Sarood

Langer

Gupta

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

Self Cite

101

View full text Add to dashboard Cite

show abstract

“…(2) To reduce the failure rate of the processors for reliability concerns [2]. (3) To avoid hardware interrupt triggered by some chips when the temperature exceeds a certain redline value, which can cause severe performance degradation and energy increase [25].…”

Section: Models and Objectivementioning

confidence: 99%

“…The objective is to minimize the makespan for a set of computation-intensive applications subject to a temperature threshold, which cannot be violated at any time during the execution. Indeed, such a threshold is imposed in many resource management systems for either energy reduction considerations or reliability concerns [2,3]. To tackle this problem, we introduce a novel notion, called thermal-aware load, to capture more precisely the loads of the servers under the thermal constraint.…”

Section: Introductionmentioning

confidence: 99%

Spatio-temporal thermal-aware scheduling for homogeneous high-performance computing datacenters

Sun

Stolf

Pierson

2017

Future Generation Computer Systems

View full text Add to dashboard Cite

“…To reduce this, the computer room air conditioning (CRAC) temperature can be set to a higher degree. However, higher room temperature can cause overheating of cores which reduces hardware reliability [9]. Dynamic Voltage Frequency Scaling (DVFS) is commonly used to prevent overheating by modulating chip frequency and voltage.…”

Section: Power Awarenessmentioning

confidence: 99%

Parallel Programming with Migratable Objects: Charm++ in Practice

Acun

Gupta

Jain

et al. 2014

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis

124

View full text Add to dashboard Cite

The advent of petascale computing has introduced new challenges (e.g. heterogeneity, system failure) for programming scalable parallel applications. Increased complexity and dynamism in science and engineering applications of today have further exacerbated the situation. Addressing these challenges requires more emphasis on concepts that were previously of secondary importance, including migratability, adaptivity, and runtime system introspection. In this paper, we leverage our experience with these concepts to demonstrate their applicability and efficacy for real world applications. Using the CHARM++ parallel programming framework, we present details on how these concepts can lead to development of applications that scale irrespective of the rough landscape of supercomputing technology. Empirical evaluation presented in this paper spans many miniapplications and real applications executed on modern supercomputers including Blue Gene/Q, Cray XE6, and Stampede.

show abstract

A 'cool' way of improving the reliability of HPC machines

Cited by 25 publications

References 23 publications

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

Spatio-temporal thermal-aware scheduling for homogeneous high-performance computing datacenters

Parallel Programming with Migratable Objects: Charm++ in Practice

Contact Info

Product

Resources

About