Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2013
DOI: 10.1145/2503210.2503228
|View full text |Cite
|
Sign up to set email alerts
|

A 'cool' way of improving the reliability of HPC machines

Abstract: Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
19
0

Year Published

2014
2014
2017
2017

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(19 citation statements)
references
References 23 publications
0
19
0
Order By: Relevance
“…We also plan to provide rich support for user priorities in PARM. Thermal behavior of CPUs can significantly affect the reliability of a machine [44] as well as the cooling costs of the data center [45]. We also plan to investigate the possibility of incorporating thermal constraints along with a strict power constraint in our scheduling scheme.…”
Section: Discussionmentioning
confidence: 99%
“…We also plan to provide rich support for user priorities in PARM. Thermal behavior of CPUs can significantly affect the reliability of a machine [44] as well as the cooling costs of the data center [45]. We also plan to investigate the possibility of incorporating thermal constraints along with a strict power constraint in our scheduling scheme.…”
Section: Discussionmentioning
confidence: 99%
“…(2) To reduce the failure rate of the processors for reliability concerns [2]. (3) To avoid hardware interrupt triggered by some chips when the temperature exceeds a certain redline value, which can cause severe performance degradation and energy increase [25].…”
Section: Models and Objectivementioning
confidence: 99%
“…The objective is to minimize the makespan for a set of computation-intensive applications subject to a temperature threshold, which cannot be violated at any time during the execution. Indeed, such a threshold is imposed in many resource management systems for either energy reduction considerations or reliability concerns [2,3]. To tackle this problem, we introduce a novel notion, called thermal-aware load, to capture more precisely the loads of the servers under the thermal constraint.…”
Section: Introductionmentioning
confidence: 99%
“…To reduce this, the computer room air conditioning (CRAC) temperature can be set to a higher degree. However, higher room temperature can cause overheating of cores which reduces hardware reliability [9]. Dynamic Voltage Frequency Scaling (DVFS) is commonly used to prevent overheating by modulating chip frequency and voltage.…”
Section: Power Awarenessmentioning
confidence: 99%