2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) 2008
DOI: 10.1109/ccgrid.2008.124
|View full text |Cite
|
Sign up to set email alerts
|

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Abstract: The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this pape… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2009
2009
2019
2019

Publication Types

Select...
5
3
1

Relationship

2
7

Authors

Journals

citations
Cited by 12 publications
(8 citation statements)
references
References 4 publications
0
8
0
Order By: Relevance
“…Using these techniques saves time relative to the case of over-aggressive checkpointing and still ensures progress. Work has also been done in the area of predictive analysis by Stearley and Oliner [9] on the Sisyphus project, which seeks to discover correlations of log-file events with both software-and hardware-related failures, and by the OVIS project [3], which looks for correlations of multi-variate hardware state behaviors with failures. The motivation for the latter work was that targeted checkpointing, based upon failure prediction, could dramatically increase the scalability of HPC applications and platforms, since checkpointing all state for the application would no longer be necessary and additionally state would only have to be saved for affected processes when they were deemed destined to fail.…”
Section: Related Workmentioning
confidence: 99%
“…Using these techniques saves time relative to the case of over-aggressive checkpointing and still ensures progress. Work has also been done in the area of predictive analysis by Stearley and Oliner [9] on the Sisyphus project, which seeks to discover correlations of log-file events with both software-and hardware-related failures, and by the OVIS project [3], which looks for correlations of multi-variate hardware state behaviors with failures. The motivation for the latter work was that targeted checkpointing, based upon failure prediction, could dramatically increase the scalability of HPC applications and platforms, since checkpointing all state for the application would no longer be necessary and additionally state would only have to be saved for affected processes when they were deemed destined to fail.…”
Section: Related Workmentioning
confidence: 99%
“…Board, core or disk temperatures correlate to failures in some studies but not others [7,8]. Some systems collect hundreds of variables per node, analyze on the fly, and save only analysis results, thus preventing comprehensive forensics or reuse of their data [9,10]. Hsu and Poole in [11] detail state of the art in power measurement and classify hardware monitoring methods from node component level to facility.…”
Section: Related Workmentioning
confidence: 99%
“…Additional data. It has been shown in the last decade that system performance can be enhanced greatly if the dispatchers are aware of additional information regarding the current system status, such as energy and power consumption of the resources [37,2,5,6], resource failures [22,7], and the heating/cooling conditions [35,3]. The additional data component of AccaSim provides an interface to integrate such extra data to the system which can then be utilized to develop and experiment with advanced dispatchers which are for instance energy and power-aware, fault-resilient and thermal-aware.…”
Section: Accasim Architecture and Main Featuresmentioning
confidence: 99%