Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Brandt, James M.; Debusschere, Bert; Gentile, Ann C.; Mayo, Jackson; Pébaÿ, Philippe Pierre; Thompson, David C.; Wong, Matthew

doi:10.1109/ccgrid.2008.124

Cited by 12 publications

(8 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using these techniques saves time relative to the case of over-aggressive checkpointing and still ensures progress. Work has also been done in the area of predictive analysis by Stearley and Oliner [9] on the Sisyphus project, which seeks to discover correlations of log-file events with both software-and hardware-related failures, and by the OVIS project [3], which looks for correlations of multi-variate hardware state behaviors with failures. The motivation for the latter work was that targeted checkpointing, based upon failure prediction, could dramatically increase the scalability of HPC applications and platforms, since checkpointing all state for the application would no longer be necessary and additionally state would only have to be saved for affected processes when they were deemed destined to fail.…”

Section: Related Workmentioning

confidence: 99%

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

Brandt

Gentile

Mayo

et al. 2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

Self Cite

View full text Add to dashboard Cite

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memoryfootprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general.

show abstract

Section: Related Workmentioning

confidence: 99%

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

Brandt

Gentile

Mayo

et al. 2009

2009 IEEE International Symposium on Parallel &Amp; Distributed Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Board, core or disk temperatures correlate to failures in some studies but not others [7,8]. Some systems collect hundreds of variables per node, analyze on the fly, and save only analysis results, thus preventing comprehensive forensics or reuse of their data [9,10]. Hsu and Poole in [11] detail state of the art in power measurement and classify hardware monitoring methods from node component level to facility.…”

Section: Related Workmentioning

confidence: 99%

Developing a power measurement framework for cyber defense

Hernández

Pouchard

McDonald

et al. 2013

Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop

View full text Add to dashboard Cite

Energy Delivery Systems (EDS) have become smarter by incorporating chips and data communication capabilities. As a result, they have become more vulnerable to cyber-attacks as well. As part of a systems health monitoring approach, we investigate whether cyber-events targeting EDS can be detected by monitoring component-level data such as temperature, voltage, power, and process indicative variables-collectively referred to as component health indicators. We report our experiences with developing a measurement framework for power consumption in different EDS components such as Cabinet Distribution Units (CDU), Power Distribution Units (PDU), and standard enterprise desktops. Our plan for gathering and analyzing power consumption data involves establishing a baseline execution profile and then capturing the effect of perturbations in the state from injecting various malware. As a contribution, we report on initial experiments with power measurement techniques and outline future work for evaluating components under normal and anomalous operating regimes.

show abstract

“…Additional data. It has been shown in the last decade that system performance can be enhanced greatly if the dispatchers are aware of additional information regarding the current system status, such as energy and power consumption of the resources [37,2,5,6], resource failures [22,7], and the heating/cooling conditions [35,3]. The additional data component of AccaSim provides an interface to integrate such extra data to the system which can then be utilized to develop and experiment with advanced dispatchers which are for instance energy and power-aware, fault-resilient and thermal-aware.…”

Section: Accasim Architecture and Main Featuresmentioning

confidence: 99%

AccaSim: a customizable workload management simulator for job dispatching research in HPC systems

et al. 2019

View full text Add to dashboard Cite

We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an attractive tool for conducting job dispatching research in HPC systems.

show abstract

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Cited by 12 publications

References 4 publications

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

Developing a power measurement framework for cyber defense

AccaSim: a customizable workload management simulator for job dispatching research in HPC systems

Contact Info

Product

Resources

About