Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas P.; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, R.; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, K. J.; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin; Piccinali, Jean-Guillaume; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim C.

doi:10.1109/cluster.2018.00069

Cited by 9 publications

(7 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors describe a series of technical challenges related to reliability, overhead and data consistency, as well as dive into system-specific issues, such as clock skew effects. Similarly, Ahlgren et al [2] collect a series of experiences associated with monitoring and its main usage scenarios from several HPC centers. Here, the authors highlight the fact that most data centers rely on similar data sources (e.g., CPU performance counters), but at the same time employ highly different collection and storage solutions, either due to a lack of standard solutions or due to vendor constraints.…”

Section: State Of the Artmentioning

confidence: 99%

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Netti¹,

Ott²,

Guillen³

et al. 2021

Preprint

View full text Add to dashboard Cite

As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multifaceted problem, much effort has gone into researching its separate aspects: however, accounts of production ODA experiences are still hard to come across.In this work we aim to bridge the gap between ODA research and production use by presenting our experiences with ODA in production, involving in particular the control of cooling infrastructures and visualization of job data on two HPC systems. We cover the entire development process, from design to deployment, highlighting our insights in an effort to drive the community forward. We rely on open-source tools, which make for a generic ODA framework suitable for most scenarios.

show abstract

Section: State Of the Artmentioning

confidence: 99%

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Netti¹,

Ott²,

Guillen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

“…Approaches for error detection based on basic textual logging (eg, syslog) to numeric metric gathering (eg, performance counters) have been studied extensively in prior work. 6,[51][52][53][54] There has been extensive research in the analysis of large-scale system monitoring data, specifically RAS and console logs. 7,8,[55][56][57] The Chopstix 58 system employs a probabilistic approach to monitoring, whereby a sketch of monitoring events efficiently characterizes a state, which can be used for identification and diagnostic purposes.…”

Section: Related Workmentioning

confidence: 99%

Application health monitoring for extreme‐scale resiliency using cooperative fault management

Agarwal

Naughton

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Summary Resiliency is and will be a critical factor in determining scientific productivity on current and exascale supercomputers, and beyond. Applications oblivious to and incapable of handling transient soft and hard errors could waste supercomputing resources or, worse, yield misleading scientific insights. We introduce a novel application‐driven silent error detection and recovery strategy based on application health monitoring. Our methodology uses application output that follows known patterns, as indicators of an application's health and knowledge that violation of these patterns could be indication of faults. Information from system monitors that report hardware and software health status is used to corroborate faults. Collectively, this information is used by a fault coordinator agent to take preventive and corrective measures by applying computational steering to an application between checkpoints. This cooperative fault management system uses the Fault Tolerance Backplane as a communication channel. The benefits of this framework are demonstrated with two real application case studies, molecular dynamics, and quantum chemistry simulations, on scalable clusters with simulated memory and I/O corruptions. The developed approach is general and can be easily applied to other applications.

show abstract

“…Establishing the necessary framework for holistic and continuous monitoring of large-scale HPC systems and their infrastructure is extremely challenging in many ways [4,9].…”

Section: Monitoring Challengesmentioning

confidence: 99%

From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

Netti,

Mueller,

Auweter

et al. 2019

Preprint

View full text Add to dashboard Cite

Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solutions exist, a system for holistic and facility-wide monitoring is still lacking in the current HPC ecosystem.In this paper we introduce DCDB, a comprehensive monitoring system capable of integrating data from all system levels. It is designed as a modular and highly-scalable framework based on a plugin infrastructure. All monitored data is aggregated at a distributed noSQL data store for analysis and cross-system correlation. We demonstrate the performance and scalability of DCDB, and describe two use cases in the area of energy management and characterization.

show abstract

Large-Scale System Monitoring Experiences and Recommendations

Cited by 9 publications

References 4 publications

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Application health monitoring for extreme‐scale resiliency using cooperative fault management

From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

Contact Info

Product

Resources

About