2018 IEEE International Conference on Cluster Computing (CLUSTER) 2018
DOI: 10.1109/cluster.2018.00069
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale System Monitoring Experiences and Recommendations

Abstract: Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 4 publications
0
7
0
Order By: Relevance
“…The authors describe a series of technical challenges related to reliability, overhead and data consistency, as well as dive into system-specific issues, such as clock skew effects. Similarly, Ahlgren et al [2] collect a series of experiences associated with monitoring and its main usage scenarios from several HPC centers. Here, the authors highlight the fact that most data centers rely on similar data sources (e.g., CPU performance counters), but at the same time employ highly different collection and storage solutions, either due to a lack of standard solutions or due to vendor constraints.…”
Section: State Of the Artmentioning
confidence: 99%
“…The authors describe a series of technical challenges related to reliability, overhead and data consistency, as well as dive into system-specific issues, such as clock skew effects. Similarly, Ahlgren et al [2] collect a series of experiences associated with monitoring and its main usage scenarios from several HPC centers. Here, the authors highlight the fact that most data centers rely on similar data sources (e.g., CPU performance counters), but at the same time employ highly different collection and storage solutions, either due to a lack of standard solutions or due to vendor constraints.…”
Section: State Of the Artmentioning
confidence: 99%
“…Approaches for error detection based on basic textual logging (eg, syslog) to numeric metric gathering (eg, performance counters) have been studied extensively in prior work . There has been extensive research in the analysis of large‐scale system monitoring data, specifically RAS and console logs .…”
Section: Related Workmentioning
confidence: 99%
“…Approaches for error detection based on basic textual logging (eg, syslog) to numeric metric gathering (eg, performance counters) have been studied extensively in prior work. 6,[51][52][53][54] There has been extensive research in the analysis of large-scale system monitoring data, specifically RAS and console logs. 7,8,[55][56][57] The Chopstix 58 system employs a probabilistic approach to monitoring, whereby a sketch of monitoring events efficiently characterizes a state, which can be used for identification and diagnostic purposes.…”
Section: Related Workmentioning
confidence: 99%
“…Establishing the necessary framework for holistic and continuous monitoring of large-scale HPC systems and their infrastructure is extremely challenging in many ways [4,9].…”
Section: Monitoring Challengesmentioning
confidence: 99%