2016
DOI: 10.1016/j.parco.2016.05.009
|View full text |Cite
|
Sign up to set email alerts
|

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Abstract: Highlights• Monitoring can provide meaningful system and application profiling in production.• Visual and analytical characterizations can inform usage and procurement decisions.• Resource utilization scoring provides simple but informative characterizations.• Continuous, synchronous, high-fidelity, whole-system monitoring is required. AbstractA detailed understanding of HPC applications' resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalabili… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 10 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…They use machine learning combined with system data, but primarily focus on diagnosing anomalies in compute node health and not performance of jobs. Agelastos et al [10] create a HPC system profiler to explain the performance variability of applications across different HPC systems.…”
Section: Related Workmentioning
confidence: 99%
“…They use machine learning combined with system data, but primarily focus on diagnosing anomalies in compute node health and not performance of jobs. Agelastos et al [10] create a HPC system profiler to explain the performance variability of applications across different HPC systems.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Agelastos et al produced some similar works. In [19], the authors put the focus on the results obtained after doing profiling (at a system and application level) based on global monitoring in an HPC cluster. They collect metrics as we have done, and use a hierarchical model to transmit the data from the compute nodes to aggregating nodes.…”
Section: Related Workmentioning
confidence: 99%
“…Profiling tools that work based on this model provide the most accurate data at high frequencies with a considerably high cost. However, they are more efficient at low frequencies but less informative and prone to detail loss [3]. Current approaches either lack the necessary efficiency to be utilized in production systems or support only post-mortem analysis that does not present online data about application events during the execution.…”
Section: Introductionmentioning
confidence: 99%