Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis 2013
DOI: 10.1145/2503210.2503230
|View full text |Cite
|
Sign up to set email alerts
|

Enabling comprehensive data-driven system management for large computational facilities

Abstract: This paper presents a tool chain, based on the open source tool TACC_Stats, for systematic and comprehensive job level resource use measurement for large cluster computers, and its incorporation into XDMoD, a reporting and analytics framework for resource management that targets meeting the information needs of users, application developers, systems administrators, systems management and funding managers. Accounting, scheduler and event logs are integrated with system performance data from TACC_Stats.TACC_Stat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 16 publications
(6 citation statements)
references
References 11 publications
0
6
0
Order By: Relevance
“…The Lustre filesystem is commonly used to provide highspeed data I/O on many HPC systems. However, I/O problems on Lustre has been widely reported [33]. To identify the dates when Lustre experienced I/O problem, we obtain the lists of partial correlated counters and partial correlated messages and identified cases of I/O problems on Lustre.…”
Section: Evaluation On Hpc Systemsmentioning
confidence: 99%
“…The Lustre filesystem is commonly used to provide highspeed data I/O on many HPC systems. However, I/O problems on Lustre has been widely reported [33]. To identify the dates when Lustre experienced I/O problem, we obtain the lists of partial correlated counters and partial correlated messages and identified cases of I/O problems on Lustre.…”
Section: Evaluation On Hpc Systemsmentioning
confidence: 99%
“…The figure clearly displays a substantial difference in resource utilization between the two jobs; Figure 7a shows inefficient use of the requested resources, with CPU User less than 3% for 24 cores; Figure 7b shows CPU User near 100% for 24 cores. 5 The Job Viewer lets user support staff and end users examine their job data in detail, enabling improved identification and diagnosis of inefficient jobs, and better use of resources.…”
Section: B Performance Metrics and Dimensionsmentioning
confidence: 99%
“…The XD Metrics on Demand (XDMoD) tool provides stakeholders with ready access to data about utilization, performance, and quality of service for High Performance Computing (HPC) resources. [1]- [5] This comprehensive tool was originally developed to support resources for the National Science Foundation (NSF) XSEDE program; it was later opensourced and made available to general HPC resources at universities, government laboratories, and commercial entities. [6] XDMoD enables users, managers, and operations staff to monitor, assess and maintain quality of service for their computational resources.…”
Section: Introductionmentioning
confidence: 99%
“…Yadwadkar et al [39] proposed to use the the Support Vector Machine (SVM) [15] to proactively predict stragglers from cluster resource utilization counters. Browne et al [8] proposed a comprehensive resource management tool by combining data from event logs, schedulers, and performance counters. In addition, Chuah et al [13] proposed to link resource usage anomalies with system failures.…”
Section: Related Workmentioning
confidence: 99%