Enabling comprehensive data-driven system management for large computational facilities

Browne, James C.; DeLeon, Robert L.; Lü, Chao; Jones, Matthew D.; Gallo, Steven M.; Ghadersohi, Amin; Patra, Abani; Barth, William L.; Hammond, John L.; Furlani, Thomas R.; McLay, Robert

doi:10.1145/2503210.2503230

Cited by 16 publications

(6 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The Lustre filesystem is commonly used to provide highspeed data I/O on many HPC systems. However, I/O problems on Lustre has been widely reported [33]. To identify the dates when Lustre experienced I/O problem, we obtain the lists of partial correlated counters and partial correlated messages and identified cases of I/O problems on Lustre.…”

Section: Evaluation On Hpc Systemsmentioning

confidence: 99%

Failure Diagnosis for Cluster Systems using Partial Correlations

Chuah¹,

Jhumka²,

Alt³

et al. 2021

2021 IEEE Intl Conf on Parallel &Amp; Distributed Processing With Applications, Big Data &Amp; Cloud Computing, Sustainable Com

View full text Add to dashboard Cite

Failures have expensive implications in HPC (High-Performance Computing) systems. Consequently, effective diagnosis of system failures is desired to help improve system reliability from both a remedial and preventive perspective. As HPC systems conduct extensive logging of resource usage and system events, parsing this data is an oft advocated basis for failure diagnosis. However, the high levels of concurrency that exist in HPC systems cause system events to frequently interleave in time and, as such, certain interactions appear or become indirect. which will be missed by current failure diagnostics techniques. To help uncover such indirect interactions, in this paper, we develop a novel approach that leverages the concept of partial correlation. The novel failure diagnostics workflow -called IFADE -extracts partial correlation of resource use counters and partial correlation of system errors. As part of our contributions, we (a) compare our diagnostics approach with current ones, (b) identify two previously unknown causes of system failures, validated by system designers and (c) provide insights into Lustre I/O and segmentation faults. IFADE has been put on the public domain to support system administrators in failure diagnosis.

show abstract

Section: Evaluation On Hpc Systemsmentioning

confidence: 99%

Failure Diagnosis for Cluster Systems using Partial Correlations

Chuah¹,

Jhumka²,

Alt³

et al. 2021

2021 IEEE Intl Conf on Parallel &Amp; Distributed Processing With Applications, Big Data &Amp; Cloud Computing, Sustainable Com

View full text Add to dashboard Cite

show abstract

“…The figure clearly displays a substantial difference in resource utilization between the two jobs; Figure 7a shows inefficient use of the requested resources, with CPU User less than 3% for 24 cores; Figure 7b shows CPU User near 100% for 24 cores. 5 The Job Viewer lets user support staff and end users examine their job data in detail, enabling improved identification and diagnosis of inefficient jobs, and better use of resources.…”

Section: B Performance Metrics and Dimensionsmentioning

confidence: 99%

“…The XD Metrics on Demand (XDMoD) tool provides stakeholders with ready access to data about utilization, performance, and quality of service for High Performance Computing (HPC) resources. [1]- [5] This comprehensive tool was originally developed to support resources for the National Science Foundation (NSF) XSEDE program; it was later opensourced and made available to general HPC resources at universities, government laboratories, and commercial entities. [6] XDMoD enables users, managers, and operations staff to monitor, assess and maintain quality of service for their computational resources.…”

Section: Introductionmentioning

confidence: 99%

Managing computational gateway resources with XDMoD

Sperhac

DeLeon

Furlani

et al. 2019

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

“…Yadwadkar et al [39] proposed to use the the Support Vector Machine (SVM) [15] to proactively predict stragglers from cluster resource utilization counters. Browne et al [8] proposed a comprehensive resource management tool by combining data from event logs, schedulers, and performance counters. In addition, Chuah et al [13] proposed to link resource usage anomalies with system failures.…”

Section: Related Workmentioning

confidence: 99%

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Yoo

Koo

Cao

et al. 2016

Conquering Big Data With High Performance Computing

View full text Add to dashboard Cite

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-ofthe-art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster. Our study processed many terabytes of system logs and application performance measurements collected on the HPC systems at NERSC. The implementation of our tool is generic enough to be used for analyzing the performance of other HPC systems and Big Data workows.

show abstract

Enabling comprehensive data-driven system management for large computational facilities

Cited by 16 publications

References 11 publications

Failure Diagnosis for Cluster Systems using Partial Correlations

Failure Diagnosis for Cluster Systems using Partial Correlations

Managing computational gateway resources with XDMoD

Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Contact Info

Product

Resources

About