SC20: International Conference for High Performance Computing, Networking, Storage and Analysis 2020
DOI: 10.1109/sc41405.2020.00069
|View full text |Cite
|
Sign up to set email alerts
|

Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 41 publications
0
5
0
Order By: Relevance
“…Due to their visual nature, these works are ODAV case studies. Some works cover ODAC experiences for specific purposes: Auweter et al [4] discuss their use of the LoadLeveler framework for CPU frequency tuning on the SuperMUC HPC system at the Leibniz Supercomputing Centre (LRZ), leading to 6% yearly energy cost savings, while Jha et al [29] describe their 2-year use of the Kaleidoscope tool on Blue Waters for live failure detection.…”
Section: State Of the Artmentioning
confidence: 99%
“…Due to their visual nature, these works are ODAV case studies. Some works cover ODAC experiences for specific purposes: Auweter et al [4] discuss their use of the LoadLeveler framework for CPU frequency tuning on the SuperMUC HPC system at the Leibniz Supercomputing Centre (LRZ), leading to 6% yearly energy cost savings, while Jha et al [29] describe their 2-year use of the Kaleidoscope tool on Blue Waters for live failure detection.…”
Section: State Of the Artmentioning
confidence: 99%
“…Root Cause Analysis. A large body of work [13,31,45,47,57,59,86,100,111,114] provides promising examples that data-driven diagnostics help detect performance anomalies and analyze root causes. For example, Sieve [100] leverages Granger causality to correlate performance anomaly data series with particular metrics as potential root causes.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, as shown in recent studies [4,15,16,20,23], many widely-deployed distributed systems cannot tolerate fail-slow faults. For example, Do et al show that slowing down one node in five scale-out distributed systems can lead to cascading performance failures [15].…”
Section: Introductionmentioning
confidence: 99%
“…Recent efforts on combating fail-slow faults mainly focus on detecting performance cascading bugs [27] monitoring fail-slow runtime behavior [6,19,23,34], and troubleshooting performance anomalies [3,6,29]. While those works provide remedies to the manifestation of fail-slow faults, a more fundamental direction is to build distributed systems that are inherently fail-slow fault tolerant.…”
Section: Introductionmentioning
confidence: 99%