Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Jha, Saurabh; Cui, Shengkun; Banerjee, Subho S.; Xu, Tianyin; Enos, Jeremy; Showerman, Mike; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.

doi:10.1109/sc41405.2020.00069

Cited by 8 publications

(5 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Due to their visual nature, these works are ODAV case studies. Some works cover ODAC experiences for specific purposes: Auweter et al [4] discuss their use of the LoadLeveler framework for CPU frequency tuning on the SuperMUC HPC system at the Leibniz Supercomputing Centre (LRZ), leading to 6% yearly energy cost savings, while Jha et al [29] describe their 2-year use of the Kaleidoscope tool on Blue Waters for live failure detection.…”

Section: State Of the Artmentioning

confidence: 99%

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Netti¹,

Ott²,

Guillen³

et al. 2021

Preprint

View full text Add to dashboard Cite

As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multifaceted problem, much effort has gone into researching its separate aspects: however, accounts of production ODA experiences are still hard to come across.In this work we aim to bridge the gap between ODA research and production use by presenting our experiences with ODA in production, involving in particular the control of cooling infrastructures and visualization of job data on two HPC systems. We cover the entire development process, from design to deployment, highlighting our insights in an effort to drive the community forward. We rely on open-source tools, which make for a generic ODA framework suitable for most scenarios.

show abstract

Section: State Of the Artmentioning

confidence: 99%

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Netti¹,

Ott²,

Guillen³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Root Cause Analysis. A large body of work [13,31,45,47,57,59,86,100,111,114] provides promising examples that data-driven diagnostics help detect performance anomalies and analyze root causes. For example, Sieve [100] leverages Granger causality to correlate performance anomaly data series with particular metrics as potential root causes.…”

Section: Related Workmentioning

confidence: 99%

FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Qiu¹,

Banerjee²,

Jha³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Modern user-facing, latency-sensitive web services include numerous distributed, intercommunicating microservices that promise to simplify software development and operation. However, multiplexing compute-resources across microservices is still challenging in production because contention for shared resources can cause latency spikes that violate the service-level objectives (SLOs) of user requests. This paper presents FIRM, an intelligent fine-grained resource management framework for predictable sharing of resources across microservices to drive up overall utilization. FIRM leverages online telemetry data and machine-learning methods to adaptively (a) detect/localize microservices that cause SLO-violations, (b) identify low-level resources in contention, and (c) take actions to mitigate SLO-violations by dynamic reprovisioning. Experiments across four microservice benchmarks demonstrate that FIRM reduces SLO violations by up to 16.7× while reducing the overall requested CPU limit by up to 62.3%. Moreover, FIRM improves performance predictability by reducing tail latencies by up to 11.5×.

show abstract

“…Unfortunately, as shown in recent studies [4,15,16,20,23], many widely-deployed distributed systems cannot tolerate fail-slow faults. For example, Do et al show that slowing down one node in five scale-out distributed systems can lead to cascading performance failures [15].…”

Section: Introductionmentioning

confidence: 99%

“…Recent efforts on combating fail-slow faults mainly focus on detecting performance cascading bugs [27] monitoring fail-slow runtime behavior [6,19,23,34], and troubleshooting performance anomalies [3,6,29]. While those works provide remedies to the manifestation of fail-slow faults, a more fundamental direction is to build distributed systems that are inherently fail-slow fault tolerant.…”

Section: Introductionmentioning

confidence: 99%

Fail-slow fault tolerance needs programming support

Yoo

Wang

Sinha

et al. 2021

Proceedings of the Workshop on Hot Topics in Operating Systems

Self Cite

View full text Add to dashboard Cite

The need for fail-slow fault tolerance in modern distributed systems is highlighted by the increasingly reported fail-slow hardware/software components that lead to poor performance system-wide. We argue that fail-slow fault tolerance not only needs new distributed protocol designs, but also desires programming support for implementing and verifying fail-slow fault-tolerant code. Our observation is that the inability of tolerating fail-slow faults in existing distributed systems is often rooted in the implementations and is difficult to understand and debug. We designed the Dependably Fast Library (DepFast) for implementing fail-slow tolerant distributed systems. DepFast provides expressive interfaces for taking control of possible fail-slow points in the program to prevent unexpected slowness propagation once and for all. We use DepFast to implement a distributed replicated state machine (RSM) and show that it can tolerate various types of fail-slow faults that affect existing RSM implementations.

show abstract

Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Cited by 8 publications

References 41 publications

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

FIRM: An Intelligent Fine-Grained Resource Management Framework for SLO-Oriented Microservices

Fail-slow fault tolerance needs programming support

Contact Info

Product

Resources

About