Fault detection in GPU-enabled Cloud Systems – An Overview

Asadova, Farida; Kertész, Gábor; Lovas, Róbert; Szénási, Sándor

doi:10.1109/sami54271.2022.9780804

Cited by 1 publication

(1 citation statement)

References 52 publications

(60 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our approach extends the results of the major related works summarized (partly leveraging [28]) in several areas including the application of various deep learning methods (autoencoders, LSTMs and GNNs), the extensive use of formal modelling, and the active steering towards the suspicious situations of a cloud debugger.…”

Section: Related Workmentioning

confidence: 59%

Experiences With Deep Learning Enhanced Steering Mechanisms for Debugging of Fundamental Cloud Services

2023

View full text Add to dashboard Cite

Cloud architecture blueprints or reference architectures allow the reuse of existing knowledge and best practices when creating new cloud native solutions. Therefore, debugging reference architecture candidates (or their new versions) is an extremely crucial but tedious and time-consuming task due to the deployment of complex services in typical multi-tenant and non-deterministic environments. During the debugging/testing/maintenance scenarios, we might be able to achieve greater levels of test coverage (and eventually improved reliability) by modelling and verifying at least their most fundamental building blocks. The main objective of our work is to integrate the stochastic modelling and verification techniques based on deep learning methods into the debugging cycle in order to handle large state spaces more efficiently, i.e. by steering the process of traversing state space towards suspicious situations that may result in potential bugs in the actual system with smart steering during the traversal. For this purpose, our presented and illustrated approach combines (among others) Continuous Time Markov Chain modelling (CTMC) techniques with deep learning methods including autoencoder, Long Short-Term Memory (LSTM) and Graph Neural Network (GNN) models. Our experiences are summarized with widespread cloud design patterns including load balancing and service mesh topologies.

show abstract

Section: Related Workmentioning

confidence: 59%