Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems

Gurumdimma, Nentawe; Jhumka, Arshad; Liakata, Maria; Chuah, Edward; Browne, James C.

doi:10.1109/ipdpsw.2015.109

Cited by 11 publications

(5 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The techniques presented in these works all fail to extract a representative log for clustering which is a desirable quality in terms of the explainability of the log error types. Similar work has shown that it is possible to achieve syntactic representatives [10,14,20]. Furthermore, past work has also utilised genetic algorithms to extract log templates [26].…”

Section: Related Workmentioning

confidence: 98%

“…To do this, past research has focused on focused log comprehension [19,25] and log evolution tracking. The latter task has mainly been approached with heuristic-based systems [15,28] and unsupervised syntax-based systems [10][11][12]20]. These systems work well for small domains where data is structured, however, the unstructured and large nature of modern log data limit the success of these techniques for real world data.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Log Summarisation for Defect Evolution Analysis

Dolga,

Zmigrod,

Silva

et al. 2023

Proceedings of the 1st International Workshop on Software Defect Datasets

View full text Add to dashboard Cite

Log analysis and monitoring are essential aspects in software maintenance and identifying defects. In particular, the temporal nature and vast size of log data leads to an interesting and important research question: How can logs be summarised and monitored over time? While this has been a fundamental topic of research in the software engineering community, work has typically focused on heuristic-, syntax-, or static-based methods. In this work, we suggest an online semantic-based clustering approach to error logs that dynamically updates the log clusters to enable monitoring code error life-cycles. We also introduce a novel metric to evaluate the performance of temporal log clusters. We test our system and evaluation metric with an industrial dataset and find that our solution outperforms similar systems. We hope that our work encourages further temporal exploration in defect datasets. CCS CONCEPTS• Theory of computation → Unsupervised learning and clustering; • Software and its engineering → Software defect analysis.

show abstract

Section: Related Workmentioning

confidence: 98%

mentioning

confidence: 99%

Log Summarisation for Defect Evolution Analysis

Dolga,

Zmigrod,

Silva

et al. 2023

Proceedings of the 1st International Workshop on Software Defect Datasets

View full text Add to dashboard Cite

show abstract

“…Berrocal et al [30] proposed an effective approach for fault detection based on the Void Search (VS) algorithm, which is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. The log entropy technique has also been employed for error detection within patterns in [31] and [32], since log entropy measures the changes in the frequency of log events to capture the system's behavior. Our approach differs from all the above methods, as we adopt stochastic gradient descent logistic regression to construct a sentiment lexicon (i.e.…”

Section: Related Workmentioning

confidence: 99%

Sentiment Analysis based Error Detection for Large-Scale Systems

Alharthi

Jhumka

et al. 2021

2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Self Cite

View full text Add to dashboard Cite

“…The application of this kind of models on Failure Prediction tasks is called Anomaly Detection. Some of the main works are on [28], which detects failure patterns by removing redundant data and then performing clustering on them, using large production data as validation; [29], where the authors correlate failure appearances with system metrics; [30], where Principal Component Analysis is applied to detect anomalies; [31], which also presents a tiered system for syslog mining: first, they represent the system's behavior as a mixture of Hidden Markov Models, then they train a prediction model that gives an anomaly score and, lastly, [32], where the authors apply eigen equation compression to build network and node-level features for the models, calculating then an anomaly score based on the probability distribution of the built features. The main conclusion I extract from this category of research is that there is no public, labeled dataset that serves as an anchor for the community, allowing scientists to compare their works.…”

Section: Distributed Systemsmentioning

confidence: 99%

“…A similar issue is present on [46], where the usage of a Dynamic Bayesian Network is proposed for reliability analysis on service oriented systems but only use a fixed observation window of 20 seconds. The observation window problem arises in several other works in this area, though not applied to Failure Prediction: [22] mentions how assuming a fixed window is not convenient for performing Root Cause Analysis and [28] studies the problem of learning patterns in failure logs of distributed systems, in which the time between failures is a key parameter to tune. I consider these issues to be close to my own problem.…”

Section: Supercomputers and High-scale Clustersmentioning

confidence: 99%

Distributed Systems Failure Management Through Applied Machine Learning

González¹,

Manuel²

View full text Add to dashboard Cite

This thesis deals with the problem of managing failures on distributed systems, specially on computer networks and high performance computing clusters. Through it, I expose and analyze the importance of the problem and how its current research landscape, while extensive, is fragmented, isolated and takes a too narrow approach. Specially, there is a gap of knowledge between academic and industrial problems and the need for a human expert and all of the problems that this entails have been overlooked. Based on this situation, I take two real datasets, a public one, detailing errors occurred on a supercomputer at Los Alamos, USA, and the Los resultados muestran que mis propuestas son capaces de conseguir soluciones exitosas con una interacción humana mínima, además de satisfacer los requerimientos y limitaciones técnicas. iv I can't start this section in any other way than recognizing the massive debt I owe to Juan Carlos Dueñas, my thesis advisor, boss and friend for the last five years. My life has changed in more ways that I could reflect here and I've learned and grown along this years. And I owe it all to you. A deep, sincere, thanks. And I also could not continue without thanking my workmates (and friends

show abstract

Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems

Cited by 11 publications

References 20 publications

Log Summarisation for Defect Evolution Analysis

Log Summarisation for Defect Evolution Analysis

Sentiment Analysis based Error Detection for Large-Scale Systems

Distributed Systems Failure Management Through Applied Machine Learning

Contact Info

Product

Resources

About