2015 IEEE International Parallel and Distributed Processing Symposium Workshop 2015
DOI: 10.1109/ipdpsw.2015.109
|View full text |Cite
|
Sign up to set email alerts
|

Towards Detecting Patterns in Failure Logs of Large-Scale Distributed Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 20 publications
0
5
0
Order By: Relevance
“…The techniques presented in these works all fail to extract a representative log for clustering which is a desirable quality in terms of the explainability of the log error types. Similar work has shown that it is possible to achieve syntactic representatives [10,14,20]. Furthermore, past work has also utilised genetic algorithms to extract log templates [26].…”
Section: Related Workmentioning
confidence: 98%
See 1 more Smart Citation
“…The techniques presented in these works all fail to extract a representative log for clustering which is a desirable quality in terms of the explainability of the log error types. Similar work has shown that it is possible to achieve syntactic representatives [10,14,20]. Furthermore, past work has also utilised genetic algorithms to extract log templates [26].…”
Section: Related Workmentioning
confidence: 98%
“…To do this, past research has focused on focused log comprehension [19,25] and log evolution tracking. The latter task has mainly been approached with heuristic-based systems [15,28] and unsupervised syntax-based systems [10][11][12]20]. These systems work well for small domains where data is structured, however, the unstructured and large nature of modern log data limit the success of these techniques for real world data.…”
mentioning
confidence: 99%
“…Berrocal et al [30] proposed an effective approach for fault detection based on the Void Search (VS) algorithm, which is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. The log entropy technique has also been employed for error detection within patterns in [31] and [32], since log entropy measures the changes in the frequency of log events to capture the system's behavior. Our approach differs from all the above methods, as we adopt stochastic gradient descent logistic regression to construct a sentiment lexicon (i.e.…”
Section: Related Workmentioning
confidence: 99%
“…The application of this kind of models on Failure Prediction tasks is called Anomaly Detection. Some of the main works are on [28], which detects failure patterns by removing redundant data and then performing clustering on them, using large production data as validation; [29], where the authors correlate failure appearances with system metrics; [30], where Principal Component Analysis is applied to detect anomalies; [31], which also presents a tiered system for syslog mining: first, they represent the system's behavior as a mixture of Hidden Markov Models, then they train a prediction model that gives an anomaly score and, lastly, [32], where the authors apply eigen equation compression to build network and node-level features for the models, calculating then an anomaly score based on the probability distribution of the built features. The main conclusion I extract from this category of research is that there is no public, labeled dataset that serves as an anchor for the community, allowing scientists to compare their works.…”
Section: Distributed Systemsmentioning
confidence: 99%
“…A similar issue is present on [46], where the usage of a Dynamic Bayesian Network is proposed for reliability analysis on service oriented systems but only use a fixed observation window of 20 seconds. The observation window problem arises in several other works in this area, though not applied to Failure Prediction: [22] mentions how assuming a fixed window is not convenient for performing Root Cause Analysis and [28] studies the problem of learning patterns in failure logs of distributed systems, in which the time between failures is a key parameter to tune. I consider these issues to be close to my own problem.…”
Section: Supercomputers and High-scale Clustersmentioning
confidence: 99%