As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual logevents for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.
In recent years, with the growth of online services and IoT devices, software log anomaly detection has become a significant concern for both academia and industry. However, at the time of writing this paper, almost all contributions to the log anomaly detection task, follow the same traditional architecture based on parsing, vectorizing, and classifying.This paper proposes OneLog, a new approach that uses a large deep model based on instead of multiple small components. OneLog utilizes a character-based convolutional neural network (CNN) originating from traditional NLP tasks. This allows the model to take advantage of multiple datasets at once and take advantage of numbers and punctuations, which were removed in previous architectures.We evaluate OneLog using four open data sets Hadoop Distributed File System (HDFS), BlueGene/L (BGL), Hadoop, and Open-Stack. We evaluate our model with single and multi-project datasets. Additionally, we evaluate robustness with synthetically evolved datasets and ahead-of-time anomaly detection test that indicates capabilities to predict anomalies before occurring.To the best of our knowledge, our multi-project model outperforms state-of-the-art methods in HDFS, Hadoop, and BGL datasets, respectively setting getting 𝐹 1 scores of 99.99, 99.99, and 99.98. However, OneLog's performance on the Openstack is unsatisfying with 𝐹 1 score of only 21.18. Furthermore, Onelogs performance suffers very little from noise showing 𝐹 1 scores of 99.95, 99.92, and 99.98 in HDFS, Hadoop, and BGL. Our work demonstrates that character-level CNNs can successfully utilize multiple datasets that boost learning in log anomaly detection. Our 𝐹 1 scores for HDFS, Hadoop, and BGL are almost too good to be true, thus upon acceptance of this paper, we offer our code for further replications. Finally, future work is needed in investigating the poor performance of the Openstack dataset.
As stability testing execution logs can be very long, software engineers need help in locating anomalous events. We develop and evaluate two models for scoring individual logevents for anomalousness, namely an N-Gram model and a Deep Learning model with LSTM (Long short-term memory). Both are trained on normal log sequences only. We evaluate the models with long log sequences of Android stability testing in our company case and with short log sequences from HDFS (Hadoop Distributed File System) public dataset. We evaluate next event prediction accuracy and computational efficiency. The LSTM model is more accurate in stability testing logs (0.848 vs 0.865), whereas in HDFS logs the N-Gram is slightly more accurate (0.904 vs 0.900). The N-Gram model has far superior computational efficiency compared to the Deep model (4 to 13 seconds vs 16 minutes to nearly 4 hours), making it the preferred choice for our case company. Scoring individual log events for anomalousness seems like a good aid for root cause analysis of failing test cases, and our case company plans to add it to its online services. Despite the recent surge in using deep learning in software system anomaly detection, we found limited benefits in doing so. However, future work should consider whether our finding holds with different LSTM-model hyper-parameters, other datasets, and with other deep-learning approaches that promise better accuracy and computational efficiency than LSTM based models.
Detecting anomalies in software logs has become a notable concern for software engineers and maintainers as they represent anomalies in software execution paths and states. This paper propose a novel anomaly detection approach based on the Siamese network on top of Recurrent Neural Networks(RNN). Accordingly, we introduce a novel training pair generation algorithm to train the Siamese network which reduces generated training significantly while maintaining the $$F_1$$ F 1 score. Additionally, we propose a hybrid model by combining the Siamese network with a traditional feedforward neural network to make end-to-end training possible, reducing engineering effort in setting up a deep-learning-based log anomaly detector. Furthermore, we provides validations of the approach on the Hadoop Distributed File System (HDFS), Blue Gene/L (BGL), and Hadoop map-reduce task log datasets. To the best of our knowledge, the proposed approach outperforms other methods on the same dataset at the $$F_1$$ F 1 scores of respectively 0.99, 0.99, and 0.94 on HDFS, BGL, and Hadoop datasets, resulting in a new state-of-the-art performance.To further evaluate the proposed method, we examine our method’s robustness to log evolutions by evaluating the model on synthetically evolved log sequences; we got the $$F_1$$ F 1 score of 0.95 on the HDFS dataset at the noise ratio of $$20\%$$ 20 % . Finally, we dive deep into some of the side benefits of the Siamese network. Accordingly, we introduce an unsupervised log evolution monitoring method alongside a visualization technique that facilitates model interpretability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.