2020
DOI: 10.1145/3385187
|View full text |Cite
|
Sign up to set email alerts
|

Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform

Abstract: Many software services today are hosted on cloud computing platforms, such as Amazon EC2, due to many benefits like reduced operational costs. However, node failures in these platforms can impact the availability of their hosted services and potentially lead to large financial losses. Predicting node failures before they actually occur is crucial, as it enables DevOps engineers to minimize their impact by performing preventative actions. However, such predictions are hard due to many challenges like the enormo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
35
0
2

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4

Relationship

1
7

Authors

Journals

citations
Cited by 52 publications
(38 citation statements)
references
References 34 publications
0
35
0
2
Order By: Relevance
“…In Table ??, a large number of studies adopted predictive models to predict bugs or faults in software systems. Failure prediction is critical to predictive maintenance since it has the ability to prevent maintenance costs and failure / bug occurrences [21,34,137,260,331]. Yilmaz and Porter [315] classify measured executions into successful and failed executions in order to apply the resulting models to systems with an unknown failure status.…”
Section: Sotware Testingmentioning
confidence: 99%
“…In Table ??, a large number of studies adopted predictive models to predict bugs or faults in software systems. Failure prediction is critical to predictive maintenance since it has the ability to prevent maintenance costs and failure / bug occurrences [21,34,137,260,331]. Yilmaz and Porter [315] classify measured executions into successful and failed executions in order to apply the resulting models to systems with an unknown failure status.…”
Section: Sotware Testingmentioning
confidence: 99%
“…Besides, some black-box ML and DL models, like LSTM and MING-based models, is hard to explain the results. e lack of decision interpretability is not convenient for maintenance work of DevOps engineers [20]. erefore this paper calculate the monitoring time series anomaly score in real-time, which is automatic feature engineering to build a random forest-based prediction model in the second layer.…”
Section: Literature Reviewmentioning
confidence: 99%
“…Cloud computing platforms generate a tremendous amount of data that is impossible to be analyzed manually. Recently, it has become increasingly common for organizations to use AIOps (Artificial Intelligence for IT Operations) to leverage such generated data to ensure the quality of service and high availability of cloud computing platforms [6,8,21,42,62,88]. AIOps leverages machine learning learners to construct machine learning models (hereafter AIOps models) with operations data collected from the cloud computing platforms (e.g., logs and alert signals) to enable quality assurance tasks such as predicting hard drive failures [42], job termination [22], service outages [88], and performance issues [44].…”
Section: Introductionmentioning
confidence: 99%
“…Recently, it has become increasingly common for organizations to use AIOps (Artificial Intelligence for IT Operations) to leverage such generated data to ensure the quality of service and high availability of cloud computing platforms [6,8,21,42,62,88]. AIOps leverages machine learning learners to construct machine learning models (hereafter AIOps models) with operations data collected from the cloud computing platforms (e.g., logs and alert signals) to enable quality assurance tasks such as predicting hard drive failures [42], job termination [22], service outages [88], and performance issues [44]. Note that we use the term "learner" to refer to a machine learning algorithm (e.g., Random Forest) and the term "model" to refer to a trained machine learning model (e.g., a Random Forest model trained on disk failure data).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation