Process-Oriented Non-intrusive Recovery for Sporadic Operations on Cloud

Fu, Min; Zhu, Liming; Weber, Ingo; Bass, Len; Liu, Anna; Xu, Xiwei

doi:10.1109/dsn.2016.17

Cited by 1 publication

(1 citation statement)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another important approach to mitigate failures is to implement fault containment strategies. Examples are i) interrupting a service as soon as a failure occurs (i.e., a fail-stop behavior), by turning high-severity failures, such as data losses, into lower-severity API exceptions that can be gracefully be handled [5,57,71]; ii) notifying the cloud management system and operators about the failures through error logs, so that they can diagnose issues and undertake recovery actions, such as restoring a previous state checkpoint or backup [19,75]; iii) separating system components across different domains to prevent cascading failures across components [2,26,34].…”

Section: Introductionmentioning

confidence: 99%

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Cotroneo¹,

Simone²,

Liguori³

et al. 2019

Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of

View full text Add to dashboard Cite

Cloud management systems provide abstractions and APIs for programmatically configuring cloud infrastructures. Unfortunately, residual software bugs in these systems can potentially lead to highseverity failures, such as prolonged outages and data losses. In this paper, we investigate the impact of failures in the context widespread OpenStack cloud management system, by performing fault injection and by analyzing the impact of the resulting failures in terms of fail-stop behavior, failure detection through logging, and failure propagation across components. The analysis points out that most of the failures are not timely detected and notified; moreover, many of these failures can silently propagate over time and through components of the cloud management system, which call for more thorough run-time checks and fault containment.

show abstract

Section: Introductionmentioning

confidence: 99%