How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Cotroneo, Domenico; Simone, Luigi De; Liguori, Pietro; Natella, Roberto; Bidokhti, Nematollah

doi:10.1145/3338906.3338916

Cited by 54 publications

(45 citation statements)

References 57 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Therefore, the Chaos Monkey randomly terminates VM instances and containers that run inside a production environment. The principles of the testing tool have inspired developers to implement similar tools for different technologies, for example, Kubernetes clusters, 12 Azure Service Fabric, 13 Docker, 14 or private cloud infrastructures. 15 Following the above ideas, we have developed a tool for monkey testing our selfhealing, trans-cloud application management platform.…”

Section: Monkey Testingmentioning

confidence: 99%

“…As a result, depending on the actual deployment of multicomponent applications, we may have different durations for their possible "instability periods" (viz., time periods during which some of their application components are left unstable). Instability periods are definitely an issue, as inconsistent answers may cause inconsistent states for an application [13]. Furthermore, unresponsiveness increases the latency in answering to end-users, potentially causing client loss in the same way as underprovisioning does [3].…”

Section: Introductionmentioning

confidence: 99%

“…The Chaos Toolkit: https://chaostoolkit.org/ 13. Failure Analysis Service: https://docs.microsoft.com/en-us/azure/service-fabric/service-fabrictestability-scenarios 14.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Self-healing trans-cloud applications

et al. 2021

View full text Add to dashboard Cite

Trans-cloud applications consist of multiple interacting components deployed across different cloud providers and at different service layers (IaaS and PaaS). In such complex deployment scenarios, fault handling and recovery need to deal with heterogeneous cloud offerings and to take into account inter-component dependencies. We propose a methodology for self-healing trans-cloud applications from failures occurring in application components or in the cloud services hosting them, both during deployment and while they are being operated. The proposed methodology enables reducing the time application components rely on faulted services, hence residing in “unstable” states where they can suddenly fail in cascade or exhibit erroneous behaviour. We also present an open-source prototype illustrating the feasibility of our proposal, which we have exploited to carry out an extensive evaluation based on controlled experiments and monkey testing.

show abstract

Section: Monkey Testingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Self-healing trans-cloud applications

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Furthermore, advancements in microprocessor manufacturing, which yield lower nodal capacitance and higher transistor density, cause an increase in the soft error rate [21], i.e., the probability of occurrence of a soft error. Software faults are also a threat to the dependability of cloud computing and virtualized systems, particularly software faults in the components required for virtualization (e.g., hypervisor, toolstack, privileged virtual machine) and cloud management [22].…”

Section: Related Workmentioning

confidence: 99%

Mitigating Virtualization Failures Through Migration to a Co-Located Hypervisor

2021

View full text Add to dashboard Cite

Many organizations are moving their systems to the cloud, where providers consolidate multiple clients using virtualization, which creates challenges to business-critical applications. Research has shown that hypervisors fail, often causing common-mode failures that may abruptly disrupt dozens of virtual machines simultaneously. We hypothesize and empirically show that a significant percentage of virtual machines affected by a hypervisor failure are capable of continuing execution on a new hypervisor. Supported by this observation, we design a technique for recovering from hypervisor failures through efficient virtual machine migration to a co-located hypervisor, which allows virtual machines to continue executing with minimal downtime and which can be transparently applied to existing applications. We evaluate a proof-of-concept implementation using fault injection of hardware and software faults and show that it can recover, on average, 41-46% of all virtual machines, as well as having a mean virtual machine downtime of 3 seconds.

show abstract

“…If intermediate code compilation exists, byte code manipulation may also be a viable option (Sanches et al 2011). In many cases, we are able to use abstract forms of the code (e.g., an abstract syntax tree) to inject a particular kind of fault (Cotroneo et al 2019;Hajdu et al 2020).…”

Section: Introductionmentioning

confidence: 99%

Injecting software faults in Python applications

2021

View full text Add to dashboard Cite

Software fault injection techniques have been largely used as means for evaluating the dependability of systems in presence of certain types of faults. Despite the large diversity of tools offering the possibility of emulating the presence of software faults, there is little practical support for emulating the presence of software faults in Python applications, which are increasingly being used to support business critical cloud services. In this paper, we present FIT4Python, a tool for injecting software faults in Python code and then use it, in a mutation testing campaign, to analyse the effectiveness of OpenStack's test suite against new probable software faults. We begin by analysing the types of faults affecting Nova Compute, the core component of OpenStack. We use our tool to emulate the presence of new faults in Nova Compute API to understand how well OpenStack's battery of unit, functional, and integration tests cover these new, but probable, situations. Results show clear limitations in the effectiveness of OpenStack developers' test suites, with many cases of injected faults passing undetected through all three types of tests and that nearly half of the analysed problems could be detected with trivial changes or additions to the unit tests.

show abstract

How bad can a bug get? an empirical analysis of software failures in the OpenStack cloud computing platform

Cited by 54 publications

References 57 publications

Self-healing trans-cloud applications

Self-healing trans-cloud applications

Mitigating Virtualization Failures Through Migration to a Co-Located Hypervisor

Injecting software faults in Python applications

Contact Info

Product

Resources

About