Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the rootcause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%~83.5%, which outperforms the state-of-the-art triage approach by 28.0%~29.7%.Index Terms-cloud computing, root cause analysis, outage triage, machine learning I. I n t r o d u c t io nCloud computing has become increasingly popular in recent years. Many companies have migrated their services to various cloud computing platforms, e.g., Microsoft Azure, Amazon AWS, and Google Cloud. These platforms provide a variety of services to millions of users from all over the world every day. Availability is one of the most critical concern to cloud computing platforms, influencing the user experience and the cloud providers' revenue significantly.Although tremendous efforts have been devoted to main taining high service availability [1]-[4], cloud computing plat forms still encounter many incidents, i.e., unplanned interrup tions of the services. These incidents, especially outages (i.e.,
The expansion of pervasive and ubiquitous computing, especially with the advancement of the Internet of Things and the Smart City concept, extend the novel means of criminality and its investigation. We argue that current forms of investigation and discovery are not sufficient to limit injuries onto persons and communities. Nonetheless, cybersecurity approaches within criminal justice, criminology, and workforce development – together – offer models that significantly benefit efforts to address public cybersecurity harms, yet they have been largely overlooked. This paper draws on an interdisciplinary lens to address cybersecurity, including criminal justice and workforce development integration and employing empowerment theory. Applying empowerment theory, this presentation demonstrates the effects from integrating cybersecurity and forensic practices into traditional law enforcement. The effects are positive as public safety will be needed to provide public safety and security in our hybrid technical world. Thus, this paper illustrates how we must, in essence, “democratize” cybersecurity through its distributed availability. We present means to achieve this and results from efforts to promote this integration through several coordinated, yet differently targeted programs at one research university.
Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the rootcause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%∼83.5%, which outperforms the state-of-the-art triage approach by 28.0%∼29.7%.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.