Towards intelligent incident management: why we need it and how we make it

Chen, Zhuangbin; Kang, Yu; Li, Liqun; Zhang, Xu; Zhang, Hongyu; Xu, Hui; Zhou, Yangfan; Li, Yang; Sun, Jeffrey; Xu, Zhangwei; Dang, Yingnong; Gao, Feng; Zhao, Pu; Qiao, Bo; Lin, Qingwei; Zhang, Dongmei; Lyu, Michael R.

doi:10.1145/3368089.3417055

Cited by 58 publications

(25 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The triage practice for both outages suffers from flood ing alarm problems [5]. For example, during the impact of Outages , there are K incidents reported by 225 services from the affected region of Outages in total.…”

Section: A Two Real-world Casesmentioning

confidence: 99%

“…Many dynamic dependencies are even implicit for engineers (e.g., asynchronous communication, virtual routers, virtual disks), and some services deployed on the same node may affect each other (e.g., monitoring services and functional services). When an outage occurs, massive noisy alerts might be reported, usually as incident tickets, due to the notorious flooding alarm problem in cloud computing platforms [5]. As a result, it is difficult to decide the root-cause service.…”

mentioning

confidence: 99%

See 1 more Smart Citation

Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining

Wang

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

Self Cite

View full text Add to dashboard Cite

Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the rootcause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%~83.5%, which outperforms the state-of-the-art triage approach by 28.0%~29.7%.Index Terms-cloud computing, root cause analysis, outage triage, machine learning I. I n t r o d u c t io nCloud computing has become increasingly popular in recent years. Many companies have migrated their services to various cloud computing platforms, e.g., Microsoft Azure, Amazon AWS, and Google Cloud. These platforms provide a variety of services to millions of users from all over the world every day. Availability is one of the most critical concern to cloud computing platforms, influencing the user experience and the cloud providers' revenue significantly.Although tremendous efforts have been devoted to main taining high service availability [1]-[4], cloud computing plat forms still encounter many incidents, i.e., unplanned interrup tions of the services. These incidents, especially outages (i.e.,

show abstract

Section: A Two Real-world Casesmentioning

confidence: 99%

mentioning

confidence: 99%

Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining

Wang

et al. 2021

2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Representative service failures include slow response, request timeout, service unavailability, etc., which could be caused by capacity issues, configuration errors, software bugs, hardware faults, etc. To quickly understand failure symptoms, a large number of monitors are configured to monitor the states of different services in a cloud system [2]. A monitor will render an incident when certain predefined conditions (e.g., "CPU utilization rate exceeds 80%") are met.…”

Section: B Cascading Effect Of Service Failuresmentioning

confidence: 99%

“…When a failure happens, system monitors will render a large number of incidents to capture different failure symptoms [2]- [4], which can help engineers quickly obtain a big picture of the failure and pinpoint the root cause. For example, "Special instance cannot be migrated" is a critical network failure in Virtual Private Cloud (VPC) service, and the incident "Tunnel † Corresponding author.…”

Section: Introductionmentioning

confidence: 99%

“…bearing network pack loss" is a signal for this network failure, which is caused by the breakdown of a physical network card on the tunnel path. Due to the large scale and complexity of online service systems, the number of incidents is overwhelming the existing incident management systems [2], [4], [5]. When a service failure occurs, aggregating related incidents can greatly reduce the number of incidents that need to be investigated.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Continuous Incident Triage for Large-Scale Online Service Systems

Chen

Lin

et al. 2019

2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE)

Self Cite

View full text Add to dashboard Cite

As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on graph representation learning over the cascading graph of cloud failures. A representation vector is learned for each unique type of incident in an unsupervised and unified manner, which is able to simultaneously encode the topological and temporal correlations among incidents. Thus, it can be easily employed for online incident aggregation. In particular, to learn the correlations more accurately, we try to recover the complete scope of failures' cascading impact by leveraging fine-grained system monitoring data, i.e., Key Performance Indicators (KPIs). The proposed framework is evaluated with real-world incident data collected from a large-scale online service system of Huawei Cloud. The experimental results demonstrate that GRLIA is effective and outperforms existing methods. Furthermore, our framework has been successfully deployed in industrial practice.

show abstract

Making service continuity smarter with artificial intelligence: An approach and its evaluation

2023

View full text Add to dashboard Cite

Service continuity entails establishing an observable and explainable continuum between customer experience and service operations. Such continuum is currently established manually, via service customer management operations (such as service incident management (IM)) often resulting in time‐consuming, human‐detrimental, and error‐prone activities. Conversely, artificial intelligence (AI) is rapidly emerging as an automated enabler towards handling the discontinuities in the aforementioned critical business tasks. Consequently, the emerging topic of AI‐driven incident management (AIIM) addresses practices and tools to resolve incidents by means of AI‐enabled organizational processes and methodologies. Our conjecture is that AIIM could reduce unplanned interruptions of service and let customers resume their work as quick as possible. While several techniques were presented in the literature to automatically identify the problems described in incident tickets by customers, this article focuses on the qualitative analysis and feature extraction off of the provided descriptions. When an incident ticket does not describe properly the problem, the analyst must ask the customer for additional details which could require several long‐lasting interactions. This article proposes ACQUA , an AIIM approach to automatically assess the quality of ticket descriptions with the goals of removing the need of additional communications and guiding the customers to properly describe the incident. A preliminary evaluation of ACQUA was performed on a dataset provided by a large bank in Europe, showing promising results and a boost of 13% in ticket resolution times and connected service continuity.

show abstract

Towards intelligent incident management: why we need it and how we make it

Cited by 58 publications

References 28 publications

Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining

Fast Outage Analysis of Large-Scale Production Clouds with Service Correlation Mining

Continuous Incident Triage for Large-Scale Online Service Systems

Making service continuity smarter with artificial intelligence: An approach and its evaluation

Contact Info

Product

Resources

About