Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw 2020
DOI: 10.1145/3368089.3417055
|View full text |Cite
|
Sign up to set email alerts
|

Towards intelligent incident management: why we need it and how we make it

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 58 publications
(25 citation statements)
references
References 28 publications
0
24
0
Order By: Relevance
“…The triage practice for both outages suffers from flood ing alarm problems [5]. For example, during the impact of Outages , there are K incidents reported by 225 services from the affected region of Outages in total.…”
Section: A Two Real-world Casesmentioning
confidence: 99%
See 1 more Smart Citation
“…The triage practice for both outages suffers from flood ing alarm problems [5]. For example, during the impact of Outages , there are K incidents reported by 225 services from the affected region of Outages in total.…”
Section: A Two Real-world Casesmentioning
confidence: 99%
“…Many dynamic dependencies are even implicit for engineers (e.g., asynchronous communication, virtual routers, virtual disks), and some services deployed on the same node may affect each other (e.g., monitoring services and functional services). When an outage occurs, massive noisy alerts might be reported, usually as incident tickets, due to the notorious flooding alarm problem in cloud computing platforms [5]. As a result, it is difficult to decide the root-cause service.…”
mentioning
confidence: 99%
“…Representative service failures include slow response, request timeout, service unavailability, etc., which could be caused by capacity issues, configuration errors, software bugs, hardware faults, etc. To quickly understand failure symptoms, a large number of monitors are configured to monitor the states of different services in a cloud system [2]. A monitor will render an incident when certain predefined conditions (e.g., "CPU utilization rate exceeds 80%") are met.…”
Section: B Cascading Effect Of Service Failuresmentioning
confidence: 99%
“…When a failure happens, system monitors will render a large number of incidents to capture different failure symptoms [2]- [4], which can help engineers quickly obtain a big picture of the failure and pinpoint the root cause. For example, "Special instance cannot be migrated" is a critical network failure in Virtual Private Cloud (VPC) service, and the incident "Tunnel † Corresponding author.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation