2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) 2019
DOI: 10.1109/ase.2019.00042
|View full text |Cite
|
Sign up to set email alerts
|

Continuous Incident Triage for Large-Scale Online Service Systems

Abstract: As online service systems continue to grow in terms of complexity and volume, how service incidents are managed will significantly impact company revenue and user trust. Due to the cascading effect, cloud failures often come with an overwhelming number of incidents from dependent services and devices. To pursue efficient incident management, related incidents should be quickly aggregated to narrow down the problem scope. To this end, in this paper, we propose GRLIA, an incident aggregation framework based on g… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
55
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 63 publications
(55 citation statements)
references
References 72 publications
(79 reference statements)
0
55
0
Order By: Relevance
“…It is worth noting that there is a body of work focuses on incident triage [8,9]. Chen et al [8] conducts a comprehensive empirical study of incident triage on 20 real-world online service systems.…”
Section: Incident Triagementioning
confidence: 99%
See 2 more Smart Citations
“…It is worth noting that there is a body of work focuses on incident triage [8,9]. Chen et al [8] conducts a comprehensive empirical study of incident triage on 20 real-world online service systems.…”
Section: Incident Triagementioning
confidence: 99%
“…This work also tries the practicability of bug triage approaches on incident triage problems, and it concludes that traditional bug triage methods need to be further improved to fit the context of incident triage. Chen et al [9] further propose DeepCT, a deep learning based approach which extracts features from historical discussions and thus solves the continuous incident triage problem. However, these work could only help back-end incident mitigation in largescale systems.…”
Section: Incident Triagementioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, deep learning (DL) techniques are rapidly developed and become one of the most popular techniques. Also, they are widely adopted in various domains in practice, such as autonomous driving cars [12], face recognition [59], speech recognition [29], aircraft collision avoidance systems [36], and software engineering [15,16,18,21,42,70]. Unfortunately, DL systems are also shown to be vulnerable to attacks and lack of robustness [40,67].…”
Section: Introductionmentioning
confidence: 99%
“…To reduce the influence of incidents and guarantee the quality of software services, there are two widely-used ways in both academia and industry [32,33], i.e., predicting the occurrence of an incident in advance so that engineers can take some proactive actions to prevent it [18,43] and mitigate the already happened incident as soon as possible [14,15]. Our work focuses on the first way since this way is able to directly avoid the occurrence of service unavailability rather than reduce the time of service unavailability.…”
Section: Introductionmentioning
confidence: 99%