Latent error prediction and fault localization for microservice applications by learning from system trace logs

Zhou, Xiang; Peng, Xin; Xie, Tao; Sun, Jun; Ji, Chao; Liu, Dewei; Xiang, Qilin; He, Chuan

doi:10.1145/3338906.3338961

Cited by 160 publications

(72 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For online service systems, alerts are a key data source for recording the anomalies generated from various system components. More specifically, monitoring systems continuously collect various data (e.g., metrics [46], logs [19,31], and traces [55,56]) from various service components, and engineers manually define many rules to check these monitoring data to ensure service availability. When a certain rule is violated, an alert would be generated to report the anomaly.…”

Section: Motivation and Problem Formulation 21 Background: Alert And Imentioning

confidence: 99%

“…Nowadays, online service systems, such as online shopping, Ebank, and search engines, have become an indispensable part in our daily life. Although tremendous efforts have been devoted to software service maintenance (e.g., collecting various monitoring data for a service system such as metrics [44,46,54], logs [19,31,51], traces [55], and alerts [29]), due to their large scale and complexity, incidents (i.e., unplanned interruption/outage to a service [2, 16,25]) are still inevitable, which could lead to system unavailability and huge economic loss [32]. For example, according to a recent survey [1], the average cost per hour of server downtime is between $301,000 and $400,000.…”

Section: Introductionmentioning

confidence: 99%

“…Similar to AirAlert, we also utilize lightweight alert data for prediction since alerts are more highlevel and comprehensive. More specifically, alerts are generated to report anomalies from other monitoring data (e.g., metrics [46], logs [19,31], and traces [55]), and thus avoid processing massive logs or metrics. However, predicting incidents based on alert data in practice also faces several challenges as follows.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Real-time incident prediction for online service systems

Zhao

Chen

Zhou

et al. 2020

Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw

View full text Add to dashboard Cite

Section: Motivation and Problem Formulation 21 Background: Alert And Imentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Real-time incident prediction for online service systems

Zhao

Chen

Zhou

et al. 2020

Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Softw

View full text Add to dashboard Cite

“…After parsing, the injector will use the ''chaosmonkey'' to inject faults. In previous work [38] and [39], researchers collected 22 representative microservice faults and listed the detail description of these faults. For those faults that result in the malfunctioning of system services by raising errors or producing incorrect results, researchers regard them as functional faults.…”

Section: ) Chaos Engineeringmentioning

confidence: 99%

A Framework of Virtual War Room and Matrix Sketch-Based Streaming Anomaly Detection for Microservice Systems

Chen

2020

IEEE Access

View full text Add to dashboard Cite

Recently, microservice has been a popular architecture to construct cloud-native systems. This novel architecture brings agility and accelerates the software development process significantly. However, it is not easy to manage and operate microservice systems due to their scale and complexity. Many approaches are proposed to automatically operate microservice systems such as anomaly detection. Nevertheless, those methods cannot be sufficiently validated and compared due to a lack of real microservice systems, which leads to the slow process of intelligent operation. These challenges inspire us to build a system named ''VWR'', a framework of Virtual War Room for operating microservice applications which allows users to simulate their microservice architectures with low overhead and inject multiple types of faults into the microservice system with chaos engineering. VWR can mimic user requests and record the end-to-end tracing data (i.e., service call chains) for each request in a way consistent with OpenTracing. With easily designed tests and the produced streaming tracing data, the users can validate the performance of their intelligent operation algorithms and improve the algorithms as needed. Besides, based on the streaming tracing data generated by VWR, we introduce a novel unsupervised anomaly detection algorithm based on Matrix Sketch and set it as a default intelligent operation algorithm in VWR. This algorithm can detect anomalies by analyzing high-dimensional performance data collected from a microservice system in a streaming manner. The experimental result in VWR shows that the matrix sketch based method can precisely detect anomalies in microservice systems and outperform some widely used anomaly detection methods such as isolation forest in some scenario. We believe more approaches on the intelligent operation of microservice systems can be constructed based on VWR. INDEX TERMS Microservice, virtual war room, matrix sketch, anomaly detection, chaos engineering.

show abstract

“…In a microservice system, each request may result in a series of distributed service invocations executed synchronously or asynchronously. A service can have several to thousands of instances dynamically created, destroyed, and managed by a microservice discovery service (e.g., the service discovery component of Docker swarm) [21,22]. For a microservice system, operation engineers and developers highly rely on trace analysis to understand architectures and diagnose various problems.…”

Section: Introductionmentioning

confidence: 99%