Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Lin, Fred; Muzumdar, Keyur; Laptev, Nikolay; Curelea, Mihai-Valentin; Lee, Seunghak; Sankar, Sriram

doi:10.1145/3392149

Cited by 15 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…N, n the number of traces, the number of services m, c the type number of metrics, the collected number of each type metric Metric Anomaly Score: We use the mean µ ik and standard deviation σ ik of the service metrics to calculate service anomaly severity [20][21][22][23][24] . The µ ik is the expected normal value and the σ ik indicates that the metric deviates from the mean.…”

Section: Notation Definitionsmentioning

confidence: 99%

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Zhang

Tian

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

Nowadays, the protection mechanisms are introduced into microservice systems to ensure the stable operation of services. However, existing approaches ignore the impact of protection mechanisms on the root cause localization of abnormal services. Specifically, the circuit breaking and rate limiting mechanisms can refuse service requests and thus change the way of anomaly propagation. Moreover, different service request frequencies and response time make service dependencies change dynamically, resulting in different probabilities of anomaly propagation among services. In this paper, we propose a novel framework named MicroGBPM to locate the root cause of abnormal services, which considers the impact of the protection mechanisms. We model anomaly propagation among services as a dynamically constructed service attributed graph with metrics and traces when a failure occurs. To eliminate the impact of the protection mechanisms, we design a two-stage dynamic calibration strategy to adjust the probability of anomaly propagation among services. Then we propose a random walking approach to calculate the root cause results by using the PageRank algorithm. The experimental results show that MicroGBPM improves the accuracy of root cause localization compared to other approaches in microservice systems with protection mechanisms.

show abstract

Section: Notation Definitionsmentioning

confidence: 99%

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Zhang

Tian

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…One of the most important capabilities of the AI-powered data-analytics for IT operations (AIOps) is fully automated RCA ([Gartner Research, 2019]). Many AIOps platform vendors like IBM (see [IBM, 2021]), Facebook (see [Lin et al, 2020]), VMware (see , Marvasti et al, 2014a, Marvasti et al, 2014b, Marvasti et al, 2016, Harutyunyan et al, 2020b, Harutyunyan et al, 2020c), HPE (see [HPE, 2019]), BigPanda (see [BigPanda, 2020]), DataDog (see [Othmane A.-A., 2021]), Moogsoft (se [Sahil K., 2016]) and others ([Moogsoft, 2016]) have almost complete vision and solution for the domain-centric RCA described in Figure 1.…”

Section: Related Workmentioning

confidence: 99%

“…It has been in the focus of researchers for decades with diverse ideas including anomaly detection, event correlations, causal inference, correlation analysis, predictive models and many others (see [ABS Consulting et al, 2014, Zawawy et al, 2010, Chuah et al, 2010, Cai et al, 2019, Marvasti et al, 2014b with references therein). Ideally, RCA should analyze all acquired monitoring datasets including logs (see , Harutyunyan et al, 2018a, Mi et al, 2012, Kostroš et al, 2014, Tak et al, 2016, Chuah et al, 2010, Bird et al, 2015, Zawawy et al, 2010, Michalski, 1983), traces (see [Suriadi et al, 2013, Lin et al, 2020) and time series data (see [Jeyakumar et al, 2019, Pearl, 2009, Spirtes et al, 2000) with possible correlations among them. Distributed tracing is the classical approach to application monitoring and diagnostics (see [Opentracing, 2019]).…”

Section: Related Workmentioning

confidence: 99%

“…Rule learning systems have wide applications including analysis of log-data and trace-data. Papers ( [Suriadi et al, 2013, Lin et al, 2020) consider analysis of log-data. Paper [Suriadi et al, 2013] compares RIPPER to C4.5 ( [Quinlan, 2014]) for RCA on log data.…”

Section: Related Workmentioning

confidence: 99%

“…It uses implementation of both algorithms in WEKA ( [Witten et al, 2005]) known as JRip and J48 for RIPPER and C4.5, respectively. Paper ( [Lin et al, 2020]) considers RCA in a large-scale production environment based on structured logs. It explores application of the Apriori algorithm ( [Agrawal et al, 1993]) with subsequent improvement with FP-Growth ( [Han et al, 2004]).…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers    

Poghosyan¹,

Harutyunyan²,

Grigoryan³

et al. 2021

jucs

View full text Add to dashboard Cite

Effective root cause analysis (RCA) of performance issues in modern cloud environ- ments remains a hard problem. Traditional RCA tracks complex issues by their signatures known as problem incidents. Common approaches to incident discovery rely mainly on expertise of users who define environment-specific set of alerts and >target detection of problems through their occurrence in the monitoring system. Adequately modeling of all possible problem patterns for nowadays extremely sophisticated data center applications is a very complex task. It may result in alert/event storms including large numbers of non-indicative precautions. Thus, the crucial task for the incident-based RCA is reduction of redundant recommendations by prioritizing those events subject to importance/impact criteria or by deriving their meaningful groupings into separable situations. In this paper, we consider automation of incident discovery based on rule induction algorithms that retrieve conditions directly from monitoring datasets without consuming the sys- tem events. Rule-learning algorithms are very flexible and powerful for many regression and classification problems, with high-level explainability. Since annotated or labeled data sets are mostly unavailable in this area of technology, we discuss data self-labelling principles which allow transforming originally unsupervised learning tasks into classification problems with further application of rule induction methods to incident detection.

show abstract

A Survey on Association Rule Mining for Enterprise Architecture Model Discovery

Pinheiro,

Guerreiro,

Mamede

2023

Bus Inf Syst Eng

View full text Add to dashboard Cite

Association Rule Mining (ARM) is a field of data mining (DM) that attempts to identify correlations among database items. It has been applied in various domains to discover patterns, provide insight into different topics, and build understandable, descriptive, and predictive models. On the one hand, Enterprise Architecture (EA) is a coherent set of principles, methods, and models suitable for designing organizational structures. It uses viewpoints derived from EA models to express different concerns about a company and its IT landscape, such as organizational hierarchies, processes, services, applications, and data. EA mining is the use of DM techniques to obtain EA models. This paper presents a literature review to identify the newest and most cited ARM algorithms and techniques suitable for EA mining that focus on automating the creation of EA models from existent data in application systems and services. It systematically identifies and maps fourteen candidate algorithms into four categories useful for EA mining: (i) General Frequent Pattern Mining, (ii) High Utility Pattern Mining, (iii) Parallel Pattern Mining, and (iv) Distribute Pattern Mining. Based on that, it discusses some possibilities and presents an exemplification with a prototype hypothesizing an ARM application for EA mining.

show abstract

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Cited by 15 publications

References 12 publications

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers

A Survey on Association Rule Mining for Enterprise Architecture Model Discovery

Contact Info

Product

Resources

About

Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment

Cited by 15 publications

References 12 publications

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Graph-Based Root Cause Localization in Microservice Systems with Protection Mechanisms

Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers&nbsp;&nbsp;&nbsp;&nbsp;

A Survey on Association Rule Mining for Enterprise Architecture Model Discovery

Contact Info

Product

Resources

About

Incident Management for Explainable and Automated Root Cause Analysis in Cloud Data Centers