2005 International Conference on Dependable Systems and Networks (DSN'05)
DOI: 10.1109/dsn.2005.44
|View full text |Cite
|
Sign up to set email alerts
|

Ensembles of Models for Automated Diagnosis of System Performance Problems

Abstract: Violations of service level objectives (SLO) in Internet services are urgent conditions requiring immediate attention. Previously we showed [1] that Tree-Augmented Bayesian Networks or TAN models are effective at identifying which low-level system properties were correlated to high-level SLO violations (the metric attribution problem) under stable workloads. In this paper we extend our approach to adapt to changing workloads and external disturbances by maintaining an ensemble of probabilistic models, adding n… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
69
0

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 97 publications
(70 citation statements)
references
References 16 publications
1
69
0
Order By: Relevance
“…Evaluating large amounts of measured metrics by statistical methods and methods from artificial intelligence can be effectively utilized to improve enterprise systems" dependability by allowing fault detection [4,5] and the forecasting of the system"s behavior [6,7]. Unfortunately, the dimensional problem of such approaches has not been sufficiently addressed by the community.…”
Section: Related Workmentioning
confidence: 99%
“…Evaluating large amounts of measured metrics by statistical methods and methods from artificial intelligence can be effectively utilized to improve enterprise systems" dependability by allowing fault detection [4,5] and the forecasting of the system"s behavior [6,7]. Unfortunately, the dimensional problem of such approaches has not been sufficiently addressed by the community.…”
Section: Related Workmentioning
confidence: 99%
“…Cohen et al [8] and Zhang et al [28] use a similar approach for estimating performance models by means of statistical learning. They use tree-augmented Bayesian networks (TAN) to discover correlations between system metrics and service level objectives (SLO).…”
Section: Related Workmentioning
confidence: 99%
“…Note that these online models are trained only with the historical data portions free of anomalies. We leverage existing techniques [1], [9], [18] for the methodology used in this phase (see Section IV-B).…”
Section: ) Monitoring Enginementioning
confidence: 99%
“…However, this technique only works for positive correlations. Zhang et al [18] leverage an ensemble of models to address variations in an underlying model (performance or correlation-based) that may change with time due to fluctuation in workload intensity or mix, software or hardware updates. Similarly, Cherkasova et al [15] identify an application change using two different models for a given time period, leading to higher accuracy.…”
Section: A Related Workmentioning
confidence: 99%