TaskInsight: A Fine-Grained Performance Anomaly Detection and Problem Locating System

Zhang, Xiao; Meng, Fan Jing; Chen, Pengfei; Xu, Jie

doi:10.1109/cloud.2016.0136

Cited by 24 publications

(8 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If corresponding back-end server has enough available memory, then load balancer sends a signal to the agent in order to set a new thread pool size. However, memory utilization is a low level metric that cannot be used as an actual pointer to the overload condition, because some other resource may be the cause of bottleneck even on efficient memory utilization, hence it has been reported in [3] that resource level metrics are too coarse to locate the overload condition. Moreover, resource level metrics are not directly related to end user experience, thus memory-utilization metric does not provide reliable response-time outage [4].…”

Section: Related Workmentioning

confidence: 99%

The Double Edge Sword Based Distributed Executor Service

Bahadur¹,

Umar²,

Ullah³

et al. 2022

Computer Systems Science and Engineering

View full text Add to dashboard Cite

Scalability is one of the most important quality attribute of softwareintensive systems, because it maintains an effective performance parallel to the large fluctuating and sometimes unpredictable workload. In order to achieve scalability, thread pool system (TPS) (which is also known as executor service) has been used extensively as a middleware service in software-intensive systems. TPS optimization is a challenging problem that determines the optimal size of thread pool dynamically on runtime. In case of distributed-TPS (DTPS), another issue is the load balancing b/w available set of TPSs running at backend servers. Existing DTPSs are overloaded either due to an inappropriate TPS optimization strategy at backend servers or improper load balancing scheme that cannot quickly recover an overload. Consequently, the performance of software-intensive system is suffered. Thus, in this paper, we propose a new DTPS that follows the collaborative round robin load balancing that has the effect of a double-edge sword. On the one hand, it effectively performs the load balancing (in case of overload situation) among available TPSs by a fast overload recovery procedure that decelerates the load on the overloaded TPSs up to their capacities and shifts the remaining load towards other gracefully running TPSs. And on the other hand, its robust load deceleration technique which is applied to an overloaded TPS sets an appropriate upper bound of thread pool size, because the pool size in each TPS is kept equal to the request rate on it, hence dynamically optimizes TPS. We evaluated the results of the proposed system against state of the art DTPSs by a clientserver based simulator and found that our system outperformed by sustaining smaller response times.

show abstract

Section: Related Workmentioning

confidence: 99%

The Double Edge Sword Based Distributed Executor Service

Bahadur¹,

Umar²,

Ullah³

et al. 2022

Computer Systems Science and Engineering

View full text Add to dashboard Cite

show abstract

“…Chan et al [89] define the model with rules based on machine learning, Song et al [15] rely on manual models. Zhang et al [16] employ unsupervised clustering on black-box tasks to induce normal resource usage behavior patterns from historical data. Monni et al [17,18] develop a technique for energy-based anomaly detection using Restricted Boltzmann Machines (RBMs) [18] and acknowledge how their technique can be used to detect collective anomalies and failures in software systems.…”

Section: Detection Based On Normal Behavior Modelingmentioning

confidence: 99%

A Taxonomy of Techniques for SLO Failure Prediction in Software Systems

Grohmann

Herbst

Chalbani³

et al. 2020

Computers

View full text Add to dashboard Cite

Failure prediction is an important aspect of self-aware computing systems. Therefore, a multitude of different approaches has been proposed in the literature over the past few years. In this work, we propose a taxonomy for organizing works focusing on the prediction of Service Level Objective (SLO) failures. Our taxonomy classifies related work along the dimensions of the prediction target (e.g., anomaly detection, performance prediction, or failure prediction), the time horizon (e.g., detection or prediction, online or offline application), and the applied modeling type (e.g., time series forecasting, machine learning, or queueing theory). The classification is derived based on a systematic mapping of relevant papers in the area. Additionally, we give an overview of different techniques in each sub-group and address remaining challenges in order to guide future research.

show abstract

“…However, most of them are computationally . By contrast, scalable algorithms have also been proposed to facilitate the anomaly detection in cloud computing, e.g., implementing a probabilistic approach to detect abnormal software systems [46], adopting Holt-Winters forecasting to identify a violation in application metrics [30], and implementing a clustering method to find the anomalous application threads [63]. A common issue of these scalable techniques is that they require low-level access to application level information, while our approach only targets general performance metrics that can be obtained via sampling the state of the system.…”

Section: Anomaly Detectionmentioning

confidence: 99%

CloudDet: Interactive Visual Analysis of Anomalous Performances in Cloud Computing Systems

Wang

Yang

Wang

et al. 2019

IEEE Trans. Visual. Comput. Graphics

View full text Add to dashboard Cite

Fig. 1. CloudDet facilitates the exploration of anomalous cloud computing performances through three levels of analysis: (a) anomaly ranking, (b) anomaly inspection, and (c) anomaly clustering. The figure showcases some exploration results with Bitbrains Datacenter traces data. Node (b1) contains both short and long term spikes, with no pattern in their occurrence times. Node (b2) shows a 12-hour periodic pattern for the performance metrics by observing the calendar chart in (a2), but encounters a spike in the process. Node (b3) shows many short-term and near-periodic spikes at the beginning and an abnormal long-term spike near the end. After collapsing the long-term one into a visual aggregation glyph in (b5), (b3) is updated and the latter temporal data "pop out", which shows a similar pattern as the beginning. Node (b4) shows a general periodic trend which is not apparent in (a4) by using the PCA analysis in (b6). Most of the nodes are clustered into three groups in (c), with each group displaying a rare but similar performance.Abstract-Detecting and analyzing potential anomalous performances in cloud computing systems is essential for avoiding losses to customers and ensuring the efficient operation of the systems. To this end, a variety of automated techniques have been developed to identify anomalies in cloud computing performance. These techniques are usually adopted to track the performance metrics of the system (e.g., CPU, memory, and disk I/O), represented by a multivariate time series. However, given the complex characteristics of cloud computing data, the effectiveness of these automated methods is affected. Thus, substantial human judgment on the automated analysis results is required for anomaly interpretation. In this paper, we present a unified visual analytics system named CloudDet to interactively detect, inspect, and diagnose anomalies in cloud computing systems. A novel unsupervised anomaly detection algorithm is developed to identify anomalies based on the specific temporal patterns of the given metrics data (e.g., the periodic pattern), the results of which are visualized in our system to indicate the occurrences of anomalies. Rich visualization and interaction designs are used to help understand the anomalies in the spatial and temporal context. We demonstrate the effectiveness of CloudDet through a quantitative evaluation, two case studies with real-world data, and interviews with domain experts.

show abstract

TaskInsight: A Fine-Grained Performance Anomaly Detection and Problem Locating System

Cited by 24 publications

References 8 publications

The Double Edge Sword Based Distributed Executor Service

The Double Edge Sword Based Distributed Executor Service

A Taxonomy of Techniques for SLO Failure Prediction in Software Systems

CloudDet: Interactive Visual Analysis of Anomalous Performances in Cloud Computing Systems

Contact Info

Product

Resources

About