Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study

Chen, Xin; Lü, Chao; Pattabiraman, Karthik

doi:10.1109/issrew.2014.105

Cited by 52 publications

(16 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Higher priority indicates higher preference for resources. According to [1] , 12 priorities can be grouped into five classes: gratis(0-1), batch (2)(3)(4)(5)(6)(7)(8), normal production(9), monitoring(10), and infrastructure (11). The number of killer tasks at "normal production" priority is 1,146, which coincides with the description that priority 9 is dominant in production priorities in [17].…”

Section: A Failure Frequency Analysissupporting

confidence: 53%

“…In contrast, we discover the resource usage pattern to recognize killer tasks and avoid resource wasting. In their recent work, they convert task attributes and mean resource usage as features, and apply recurrent neural network to predict task failures [11]. However, only average resource usage instead of time series data is used in their model.…”

Section: A Google Trace Analysismentioning

confidence: 99%

“…Prior studies on failures of cloud computing systems focus on characterization of job failures [10], server failures [5] and failure prediction [11]. To the best of our knowledge, we are the first to recognize killer tasks and perform online recognition at their early stage.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Tang

Liu

Jia

et al. 2016

2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)

View full text Add to dashboard Cite

To better understand task failures in cloud computing systems, we analyze failure frequency of tasks based on Google cluster dataset, and find what we call as killer tasks that suffer from long-term failures and repeated rescheduling. Killer task can be a big concern of cloud systems as it causes unnecessary resource wasting and significant increase of scheduling workloads. Hence there is a need to provide a service for cloud system operators to recognize killer tasks in time. In this paper, we propose an online killer task recognition service based on the resource usage time series which can recognize killer tasks at the very early stage of their occurrence so that they can be handled appropriately instead of being rescheduled. The experiment results show that the proposed service performs a 93.6% accuracy in recognizing killer tasks with an 87% timing advance and 86.6% resource saving for the cloud system averagely.

show abstract

Section: A Failure Frequency Analysissupporting

confidence: 53%

Section: A Google Trace Analysismentioning

confidence: 99%

See 1 more Smart Citation

Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Tang

Liu

Jia

et al. 2016

2016 IEEE Symposium on Service-Oriented System Engineering (SOSE)

View full text Add to dashboard Cite

show abstract

“…There are existing research works in the literature that applied statistical, machine and deep learning methods using Google dataset for different prediction purposes such as workload, scheduling, and job/task failure prediction. Chen et al [12] studied main features of application job and task failures in cloud computing. Authors analyzed events and resource usages of the jobs and tasks to determine features related to the failures.…”

Section: A Task Failure Predictionmentioning

confidence: 99%

Proactive Failure-Aware Task Scheduling Framework for Cloud Computing

2021

View full text Add to dashboard Cite

Cloud computing is a widely adopted platform for executing tasks of different application types that belong to the end users. In the cloud, application task is prone to failure for several reasons, such as software bug or exception, virtual or physical infrastructure failure. Cloud service providers are responsible for managing availability of scheduled computing tasks in order to provide high level QoS for their customers. Protecting task against failure is a challenging and not a trivial mission due to dynamic, heterogeneous and large distributed structure of the cloud environment. The existing works in the literature focus on task failure prediction and neglect the remedy (post) actions. In this work, we first study and analyze three publicly available large cluster datasets from Google, Alibaba, and Trinity, to characterize task failure in cloud computing platform. We then propose a failure-aware task scheduling framework that can predict the termination status for a set of given tasks during the runtime, and take the appropriate remedy actions. The framework uses deep learning methods named Artificial and Convolutional Neural Network, ANN and CNN, for different prediction purposes. In addition, we formalize the actions selection problem as Integer Linear Programming (ILP) model and propose a heuristic optimization solution that aims to minimize the failure probability of tasks and their resources usage. The results show ANN and CNN can achieve prediction accuracy of up to 94% and 92%, respectively using Google dataset. Moreover, the framework can protect up to 40% of tasks that are predicted as failed using Alibaba dataset by taking the appropriate remedy actions, and hence save many of cluster's resources such as CPU and RAM.INDEX TERMS Task failure prediction, deep learning, task scheduling, cloud computing.

show abstract

“…However, they don't leverage a specific technique to conduct failure prediction. In their later work, they convert job attributes and mean resource usage as features, and apply recurrent neural network to predict job failures [15]. El-Sayed et al [16] characterize unsuccessful jobs and employ classification techniques to predict job failures.…”

Section: Related Workmentioning

confidence: 99%

Hunting Killer Tasks for Cloud System through Behavior Pattern Learning

Tang

Liu

Jia

et al. 2016

2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W)

View full text Add to dashboard Cite

Motivated by frequent failures in cloud computing systems, we analyze failure frequency and continuity of tasks from the Google cloud cluster, and find what we call killer tasks that suffer from frequent failures and repeated rescheduling. Killer task can be a big concern in cloud systems as it causes unnecessary resource wasting and significant increase of scheduling workloads. In this paper, we investigate characteristics and behavior patterns of killer tasks, then develop an approach to recognize killer tasks at the very early stage of their occurrence so that they can be addressed proactively instead of being rescheduled repeatedly. The empirical results show that our approach performs at 97% of precision in recognizing killer tasks with a maximal 1,164 minutes of lead time and 89% of resource saving for the cloud system on average.

show abstract

Failure Prediction of Jobs in Compute Clouds: A Google Cluster Case Study

Cited by 52 publications

References 14 publications

Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Time Series Based Killer Task Online Recognition Service: A Google Cluster Case Study

Proactive Failure-Aware Task Scheduling Framework for Cloud Computing

Hunting Killer Tasks for Cloud System through Behavior Pattern Learning

Contact Info

Product

Resources

About