2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID) 2008
DOI: 10.1109/ccgrid.2008.79
|View full text |Cite
|
Sign up to set email alerts
|

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Abstract: Abstract-In this paper, we describe the design and implementation of two mechanisms for fault-tolerance and recovery for complex scientific workflows on computational grids. We present our algorithms for over-provisioning and migration, which are our primary strategies for fault-tolerance. We consider application performance models, resource reliability models, network latency and bandwidth and queue wait times for batch-queues on compute resources for determining the correct fault-tolerance strategy. Our goal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
45
0

Year Published

2008
2008
2020
2020

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 56 publications
(45 citation statements)
references
References 29 publications
0
45
0
Order By: Relevance
“…Integrating Fault Tolerance with HEFT/DSH: Figure 1 describes the algorithm that we use to integrate the over-provisioning algorithm in [11] with HEFT. First, we sort the tasks in the DAG by its upward rank.…”
Section: B Scheduling Algorithms With Overprovisioningmentioning
confidence: 99%
See 2 more Smart Citations
“…Integrating Fault Tolerance with HEFT/DSH: Figure 1 describes the algorithm that we use to integrate the over-provisioning algorithm in [11] with HEFT. First, we sort the tasks in the DAG by its upward rank.…”
Section: B Scheduling Algorithms With Overprovisioningmentioning
confidence: 99%
“…Our goal is to find the smallest set of resources to replicate the given workflow task to satisfy these constraints. We use an efficient algorithm described by Kandaswamy et al [11] to find the smallest subset of resources that satisfies these constraints. In the cases when it is not possible to satisfy the success probability or deadline constraints, the over-provisioning algorithm returns all possible resource combinations tagged with the success probabilities for each resource set, so that a best-effort replicated set of resources can be chosen.…”
Section: B Scheduling Algorithms With Overprovisioningmentioning
confidence: 99%
See 1 more Smart Citation
“…a forecast verification code) can be converted into and registered as services within LEAD and thus added to workflows. In order to ensure quality of service for time-critical meteorological experiments, especially forecasts (see §6), LEAD developed a fault tolerance recovery system (Fowler et al 2008;Kandaswamy et al 2008) based upon concepts such as over-provisioning and job resubmission.…”
mentioning
confidence: 99%
“…[75] Most of the work in the area of scheduling has as its underlying paradigm the "task parallel" model of parallel computing. [55;76-81] In the Bag of Tasks (BoT) model the tasks are presumed to be independent and embarrassingly parallel.…”
Section: Strategies For Reliabilitymentioning
confidence: 99%