Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Kandaswamy, Gopi; Mandal, Anirban; Reed, Daniel A.

doi:10.1109/ccgrid.2008.79

Cited by 56 publications

(45 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Integrating Fault Tolerance with HEFT/DSH: Figure 1 describes the algorithm that we use to integrate the over-provisioning algorithm in [11] with HEFT. First, we sort the tasks in the DAG by its upward rank.…”

Section: B Scheduling Algorithms With Overprovisioningmentioning

confidence: 99%

“…Our goal is to find the smallest set of resources to replicate the given workflow task to satisfy these constraints. We use an efficient algorithm described by Kandaswamy et al [11] to find the smallest subset of resources that satisfies these constraints. In the cases when it is not possible to satisfy the success probability or deadline constraints, the over-provisioning algorithm returns all possible resource combinations tagged with the success probabilities for each resource set, so that a best-effort replicated set of resources can be chosen.…”

Section: B Scheduling Algorithms With Overprovisioningmentioning

confidence: 99%

“…Checkpointing-recovery techniques make it possible for the workflow to resume execution from the last checkpoint instead of restarting from the beginning, should a failure occurs. Over-provisioning [11] techniques replicate a task on more than one resources to increase the probability of successful execution. Although these techniques address the reliability challenges to some extent, to the best of our knowledge, no large-scale study has been done on how effective they are when coupled with workflow management and scheduling.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang

Mandal

Koelbel

et al. 2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

Self Cite

View full text Add to dashboard Cite

More and more complex scientific workflows are now executed on computational grids. In addition to the challenges of managing and scheduling these workflows, additional reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance techniques with existing workflow scheduling algorithms. We present a study on the effectiveness of the combined approaches by analyzing their impact on the reliability of workflow execution, workflow performance and resource usage under different reliability models, failure prediction accuracies and workflow application types.

show abstract

Section: B Scheduling Algorithms With Overprovisioningmentioning

confidence: 99%

Section: B Scheduling Algorithms With Overprovisioningmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang

Mandal

Koelbel

et al. 2009

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

Self Cite

View full text Add to dashboard Cite

show abstract

“…a forecast verification code) can be converted into and registered as services within LEAD and thus added to workflows. In order to ensure quality of service for time-critical meteorological experiments, especially forecasts (see §6), LEAD developed a fault tolerance recovery system (Fowler et al 2008;Kandaswamy et al 2008) based upon concepts such as over-provisioning and job resubmission.…”

mentioning

confidence: 99%

Transforming the sensing and numerical prediction of high-impact local weather through dynamic adaptation

Droegemeier

2008

Phil. Trans. R. Soc. A.

View full text Add to dashboard Cite

Mesoscale weather, such as convective systems, intense local rainfall resulting in flash floods and lake effect snows, frequently is characterized by unpredictable rapid onset and evolution, heterogeneity and spatial and temporal intermittency. Ironically, most of the technologies used to observe the atmosphere, predict its evolution and compute, transmit or store information about it, operate in a static pre-scheduled framework that is fundamentally inconsistent with, and does not accommodate, the dynamic behaviour of mesoscale weather. As a result, today's weather technology is highly constrained and far from optimal when applied to any particular situation. This paper describes a new cyberinfrastructure framework, in which remote and in situ atmospheric sensors, data acquisition and storage systems, assimilation and prediction codes, data mining and visualization engines, and the information technology frameworks within which they operate, can change configuration automatically, in response to evolving weather. Such dynamic adaptation is designed to allow system components to achieve greater overall effectiveness, relative to their static counterparts, for any given situation.The associated service-oriented architecture, known as Linked Environments for Atmospheric Discovery (LEAD), makes advanced meteorological and cyber tools as easy to use as ordering a book on the web. LEAD has been applied in a variety of settings, including experimental forecasting by the US National Weather Service, and allows users to focus much more attention on the problem at hand and less on the nuances of data formats, communication protocols and job execution environments.

show abstract

“…[75] Most of the work in the area of scheduling has as its underlying paradigm the "task parallel" model of parallel computing. [55;76-81] In the Bag of Tasks (BoT) model the tasks are presumed to be independent and embarrassingly parallel.…”

Section: Strategies For Reliabilitymentioning

confidence: 99%

Enhancing reliability with Latin Square redundancy on desktop grids.

Johnson¹

View full text Add to dashboard Cite

Fault Tolerance and Recovery of Scientific Workflows on Computational Grids

Cited by 56 publications

References 29 publications

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Transforming the sensing and numerical prediction of high-impact local weather through dynamic adaptation

Enhancing reliability with Latin Square redundancy on desktop grids.

Contact Info

Product

Resources

About