2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid 2009
DOI: 10.1109/ccgrid.2009.59
|View full text |Cite
|
Sign up to set email alerts
|

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Abstract: More and more complex scientific workflows are now executed on computational grids. In addition to the challenges of managing and scheduling these workflows, additional reliability challenges arise because of the unreliable nature of large-scale grid infrastructure. Fault tolerance mechanisms like over-provisioning and checkpoint-recovery are used in current grid application management systems to address these reliability challenges. In this work, we propose new approaches that combine these fault tolerance te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
25
0

Year Published

2010
2010
2019
2019

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(25 citation statements)
references
References 17 publications
0
25
0
Order By: Relevance
“…In [16], Taverna uses repeated retry with increasing delay intervals between retries. Zhang et al [25] proposed combining over-provisioning and checkpoint-recovery with existing workflow scheduling algorithms. The work of others [24,21,15], proposed either a reactiveonly or a proactive-only FT support.…”
Section: Related and Future Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In [16], Taverna uses repeated retry with increasing delay intervals between retries. Zhang et al [25] proposed combining over-provisioning and checkpoint-recovery with existing workflow scheduling algorithms. The work of others [24,21,15], proposed either a reactiveonly or a proactive-only FT support.…”
Section: Related and Future Workmentioning
confidence: 99%
“…Therefore, cost-effective fault tolerance support for grid applications is critical. To date, FT mechanisms in grids are typically reactive, inflexible and/or de facto place significant burden on the application developers to manage faults themselves [8,16,25,24,21,15,16].…”
Section: Introductionmentioning
confidence: 99%
“…In the latter case the available information is aggregated to time series documenting the number of pending and finished tasks, which is crucial for the scalability of event based monitoring [17][18][19][20] and deriving scheduling strategies [21,22].…”
Section: Background and Literature Reviewmentioning
confidence: 99%
“…Several techniques have been developed to cope with the negative impact of job failures on the execution of scientific workflows. The most common technique is to retry the failed job [17]- [19]. However, retrying a clustered job can be expensive since completed tasks within the job usually need to be recomputed, thereby resource cycles are wasted.…”
Section: Introductionmentioning
confidence: 99%