2009 Fifth IEEE International Conference on E-Science 2009
DOI: 10.1109/e-science.2009.51
|View full text |Cite
|
Sign up to set email alerts
|

A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Year Published

2010
2010
2017
2017

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 29 publications
(24 citation statements)
references
References 15 publications
0
24
0
Order By: Relevance
“…A fault-tolerant elastic scheduling algorithms for real-time tasks in clouds named FESTAL, that aimed for both fault tolerance and high resource utilization in clouds is proposed in [12]. A new heuristic called resubmission impact to handle the faults during the execution of SWf tasks in distributed systems is proposed in [3]. Most of the related approaches are based on the predictions of failure probability of a task on a resource in a certain time interval and also budget surpluses due to replication of tasks.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…A fault-tolerant elastic scheduling algorithms for real-time tasks in clouds named FESTAL, that aimed for both fault tolerance and high resource utilization in clouds is proposed in [12]. A new heuristic called resubmission impact to handle the faults during the execution of SWf tasks in distributed systems is proposed in [3]. Most of the related approaches are based on the predictions of failure probability of a task on a resource in a certain time interval and also budget surpluses due to replication of tasks.…”
Section: Related Workmentioning
confidence: 99%
“…However, existing task clustering strategies have ignored the effect of task failures on clouds, despite their significant effect on the large-scale distributed systems such as grids and clouds [2]. The scientists usually require highly distributed systems to compute complex problems that can run for many days or even weeks [3]. If the system is a low fault-tolerant, then it can lose days or even weeks of computation time and it is intolerable for scientists.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…To overcome problems resulting from unexpected job crashes and network interruptions, ASKALON is able to handle most of the common failures. Jobs and file transfers are resubmitted on failure and jobs might also be rescheduled to a different resource if transfers or jobs failed more than 5 times on a resource (Plankensteiner et al, 2009a). These features still exist in the cloud version but play a less important role as resources showed to be more reliable in the cloud case.…”
Section: Middleware Askalonmentioning
confidence: 99%
“…An important feature which is distributed over several components of ASKALON is the capability to handle faults in distributed systems. Resources or network connections might fail any time and mechanisms as described in Plankensteiner et al (2009a) are integrated in the execution engine Qin et al (2007) allowing workflows to finish even when parts of the system fail.…”
Section: Middleware Askalonmentioning
confidence: 99%