A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact

Plankensteiner, Kassian; Prodan, Radu; Fahringer, Thomas

doi:10.1109/e-science.2009.51

Cited by 29 publications

(24 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A fault-tolerant elastic scheduling algorithms for real-time tasks in clouds named FESTAL, that aimed for both fault tolerance and high resource utilization in clouds is proposed in [12]. A new heuristic called resubmission impact to handle the faults during the execution of SWf tasks in distributed systems is proposed in [3]. Most of the related approaches are based on the predictions of failure probability of a task on a resource in a certain time interval and also budget surpluses due to replication of tasks.…”

Section: Related Workmentioning

confidence: 99%

“…However, existing task clustering strategies have ignored the effect of task failures on clouds, despite their significant effect on the large-scale distributed systems such as grids and clouds [2]. The scientists usually require highly distributed systems to compute complex problems that can run for many days or even weeks [3]. If the system is a low fault-tolerant, then it can lose days or even weeks of computation time and it is intolerable for scientists.…”

Section: Introductionmentioning

confidence: 99%

“…If one task fails, another replicated task will balance the workflow execution. This approach assures high level of fault-tolerance, if there are enough resources available [3].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments

Vinay

Kumar

2017

2017 IEEE 7th International Advance Computing Conference (IACC)

View full text Add to dashboard Cite

Abstract-Executing clustered tasks has proven to be an efficient method to improve the computation of Scientific Workflows (SWf) on clouds. However, clustered tasks has a higher probability of suffering from failures than a single task. Therefore, fault tolerance in cloud computing is extremely essential while running large-scale scientific applications. In this paper, a new heuristic called Cluster based Heterogeneous Earliest Finish Time (C-HEFT) algorithm to enhance the scheduling and fault tolerance mechanism for SWf in highly distributed cloud environments is proposed. To mitigate the failure of clustered tasks, this algorithm uses idle-time of the provisioned resources to resubmit failed clustered tasks for successful execution of SWf. Experimental results show that the proposed algorithm have convincing impact on the SWf executions and also drastically reduce the resource waste compared to existing task replication techniques. A trace based simulation of five real SWf shows that this algorithm is able to sustain unexpected task failures with minimal cost and makespan.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments

Vinay

Kumar

2017

2017 IEEE 7th International Advance Computing Conference (IACC)

View full text Add to dashboard Cite

show abstract

“…To overcome problems resulting from unexpected job crashes and network interruptions, ASKALON is able to handle most of the common failures. Jobs and file transfers are resubmitted on failure and jobs might also be rescheduled to a different resource if transfers or jobs failed more than 5 times on a resource (Plankensteiner et al, 2009a). These features still exist in the cloud version but play a less important role as resources showed to be more reliable in the cloud case.…”

Section: Middleware Askalonmentioning

confidence: 99%

“…An important feature which is distributed over several components of ASKALON is the capability to handle faults in distributed systems. Resources or network connections might fail any time and mechanisms as described in Plankensteiner et al (2009a) are integrated in the execution engine Qin et al (2007) allowing workflows to finish even when parts of the system fail.…”

Section: Middleware Askalonmentioning

confidence: 99%

Experiences with distributed computing for meteorological applications: grid computing and cloud computing

et al. 2015

View full text Add to dashboard Cite

Abstract. Experiences with three practical meteorological applications with different characteristics are used to highlight the core computer science aspects and applicability of distributed computing to meteorology. Through presenting cloud and grid computing this paper shows use case scenarios fitting a wide range of meteorological applications from operational to research studies. The paper concludes that distributed computing complements and extends existing high performance computing concepts and allows for simple, powerful and cost-effective access to computing capacity.

show abstract

Scheduling Scientific Workflows to Meet Soft Deadlines in the Absence of Failure Models

Plankensteiner

Prodan

Fahringer

2010

Euro-Par 2010 - Parallel Processing

Self Cite

View full text Add to dashboard Cite

Highly distributed systems such as Clouds and Grids are used to execute complex scientific workflow applications by researchers from various areas of science. While scientists rightfully expect efficient and reliable execution of their applications, current systems often cannot deliver the required Quality of Service. We propose a dynamic execution and scheduling heuristic able to schedule workflow applications with a high degree of fault tolerance, while taking into account soft deadlines. Experimental results show that our method meets soft deadlines in volatile highly distributed systems in the absence of historic failure trace data or complex failure models of the target system.

show abstract

A New Fault Tolerance Heuristic for Scientific Workflows in Highly Distributed Environments Based on Resubmission Impact

Cited by 29 publications

References 15 publications

Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments

Fault-Tolerant Scheduling for Scientific Workflows in Cloud Environments

Experiences with distributed computing for meteorological applications: grid computing and cloud computing

Scheduling Scientific Workflows to Meet Soft Deadlines in the Absence of Failure Models

Contact Info

Product

Resources

About