Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Yang, Zili; Mandal, Anirban; Koelbel, Charles; Cooper, Keith D.

doi:10.1109/ccgrid.2009.59

Cited by 39 publications

(25 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In [16], Taverna uses repeated retry with increasing delay intervals between retries. Zhang et al [25] proposed combining over-provisioning and checkpoint-recovery with existing workflow scheduling algorithms. The work of others [24,21,15], proposed either a reactiveonly or a proactive-only FT support.…”

Section: Related and Future Workmentioning

confidence: 99%

See 1 more Smart Citation

Architecture-based fault tolerance support for grid applications

Yusuf

Schmidt

Peake

2011

Proceedings of the Joint ACM SIGSOFT Conference -- QoSA and ACM SIGSOFT Symposium -- ISARCS on Quality of Software Architecture

View full text Add to dashboard Cite

Failure in long running grid applications is arguably inevitable and costly. Therefore, fault tolerance (FT) support for grid applications is needed. This paper evaluates an extension of our prior work on Recovery Aware Components (RAC), a component-based FT approach. Our extension utilizes the grid application architecture according to a small number of architectural classes. In this paper, we evaluate the MapReduce architecture only and analyze the reliability improvement MapReduce applications would gain by adopting the RAC approach. Our analysis shows that significant increases in reliability are possible at moderate extra cost. Obviously the cost of FT depends on the failure rate of the managed system, i.e., the system to be protected from faults, and the FT strategy chosen. Our work aims to give High Performance Computing (HPC) software architects the tools to control these factors for different grid application architectures.

show abstract

Section: Related and Future Workmentioning

confidence: 99%

“…Therefore, cost-effective fault tolerance support for grid applications is critical. To date, FT mechanisms in grids are typically reactive, inflexible and/or de facto place significant burden on the application developers to manage faults themselves [8,16,25,24,21,15,16].…”

Section: Introductionmentioning

confidence: 99%

Architecture-based fault tolerance support for grid applications

Yusuf

Schmidt

Peake

2011

Proceedings of the Joint ACM SIGSOFT Conference -- QoSA and ACM SIGSOFT Symposium -- ISARCS on Quality of Software Architecture

View full text Add to dashboard Cite

show abstract

“…In the latter case the available information is aggregated to time series documenting the number of pending and finished tasks, which is crucial for the scalability of event based monitoring [17][18][19][20] and deriving scheduling strategies [21,22].…”

Section: Background and Literature Reviewmentioning

confidence: 99%

Performance analysis of concurrent workflows

Kempa-Liehr

2015

Journal of Big Data

View full text Add to dashboard Cite

Automated workflows are the key concept of big data pipelines in science, engineering and enterprise applications. The performance analysis of automated workflows is an important topic of the continuous improvement process and the foundation of designing new workflows. This paper introduces the concept of process evolution functions and event reduction policies, which allow for the time resolved visualization of an unlimited number of concurrent workflows by means of aggregated task views. The visualization allows for an intuitive approach to the performance analysis of concurrent workflows. The theoretical foundation of this approach is applicable for workflows represented by directed acyclic graphs. It is explained on the basis of a simple IO-workflow model, which is typically found for distributed resource management systems utilized for many-task computing.

show abstract

“…Several techniques have been developed to cope with the negative impact of job failures on the execution of scientific workflows. The most common technique is to retry the failed job [17]- [19]. However, retrying a clustered job can be expensive since completed tasks within the job usually need to be recomputed, thereby resource cycles are wasted.…”

Section: Introductionmentioning

confidence: 99%

Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Chen

Silva

Deelman

et al. 2016

IEEE Trans. Cloud Comput.

View full text Add to dashboard Cite

Task clustering has proven to be an effective method to reduce execution overhead and to improve the computational granularity of scientific workflow tasks executing on distributed resources. However, a job composed of multiple tasks may have a higher risk of suffering from failures than a single task job. In this paper, we conduct a theoretical analysis of the impact of transient failures on the runtime performance of scientific workflow executions. We propose a general task failure modeling framework that uses a Maximum Likelihood estimation-based parameter estimation process to model workflow performance. We further propose 3 fault-tolerant clustering strategies to improve the runtime performance of workflow executions in faulty execution environments. Experimental results show that failures can have significant impact on executions where task clustering policies are not fault-tolerant, and that our solutions yield makespan improvements in such scenarios. In addition, we propose a dynamic task clustering strategy to optimize the workflow's makespan by dynamically adjusting the clustering granularity when failures arise. A trace-based simulation of five real workflows shows that our dynamic method is able to adapt to unexpected behaviors, and yields better makespans when compared to static methods.

show abstract

Combined Fault Tolerance and Scheduling Techniques for Workflow Applications on Computational Grids

Cited by 39 publications

References 17 publications

Architecture-based fault tolerance support for grid applications

Architecture-based fault tolerance support for grid applications

Performance analysis of concurrent workflows

Dynamic and Fault-Tolerant Clustering for Scientific Workflows

Contact Info

Product

Resources

About