Improving reliability is one of the major concerns of scientific workflow scheduling in clouds. The ever-growing computational complexity and data size of workflows present challenges to fault-tolerant workflow scheduling. Therefore, it is essential to design a cost-effective fault-tolerant scheduling approach for large-scale workflows. In this paper, we propose a dynamic fault-tolerant workflow scheduling (DFTWS) approach with hybrid spatial and temporal re-execution schemes. First, DFTWS calculates the time attributes of tasks and identifies the critical path of workflow in advance. Then, DFTWS assigns appropriate virtual machine (VM) for each task according to the task urgency and budget quota in the phase of initial resource allocation. Finally, DFTWS performs online scheduling, which makes real-time fault-tolerant decisions based on failure type and task criticality throughout workflow execution. The proposed algorithm is evaluated on real-world workflows. Furthermore, the factors that affect the performance of DFTWS are analyzed. The experimental results demonstrate that DFTWS achieves a trade-off between high reliability and low cost objectives in cloud computing environments.
Cloud computing has attracted a growing number of enterprises to move their business to the cloud because of the associated operational and cost benefits. Improving availability is one of the major concerns of cloud application owners because modern applications generally comprise a large number of components and failures are common at scale. Fault tolerance enables an application to continue operating properly when failure occurs, but fault tolerance strategy is typically employed for the most important components because of financial concerns. Therefore, identifying important components has become a critical research issue. To address this problem, we propose a failure-sensitive structure-based component ranking approach (FSCRank), which integrates component failure impact and application structure information into component importance evaluation. An iterative ranking algorithm is developed according to the structural characteristics of cloud applications. The experimental results show that FSCRank outperforms the other two structure-based ranking algorithms for cloud applications. In addition, factors that affect application availability optimization are analyzed and summarized. The experimental results suggest that the availability of cloud applications can be greatly improved by implementing fault tolerance strategy for the important components identified by FSCRank.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.