Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

Chtepen, Maria; Claeys, Filip; Dhoedt, Bart; Turck, Filip De; Vanrolleghem, Peter; Demeester, Piet

doi:10.1007/978-3-540-72584-8_60

Cited by 6 publications

(2 citation statements)

References 6 publications

(9 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Jiang and Zhou [17] suggested a fault-tolerant algorithm for scheduling jobs by matching the resource's trust level and the user's security, the number of copies will be determined according to the security level of the network, which is variable. Chtepen et al [18] introduced a heuristic schedule that relies on replicating functions and rearranging unsuccessful tasks using real-time network state information rather than relying on scheduled job data.…”

Section: -Related Workmentioning

confidence: 99%

Customizing the minimum number of replicas for achieving fault tolerance in a cloud/grid environment

S. Almhanna,

A. Murshedi,

Al-Turaihi

et al. 2024

Bulletin EEI

View full text Add to dashboard Cite

Networks consist of numerous resources; it is crucial not to overlook fault tolerance and consider it during planning. This is because errors during implementation can result in wasted time and effort, thereby squandering these resources. One solution to address this issue effectively is to implement the task on multiple resources to minimize the occurrence of failed tasks. However, employing an unspecified or fixed number of resources can lead to the depletion of network resources and the overall failure of the network. Replication plays a pivotal role in enhancing data availability in distributed systems. By storing data in multiple locations, users can still access it even if some copies are unavailable due to site failure. Many replication-based algorithms utilize a predetermined number of iterations per function, which may consume excessive network resources, even if the ongoing task does not require such abundant resources. This paper proposes task replication as a viable mechanism for an efficient and fault-tolerant scheduling system. We introduce an algorithm that dynamically selects the optimal and minimal number of replicas based on the network's failure history. This approach aims to minimize the failure rate during task execution.

show abstract

Section: -Related Workmentioning

confidence: 99%

Customizing the minimum number of replicas for achieving fault tolerance in a cloud/grid environment

S. Almhanna,

A. Murshedi,

Al-Turaihi

et al. 2024

Bulletin EEI

View full text Add to dashboard Cite

show abstract

“…So, the resource failure rate (FR) is used in this paper to determine the checkpoint interval and the number of checkpoints instead of using the resource fault index. The scheduler or the broker of the present scheduling systems [5,10,13,14,16] selects resources according to the response time combined with the resource fault index to execute the job. If the selected resource is failed and it is the only available resource that can execute the job at that time, the job must wait for that resource to join the system again and become available.…”

Section: Introductionmentioning

confidence: 99%

A job checkpointing system for computational grids

Amoon¹

2013

Open Computer Science

View full text Add to dashboard Cite

Fault tolerance is an important property in computational grids since the resources are geographically distributed.Job checkpointing is one of the most common utilized techniques for providing fault tolerance in computational grids. The efficiency of checkpointing depends on the choice of the checkpoint interval. Inappropriate checkpointing interval can delay job execution. In this paper, a fault-tolerant scheduling system based on checkpointing technique is presented and evaluated. When scheduling a job, the system uses both average failure time and failure rate of grid resources combined with resources response time to generate scheduling decisions. The system uses the failure rate of the assigned resources to calculate the checkpoint interval for each job. Extensive simulation experiments are conducted to quantify the performance of the proposed system. Experiments have shown that the proposed system can considerably improve throughput, turnaround time, grid load and failure tendency of computational grids.

show abstract

Dynamic and Adaptive Fault Tolerant Scheduling With QoS Consideration in Computational Grid

Haider

Nazir

2017

IEEE Access

View full text Add to dashboard Cite

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

Cited by 6 publications

References 6 publications

Customizing the minimum number of replicas for achieving fault tolerance in a cloud/grid environment

Customizing the minimum number of replicas for achieving fault tolerance in a cloud/grid environment

A job checkpointing system for computational grids

Dynamic and Adaptive Fault Tolerant Scheduling With QoS Consideration in Computational Grid

Contact Info

Product

Resources

About