2007
DOI: 10.1007/978-3-540-72584-8_60
|View full text |Cite
|
Sign up to set email alerts
|

Providing Fault-Tolerance in Unreliable Grid Systems Through Adaptive Checkpointing and Replication

Abstract: Abstract.As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2008
2008
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 6 publications
(9 reference statements)
0
2
0
Order By: Relevance
“…Jiang and Zhou [17] suggested a fault-tolerant algorithm for scheduling jobs by matching the resource's trust level and the user's security, the number of copies will be determined according to the security level of the network, which is variable. Chtepen et al [18] introduced a heuristic schedule that relies on replicating functions and rearranging unsuccessful tasks using real-time network state information rather than relying on scheduled job data.…”
Section: -Related Workmentioning
confidence: 99%
“…Jiang and Zhou [17] suggested a fault-tolerant algorithm for scheduling jobs by matching the resource's trust level and the user's security, the number of copies will be determined according to the security level of the network, which is variable. Chtepen et al [18] introduced a heuristic schedule that relies on replicating functions and rearranging unsuccessful tasks using real-time network state information rather than relying on scheduled job data.…”
Section: -Related Workmentioning
confidence: 99%
“…So, the resource failure rate (FR) is used in this paper to determine the checkpoint interval and the number of checkpoints instead of using the resource fault index. The scheduler or the broker of the present scheduling systems [5,10,13,14,16] selects resources according to the response time combined with the resource fault index to execute the job. If the selected resource is failed and it is the only available resource that can execute the job at that time, the job must wait for that resource to join the system again and become available.…”
Section: Introductionmentioning
confidence: 99%