Dynamic Adaptation of Checkpoints and Rescheduling in Grid Computing

Therasa., Antony Lidya S.; Sumathi, G.; Dalya., Antony S.

doi:10.5120/636-891

Cited by 13 publications

(5 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In their work [10], Theresa et al propose two dynamic checkpoint strategies: Last Failure time-based Checkpoint Adaptation (LFCA) and Mean Failure time-based Checkpoint Adaptation (MFCA), which takes into account the stability of the system and the probability of failure concerning individual resources.…”

Section: State Of the Artmentioning

confidence: 99%

A novel adaptive checkpointing method based on information obtained from workflow structure

Kail

Kacsuk

Kozlovszky

2016

csci

View full text Add to dashboard Cite

Scientific workflows are data-and compute-intensive; thus, they may run for days or even weeks on parallel and distributed infrastructures such as grids, supercomputers, and clouds. In these high-performance computing infrastructures, the number of failures that can arise during scientific-workflow enactment can be high, so the use of fault-tolerance techniques is unavoidable. The most-frequently used fault-tolerance technique is taking checkpoints from time to time; when failure is detected, the last consistent state is restored. One of the most-critical factors that has great impact on the effectiveness of the checkpointing method is the checkpointing interval. In this work, we propose a Static (Wsb) and an Adaptive (AWsb) Workflow Structure Based checkpointing algorithm. Our results showed that, compared to the optimal checkpointing strategy, the static algorithm may decrease the checkpointing overhead by as much as 33% without affecting the total processing time of workflow execution. The adaptive algorithm may further decrease this overhead while keeping the overall processing time at its necessary minimum.

show abstract

Section: State Of the Artmentioning

confidence: 99%

A novel adaptive checkpointing method based on information obtained from workflow structure

Kail

Kacsuk

Kozlovszky

2016

csci

View full text Add to dashboard Cite

show abstract

“…In that work, by means of simulation, they propose a similar approach to ours, trying to both increase the utilization and meet job deadlines, but without providing any differentiation of QoS levels. These rescheduling techniques can also be used to try to provide fault tolerance performance, such as the work presented in [52], where this is provided by changing the frequency of the checkpointing process based on current status and history of failure information of the resource and by rescheduling the jobs when those failures happen. However, they do not migrate jobs to increase utilization or QoS differentiation amongst users.…”

Section: Qos Mechanismsmentioning

confidence: 99%

From volunteer to trustable computing: Providing QoS-aware scheduling mechanisms for multi-grid computing environments

Conejero

Caminero

Carrión

et al. 2014

Future Generation Computer Systems

View full text Add to dashboard Cite

From volunteer to trustable computing: Providing QoS-aware scheduling mechanisms for multigrid computing environments. Future generations computer systems, AbstractThe exploitation of service oriented technologies, such as Grid computing, is being boosted by the current service oriented economy trend, leading to a growing need of Quality of Service (QoS) mechanisms. However, Grid computing was created to provide vast amounts of computational power but in a best effort way. Providing QoS guarantees is therefore a very difficult and complex task due to the distributed and heterogeneous nature of their resources, specially the volunteer computing resources (e.g., desktop resources).The scope of this paper is to empower an integrated multi QoS support suitable for Grid Computing environments made of either dedicated and volunteer resources, even taking advantage of that fact. The QoS is provided through SLAs by exploiting different available scheduling mechanisms in a coordinated way, and applying appropriate resource usage optimization techniques. It is based on the differentiated use of reservations and scheduling in advance techniques, enhanced with the integration of rescheduling techniques that improve the allocation decisions already made, achieving a higher re- source utilization and still ensuring the agreed QoS. As a result, our proposal enhances best-effort Grid environments by providing QoS aware scheduling capabilities. This proposal has been validated by means of a set of experiments performed in a real Grid testbed. Results show how the proposed framework effectively harnesses the specific capabilities of the underlying resources to provide every user with the desired QoS level, while, at the same time, optimizing the resources' usage.

show abstract

“…The authors of [5,6,7,8] make some assumptions or gather statistics about failure distribution of individual resources or resource systems and based on these values and calculations adjust fault tolerant mechanisms. For example Theresa et al propose in their work [8] two dynamic checkpoint strategies: Last Failure time based Checkpoint Adaptation (LFCA) and Mean Failure time based Checkpoint Adaptation (MFCA) which takes into account the stability of the system and the probability of failure concerning the individual resources. Young in [5] has already in 1974 defined his formula for the optimum periodic checkpoint interval which is based on the checkpointing cost and the mean time between failures (MTBF) with the assumption that failure intervals follow an exponential distribution.…”

Section: A Fault Tolerancementioning

confidence: 99%