Abstract:Grid is a form distributed computing mainly to virtualilze and utilize geographically distributed idle resources. A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result varying resource availability becomes common place, often resulting in loss and delay of executing jobs. To ensure good performance fault tolerance should be taken into account. Here we address the fault tolerance in terms of resource failure. Commonly utilized … Show more
“…In their work [10], Theresa et al propose two dynamic checkpoint strategies: Last Failure time-based Checkpoint Adaptation (LFCA) and Mean Failure time-based Checkpoint Adaptation (MFCA), which takes into account the stability of the system and the probability of failure concerning individual resources.…”
Scientific workflows are data-and compute-intensive; thus, they may run for days or even weeks on parallel and distributed infrastructures such as grids, supercomputers, and clouds. In these high-performance computing infrastructures, the number of failures that can arise during scientific-workflow enactment can be high, so the use of fault-tolerance techniques is unavoidable. The most-frequently used fault-tolerance technique is taking checkpoints from time to time; when failure is detected, the last consistent state is restored. One of the most-critical factors that has great impact on the effectiveness of the checkpointing method is the checkpointing interval. In this work, we propose a Static (Wsb) and an Adaptive (AWsb) Workflow Structure Based checkpointing algorithm. Our results showed that, compared to the optimal checkpointing strategy, the static algorithm may decrease the checkpointing overhead by as much as 33% without affecting the total processing time of workflow execution. The adaptive algorithm may further decrease this overhead while keeping the overall processing time at its necessary minimum.
“…In their work [10], Theresa et al propose two dynamic checkpoint strategies: Last Failure time-based Checkpoint Adaptation (LFCA) and Mean Failure time-based Checkpoint Adaptation (MFCA), which takes into account the stability of the system and the probability of failure concerning individual resources.…”
Scientific workflows are data-and compute-intensive; thus, they may run for days or even weeks on parallel and distributed infrastructures such as grids, supercomputers, and clouds. In these high-performance computing infrastructures, the number of failures that can arise during scientific-workflow enactment can be high, so the use of fault-tolerance techniques is unavoidable. The most-frequently used fault-tolerance technique is taking checkpoints from time to time; when failure is detected, the last consistent state is restored. One of the most-critical factors that has great impact on the effectiveness of the checkpointing method is the checkpointing interval. In this work, we propose a Static (Wsb) and an Adaptive (AWsb) Workflow Structure Based checkpointing algorithm. Our results showed that, compared to the optimal checkpointing strategy, the static algorithm may decrease the checkpointing overhead by as much as 33% without affecting the total processing time of workflow execution. The adaptive algorithm may further decrease this overhead while keeping the overall processing time at its necessary minimum.
“…In that work, by means of simulation, they propose a similar approach to ours, trying to both increase the utilization and meet job deadlines, but without providing any differentiation of QoS levels. These rescheduling techniques can also be used to try to provide fault tolerance performance, such as the work presented in [52], where this is provided by changing the frequency of the checkpointing process based on current status and history of failure information of the resource and by rescheduling the jobs when those failures happen. However, they do not migrate jobs to increase utilization or QoS differentiation amongst users.…”
From volunteer to trustable computing: Providing QoS-aware scheduling mechanisms for multigrid computing environments.
Future generations computer systems,
AbstractThe exploitation of service oriented technologies, such as Grid computing, is being boosted by the current service oriented economy trend, leading to a growing need of Quality of Service (QoS) mechanisms. However, Grid computing was created to provide vast amounts of computational power but in a best effort way. Providing QoS guarantees is therefore a very difficult and complex task due to the distributed and heterogeneous nature of their resources, specially the volunteer computing resources (e.g., desktop resources).The scope of this paper is to empower an integrated multi QoS support suitable for Grid Computing environments made of either dedicated and volunteer resources, even taking advantage of that fact. The QoS is provided through SLAs by exploiting different available scheduling mechanisms in a coordinated way, and applying appropriate resource usage optimization techniques. It is based on the differentiated use of reservations and scheduling in advance techniques, enhanced with the integration of rescheduling techniques that improve the allocation decisions already made, achieving a higher re- source utilization and still ensuring the agreed QoS. As a result, our proposal enhances best-effort Grid environments by providing QoS aware scheduling capabilities. This proposal has been validated by means of a set of experiments performed in a real Grid testbed. Results show how the proposed framework effectively harnesses the specific capabilities of the underlying resources to provide every user with the desired QoS level, while, at the same time, optimizing the resources' usage.
“…The authors of [5,6,7,8] make some assumptions or gather statistics about failure distribution of individual resources or resource systems and based on these values and calculations adjust fault tolerant mechanisms. For example Theresa et al propose in their work [8] two dynamic checkpoint strategies: Last Failure time based Checkpoint Adaptation (LFCA) and Mean Failure time based Checkpoint Adaptation (MFCA) which takes into account the stability of the system and the probability of failure concerning the individual resources. Young in [5] has already in 1974 defined his formula for the optimum periodic checkpoint interval which is based on the checkpointing cost and the mean time between failures (MTBF) with the assumption that failure intervals follow an exponential distribution.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.