Load balancing in the presence of random node failure and recovery

Dhakal,; Hayat,; Pezoa,; Abdallah, Chaouki T.; Birdwell,; Chiasson,

doi:10.1109/ipdps.2006.1639293

Cited by 9 publications

(22 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the available literature on distributed computing in such uncertain environments primarily considers reactive techniques, where a node failure is addressed only after its occurrence [7]. One of the few exceptions is the paper of Dhakal et al [12] that presents two preemptive load-balancing policies for a heterogeneous distributed computing system with wireless links between nodes. Preemptiveness in this case implies adjusting actions to compensate for the possibility of node failure/recovery.…”

Section: Related Workmentioning

confidence: 99%

Probabilistic resource allocation in heterogeneous distributed systems with random failures

Shestak

Chong

Maciejewski

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Probabilistic resource allocation in heterogeneous distributed systems with random failures

Shestak

Chong

Maciejewski

et al. 2012

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…Regarding the transfer times, our assumptions are justified according to our prior work [9], [10], [16] and the empirical data obtained from the experiments conducted over the DC architecture to be discussed in Section 3. In addition, we have assumed that the mean transfer time of the ith group of tasks being transferred to the kth node follows the first-order approximation:…”

Section: Assumption A2 (Independence Of the Random Times)mentioning

confidence: 99%

“…In general, however, the above partitions p jk may not be effective and must be adjusted in order to compensate for the effects of the random transfer times. The load to be migrated from the jth to the kth must be adjusted according to what is called the load-balancing gain [9], [10], [16], [22], which is denoted as K jk , yielding…”

Section: Distributed Load-balancing Policymentioning

confidence: 99%

“…The role of LB in improving the performance of DCSs has been studied vastly considering a number of performance metrics; these include the average response time of an entire workload [1], [9], the probability of successfully serving an entire workload [10]- [16], the probability of serving a workload within a given amount of time [17], the average queue-length of a node [18], [19], and the total sum of communication and service times [5], [20]. In addition, the problem of LB has been studied under both static and dynamic scenarios.…”

Section: Introductionmentioning

confidence: 99%

“…In these works, the authors have considered random communication delays as well as random server failure. Additionally, in an earlier work we have studied the effect of node failure and recovery on the average response time of a workload served by a two-node DCS [10].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

Pezoa

Dhakal

Hayat

2010

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-In distributed computing systems (DCSs) where server nodes can fail permanently with non-zero probability, the system performance can be assessed by means of the service reliability, defined as the probability of serving all the tasks queued in the DCS before all the nodes fail. This paper presents a rigorous probabilistic framework to analytically characterize the service reliability of a DCS in the presence of communication uncertainties and stochastic topological changes due to node deletions. The framework considers a system composed of heterogeneous nodes with stochastic service and failure times and a communication network imposing random tangible delays. The framework also permits arbitrarily specified, distributed load-balancing actions to be taken by the individual nodes in order to improve the service reliability. The presented analysis is based upon a novel use of the concept of stochastic regeneration, which is exploited to derive a system of difference-differential equations characterizing the service reliability. The theory is further utilized to optimize certain load-balancing policies for maximal service reliability; the optimization is carried out by means of an algorithm that scales linearly with the number of nodes in the system. The analytical model is validated using both Monte-Carlo simulations and experimental data collected from a DCS testbed.

show abstract

Dynamic Load Balancing for Robust Distributed Computing in the Presence of Topological Impairments

Hayat

Pezoa

Dietz

et al. 2008

Wiley Handbook of Science and Technology for Homeland Security

View full text Add to dashboard Cite

The purpose of any distributed computing system (DCS) is to offer a flexible, reliable and powerful computing platform. With the advances in mobile computing, wireless communications and sensor networks, DCSs have emerged in new applications such as wireless sensor networks (WSNs), military battlefield awareness, surveillance and threat detection, to name a few. These new application areas introduce new challenges to DCSs when operated or deployed in harsh or threat-prone environments. For instance, in WSNs deployed in a military battlefield, the computing elements (CEs) of a DSC join and leave the DCS at any time in a stochastic fashion. More generally, factors such as limited or intermittent communication resources, CEs' power constraints or long-term physical damage of the CEs, can result in random topological changes in the DCS, which, in turn, can severely degrade their performance and reliability. Many of these factors can be attributable to physical attacks on our information infrastructure, of which weapons of mass destruction (WMD) is an important example. This observation has triggered government agencies, such as the Defense Threat Reduction Agency, to launch research initiatives in network science to understand the extent of damage that can be inflicted upon networks in the event of attacks and also to develop strategies to increase the robustness of networks when threat is present. In this article, we review modern dynamic load balancing (DLB) techniques and their mathematical stochastic models that can be exploited by DCS developers to increase the DCS's robustness to random topological changes, and at the same time, to efficiently use the available computing resources of the system in the presence of communication uncertainty and CE dysfunction. Two scenarios are considered: one where CEs can fail and recover at random instants and another where CEs can fail permanently. Under the first scenario we look for minimizing the average response time of a given application. In the second scenario the goal is to maximize the probability of successfully running an entire application. DLB policies are tested using a small-scale DCS environment and compared to

show abstract

Load balancing in the presence of random node failure and recovery

Cited by 9 publications

References 11 publications

Probabilistic resource allocation in heterogeneous distributed systems with random failures

Probabilistic resource allocation in heterogeneous distributed systems with random failures

Maximizing Service Reliability in Distributed Computing Systems with Random Node Failures: Theory and Implementation

Dynamic Load Balancing for Robust Distributed Computing in the Presence of Topological Impairments

Contact Info

Product

Resources

About