Proceedings International Parallel and Distributed Processing Symposium
DOI: 10.1109/ipdps.2003.1213241
|View full text |Cite
|
Sign up to set email alerts
|

Recovery schemes for high availability and high performance distributed real-time computing

Abstract: Clusters and distributed systems offer fault tolerance and high performance through load sharing, and are thus attractive in real-time applications. When all computers are up and running, we would like the load to be evenly distributed among the computers. When one or more computers fail the must be redistributed. The redistribution is determined by the recovery scheme. The recovery scheme should keep the load as evenly distributed as possible even when the most unfavorable combinations of computers break down… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 3 publications
0
3
0
Order By: Relevance
“…In this case the recovery schemes are implemented in the external systems as second, third, fourth,... alternative destinations (alternative cluster nodes) in case the primary, secondary,... destination goes down); • there is one network address for each computer and we use IP takeover (or similar techniques); • all the work performed by a computer is done by one process or a group of related processes that share local resources and thus must be moved as one unit. The results presented here can be easily generalized to the case with a number of independent processes on each computer using the same technique as in some of our previous papers [9,10].…”
Section: Problem Formulationmentioning
confidence: 88%
See 1 more Smart Citation
“…In this case the recovery schemes are implemented in the external systems as second, third, fourth,... alternative destinations (alternative cluster nodes) in case the primary, secondary,... destination goes down); • there is one network address for each computer and we use IP takeover (or similar techniques); • all the work performed by a computer is done by one process or a group of related processes that share local resources and thus must be moved as one unit. The results presented here can be easily generalized to the case with a number of independent processes on each computer using the same technique as in some of our previous papers [9,10].…”
Section: Problem Formulationmentioning
confidence: 88%
“…Another algorithm, called Greedy, is presented in [9]. This algorithm generates the recovery schemes that give optimality for a larger number of cases than the Log algorithm, (i.e.…”
Section: Previous Researchmentioning
confidence: 99%
“…In our system, however, we do not have the (non-volatile) memory capacity and the computational resources for a checkpointing approach. In general purpose distributed computing it is common to use redundant hardware and employ load sharing techniques to increase fault-tolerance (see, e.g., [5]). …”
Section: Related Workmentioning
confidence: 99%