2000
DOI: 10.1007/3-540-45591-4_168
|View full text |Cite
|
Sign up to set email alerts
|

Fault Tolerant Wide-Area Parallel Computing

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2005
2005
2020
2020

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 17 publications
(6 citation statements)
references
References 6 publications
0
6
0
Order By: Relevance
“…Wide-area FL methods can be a possible solution [118] for this kind of scenario. In wide-area FL methods, a replica of each application/algorithm runs at different transmission substations, as shown in Figure 9 [119], to avoid overloading the available computation and communication resources of that particular station. Thus, fault can be located even with less number of devices installed at different end-terminals of transmission links.…”
Section: Wide-area Fl Approachmentioning
confidence: 99%
“…Wide-area FL methods can be a possible solution [118] for this kind of scenario. In wide-area FL methods, a replica of each application/algorithm runs at different transmission substations, as shown in Figure 9 [119], to avoid overloading the available computation and communication resources of that particular station. Thus, fault can be located even with less number of devices installed at different end-terminals of transmission links.…”
Section: Wide-area Fl Approachmentioning
confidence: 99%
“…For more general models, including also data parallel programs, we have shown the notable case of ProActive: from the programming model properties, its designers have developed an optimized checkpointing protocol. Some researchers also focused on the data parallel model, but restricted to programs featuring consistency steps (see Section 1) and with the only goal of defining cost models [58].…”
Section: Discussionmentioning
confidence: 99%
“…In this way consistency could be guaranteed at those steps at the cost of periodic global synchronizations (e.g. see [58]). As described in the following section, in this paper we have chosen to avoid such period synchronizations to introduce fully asynchronous checkpointing protocols.…”
Section: Example Of Inconsistent States In Data Parallel Programsmentioning
confidence: 99%
“…The most common failure modes include machine faults in which hosts go down and get rebooted, and network faults, where links go down. Finding a single monolithic solution for fault tolerance that is acceptable to all user applications is unlikely [2]. Among the recovery models, there are strategies such as rollback recovery, in which you can go back to a previous correct state that has been previously saved.…”
Section: Introductionmentioning
confidence: 99%