Proceedings of the 22nd European MPI Users' Group Meeting 2015
DOI: 10.1145/2802658.2802670
|View full text |Cite
|
Sign up to set email alerts
|

Sliding Substitution of Failed Nodes

Abstract: This paper considers the questions of how spare nodes should be allocated, how to substitute them for faulty nodes, and how much the communication performance is affected by such a substitution. The third question stems from the modification of the rank mapping by node substitutions, which can incur additional message collisions. In a stencil computation, rank mapping is done in a straightforward way on a Cartesian network without incurring any message collisions. However, once a substitution has occurred, the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 15 publications
(17 reference statements)
0
6
0
Order By: Relevance
“…Hori et al 27 discussed how one can use spare nodes to restart an application which has experienced a node failure. This technique could be applied to our case.…”
Section: Related Workmentioning
confidence: 99%
“…Hori et al 27 discussed how one can use spare nodes to restart an application which has experienced a node failure. This technique could be applied to our case.…”
Section: Related Workmentioning
confidence: 99%
“…The application developer is responsible for implementing recovery using those operations, choosing the type of recovery best suited for its application. A collection of works on ULFM [9,[16][17][18]21,23,26] has investigated the applicability of ULFM and benchmarked individual operations of it. Bosilca et al [7,8] and Katti et al [19] propose efficient fault detection algorithms to integrate with ULFM.…”
Section: Related Workmentioning
confidence: 99%
“…Although there has been a large bibliography [4,5,9,11,[16][17][18][21][22][23]26] discussing the programming model and prototypes of those approaches, no study has presented an in-depth performance evaluation of them -most previous works either focus on individual aspects of each approach or perform limited scale experiments. In this paper, we present an extensive evaluation using HPC proxy applications to contrast these two leading global-restart recovery approaches.…”
Section: Introductionmentioning
confidence: 99%
“…In theory, five message collisions, for example, means the communication time gets slower five times. On the K computer, only three times slower communication time was observed because simultaneous four message sending takes 1.7 times of the time of sending one message ( 3 5 / 1.7 ) (Hori et al, 2015). One possible reason to explain this slowness (1.7 with 5P-stencil and 3.7 with 7P-stencil) is the insufficient bandwidth between the memory and the network controller chip.…”
Section: Evaluations On K Bg/q and Tsubame 25mentioning
confidence: 99%