2021
DOI: 10.1007/978-3-030-85665-6_28
|View full text |Cite
|
Sign up to set email alerts
|

Towards High Performance Resilience Using Performance Portable Abstractions

Abstract: In the drive towards Exascale, the extreme heterogeneity of supercomputers at all levels places a major development burden on HPC applications. To this end, performance portable abstractions such as those advocated by Kokkos, RAJA and HPX are becoming increasingly popular. At the same time, the unprecedented scalability requirements of such heterogeneous components means higher failure rates, motivating the need for resilience in systems and applications. Unfortunately, state-of-art resilience techniques based… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
(24 reference statements)
0
2
0
Order By: Relevance
“…The control-flow resilience layer is a newer focus of research. Kokkos Resilience [8], [15] and Resilient HCLIB [16] both utilize existing parallel control libraries to manage control-flow during recovery, which benefits from preexisting knowledge of the application's default control-flow. The two implement resilience methods in parallel-region and asynchronous-many-task runtimes, respectively.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The control-flow resilience layer is a newer focus of research. Kokkos Resilience [8], [15] and Resilient HCLIB [16] both utilize existing parallel control libraries to manage control-flow during recovery, which benefits from preexisting knowledge of the application's default control-flow. The two implement resilience methods in parallel-region and asynchronous-many-task runtimes, respectively.…”
Section: Related Workmentioning
confidence: 99%
“…We present a comprehensive resilience system runtime which integrates process resilience, control-flow resilience, and data resilience. Out implementation uses Fenix for the process resilience component, Kokkos Resilience [8] for controlflow resilience component, and VeloC [9] for data resilience. The result of this is a highly performant, comprehensive resilience environment consisting of runtimes that offer support for a wide array of state-of-the-art resilience strategies.…”
Section: Introductionmentioning
confidence: 99%