2016
DOI: 10.1017/s095679681600006x
|View full text |Cite
|
Sign up to set email alerts
|

Transparent fault tolerance for scalable functional computation

Abstract: Reliability is set to become a major concern on emergent large-scale architectures. While there are many parallel languages, and indeed many parallel functional languages, very few address reliability. The notable exception is the widely emulated Erlang distributed actor model that provides explicit supervision and recovery of actors with isolated state.We investigate scalable transparent fault tolerant functional computation with automatic supervision and recovery of tasks. We do so by developing HdpH-RS, a v… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 42 publications
0
5
0
Order By: Relevance
“…Moreover, reliability is increasingly an issue at HPC scale, and here, the statelessness of many algebraic computations means failed computations can be safely recomputed. The HdpH-RS extension tracks the location of computations and reinstates any that may have failed [60,63].…”
Section: Resultsmentioning
confidence: 99%
“…Moreover, reliability is increasingly an issue at HPC scale, and here, the statelessness of many algebraic computations means failed computations can be safely recomputed. The HdpH-RS extension tracks the location of computations and reinstates any that may have failed [60,63].…”
Section: Resultsmentioning
confidence: 99%
“…This makes the proposed approach limited in its fault tolerance, and further analysis and a clear fault model are needed. Haskell distributed parallel Haskell (HdpH) (Stewart, 2013;Stewart et al, 2013;Maier et al, 2014;Stewart et al, 2016) is a variant of distributed parallel Haskell for reliable computation. HdpH does provides monitoring and recovering capabilities but, like other proposals for distributed Haskell, its fault tolerance mechanisms are applied during runtime, not during compile time, and it does not provide a mechanism to verify the correct handling of fault classes.…”
Section: Related Workmentioning
confidence: 99%
“…Resilient distributed work-stealing runtime systems use fault tolerant protocols for tracking task migration under failure [5,9]. Our work focuses on the APGAS model, in which tasks are explicitly assigned to places, hence they are not migratable.…”
Section: Related Workmentioning
confidence: 99%