Transparent fault tolerance for scalable functional computation

Stewart, Robert; Maier, Patrick; Trinder, Phil

doi:10.1017/s095679681600006x

Cited by 8 publications

(5 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, reliability is increasingly an issue at HPC scale, and here, the statelessness of many algebraic computations means failed computations can be safely recomputed. The HdpH-RS extension tracks the location of computations and reinstates any that may have failed [60,63].…”

Section: Resultsmentioning

confidence: 99%

HPC‐GAP: engineering a 21st‐century high‐performance computer algebra system

Behrends

Hammond

Janjić

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

SUMMARYSymbolic computation has underpinned a number of key advances in Mathematics and Computer Science. Applications are typically large and potentially highly parallel, making them good candidates for parallel execution at a variety of scales from multi-core to high-performance computing systems. However, much existing work on parallel computing is based around numeric rather than symbolic computations. In particular, symbolic computing presents particular problems in terms of varying granularity and irregular task sizes that do not match conventional approaches to parallelisation. It also presents problems in terms of the structure of the algorithms and data. This paper describes a new implementation of the free open-source GAP computational algebra system that places parallelism at the heart of the design, dealing with the key scalability and cross-platform portability problems. We provide three system layers that deal with the three most important classes of hardware: individual shared memory multi-core nodes, mid-scale distributed clusters of (multi-core) nodes and full-blown high-performance computing systems, comprising large-scale tightly connected networks of multi-core nodes. This requires us to develop new cross-layer programming abstractions in the form of new domain-specific skeletons that allow us to seamlessly target different hardware levels. Our results show that, using our approach, we can achieve good scalability and speedups for two realistic exemplars, on high-performance systems comprising up to 32 000 cores, as well as on ubiquitous multi-core systems and distributed clusters. The work reported here paves the way towards full-scale exploitation of symbolic computation by high-performance computing systems, and we demonstrate the potential with two major case studies.

show abstract

Section: Resultsmentioning

confidence: 99%

HPC‐GAP: engineering a 21st‐century high‐performance computer algebra system

Behrends

Hammond

Janjić

et al. 2016

Concurrency and Computation

Self Cite

View full text Add to dashboard Cite

show abstract

“…This makes the proposed approach limited in its fault tolerance, and further analysis and a clear fault model are needed. Haskell distributed parallel Haskell (HdpH) (Stewart, 2013;Stewart et al, 2013;Maier et al, 2014;Stewart et al, 2016) is a variant of distributed parallel Haskell for reliable computation. HdpH does provides monitoring and recovering capabilities but, like other proposals for distributed Haskell, its fault tolerance mechanisms are applied during runtime, not during compile time, and it does not provide a mechanism to verify the correct handling of fault classes.…”

Section: Related Workmentioning

confidence: 99%

Fault-tolerant functional reactive programming (extended version)

Perez¹,

Goodloe²

2020

J. Funct. Prog.

View full text Add to dashboard Cite

Highly critical application domains, like medicine and aerospace, require the use of strict design, implementation, and validation techniques. Functional languages have been used in these domains to develop synchronous dataflow programming languages for reactive systems. Causal stream functions and functional reactive programming (FRP) capture the essence of those languages in a way that is both elegant and robust. To guarantee that critical systems can operate under high stress over long periods of time, these applications require clear specifications of possible faults and hazards, and how they are being handled. Modeling failure is straightforward in functional languages, and many functional reactive abstractions incorporate support for failure or termination. However, handling unknown types of faults, and incorporating fault tolerance into FRP, requires a different construction and remains an open problem. This work demonstrates how to extend an existing functional reactive framework with fault tolerance features. At value level, we tag faulty signals with reliability and probability information and use random testing to inject faults and validate system properties encoded in temporal logic. At type level, we tag components with the kinds of faults they may exhibit and use type-level programming to obtain compile-time guarantees of key aspects of fault tolerance. Our approach is powerful enough to be used in systems with realistic complexity, and flexible enough to be used to guide system analysis and design, validate system properties in the presence of faults, perform runtime monitoring, and study the effects of different fault tolerance mechanisms.

show abstract

“…Resilient distributed work-stealing runtime systems use fault tolerant protocols for tracking task migration under failure [5,9]. Our work focuses on the APGAS model, in which tasks are explicitly assigned to places, hence they are not migratable.…”

Section: Related Workmentioning

confidence: 99%

Resilient Optimistic Termination Detection for the Async-Finish Model

Hamouda

Milthorpe

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

2[0000−0001−7300−9565] and Josh Milthorpe 1[0000−0002−3588−9896]Abstract. Driven by increasing core count and decreasing mean-time-to-failure in supercomputers, HPC runtime systems must improve support for dynamic task-parallel execution and resilience to failures. The async-finish task model, adapted for distributed systems as the asynchronous partitioned global address space programming model, provides a simple way to decompose a computation into nested task groups, each managed by a 'finish' that signals the termination of all tasks within the group. For distributed termination detection, maintaining a consistent view of task state across multiple unreliable processes requires additional book-keeping when creating or completing tasks and finish-scopes. Runtime systems which perform this book-keeping pessimistically, i.e. synchronously with task state changes, add a high communication overhead compared to non-resilient protocols. In this paper, we propose optimistic finish, the first message-optimal resilient termination detection protocol for the async-finish model. By avoiding the communication of certain task and finish events, this protocol allows uncertainty about the global structure of the computation which can be resolved correctly at failure time, thereby reducing the overhead for failure-free execution. Performance results using micro-benchmarks and the LULESH hydrodynamics proxy application show significant reductions in resilience overhead with optimistic finish compared to pessimistic finish. Our optimistic finish protocol is applicable to any task-based runtime system offering automatic termination detection for dynamic graphs of non-migratable tasks.Recent advances in high-performance computing (HPC) systems have greatly increased parallelism, with both larger numbers of nodes, and larger core counts within each node. With increased system size and complexity comes an increase in the expected rate of failures. Programmers of HPC systems must therefore address the twin challenges of efficiently exploiting available parallelism and ensuring resilience to component failures. As more industrial and scientific communities rely on HPC to drive innovation, there is a need for productive programming models for scalable resilient applications.

show abstract

Transparent fault tolerance for scalable functional computation

Cited by 8 publications

References 42 publications

HPC‐GAP: engineering a 21st‐century high‐performance computer algebra system

HPC‐GAP: engineering a 21st‐century high‐performance computer algebra system

Fault-tolerant functional reactive programming (extended version)

Resilient Optimistic Termination Detection for the Async-Finish Model

Contact Info

Product

Resources

About