Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations

Hoefler, Torsten; Gottschling, Peter; Rehm, Wolfgang; Lumsdaine, Andrew

doi:10.1007/11846802_52

Cited by 12 publications

(10 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The plain MPI implementation is extracted from a state of the art work on optimizing CG solvers with non‐blocking collective operations. () This latter implementation contains roughly 900 lines of code. The same output has been reproduced using ExaShark with 150 lines of C++ code.…”

Section: Applicationsmentioning

confidence: 99%

A high‐level library for multidimensional arrays programming in computational science

Chakroun¹,

Aa²,

Fraine

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

Summary This paper describes ExaShark, a hybrid n‐dimensional array toolkit offered as a high‐level library for scientists to compute large‐scale simulations. It offers a global‐array–like interface while its runtime can be configured to use shared memory threading techniques, inter‐node distribution techniques, or combinations of both. ExaShark takes advantage of the latest HPC technologies, helping to scale to future generation systems. It has been used to develop several scientific applications including stencil codes, solvers, and matrix factorization algorithms. These applications are used to demonstrate that it improves on the state of the art by providing a user‐friendly, generic API without sacrificing performance.

show abstract

Section: Applicationsmentioning

confidence: 99%

A high‐level library for multidimensional arrays programming in computational science

Chakroun¹,

Aa²,

Fraine

et al. 2018

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The reference implementation is an open-source code 2 that uses MPI blocking and non-blocking all-to-all collective operations to implement the halo exchange [17]. [17] demonstrates the effectiveness of using non-blocking collective operations to overlap the communication and computation in the haloexchange. We decouple the halo exchange operation onto a separate group of processes, denoted as group G 1 .…”

Section: Conjugate Gradient Solvermentioning

confidence: 99%

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Gioiosa

Kestor

Laure

2017

2017 46th International Conference on Parallel Processing (ICPP)

View full text Add to dashboard Cite

Production-quality parallel applications are often a mixture of diverse operations, such as computation-and communication-intensive, regular and irregular, tightly coupled and loosely linked operations. In conventional construction of parallel applications, each process performs all the operations, which might result inefficient and seriously limit scalability, especially at large scale. We propose a decoupling strategy to improve the scalability of applications running on large-scale systems. Our strategy separates application operations onto groups of processes and enables a dataflow processing paradigm among the groups. This mechanism is effective in reducing the impact of load imbalance and increases the parallel efficiency by pipelining multiple operations. We provide a proof-of-concept implementation using MPI, the de-facto programming system on current supercomputers. We demonstrate the effectiveness of this strategy by decoupling the reduce, particle communication, halo exchange and I/O operations in a set of scientific and dataanalytics applications. A performance evaluation on 8,192 processes of a Cray XC40 supercomputer shows that the proposed approach can achieve up to 4× performance improvement.

show abstract

“…However, our previous works involving overlap, such as optimization of a Poisson solver [5] or the optimization of a Fast Fourier Transformation [6] showed that this simple heuristic is not sufficient to achieve good overlap. The two main reasons for this have been found in a theoretical and practical analysis of nonblocking collective operations [7].…”

Section: Manual Transformation Techniquementioning

confidence: 99%

“…Code must often be significantly restructured to take full advantage of non-blocking collective operations. We learned in several application studies [5,6] that using non-blocking collectives can lead to performance benefits of up to 35% by overlapping computation and communication. We also showed that their usage is likely to be labor-intensive and error-prone-and may decrease code readability as well.…”

Section: Introductionmentioning

confidence: 99%

Leveraging non-blocking collective communication in high-performance applications

Hoefler

Gottschling

Lumsdaine

2008

Proceedings of the Twentieth Annual Symposium on Parallelism in Algorithms and Architectures

Self Cite

View full text Add to dashboard Cite

Although overlapping communication with computation is an important mechanism for achieving high performance in parallel programs, developing applications that actually achieve good overlap can be difficult. Existing approaches are typically based on manual or compiler-based transformations. This paper presents a pattern and library-based approach to optimizing collective communication in parallel high-performance applications, based on using non-blocking collective operations to enable overlapping of communication and computation. Common communication and computation patterns in iterative SPMD computations are used to motivate the transformations we present. Our approach provides the programmer with the capability to separately optimize communication and computation in an application, while automating the interaction between computation and communication to achieve maximum overlap. Performance results with a model application show more than a 90% decrease in communication overhead, resulting in 21% overall performance improvements.

show abstract

Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations

Cited by 12 publications

References 14 publications

A high‐level library for multidimensional arrays programming in computational science

A high‐level library for multidimensional arrays programming in computational science

Preparing HPC Applications for the Exascale Era: A Decoupling Strategy

Leveraging non-blocking collective communication in high-performance applications

Contact Info

Product

Resources

About