A Case for Non-blocking Collective Operations

Hoefler, Torsten; Squyres, Jeffrey M.; Rehm, Wolfgang; Lumsdaine, Andrew

doi:10.1007/11942634_17

Cited by 27 publications

(14 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Non-blocking collective operations can move the pseudo-synchronization to the background and allow the user application to tolerate process skew to a certain extent. A detailed discussion of pseudo-synchronization and its effect on parallel program runs is given in [20,21].…”

Section: Non-blocking Collective Oper-ationsmentioning

confidence: 99%

Implementation and performance analysis of non-blocking collective operations for MPI

Hoefler

Lumsdaine

Rehm

2007

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing

Self Cite

156

136

View full text Add to dashboard Cite

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other highperformance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.

show abstract

Section: Non-blocking Collective Oper-ationsmentioning

confidence: 99%

Implementation and performance analysis of non-blocking collective operations for MPI

Hoefler

Lumsdaine

Rehm

2007

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing

Self Cite

156

136

View full text Add to dashboard Cite

show abstract

“…In this work an MPI AlltoAll collective operation is used to gather data from neighbor nodes instead of using the typical MPI Send/MPI Recv semantics. This collective operation is partially overlapped with the computation on locally available data by using a particular non-blocking version of the MPI AlltoAll collective operation [14]. In addition, the MPI Allreduce collective operation in the CG solver couldn't be overlapped with computation due to data dependencies.…”

Section: Related Workmentioning

confidence: 99%

Optimizing multiple conjugate gradient solvers for large‐scale systems

Sancho

Kerbyson

2009

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYConjugate gradient (CG) solvers are the most time consuming part of many scientific applications. These solvers exhibit communication operations that can prevent high performance from being achieved on large-scale systems. In this paper we present a novel technique to boost the performance of these solvers. In this, multiple independent solvers, which occur in some applications, are combined allowing for the overlapping of communication with other communication and computation, resulting with increased performance. This is the first work where a combination of CG solvers has been exploited and offers performance improvements, which may be particularly important in very large-scale systems. Results are presented for the MIMD lattice computation (MILC) application that show the cost of collective communications can be reduced by a factor of up to 2.5×. Moreover the performance of MILC is significantly improved, by over 10% for typical lattice sizes on a 1024-processor system, and by 15% on a 4096-processor system. Larger improvements are expected on larger systems.

show abstract

“…al. proposed using host based techniques for designing non-blocking collective operations [8]. However, host based techniques, offer limited performance portability and may not deliver complete overlap.…”

Section: Impact Of System Noise Of Pcg Run-timesmentioning

confidence: 99%

“…Simplistic designs of non-blocking collectives requiring progressing the MPI library explicitly by CPU intervention, e.g. calling MPI Test [8], offsets much of the benefit of non-blocking communication. Similarly, if threads within the library are used for progression, the application performance can be hurt by interrupt processing, thread scheduling and other such factors, [9].…”

Section: Introductionmentioning

confidence: 99%

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

Kandalla

Yang

Keasler

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Scientists across a wide range of domains increasingly rely on computer simulation for their investigations. Such simulations often spend a majority of their run-times solving large systems of linear equations that require vast amounts of computational power and memory. It is hence critical to design solvers in a highly efficient and scalable manner. Hypre is a high performance, scalable software library that offers several optimized linear solver routines and pre-conditioners. In this paper, we study the characteristics of Hypre's Preconditioned Conjugate Gradient (PCG) solver algorithm. The PCG routine is known to spend a majority of its communication time in the MPI Allreduce operation to compute a global summation during the innerproduct operation. The MPI Allreduce is a blocking operation whose latency is often a limiting factor to the overall efficiency of the PCG solver routine, and correspondingly the performance of simulations that rely on this solver. Hence, hiding the latency of the MPI Allreduce operation is critical towards scaling the PCG solver routine and improving the performance of many simulations.The upcoming revision of MPI, MPI-3, will provide support for non-blocking collective communication to enable latency-hiding. The latest InfiniBand adapter from Mellanox, ConnectX-2, enables offloading of generalized lists of communication operations to the network interface. Such an interface can be leveraged to design non-blocking collective operations. In this paper, we design fully functional, scalable algorithms for the MPI Iallreduce operation, based on the network offload technology. To the best of our knowledge, this is the first such design to be presented in the literature. Our designs scale beyond 512 processes and we achieve near perfect communication/computation overlap. We also re-design the PCG solver routine to leverage our proposed MPI Iallreduce operation to hide the latency of the global reduction operations. We observe up to 21% improvements in the run-times of the PCG routine, when compared to the default PCG implementation in Hypre. We also note that about 16% of the overall benefits are due to overlapping the Allreduce operations.

show abstract

A Case for Non-blocking Collective Operations

Cited by 27 publications

References 24 publications

Implementation and performance analysis of non-blocking collective operations for MPI

Implementation and performance analysis of non-blocking collective operations for MPI

Optimizing multiple conjugate gradient solvers for large‐scale systems

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

Contact Info

Product

Resources

About