Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

Kandalla, Krishna; Yang, Ulrike Meier; Keasler, Jeff; Kolev, Tzanio V.; Moody, Adam; Subramoni, Hari; Tomko, Karen; Vienne, J.; Supinski, Bronis R. de; Panda, Dhabaleswar K.

doi:10.1109/ipdps.2012.106

Cited by 25 publications

(3 citation statements)

References 21 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The link rate is upgraded to 25 Gbps from 14 Gbps of the TianHe-2A supercomputer system. Collective offload [5] accelerates collective operations, effectively improves the throughput of a single chip. HFI-E provides the software-hardware interface for accessing the high-performance network, implementing the proprietary Mini Packet/Remote Direct Memory Access (MP/RDMA) communication and collective offload mechanism.…”

Section: Proprietary Interconnectmentioning

confidence: 99%

Brief introduction of TianHe exascale prototype system

Wang

Lü

Chen

et al. 2021

Tinshhua Sci. Technol.

View full text Add to dashboard Cite

Facing the challenges of the next generation exascale computing, National University of Defense Technology has developed a prototype system to explore opportunities, solutions, and limits toward the next generation Tianhe system. This paper briefly introduces the prototype system, which is deployed at the National Supercomputer Center in Tianjin and has a theoretical peak performance of 3.15 Pflops. A total of 512 compute nodes are found where each node has three proprietary CPUs called Matrix-2000+. The system memory is 98.3 TB, and the storage is 1.4 PB in total.

show abstract

Section: Proprietary Interconnectmentioning

confidence: 99%

Brief introduction of TianHe exascale prototype system

Wang

Lü

Chen

et al. 2021

Tinshhua Sci. Technol.

View full text Add to dashboard Cite

show abstract

“…In addition, all PAMI collective calls are non-blocking. We plan to explore MPI 3.0 non-blocking (Hoefler et al, 2007;Kandalla et al, 2012Kandalla et al, , 2013 collective implementation that takes advantage of the non-blocking APIs in PAMI.…”

Section: Summary and Future Workmentioning

confidence: 99%

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Kumar

Mamidala

Heidelberger

et al. 2014

The International Journal of High Performance Computing Applica

View full text Add to dashboard Cite

The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collective networks. We achieve an 8 byte double-sum MPI_Allreduce latency of 10.25 ms on 1,572,864 MPI ranks. We accelerate summing of network packets with local buffers by the use of the Quad Processing SIMD unit in the BG/Q cores and executing the sums on multiple communication threads supported by the optimized communication libraries. The achieved net gain is a peak throughput of 6.3 GB/s for double-sum allreduce. We also achieve over 90% of network peak for MPI_Alltoall with 65,536 MPI ranks.

show abstract

“…The mechanisms are generic enough to implement both blocking and nonblocking semantics, rootbased (Reduce) and non-root based reductions (Allreduce), unlike this research [8]. Compared to [5], the concepts and mechanisms are portable across different architectures.…”

Section: Related Workmentioning

confidence: 99%

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Venkata

Shamis

Sampath

et al. 2013

2013 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Many scientific simulations, using the Message Passing Interface (MPI) programming model, are sensitive to the performance and scalability of reduction collective operations such as MPI Allreduce and MPI Reduce. These operations are the most widely used abstractions to perform mathematical operations over all processes that are part of the simulation. In this work, we propose a hierarchical design to implement the reduction operations on multicore systems. This design aims to improve the efficiency of reductions by 1) tailoring the algorithms and customizing the implementations for various communication mechanisms in the system 2) providing the ability to configure the depth of hierarchy to match the system architecture, and 3) providing the ability to independently progress each of this hierarchy. Using this design, we implement MPI Allreduce and MPI Reduce operations (and its nonblocking variants MPI Iallreduce and MPI Ireduce) for all message sizes, and evaluate on multiple architectures including InfiniBand and Cray XT5. We leverage and enhance our existing infrastructure, Cheetah, which is a framework for implementing hierarchical collective operations to implement these reductions.The experimental results show that the Cheetah reduction operations outperform the production-grade MPI implementations such as Open MPI default, Cray MPI, and MVAPICH2, demonstrating its efficiency, flexibility and portability. On Infini-Band systems, with a microbenchmark, a 512-process Cheetah nonblocking Allreduce and Reduce achieves a speedup of 23x and 10x, respectively, compared to the default Open MPI reductions. The blocking variants of the reduction operations also show similar performance benefits. A 512-process nonblocking Cheetah Allreduce achieves a speedup of 3x, compared to the default MVAPICH2 Allreduce implementation. On a Cray XT5 system, a 6144-process Cheetah Allreduce outperforms the Cray MPI by 145%. The evaluation with an application kernel, Conjugate Gradient solver, shows that the Cheetah reductions speeds up total time to solution by 195%, demonstrating the potential benefits for scientific simulations.Currently used algorithms and implementations for Allreduce and Reduce suffer from several performance drawbacks on multicore systems. These systems typically consist of tens of Central Processing Unit (CPU) cores on a node, network interface with bandwidth of tens of Giga bytes per second and latency of a few microseconds, and have multiple communication mechanisms -multiple cache levels, intra-node communication buses, and network interfaces -with varying performance characteristics. The multicore system architecture is ubiquitous in extreme scale systems. Also, these systems are widely used by scientific community for executing the scientific simulations [1]. Most existing Allreduce and Reduce implementations do not consider these performance variations in communication mechanisms in modern systems, and typically have a single implementation for all these different communication mechanisms resulting in ...

show abstract

Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers

Cited by 25 publications

References 21 publications

Brief introduction of TianHe exascale prototype system

Brief introduction of TianHe exascale prototype system

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Contact Info

Product

Resources

About