Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Kumar, Sameer; Mamidala, Amith R.; Heidelberger, Philip; Chen, Dong; Faraj, Daniel A.

doi:10.1177/1094342014552086

Cited by 24 publications

(12 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The latencies are reported for messages up to 140 Kbyte (KB) in size are high -on the order of milliseconds. Kumar et al [11] developed an efficient algorithm for the Blue Gene/Q platform, which leverages the system's 5D torus with the reductions being performed by the host CPU. Adachi [6] implemented the Rabenseifner algorithm for the K-computer taking advantage its 5D network topology, segmenting the vectors into three parts which are reduced in parallel over three disjoint trees, and using the host CPU to perform the data reductions.…”

Section: Previous Workmentioning

confidence: 99%

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

Graham

Levi

Burredy

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper describes the new hardware-based streamingaggregation capability added to Mellanox's Scalable Hierarchical Aggregation and Reduction Protocol in its HDR InfiniBand switches. For large messages, this capability is designed to achieve reduction bandwidths similar to those of point-to-point messages of the same size, and complements the latency-optimized low-latency aggregation reduction capabilities, aimed at small data reductions. MPI Allreduce() bandwidth measured on an HDR InfiniBand based system achieves about 95% of network bandwidth. For medium and large data reduction this also improves the reduction bandwidth by a factor of 2-5 relative to hostbased (e.g., software-based) reduction algorithms. Using this capability also increased DL-Poly and PyTorch application performance by as much as 4% and 18%, respectively. This paper describes SHARP Streaming-Aggregation hardware architecture and a set of synthetic and application benchmarks used to study this new reduction capability, and the range of data sizes for which Streaming-Aggregation performs better than the low-latency aggregation algorithm.

show abstract

Section: Previous Workmentioning

confidence: 99%

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

Graham

Levi

Burredy

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…MPI collectives optimization algorithms for this generation of Blue Gene were analyzed in [10]. The recent version Blue Gene/Q [14] provides additional performance improvements for MPI collectives [19]. On a 96,304 node system, the latency of a short allreduce is about 6.5 µ-seconds.…”

Section: Previous Workmentioning

confidence: 99%

Towards A Data Centric System Architecture: SHARP

Graham

Bloch

Bureddy

et al. 2017

JSFI

View full text Add to dashboard Cite

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. The SHARP technology is a step towards a data-centric architecture, where data is manipulated throughout the system. This paper introduces a new SHARP optimization, and studies aspects that impact application performance in a data-centric environment. The use of UD-Multicast to distribute aggregation results is introduced, reducing the letency of an eight-byte MPI Allreduce() across 128 nodes by 16%. Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%. The distribution of latency across processes in the communicator is studied, as is the capacity of the system to process concurrent aggregation operations.

show abstract

“…R. Thakur and W. Gropp described the algorithms used by MPICH [31]. Some specific implementations on the IBM BG/Q platform have been discussed in [22]. K. Kandalla et al discussed how to develop the topology-aware algorithms for Infiniband clusters [21].…”

Section: Related Workmentioning

confidence: 99%

Parallel implementation and performance optimization of the configuration-interaction method

Shan

Williams

Johnson

et al. 2015

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

The configuration-interaction (CI) method, long a popular approach to describe quantum many-body systems, is cast as a very large sparse matrix eigenpair problem with matrices whose dimension can exceed one billion. Such formulations place high demands on memory capacity and memory bandwidth-two quantities at a premium today. In this paper, we describe an efficient, scalable implementation, BIGSTICK, which, by factorizing both the basis and the interaction into two levels, can reconstruct the nonzero matrix elements on the fly, reduce the memory requirements by one or two orders of magnitude, and enable researchers to trade reduced resources for increased computational time. We optimize BIGSTICK on two leading HPC platformsthe Cray XC30 and the IBM Blue Gene/Q. Specifically, we not only develop an empirically-driven load balancing strategy that can evenly distribute the matrix-vector multiplication across 256K threads, we also developed techniques that improve the performance of the Lanczos reorthogonalization. Combined, these optimizations improved performance by 1.3-8× depending on platform and configuration.

show abstract

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Cited by 24 publications

References 32 publications

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)TM Streaming-Aggregation Hardware Design and Evaluation

Towards A Data Centric System Architecture: SHARP

Parallel implementation and performance optimization of the configuration-interaction method

Contact Info

Product

Resources

About