2014
DOI: 10.1177/1094342014552086
|View full text |Cite
|
Sign up to set email alerts
|

Optimization of MPI collective operations on the IBM Blue Gene/Q supercomputer

Abstract: The Blue Gene/Q (BG/Q) machine is the latest in the line of IBM massively parallel supercomputers, designed to scale to 262,144 nodes and 16 million threads. Each BG/Q node has 68 hardware threads. Hybrid programming paradigms, which use message passing among nodes and multi-threading within nodes, enable applications to achieve high throughput on BG/Q. In this paper, we present scalable algorithms to optimize MPI collective operations by taking advantage of the various features of the BG/Q torus and collectiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2015
2015
2023
2023

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 24 publications
(12 citation statements)
references
References 32 publications
0
12
0
Order By: Relevance
“…The latencies are reported for messages up to 140 Kbyte (KB) in size are high -on the order of milliseconds. Kumar et al [11] developed an efficient algorithm for the Blue Gene/Q platform, which leverages the system's 5D torus with the reductions being performed by the host CPU. Adachi [6] implemented the Rabenseifner algorithm for the K-computer taking advantage its 5D network topology, segmenting the vectors into three parts which are reduced in parallel over three disjoint trees, and using the host CPU to perform the data reductions.…”
Section: Previous Workmentioning
confidence: 99%
“…The latencies are reported for messages up to 140 Kbyte (KB) in size are high -on the order of milliseconds. Kumar et al [11] developed an efficient algorithm for the Blue Gene/Q platform, which leverages the system's 5D torus with the reductions being performed by the host CPU. Adachi [6] implemented the Rabenseifner algorithm for the K-computer taking advantage its 5D network topology, segmenting the vectors into three parts which are reduced in parallel over three disjoint trees, and using the host CPU to perform the data reductions.…”
Section: Previous Workmentioning
confidence: 99%
“…MPI collectives optimization algorithms for this generation of Blue Gene were analyzed in [10]. The recent version Blue Gene/Q [14] provides additional performance improvements for MPI collectives [19]. On a 96,304 node system, the latency of a short allreduce is about 6.5 µ-seconds.…”
Section: Previous Workmentioning
confidence: 99%
“…R. Thakur and W. Gropp described the algorithms used by MPICH [31]. Some specific implementations on the IBM BG/Q platform have been discussed in [22]. K. Kandalla et al discussed how to develop the topology-aware algorithms for Infiniband clusters [21].…”
Section: Related Workmentioning
confidence: 99%