Optimization of MPI collective communication on BlueGene/L systems

Almási, George; Heidelberger, Philip; Archer, Charles J; Martorell, Xavier; Erway, C. Chris; Moreira, José E.; Steinmacher-Burow, Burkhard; Zheng, Yili

doi:10.1145/1088149.1088183

Cited by 119 publications

(76 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The butterfly-like algorithm has been developed some times ago [22,27] and has been extended to handle non-power-of-two numbers of processes [23]. Various architecture specific all-reduce schemes have also been developed [1,4,12,17,26]. An all-reduce algorithm was designed for BlueGene/L systems in [1].…”

Section: Ethernet Switched Cluster Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Bandwidth optimal all-reduce algorithms for clusters of workstations

Patarasuk

Yuan

2009

Journal of Parallel and Distributed Computing

312

147

View full text Add to dashboard Cite

We consider an efficient realization of the all-reduce operation with large data sizes in cluster environments, under the assumption that the reduce operator is associative and commutative. We derive a tight lower bound of the amount of data that must be communicated in order to complete this operation and propose a ring-based algorithm that only requires tree connectivity to achieve bandwidth optimality. Unlike the widely used butterfly-like all-reduce algorithm that incurs network contention in SMP/multi-core clusters, the proposed algorithm can achieve contention-free communication in almost all contemporary clusters including SMP/multi-core clusters and Ethernet switched clusters with multiple switches. We demonstrate that the proposed algorithm is more efficient than other algorithms on clusters with different nodal architectures and networking technologies when the data size is sufficiently large.

show abstract

Section: Ethernet Switched Cluster Resultsmentioning

confidence: 99%

“…Various architecture specific all-reduce schemes have also been developed [1,4,12,17,26]. An all-reduce algorithm was designed for BlueGene/L systems in [1]. In [12], an all-reduce scheme that takes advantage of remote DMA (RDMA) capability was developed for VIA-based clusters.…”

Section: Ethernet Switched Cluster Resultsmentioning

confidence: 99%

Bandwidth optimal all-reduce algorithms for clusters of workstations

Patarasuk

Yuan

2009

Journal of Parallel and Distributed Computing

312

147

View full text Add to dashboard Cite

show abstract

“…The BlueGene/L and BlueGene/Q supercomputers feature specialized collective networks that perform these reductions completely in hardware, using ALUs embedded in network routers [7,14]. In contrast to Coup, their main advantage is minimizing the latency of scalar or short reductions across a very large number of nodes.…”

Section: Additional Related Workmentioning

confidence: 99%

Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems

Zhang

Horn

Sánchez

2015

Proceedings of the 48th International Symposium on Microarchitecture

View full text Add to dashboard Cite

We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with updateonly permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model.We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.

show abstract

“…BG/L-MPI [9] has successfully exploited the rich features of BG/L in terms of the network topology, special purpose network hardware, and architectural compromises. While BG/L-MPI is originally ported from MPICH2 [3], its collective routines have demonstrated superior performance comparing to the original implementation and is close to the peak capabilities of the networks and processors.…”

Section: Blue Gene/l: a Parallel I/o Perspectivementioning

confidence: 99%

“…all processes reading the same data from a file), the inter-process data exchange phase may dominate the overall performance. To address the issue of the communication phase of MPI I/O collective operations, we rely on the BG/L MPI implementation [9] as it has successfully explored and utilized the rich network features of BG/L machine. We tuned the communication phase of MPI I/O collective operations to choose the best performing communication method among BG/L MPI routines.…”

Section: Communication Phase Optimizationsmentioning

confidence: 99%