Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Venkata, Manjunath Gorentla; Shamis, Pavel; Sampath, Rahul S.; Graham, Richard L.; Ladd, Joshua

doi:10.1109/cluster.2013.6702676

Cited by 13 publications

(7 citation statements)

References 10 publications

(8 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, several work has be done in optimizing this trees for MPI, eg. [15] [4], and this is not the scope of this work. However, is worth to mention that these structures can also be used for GGAS and GPU clusters.…”

Section: Work Sharing: Data Distribution Over Multiple Gpusmentioning

confidence: 93%

“…To name a few, in [4], [15] and [3], blocking and nonblocking allreduce and reduce operations are optimized. Also, since GPUs are highly suitable to perform parallel reductions, the in-core reduction on a single GPU has been highly optimized, like described in [16].…”

Section: Introductionmentioning

confidence: 99%

“…For instance, [2] and [3] show that MPI-based scientific simulations spend over 40% of their run time for reduction operations. Iterative solvers such as Conjugate Gradient, GMRES, and Newton, which are important components of many scientific simulations, highly rely on reductions [4].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Oden

Klenk

Fröning

2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

GPUs gain high popularity in High Performance Computing, due to their massive parallelism and high performance per Watt. Despite their popularity, data transfer between multiple GPUs in a cluster remains a problem. Most communication models require the CPU to control the data flow; also intermediate staging copies to host memory are often inevitable. These two facts lead to higher CPU and memory utilization. As a result, overall performance decreases and power consumption increases.Collective operations like reduce and allreduce are very common in scientific simulations and also very sensitive to performance. Due to their massive parallelism, GPUs are very suitable for such operations, but they only excel in performance if they can process the problem in-core. Global GPU Address Spaces (GGAS) enable a direct GPU-to-GPU communication for heterogeneous clusters, which is completely in-line with the GPUs thread-collective execution model and does not require CPU assistance or staging copies in host memory. As we will see, GGAS helps to process collective operations among distributed GPUs in-core.In this paper, we introduce the implementation and optimization of collective reduce and allreduce operations using GGAS as a communication model. Compared to message passing, we get a speedup of 1.7x for small data sizes. A detailed analysis based on power measurements of CPU, host memory and GPU reveals that GGAS as communication model not only saves cycles, also the power and energy consumption is reduced dramatically. For instance, for an allreduce operation half of the energy can be saved by the reduced the power consumption in combination with the lower run time.

show abstract

Section: Work Sharing: Data Distribution Over Multiple Gpusmentioning

confidence: 93%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Oden

Klenk

Fröning

2014

2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

View full text Add to dashboard Cite

show abstract

“…Algorithmic work performed by Venkata et al [33] developed short vector blocking and non blocking reduction and barrier operations using a recursive K-ing type host-based approach, and extended work by Thakur [31]. Vadhiar et al [32] presented implementations of blocking reduction, gather and broadcast operations using sequential, chain, binary, binomial tree and Rabenseifner algorithms.…”

Section: Previous Workmentioning

confidence: 99%

Towards A Data Centric System Architecture: SHARP

Graham

Bloch

Bureddy

et al. 2017

JSFI

View full text Add to dashboard Cite

Increased system size and a greater reliance on utilizing system parallelism to achieve computational needs, requires innovative system architectures to meet the simulation challenges. The SHARP technology is a step towards a data-centric architecture, where data is manipulated throughout the system. This paper introduces a new SHARP optimization, and studies aspects that impact application performance in a data-centric environment. The use of UD-Multicast to distribute aggregation results is introduced, reducing the letency of an eight-byte MPI Allreduce() across 128 nodes by 16%. Use of reduction trees that avoid the inter-socket bus further improves the eight-byte MPI Allreduce() latency across 128 nodes, with 28 processes per node, by 18%. The distribution of latency across processes in the communicator is studied, as is the capacity of the system to process concurrent aggregation operations.

show abstract

“…According to research studies over the past two decades [2,3], MPI reduction operations, particularly MPI reduce and allreduce, are the most used collective operations in scientific applications. In the reduce operation, each node i owns a vector x i of n elements.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical redesign of classic MPI reduction algorithms

Hasanov

Lastovetsky

2016

J Supercomput

View full text Add to dashboard Cite

Optimization of MPI collective communication operations has been an active research topic since the advent of MPI in 1990s. Many general and architecturespecific collective algorithms have been proposed and implemented in the state-of-theart MPI implementations. Hierarchical topology-oblivious transformation of existing communication algorithms has been recently proposed as a new promising approach to optimization of MPI collective communication algorithms and MPI-based applications. This approach has been successfully applied to the most popular parallel matrix multiplication algorithm, SUMMA, and the state-of-the-art MPI broadcast algorithms, demonstrating significant multifold performance gains, especially for large-scale HPC systems. In this paper, we apply this approach to optimization of the MPI Reduce and Allreduce operations. Theoretical analysis and experimental results on a cluster of Grid'5000 platform are presented.

show abstract

Optimizing blocking and nonblocking reduction operations for multicore systems: Hierarchical design and implementation

Cited by 13 publications

References 10 publications

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Energy-Efficient Collective Reduce and Allreduce Operations on Distributed GPUs

Towards A Data Centric System Architecture: SHARP

Hierarchical redesign of classic MPI reduction algorithms

Contact Info

Product

Resources

About