Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Doi, Jun; Negishi, Yasushi

doi:10.1109/sc.2010.38

Cited by 29 publications

(12 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Message combining merges multiple packets to the same destination at an intermediate node. A similar approach is attempted in the N × 2N tori in the IBM BlueGene/L [14]. In [14], it is reported that software message concatenation improves the performance of the MPI Alltoall function, when it is performed just before packets are turned in a dimension.…”

Section: Efficient Communication Methodsmentioning

confidence: 99%

The Case for Network Coding for Collective Communication on HPC Interconnection Networks

Shalaby

Fujiwara

Koibuchi

2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYRecently network bandwidth becomes a performance concern particularly for collective communication since bisection bandwidths of supercomputers become far less than their full bisection bandwidths. In this context we propose the use of a network coding technique to reduce the number of unicasts and the size of data transferred in latency-sensitive collective communications in supercomputers. Our proposed network coding scheme has a hierarchical multicasting structure with intra-group and inter-group unicasts. Quantitative analysis show that the aggregate path hop counts by our hierarchical network coding decrease as much as 94% when compared to conventional unicast-based multicasts. We validate these results by cycle-accurate network simulations. In 1,024-switch networks, the network reduces the execution time of collective communications as much as 70%. We also show that our hierarchical network coding is beneficial for any packet size.

show abstract

Section: Efficient Communication Methodsmentioning

confidence: 99%

The Case for Network Coding for Collective Communication on HPC Interconnection Networks

Shalaby

Fujiwara

Koibuchi

2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…On the other hand, our approach increases both of the performance and scaling through computation-communication overlap and 2-D decomposition. Doi et al [21] overlap computation and communication in a shared-memory parallel environment. However, each core uses blocking communication, and the overlap takes place between cores.…”

Section: Related Workmentioning

confidence: 99%

Computation–communication overlap and parameter auto-tuning for scalable parallel 3-D FFT

Song

Hollingsworth

2016

Journal of Computational Science

View full text Add to dashboard Cite

a b s t r a c tParallel 3-D FFT is widely used in scientific applications, therefore it is important to achieve high performance on large-scale systems with many thousands of computing cores. This paper describes a new method for scalable high-performance parallel 3-D FFT. We use a 2-D decomposition of 3-D arrays to increase scaling to a large number of cores. In order to achieve high performance, we use non-blocking MPI all-to-all operations and exploit computation-communication overlap. We also auto-tune our 3-D FFT code efficiently in a large parameter space and cope with the complex trade-off in optimizing our code in various system environments. According to experimental results from two systems, our method computes parallel 3-D FFT significantly faster than three existing libraries, and scales well to at least 32,768 compute cores.

show abstract

“…1, this method fundamentally requires three all-to-all communication steps. This all-to-all communication can account for anywhere from 50% to over 90% of the overall running time (Section 4), and was the focus of many continuing research work [5,10,29,30].…”

Section: Soi Fftmentioning

confidence: 99%

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Park

Bikshandi

Vaidyanathan

et al. 2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

This paper demonstrates the first tera-scale performance of Intel R Xeon Phi TM coprocessors on 1D fft computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 tflops with 512 nodes, which is 1.5× than achievable on a same number of Intel R Xeon R nodes. It is a challenge to fully utilize the compute capability presented by many-core widevector processors for bandwidth-bound fft computation. We leverage a new algorithm, Segment-of-Interest fft, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running fft on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging hpc systems that are increasingly communication limited.

show abstract

Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers

Cited by 29 publications

References 17 publications

The Case for Network Coding for Collective Communication on HPC Interconnection Networks

The Case for Network Coding for Collective Communication on HPC Interconnection Networks

Computation–communication overlap and parameter auto-tuning for scalable parallel 3-D FFT

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Contact Info

Product

Resources

About