Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

Khorassani, Kawthar Shafie; Chu, C. Y. Cyrus; Anthony, Quentin; Subramoni, Hari; Panda, Dhabaleswar K.

doi:10.1109/ccgrid51090.2021.00021

Cited by 9 publications

(4 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One-stage routing: As can be seen in Fig. 2 (a), One-stage routing conducts all-to-all broadcast from each node to all the other nodes without any intermediate relay nodes, which requires ⌈ 16 2 8 ⌉ = 32 wavelengths according to Lemma 1. Since the available number of wavelengths is two, it takes ⌈ 32 2 ⌉ = 16 communication steps (time slots) for finishing the All-gather operation, with each step sending amount of data d.…”

Section: Motivationmentioning

confidence: 99%

“…A S the number of GPUs or other accelerators integrated into systems increases, efficient communication among these devices becomes crucial for running HPC applications. All-to-all communication methods, such as the message passing interface (MPI) All-gather operation, are widely used in many legacy scientific applications to perform Fast Fourier Transforms (FFTs) when data is distributed among multiple processes [2]. The All-gather operation is also gaining attention for performing model or hybrid parallelisms in the training of distributed Deep Neural Networks (DNNs) on GPU clusters [3].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Dai,

Chen,

Huang

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

In the realm of parallel and distributed computation, All-gather operation, a process where each node in a distributed system gathers data from all others, is pivotal. This operation underpins various high-performance computing (HPC) applications, notably in distributed deep learning (DL), by enabling model and hybrid parallelisms. Although optical interconnection networks promise unmatched bandwidth and reliability for data transfers between distributed nodes, most current All-gather algorithms remain optimized for electrical interconnects, leading to suboptimal performance in optical contexts. This paper proposes "OpTree", an advanced scheme distinctly designed for All-gather operation in optical interconnect systems. OpTree constructs an optimal m-ary tree that minimizes communication time by determining the optimal number of communication stages. A comprehensive comparison between OpTree's communication steps and existing All-gather algorithms is provided. Theoretical insights reveal that OpTree substantially curtails communication steps within optical interconnects. Constraints imposed by OpTree on optical communication are also elaborated. Empirical evaluations, through rigorous simulations, establish that: 1) OpTree is effective in generating an optimal m-ary tree for minimizing communication time. 2) For a 1024-node optical ring system, OpTree cuts communication time by 72.97%, 93.15%, and 86.32% against WRHT, Ring, and Neighbor Exchange (NE) schemes, respectively, tested over different message sizes.3) With varying node counts, the reductions stand at 42.27%, 92.74%, and 85.49% against the same counterparts. 4) As the number of wavelengths increases, communication time further diminishes.

show abstract

Section: Motivationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Efficient Algorithm for All-Gather Operation in Optical Interconnect Systems

Dai,

Chen,

Huang

et al. 2024

IEEE Open J. Commun. Soc.

View full text Add to dashboard Cite

show abstract

“…Some studies [16], [17], [18], [19] have been done to optimize uniform all-to-all algorithms. Recent works have looked into optimizing all-toall for GPU-based clusters [20], [21]. Most relevant to our work is [13], which presented a high-radix implementation of all-to-all.…”

Section: Related Workmentioning

confidence: 99%

“…Data exchanges between cores on the same nodes translate to a direct memory copy. Therefore, they are faster than data exchanges between cores on different nodes, which require data to move via the network [28], [20]. To exploit this extra locality offered by the shared memory, we develop a new algorithm called the two-layer (2) Tunable Radix All-to-all algorithm (TRA2), that improves upon TRA.…”

Section: Two-layer Tunable-radix All-to-all (Tra2)mentioning

confidence: 99%

Configurable Algorithms for All-to-All Collectives

Fan,

Petruzza,

Gilray

et al. 2024

ISC High Performance 2024 Research Paper Proceedings (39th International Conference)

View full text Add to dashboard Cite

MPI_Alltoall is a commonly used collective that allows a fixed-size data block to be exchanged between every pair of processes. The function can be implemented through a logarithmic number of point-to-point communication rounds, where the exact number of rounds and total data exchanged among processes depend on the log base (radix). This paper presents a mathematical foundation for studying all communication patterns for the all-to-all collective by developing parameterized formulas for total communication rounds and data exchanged. The model is used to narrow down a radix, √ P (P : process count), that effectively balances latency and bandwidth concerns, yielding optimal performance-as also confirmed via evaluation on the Theta and Polaris supercomputers at ANL. We also present a novel two-layer tunable radix algorithm to take advantage of the shared-memory parallelism offered by modern systems. The algorithm decouples communication rounds into two phases that can be individually optimized to take advantage of the shared memory and high-speed interconnect separately. Our approach demonstrates improvements of up to 3.8× on Theta and 4.2× on Polaris over the vendor-optimized MPICH-based implementation of MPI_Alltoall for fast Fourier transform application.

show abstract