2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid) 2021
DOI: 10.1109/ccgrid51090.2021.00021
|View full text |Cite
|
Sign up to set email alerts
|

Adaptive and Hierarchical Large Message All-to-all Communication Algorithms for Large-scale Dense GPU Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(4 citation statements)
references
References 28 publications
0
4
0
Order By: Relevance
“…One-stage routing: As can be seen in Fig. 2 (a), One-stage routing conducts all-to-all broadcast from each node to all the other nodes without any intermediate relay nodes, which requires ⌈ 16 2 8 ⌉ = 32 wavelengths according to Lemma 1. Since the available number of wavelengths is two, it takes ⌈ 32 2 ⌉ = 16 communication steps (time slots) for finishing the All-gather operation, with each step sending amount of data d.…”
Section: Motivationmentioning
confidence: 99%
See 1 more Smart Citation
“…One-stage routing: As can be seen in Fig. 2 (a), One-stage routing conducts all-to-all broadcast from each node to all the other nodes without any intermediate relay nodes, which requires ⌈ 16 2 8 ⌉ = 32 wavelengths according to Lemma 1. Since the available number of wavelengths is two, it takes ⌈ 32 2 ⌉ = 16 communication steps (time slots) for finishing the All-gather operation, with each step sending amount of data d.…”
Section: Motivationmentioning
confidence: 99%
“…A S the number of GPUs or other accelerators integrated into systems increases, efficient communication among these devices becomes crucial for running HPC applications. All-to-all communication methods, such as the message passing interface (MPI) All-gather operation, are widely used in many legacy scientific applications to perform Fast Fourier Transforms (FFTs) when data is distributed among multiple processes [2]. The All-gather operation is also gaining attention for performing model or hybrid parallelisms in the training of distributed Deep Neural Networks (DNNs) on GPU clusters [3].…”
Section: Introductionmentioning
confidence: 99%
“…Some studies [16], [17], [18], [19] have been done to optimize uniform all-to-all algorithms. Recent works have looked into optimizing all-toall for GPU-based clusters [20], [21]. Most relevant to our work is [13], which presented a high-radix implementation of all-to-all.…”
Section: Related Workmentioning
confidence: 99%
“…Data exchanges between cores on the same nodes translate to a direct memory copy. Therefore, they are faster than data exchanges between cores on different nodes, which require data to move via the network [28], [20]. To exploit this extra locality offered by the shared memory, we develop a new algorithm called the two-layer (2) Tunable Radix All-to-all algorithm (TRA2), that improves upon TRA.…”
Section: Two-layer Tunable-radix All-to-all (Tra2)mentioning
confidence: 99%