2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS) 2022
DOI: 10.1109/icdcs54860.2022.00087
|View full text |Cite
|
Sign up to set email alerts
|

AIACC-Training: Optimizing Distributed Deep Learning Training through Multi-streamed and Concurrent Gradient Communications

Abstract: There is a growing interest in training deep neural networks (DNNs) in a GPU cloud environment. This is typically achieved by running parallel training workers on multiple GPUs across computing nodes. Under such a setup, the communication overhead is often responsible for long training time and poor scalability. This paper presents AIACC-Training, a unified communication framework designed for the distributed training of DNNs in a GPU cloud environment. AIACC-Training permits a training worker to participate i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(3 citation statements)
references
References 21 publications
0
3
0
Order By: Relevance
“…Inferencing frameworks like TensorRT [40] are not suitable for this scenario because they do not produce activations of intermediate layers. As can be seen from Figure 13, [41], [38], [42]. The former partitions the training data across multiple GPUs [5], [43], and the latter splits the model layers vertically and then distributes different layers onto GPUs to reduce the memory pressure of the model states on a single GPU [6].…”
Section: Further Analysis 1) Training Efficiencymentioning
confidence: 99%
“…Inferencing frameworks like TensorRT [40] are not suitable for this scenario because they do not produce activations of intermediate layers. As can be seen from Figure 13, [41], [38], [42]. The former partitions the training data across multiple GPUs [5], [43], and the latter splits the model layers vertically and then distributes different layers onto GPUs to reduce the memory pressure of the model states on a single GPU [6].…”
Section: Further Analysis 1) Training Efficiencymentioning
confidence: 99%
“…The asymmetry multiplier µ = 8. We also apply gradient compression [27] to the per-sample gradient to keep the top 90% values only. Our DPAF is configured to be C2-C1-× for MNIST/FMNIST, C2-C2-C1 for CelebA, and C3-C1-× for FFHQ, where the notation Cx 1 -Cx 2 -Cx 3 means that the D uses x 1 layers as conv1, x 2 layers as conv2*, and x 3 layers as conv3*.…”
Section: Experiments Setupmentioning
confidence: 99%
“…The Impact of Gradient Compression. Gradient compression (GC) [27] is originally proposed to reduce the communication cost in federated learning. The rationale behind gradient compression is that most of the values in the gradient contribute nearly no information on the update.…”
Section: Pre-training the Model With Public Datamentioning
confidence: 99%