2023
DOI: 10.1109/tpds.2023.3243261
|View full text |Cite
|
Sign up to set email alerts
|

Expediting Distributed DNN Training With Device Topology-Aware Graph Deployment

Abstract: This paper presents TAG, an automatic system to derive optimized DNN training graph and its deployment onto any device topology, for expedited training in device-and topology-heterogeneous ML clusters. We novelly combine both the DNN computation graph and the device topology graph as input to a graph neural network (GNN), and join the GNN with a search-based method to quickly identify optimized distributed training strategies. To reduce communication in a heterogeneous cluster, we further explore a lossless gr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 41 publications
(87 reference statements)
0
4
0
Order By: Relevance
“…(3) DeepSpeed [36] supports ZeRO-based [35] data parallelism and implements intra-op model parallelism for MoE layers. (4) TAG [56] is a heterogeneity-aware DNN training system. TAG supports data parallelism and inter-op model parallelism.…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…(3) DeepSpeed [36] supports ZeRO-based [35] data parallelism and implements intra-op model parallelism for MoE layers. (4) TAG [56] is a heterogeneity-aware DNN training system. TAG supports data parallelism and inter-op model parallelism.…”
Section: Methodsmentioning
confidence: 99%
“…The performance of SFB is primarily determined by the batch size and the number of devices involved [7,54]. TAG [56] proposes an integer linear programming-based technique to automatically identify beneficial application of SFB to tensors in a DNN model trained in a homogeneous cluster. However, uneven tensor partitioning across heterogeneous resources introduces additional complication to this problem.…”
Section: Sufficient Factor Broadcasting (Sfb)mentioning
confidence: 99%
See 2 more Smart Citations