IEEE INFOCOM 2022 - IEEE Conference on Computer Communications 2022
DOI: 10.1109/infocom48880.2022.9796752
|View full text |Cite
|
Sign up to set email alerts
|

AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Abstract: ByteScheduler partitions and rearranges tensor transmissions to improve the communication efficiency of distributed Deep Neural Network (DNN) training. The configuration of hyper-parameters (i.e., the partition size and the credit size) is critical to the effectiveness of partitioning and rearrangement. Currently, ByteScheduler adopts Bayesian Optimization (BO) to find the optimal configuration for the hyper-parameters beforehand. In practice, however, various runtime factors (e.g., worker node status and netw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
0
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 38 publications
0
0
0
Order By: Relevance
“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”
Section: Related Workmentioning
confidence: 99%
“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”
Section: Related Workmentioning
confidence: 99%
“…Egeria aims to reduce the total training workload, thus should be compatible with them. Additionally, there are a wide range of networking solutions that can help in distributed DNN training [55,[85][86][87]97]. Model-Keeper [43] accelerates training by repurposing previouslytrained models in a shared cluster.…”
Section: Related Workmentioning
confidence: 99%