AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Ma, Yiqing; Wang, Hao; Zhang, Yiming; Chen, Kai

doi:10.1109/infocom48880.2022.9796752

Cited by 6 publications

(2 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Communication acceleration: Existing communication acceleration techniques include, but are not limited to: (1) leveraging high throughput and low latency communication links, such as RDMA [34], [35], [36], InfiniBand, Intel Omni-Path, and NVIDIA's NVLink 3 ; (2) utilizing message passing interface (MPI) and MPI-like implementations like OpenMPI 4 and Gloo [37]; (3) using high-performance communication collectives, such as NCCL 5 and BLink [38], which support efficient communication between GPUs and many popular deep learning frameworks; (4) reducing data communication during synchronization process, such as gradient quantization, compression and sparsification [39], [40], [41], [42], [43], [44]; (5) using stale parameter updates to reduce the number of synchronization parameters, such as parameter freezing [45], [46], [47], Round-Robin Synchronous Parallel [48] and Bounded Staleness Parallel [49]; (6) tuning deep learning hyper-parameters, such as AutoByte [50]; (7) minimize user-level overhead by conducting parameter aggregation at the transport layer [13]; (8) improving network-layer performance, such as networklevel flow scheduling [51], [52] and congestion control [53]. Communication scheduling: Due to the layer-wise and tensor-wise structure of DNNs, some works continuously explore to maximize the overlap of communication and computation.…”

Section: Related Workmentioning

confidence: 99%

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Gao,

Hu,

Mashhadi

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

The communication bottleneck severely constrains the scalability of distributed deep learning, and efficient communication scheduling accelerates distributed DNN training by overlapping computation and communication tasks. However, existing approaches based on tensor partitioning are not efficient and suffer from two challenges: (1) the fixed number of tensor blocks transferred in parallel can not necessarily minimize the communication overheads; (2) although the scheduling order that preferentially transmits tensor blocks close to the input layer can start forward propagation in the next iteration earlier, the shortest per-iteration time is not obtained. In this paper, we propose an efficient communication framework called US-Byte. It can schedule unequal-sized tensor blocks in a near-optimal order to minimize the training time. We build the mathematical model of US-Byte by two phases: (1) the overlap of gradient communication and backward propagation, and (2) the overlap of gradient communication and forward propagation. We theoretically derive the optimal solution for the second phase and efficiently solve the first phase with a low-complexity algorithm. We implement the US-Byte architecture on PyTorch framework. Extensive experiments on two different 8-node GPU clusters demonstrate that US-Byte can achieve up to 1.26x and 1.56x speedup compared to ByteScheduler and WFBP, respectively. We further exploit simulations of 128 GPUs to verify the potential scaling performance of US-Byte. Simulation results show that US-Byte can achieve up to 1.69x speedup compared to the state-of-the-art communication framework.

show abstract

Section: Related Workmentioning

confidence: 99%

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Gao,

Hu,

Mashhadi

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…Egeria aims to reduce the total training workload, thus should be compatible with them. Additionally, there are a wide range of networking solutions that can help in distributed DNN training [55,[85][86][87]97]. Model-Keeper [43] accelerates training by repurposing previouslytrained models in a shared cluster.…”

Section: Related Workmentioning

confidence: 99%

Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

Wang

Sun

Chen

et al. 2023

Proceedings of the Eighteenth European Conference on Computer Systems

Self Cite

View full text Add to dashboard Cite

Training deep neural networks (DNNs) is time-consuming. While most existing solutions try to overlap/schedule computation and communication for efficient training, this paper goes one step further by skipping computing and communication through DNN layer freezing. Our key insight is that the training progress of internal DNN layers differs significantly, and front layers often become well-trained much earlier than deep layers. To explore this, we first introduce the notion of training plasticity to quantify the training progress of internal DNN layers. Then we design Egeria, a knowledgeguided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layers' training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead. In addition, Egeria caches the intermediate outputs of the frozen layers with prefetching to further skip the forward computation. Our implementation and testbed experiments with popular vision and language models show that Egeria achieves 19%-43% training speedup w.r.t. the state-of-the-art without sacrificing accuracy.

show abstract

MDP: Model Decomposition and Parallelization of Vision Transformer for Distributed Edge Inference

Wang,

Zhang,

Jin

et al. 2023

2023 19th International Conference on Mobility, Sensing and Networking (MSN)

View full text Add to dashboard Cite

AutoByte: Automatic Configuration for Optimal Communication Scheduling in DNN Training

Cited by 6 publications

References 38 publications

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning

Egeria: Efficient DNN Training with Knowledge-Guided Layer Freezing

MDP: Model Decomposition and Parallelization of Vision Transformer for Distributed Edge Inference

Contact Info

Product

Resources

About