Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Bao, Yixin; Peng, Yanghua; Chen, Yangrui; Wu, Chuan

doi:10.1109/infocom41043.2020.9155446

Cited by 48 publications

(21 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With dynamic graphs, the next iteration might touch a different set of parameters, which would invalidate the schedule derived from the previous iteration. PACE [12] computes the optimal communication schedule and implements preemption by segmenting primitive AllReduce operations into smaller pieces. Although segmenting can indeed mimic preemption, it will on the other hand hurt the total communication time as we have seen in Fig.…”

Section: Related Workmentioning

confidence: 99%

“…Compared to PyTorch DDP, ZeRO can scale to much larger models as each process only needs to maintain a small partition of the model. The high scalabil-√ √ √ PACE [12] √ √ √ [29] employs a different approach where the model stack is decomposed into multiple stages, where data parallelism is applied within one stage and pipeline with model parallelisms govern the workload across stages. One subtle detail is that to attain high training speed, PipeDream slightly sacrifices accuracy by using the latest gradients from multiple concurrent passes.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li,

Zhao,

Varma

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. Py-Torch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs. * This work was conducted when Pieter Noordhuis was an employee at Facebook.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Li,

Zhao,

Varma

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…There are also schedulers for MPI‐based systems like Slurm 16 and Nomad 17 . Machine learning and deep learning clusters are one of the clusters that researcher are working on to develop efficient resource schedulers 25‐28 . For example, Peng et al 26 propose a scheduler that resizes the allocated resources dynamically by using supervised and reinforcement learning on deep learning clusters.…”

Section: Related Workmentioning

confidence: 99%

Twister2 Cross‐platform resource scheduler for big data

Uyar

Gunduz

Kamburugamuve

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Twister2 is an open-source big data hosting environment designed to process both batch and streaming data at scale. Twister2 runs jobs in both high-performance computing (HPC) and big data clusters. It provides a cross-platform resource scheduler to run jobs in diverse environments. Twister2 is designed with a layered architecture to support various clusters and big data problems. In this paper, we present the cross-platform resource scheduler of Twister2. We identify required services and explain implementation details. We present job startup delays for single jobs and multiple concurrent jobs in Kubernetes and OpenMPI clusters. We compare job startup delays for Twister2 and Spark at a Kubernetes cluster. In addition, we compare the performance of terasort algorithm on Kubernetes and bare metal clusters at AWS cloud.

show abstract

“…To improve the system scalability, pipelining between computing tasks and communication tasks is one of the main methods to hide some communication overheads [2,23]- [25,29]. Due to the layer-wise structure of deep models, the gradient aggregation of the current layer has no dependency with its previous layer's gradient calculation as shown in Fig.…”

Section: Background and Related Work A Distributed Sgd With Data Para...mentioning

confidence: 99%

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Shi

Zhang

2021

Preprint

View full text Add to dashboard Cite

Distributed training with synchronous stochastic gradient descent (SGD) on GPU clusters has been widely used to accelerate the training process of deep models. However, SGD only utilizes the first-order gradient in model parameter updates, which may take days or weeks. Recent studies have successfully exploited approximate second-order information to speed up the training process, in which the Kronecker-Factored Approximate Curvature (KFAC) emerges as one of the most efficient approximation algorithms for training deep models. Yet, when leveraging GPU clusters to train models with distributed KFAC (D-KFAC), it incurs extensive computation as well as introduces extra communications during each iteration. In this work, we propose D-KFAC (SPD-KFAC) with smart parallelism of computing and communication tasks to reduce the iteration time. Specifically, 1) we first characterize the performance bottlenecks of D-KFAC, 2) we design and implement a pipelining mechanism for Kronecker factors computation and communication with dynamic tensor fusion, and 3) we develop a load balancing placement for inverting multiple matrices on GPU clusters. We conduct realworld experiments on a 64-GPU cluster with 100Gb/s InfiniBand interconnect. Experimental results show that our proposed SPD-KFAC training scheme can achieve 10%-35% improvement over state-of-the-art algorithms.

show abstract

Preemptive All-reduce Scheduling for Expediting Distributed DNN Training

Cited by 48 publications

References 15 publications

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Twister2 Cross‐platform resource scheduler for big data

Accelerating Distributed K-FAC with Smart Parallelism of Computing and Communication Tasks

Contact Info

Product

Resources

About