A generic communication scheduler for distributed DNN training acceleration

Peng, Yanghua; Zhu, Yibo; Chen, Yangrui; Bao, Yixin; Yi, Bairen; Lan, Chang; Wu, Chuan; Guo, Chuanxiong

doi:10.1145/3341301.3359642

Cited by 249 publications

(167 citation statements)

References 10 publications

Supporting

Mentioning

151

Contrasting

Order By: Relevance

“…E.g., ResNet-50 and ResNet-152 [25] models that achieve up to 75% accuracy in classifying images [19] are 100MB and 240MB in size respectively. For such large models, it is well known [39] that the overall training time is dominated by communication time taken to share updates among many parallel workers; for ResNet-50 model trained with 30 GPUs (NVIDIA P100), we observe that per-iteration communication time (320 ) is 3× the gradient computation time (100 ).…”

Section: Dml Performance Analysismentioning

confidence: 93%

“…Algorithms: We evaluate the following: PS-based asynchronous and synchronous variants of MLfabric, or MLfabric-A and MLfabric-S, respectively; vanilla PS-based asynchronous (Async); and AllReduce-based (using NCCL library) synchronous algorithms using ring-reduce communication (RR-Sync). We also compare with other state-of-the-art approaches: (1) SwitchML [42], where aggregation of gradient updates happen on P4 [37] network switches as opposed to end hosts, and (2) BytePS [39], an alternate communication library based on parameter server architecture.…”

Section: Datasets and ML Modelsmentioning

confidence: 99%

“…Comparison with other frameworks: We also compare MLfabric-S with SwitchML [42] and BytePS [39] which leverage switch-level aggregation, and scheduling without aggregation, respectively. We train a ResNet-152 model using 8 GPUs on 8 nodes connected by 10Gbps ethernet (no RDMA); we chose this configuration (different from the rest of our experiments) since SwitchML did not support multiple GPUs on a single node.…”

Section: Performance Of Mlfabric-smentioning

confidence: 99%

“…Per-iteration runtime in BytePS is determined by the slowest server. Because of Tensor partitioning [39], BytePS is able to efficiently pipeline update transfers; thus, the slowdown is less severe (but still 2×).…”

Section: Performance Of Mlfabric-smentioning

confidence: 99%

“…DML algorithms, e.g., stochastic gradient descent (SGD), and Latent Dirichlet Allocation (LDA) [10,32], are iterative in nature, and are both computation and communication intensive ( §2). Over the years, a variety of DML systems [1,6,7,17] were developed to accelerate training by improving worker computation, e.g., via hardware accelerators [23,38], improving communication efficiency [2,3,42,48], and coordinating computation with communication [24,39,49,55]. These systems generally use one of two architectures: (a) Parameter Server (PS) [12,50], where, the model is stored at a separate location (server); in every iteration workers pull the latest model and compute an update, which is then shipped to the server and applied to the model.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Network-accelerated distributed machine learning for multi-tenant settings

Viswanathan

Balasubramanian

2020

Proceedings of the 11th ACM Symposium on Cloud Computing

View full text Add to dashboard Cite

Many distributed machine learning (DML) workloads are increasingly being run in shared clusters. Training in such clusters can be impeded by unexpected compute and network contention, resulting in stragglers. We present MLfabric, a contention-aware DML system that manages the performance of a DML job running in a shared cluster. The DML application hands all network communication (gradient and model transfers) to the MLfabric communication library. MLfabric then carefully orders transfers to improve convergence, opportunistically aggregates them at idle DML workers to improve resource efficiency, and replicates them to support new notions of fault tolerance, while systematically accounting for compute stragglers and network contention. We find that MLfabric achieves up to 3× speed-up in training large deep learning models in realistic dynamic cluster settings. CCS CONCEPTS • Computing methodologies → Parallel algorithms.

show abstract

Section: Dml Performance Analysismentioning

confidence: 93%

Section: Datasets and ML Modelsmentioning

confidence: 99%

Section: Performance Of Mlfabric-smentioning

confidence: 99%

Section: Performance Of Mlfabric-smentioning

confidence: 99%