Distributed Training Large-Scale Deep Architectures

Zou, Shang-Xuan; Chen, Chun‐Yen; Wu, Jui-Lin; Chou, Chun‐Nan; Tsao, Chia-Chin; Tung, Kuan-Chieh; Lin, Ting-Wei; Sung, Cheng-Lung; Chang, Edward Yi

doi:10.1007/978-3-319-69179-4_2

Cited by 17 publications

(10 citation statements)

References 38 publications

(47 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In addition, authors conducted comparative measurement study on the resource consumption patterns on the four frameworks and their performance and accuracy implications, including CPU and memory consumption, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. Zou et al [55] evaluated the performance of four deep learning frameworks including Caffe, MXNet, TensorFlow and Troch on the ILSVRC-2012 dataset which is a subset of the ImageNet dataset, however, this study lacks the empirical evaluation for the frameworks used. Liu et al [27] evaluated five deep learning frameworks including Caffe2 4 , Chainer, Microsoft Cognitive Toolkit (CNTK), MXNet, and Tensorflow across multiple GPUs and multiple nodes on two datasets, CIFAR-10 and ImageNet.…”

Section: Related Workmentioning

confidence: 99%

DLBench: a comprehensive experimental evaluation of deep learning frameworks

et al. 2021

View full text Add to dashboard Cite

Deep Learning (DL) has achieved remarkable progress over the last decade on various tasks such as image recognition, speech recognition, and natural language processing. In general, three main crucial aspects fueled this progress: the increasing availability of large amount of digitized data, the increasing availability of affordable parallel and powerful computing resources (e.g., GPU) and the growing number of open source deep learning frameworks that facilitate and ease the development process of deep learning architectures. In practice, the increasing popularity of deep learning frameworks calls for benchmarking studies that can effectively evaluate and understand the performance characteristics of these systems. In this paper, we conduct an extensive experimental evaluation and analysis of six popular deep learning frameworks, namely, TensorFlow, MXNet, PyTorch, Theano, Chainer, and Keras, using three types of DL architectures Convolutional Neural Networks (CNN), Faster Region-based Convolutional Neural Networks (Faster R-CNN), and Long Short Term Memory (LSTM). Our experimental evaluation considers different aspects for its comparison including accuracy, training time, convergence and resource consumption patterns. Our experiments have been conducted on both CPU and GPU environments using different datasets. We report and analyze the performance characteristics of the studied frameworks. In addition, we report a set of insights and important lessons that we have learned from conducting our experiments.

show abstract

Section: Related Workmentioning

confidence: 99%

DLBench: a comprehensive experimental evaluation of deep learning frameworks

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Amiri et al [5] propose a centralized scheduling strategy that assigns tasks to workers to minimize the average completion time with the help of one master. Zou et al [37] develop a procedure to help users better choose the mini-batch size and the number of PSs. Similarly, Yan et al [30] develop performance models that quantify the impact of data partitioning and system provisioning on system performance and scalability.…”

Section: Related Workmentioning

confidence: 99%

Online scheduling of heterogeneous distributed machine learning jobs

Zhang

Zhou

et al. 2020

Proceedings of the Twenty-First International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Netw

View full text Add to dashboard Cite

Distributed machine learning (ML) has played a key role in today's proliferation of AI services. A typical model of distributed ML is to partition training datasets over multiple worker nodes to update model parameters in parallel, adopting a parameter server architecture. ML training jobs are typically resource elastic, completed using various time lengths with different resource configurations. A fundamental problem in a distributed ML cluster is how to explore the demand elasticity of ML jobs and schedule them with different resource configurations, such that the utilization of resources is maximized and average job completion time is minimized. To address it, we propose an online scheduling algorithm to decide the execution time window, the number and the type of concurrent workers and parameter servers for each job upon its arrival, with a goal of minimizing the weighted average completion time. Our online algorithm consists of (i) an online scheduling framework that groups unprocessed ML training jobs into a batch iteratively, and (ii) a batch scheduling algorithm that configures each ML job to maximize the total weight of scheduled jobs in the current iteration. Our online algorithm guarantees a good parameterized competitive ratio with polynomial time complexity. Extensive evaluations using real-world data demonstrate that it outperforms state-of-the-art schedulers in today's AI cloud systems.

show abstract

“…Popular deep learning frameworks [6,[27][28][29] provide distributed SGD-based algorithms [30,31] to train increasingly bigger models on larger datasets. Existing works towards understanding distributed training workloads can be broadly categorized into performance modeling [1,9,10] and empirical studies [1,7,22,32,33]. In contrast to prior model-driven performance modeling studies [10,12,13], where a static endto-end training time prediction is the main focus, our work leverages data-driven modeling that is powered by a largescale empirical measurement in a popular cloud platform.…”

Section: Related Workmentioning

confidence: 99%

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Walls

Guo

2020

Preprint

View full text Add to dashboard Cite

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on largescale datasets. However, it is challenging to determine the appropriate cluster configuration-e.g., server type and number-for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloudbased measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

show abstract

Distributed Training Large-Scale Deep Architectures

Cited by 17 publications

References 38 publications

DLBench: a comprehensive experimental evaluation of deep learning frameworks

DLBench: a comprehensive experimental evaluation of deep learning frameworks

Online scheduling of heterogeneous distributed machine learning jobs

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Contact Info

Product

Resources

About