Our system is currently under heavy load due to increased usage. We're actively working on upgrades to improve performance. Thank you for your patience.
2017
DOI: 10.1007/978-3-319-69179-4_2
|View full text |Cite
|
Sign up to set email alerts
|

Distributed Training Large-Scale Deep Architectures

Abstract: Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
10
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(10 citation statements)
references
References 38 publications
(47 reference statements)
0
10
0
Order By: Relevance
“…In addition, authors conducted comparative measurement study on the resource consumption patterns on the four frameworks and their performance and accuracy implications, including CPU and memory consumption, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. Zou et al [55] evaluated the performance of four deep learning frameworks including Caffe, MXNet, TensorFlow and Troch on the ILSVRC-2012 dataset which is a subset of the ImageNet dataset, however, this study lacks the empirical evaluation for the frameworks used. Liu et al [27] evaluated five deep learning frameworks including Caffe2 4 , Chainer, Microsoft Cognitive Toolkit (CNTK), MXNet, and Tensorflow across multiple GPUs and multiple nodes on two datasets, CIFAR-10 and ImageNet.…”
Section: Related Workmentioning
confidence: 99%
“…In addition, authors conducted comparative measurement study on the resource consumption patterns on the four frameworks and their performance and accuracy implications, including CPU and memory consumption, and their correlations to varying settings of hyper-parameters under different configuration combinations of hardware, parallel computing libraries. Zou et al [55] evaluated the performance of four deep learning frameworks including Caffe, MXNet, TensorFlow and Troch on the ILSVRC-2012 dataset which is a subset of the ImageNet dataset, however, this study lacks the empirical evaluation for the frameworks used. Liu et al [27] evaluated five deep learning frameworks including Caffe2 4 , Chainer, Microsoft Cognitive Toolkit (CNTK), MXNet, and Tensorflow across multiple GPUs and multiple nodes on two datasets, CIFAR-10 and ImageNet.…”
Section: Related Workmentioning
confidence: 99%
“…Amiri et al [5] propose a centralized scheduling strategy that assigns tasks to workers to minimize the average completion time with the help of one master. Zou et al [37] develop a procedure to help users better choose the mini-batch size and the number of PSs. Similarly, Yan et al [30] develop performance models that quantify the impact of data partitioning and system provisioning on system performance and scalability.…”
Section: Related Workmentioning
confidence: 99%
“…Popular deep learning frameworks [6,[27][28][29] provide distributed SGD-based algorithms [30,31] to train increasingly bigger models on larger datasets. Existing works towards understanding distributed training workloads can be broadly categorized into performance modeling [1,9,10] and empirical studies [1,7,22,32,33]. In contrast to prior model-driven performance modeling studies [10,12,13], where a static endto-end training time prediction is the main focus, our work leverages data-driven modeling that is powered by a largescale empirical measurement in a popular cloud platform.…”
Section: Related Workmentioning
confidence: 99%