2019 IEEE International Conference on Autonomic Computing (ICAC) 2019
DOI: 10.1109/icac.2019.00024
|View full text |Cite
|
Sign up to set email alerts
|

Speeding up Deep Learning with Transient Servers

Abstract: Distributed training frameworks, like TensorFlow, have been proposed as a means to reduce the training time of deep learning models by using a cluster of GPU servers. While such speedups are often desirable-e.g., for rapidly evaluating new model designs-they often come with significantly higher monetary costs due to sublinear scalability. In this paper, we investigate the feasibility of using training clusters composed of cheaper transient GPU servers to get the benefits of distributed training without the hig… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 14 publications
(6 citation statements)
references
References 28 publications
(28 reference statements)
0
6
0
Order By: Relevance
“…As a third group of related work, there are a number of systems, most prominently in the domain of the Internet of Things, that are specifically focusing on runtime support for AI components in networked systems, e.g. Amazon (2021), Kim and Kim (2020), Li et al (2019) and Microsoft (2021). However, these focus purely on running AI models in the Cloud or Edge and do not support the level of dynamism necessary for pervasive application cases.…”
Section: Related Workmentioning
confidence: 99%
“…As a third group of related work, there are a number of systems, most prominently in the domain of the Internet of Things, that are specifically focusing on runtime support for AI components in networked systems, e.g. Amazon (2021), Kim and Kim (2020), Li et al (2019) and Microsoft (2021). However, these focus purely on running AI models in the Cloud or Edge and do not support the level of dynamism necessary for pervasive application cases.…”
Section: Related Workmentioning
confidence: 99%
“…One of the key challenges of using transient servers for distributed training is that they can be revoked at any time. Even the revocation of a single worker can lead to significant performance degradation [7]. In this section, we characterize the revocation patterns of Google Cloud's transient servers.…”
Section: Characterizing Revocation Overheadmentioning
confidence: 99%
“…Popular deep learning frameworks [6,[27][28][29] provide distributed SGD-based algorithms [30,31] to train increasingly bigger models on larger datasets. Existing works towards understanding distributed training workloads can be broadly categorized into performance modeling [1,9,10] and empirical studies [1,7,22,32,33]. In contrast to prior model-driven performance modeling studies [10,12,13], where a static endto-end training time prediction is the main focus, our work leverages data-driven modeling that is powered by a largescale empirical measurement in a popular cloud platform.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations