Speeding up Deep Learning with Transient Servers

Li, Shijian; Walls, Robert J.; Xu, Lei; Guo, Tian

doi:10.1109/icac.2019.00024

Cited by 14 publications

(6 citation statements)

References 28 publications

(28 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a third group of related work, there are a number of systems, most prominently in the domain of the Internet of Things, that are specifically focusing on runtime support for AI components in networked systems, e.g. Amazon (2021), Kim and Kim (2020), Li et al (2019) and Microsoft (2021). However, these focus purely on running AI models in the Cloud or Edge and do not support the level of dynamism necessary for pervasive application cases.…”

Section: Related Workmentioning

confidence: 99%

Elastic AI: system support for adaptive machine learning in pervasive computing systems

Cichiwskyj

Schmeißer

Qian

et al. 2021

CCF Trans. Pervasive Comp. Interact.

View full text Add to dashboard Cite

Artificial intelligence (AI) is an important part of today’s pervasive computing systems. Still, there is no end-to-end system platform that allows to deploy, update, manage and execute AI models in pervasive systems. We propose such a system platform in this paper. Most importantly, we reuse concepts and techniques from twenty years of pervasive computing research on how to enable runtime adaptation and apply it to AI. This allows to specify adaptive AI models that are able to react to a multitude of dynamic changes, e.g. with respect to available devices, networking conditions, but also application requirements and sensor data sources. Developers can optimise their applications iteratively, starting with a generic setup and refining it step by step towards their specific pervasive computing scenario. To show the applicability of our platform, we apply it to two pervasive use cases and evaluate them, achieving up to four times faster inference and three times lower energy consumption compared to a classical AI deployment.

show abstract

Section: Related Workmentioning

confidence: 99%

Elastic AI: system support for adaptive machine learning in pervasive computing systems

Cichiwskyj

Schmeißer

Qian

et al. 2021

CCF Trans. Pervasive Comp. Interact.

View full text Add to dashboard Cite

show abstract

“…One of the key challenges of using transient servers for distributed training is that they can be revoked at any time. Even the revocation of a single worker can lead to significant performance degradation [7]. In this section, we characterize the revocation patterns of Google Cloud's transient servers.…”

Section: Characterizing Revocation Overheadmentioning

confidence: 99%

“…Popular deep learning frameworks [6,[27][28][29] provide distributed SGD-based algorithms [30,31] to train increasingly bigger models on larger datasets. Existing works towards understanding distributed training workloads can be broadly categorized into performance modeling [1,9,10] and empirical studies [1,7,22,32,33]. In contrast to prior model-driven performance modeling studies [10,12,13], where a static endto-end training time prediction is the main focus, our work leverages data-driven modeling that is powered by a largescale empirical measurement in a popular cloud platform.…”

Section: Related Workmentioning

confidence: 99%

“…Additionally, prior work also accounted for application-specific requirements, such as interactivity, when designing transient-aware mechanisms [38,41]. As a promising and cheap way to provide good parallelisms, transient servers have garnered a lot of interests for big data big data analytics [39,40,42,43], memory-intensive applications [44], cluster resource managers [45,46], and most recently deep learning [7]. Our work provides a new perspective with a focus on characterizing and modeling distributed training on transient servers.…”

Section: Related Workmentioning

confidence: 99%

“…Revoked GPU servers often mean significant loss of work and require manual effort by the practitioner to request new servers, to reconfigure the training cluster, and even to diagnose potential performance bottlenecks. Concretely, when a GPU server is revoked, all its local training progress will disappear and in the worst case, the revocation will also impede the functionality of saving the trained model [6,7].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

Walls

Guo

2020

Preprint

Self Cite

View full text Add to dashboard Cite

Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on largescale datasets. However, it is challenging to determine the appropriate cluster configuration-e.g., server type and number-for different training workloads while balancing the trade-offs in training time, cost, and model accuracy. Adding to the complexity is the potential to reduce the monetary cost by using cheaper, but revocable, transient GPU servers.In this work, we analyze distributed training performance under diverse cluster configurations using CM-DARE, a cloudbased measurement and training framework. Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers. We also demonstrate the feasibility of predicting training speed and overhead using regression-based models. Finally, we discuss potential use cases of our performance modeling such as detecting and mitigating performance bottlenecks.

show abstract

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Yao,

Zhang,

et al. 2024

J Supercomput

View full text Add to dashboard Cite

Speeding up Deep Learning with Transient Servers

Cited by 14 publications

References 28 publications

Elastic AI: system support for adaptive machine learning in pervasive computing systems

Elastic AI: system support for adaptive machine learning in pervasive computing systems

Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

LBB: load-balanced batching for efficient distributed learning on heterogeneous GPU cluster

Contact Info

Product

Resources

About