Proceedings of the 24th European MPI Users' Group Meeting 2017
DOI: 10.1145/3127024.3127037
|View full text |Cite
|
Sign up to set email alerts
|

What does fault tolerant deep learning need from MPI?

Abstract: Deep Learning (DL) algorithms have become the de facto Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: What is needed from M… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(12 citation statements)
references
References 44 publications
0
12
0
Order By: Relevance
“…Reproducibility is essential also in the context of fault tolerance, iterative refinement, debugging and optimization of adaptable models, especially for large scale and distributed workflow applications, like cloud computing platforms and Industry 4.0 [32]. The need for specific fault tolerance features for the properties of DL algorithms and their implementations has been already discussed elsewhere [2,36,31]. Reproducibility is also the basis for developing comparison criteria and metrics for the objective evaluation of model properties, like robustness and trustworthiness [33,19].…”
Section: Discussionmentioning
confidence: 99%
“…Reproducibility is essential also in the context of fault tolerance, iterative refinement, debugging and optimization of adaptable models, especially for large scale and distributed workflow applications, like cloud computing platforms and Industry 4.0 [32]. The need for specific fault tolerance features for the properties of DL algorithms and their implementations has been already discussed elsewhere [2,36,31]. Reproducibility is also the basis for developing comparison criteria and metrics for the objective evaluation of model properties, like robustness and trustworthiness [33,19].…”
Section: Discussionmentioning
confidence: 99%
“…Although there are eorts to counteract some of this, production-ready solutions are lacking. Some of the described implementations allow for checkpointing to counteract this, but signicant eort is necessary to enable true fault-tolerance, as is described in Amatya et al [6]. It is also possible to reduce the probability of failure for each individual node, but this requires very specic hardware that is expensive and not generally available in commodity scale-out data centers or in the cloud.…”
Section: Fault Tolerancementioning
confidence: 99%
“…However, recent works show a growing interests. Early work by Vinay et al [4] addressed the requirements of MPI for designing fault tolerant DL applications, more specifically checkpoint-restart. Reagen et al [29] proposed a framework for quantifying the resilience DNNs.…”
Section: Related Workmentioning
confidence: 99%
“…While extensive research has been done to address most of these issues in many HPC workloads, not much has been done for DL workloads. For instance, not many frameworks are ideal for distributed training in supercomputers hence scaling DL training in distributed clusters often requires significant engineering effort [4]. To solve this problem, the DL community has created tools such as Horovod [5] that support distributed training on top of existing frameworks.…”
Section: Introductionmentioning
confidence: 99%