What does fault tolerant deep learning need from MPI?

Amatya, Vinay; Vishnu, Abhinav; Siegel, Charles; Daily, Jeff

doi:10.1145/3127024.3127037

Cited by 15 publications

(12 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Reproducibility is essential also in the context of fault tolerance, iterative refinement, debugging and optimization of adaptable models, especially for large scale and distributed workflow applications, like cloud computing platforms and Industry 4.0 [32]. The need for specific fault tolerance features for the properties of DL algorithms and their implementations has been already discussed elsewhere [2,36,31]. Reproducibility is also the basis for developing comparison criteria and metrics for the objective evaluation of model properties, like robustness and trustworthiness [33,19].…”

Section: Discussionmentioning

confidence: 99%

Deep Learning Reproducibility and Explainable AI (XAI)

Leventi-Peetz¹,

T.²

2022

Preprint

View full text Add to dashboard Cite

The nondeterminism of Deep Learning (DL) training algorithms and its influence on the explainability of neural network (NN) models are investigated in this work with the help of image classification examples. To discuss the issue, two convolutional neural networks (CNN) have been trained and their results compared. The comparison serves the exploration of the feasibility of creating deterministic, robust DL models and deterministic explainable artificial intelligence (XAI) in practice. Successes and limitation of all here carried out efforts are described in detail. The source code of the attained deterministic models has been listed in this work. Reproducibility is indexed as a development-phase-component of the Model Governance Framework, proposed by the EU within their excellence in AI approach. Furthermore, reproducibility is a requirement for establishing causality for the interpretation of model results and building of trust towards the overwhelming expansion of AI systems applications. Problems that have to be solved on the way to reproducibility and ways to deal with some of them, are examined in this work.

show abstract

Section: Discussionmentioning

confidence: 99%

Deep Learning Reproducibility and Explainable AI (XAI)

Leventi-Peetz¹,

T.²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Although there are eorts to counteract some of this, production-ready solutions are lacking. Some of the described implementations allow for checkpointing to counteract this, but signicant eort is necessary to enable true fault-tolerance, as is described in Amatya et al [6]. It is also possible to reduce the probability of failure for each individual node, but this requires very specic hardware that is expensive and not generally available in commodity scale-out data centers or in the cloud.…”

Section: Fault Tolerancementioning

confidence: 99%

A Survey on Distributed Machine Learning

et al. 2020

View full text Add to dashboard Cite

The demand for articial intelligence has grown signicantly over the last decade and this growth has been fueled by advances in machine learning techniques and the ability to leverage hardware acceleration. However, in order to increase the quality of predictions and render machine learning solutions feasible for more complex applications, a substantial amount of training data is required. Although small machine learning models can be trained with modest amounts of data, the input for training larger models such as neural networks grows exponentially with the number of parameters. Since the demand for processing training data has outpaced the increase in computation power of computing machinery, there is a need for distributing the machine learning workload across multiple machines, and turning the centralized into a distributed system. These distributed systems present new challenges, rst and foremost the ecient parallelization of the training process and the creation of a coherent model. This article provides an extensive overview of the current state-of-the-art in the eld by outlining the challenges and opportunities of distributed machine learning over conventional (centralized) machine learning, discussing the techniques used for distributed machine learning, and providing an overview of the systems that are available. 3:3200x over conventional CPUs for an image recognition algorithm using a pretrained multilayer perceptron (MLP).An alternative to generic GPUs for acceleration is the use of Application Specic Integrated Circuits (ASICs) which implement specialized functions through a highly optimized design. In recent times, the demand for such chips has risen signicantly [100]. When applied to e.g. Bitcoin mining, ASICs have a signicant competitive advantage over GPUs and CPUs due to their high performance and power eciency [145]. Since matrix multiplications play a prominent role in many machine learning algorithms, these workloads are highly amenable to acceleration through ASICS. Google applied this concept in their Tensor Processing Unit (TPU) [129], which, as the name suggests, is an ASIC that specializes in calculations on tensors (n-dimensional arrays), and is designed to accelerate their Tensorow [1][2] framework, a popular building block for machine learning models. The most important component of the TPU is its Matrix Multiply unit based on a systolic array. TPUs use a MIMD (Multiple Instructions, Multiple Data) [51] architecture which, unlike GPUs, allows them to execute diverging branches eciently. TPUs are attached to the server system through the PCI Express bus. This provides them with a direct connection with the CPU which allows for a high aggregated bandwidth of 63GB/s (PCI-e5x16). Multiple TPUs can be used in a data center and the individual units can collaborate in a distributed setting. The benet of the TPU over regular CPU/GPU setups is not only its increased processing power but also its power eciency, which is important in large-scale applications due to the cost of energy and the lim...

show abstract

“…However, recent works show a growing interests. Early work by Vinay et al [4] addressed the requirements of MPI for designing fault tolerant DL applications, more specifically checkpoint-restart. Reagen et al [29] proposed a framework for quantifying the resilience DNNs.…”

Section: Related Workmentioning

confidence: 99%

“…While extensive research has been done to address most of these issues in many HPC workloads, not much has been done for DL workloads. For instance, not many frameworks are ideal for distributed training in supercomputers hence scaling DL training in distributed clusters often requires significant engineering effort [4]. To solve this problem, the DL community has created tools such as Horovod [5] that support distributed training on top of existing frameworks.…”

Section: Introductionmentioning

confidence: 99%

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Rojas¹,

Kahira²,

Meneses³

et al. 2020

Preprint

View full text Add to dashboard Cite

Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

show abstract

What does fault tolerant deep learning need from MPI?

Cited by 15 publications

References 44 publications

Deep Learning Reproducibility and Explainable AI (XAI)

Deep Learning Reproducibility and Explainable AI (XAI)

A Survey on Distributed Machine Learning

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

Contact Info

Product

Resources

About