2018
DOI: 10.1002/cpe.4989
|View full text |Cite
|
Sign up to set email alerts
|

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML

Abstract: Summary Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the mos… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 17 publications
(5 citation statements)
references
References 5 publications
0
3
0
Order By: Relevance
“…Weak scaling of Cosmology DCGAN network using Horovod [22] and CrayPE [23] MPI libraries with Tensorflow at NERSC Most modern Deep Learning applications require large compute resources because of the large datasets and complex models needed to solve tasks. HPC facilities are particularly well suited to address this demand and work has already been done at NERSC to study GANs on large-scale HPC systems [21]. In figure 7 we show that we are able to scale GAN architectures up to 1000s of compute nodes with reasonable efficiency using modern MPI libraries.…”
Section: Discussionmentioning
confidence: 90%
“…Weak scaling of Cosmology DCGAN network using Horovod [22] and CrayPE [23] MPI libraries with Tensorflow at NERSC Most modern Deep Learning applications require large compute resources because of the large datasets and complex models needed to solve tasks. HPC facilities are particularly well suited to address this demand and work has already been done at NERSC to study GANs on large-scale HPC systems [21]. In figure 7 we show that we are able to scale GAN architectures up to 1000s of compute nodes with reasonable efficiency using modern MPI libraries.…”
Section: Discussionmentioning
confidence: 90%
“…Furthermore, to ascertain whether a better prediction model for presenteeism and absenteeism exists, generalized logistic model (GLM), Naive Bayes (NB), recursive partitioning and regression trees (RPAR T), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and generalized boosted models (GBMs) were used as machine learning algorithm systems. Modern neural network layers, activation functions, optimizers, and tools for evaluating, measuring, and debugging deep neural networks are all supported by TensorFlow [ 21 ]. The area under the curve (AUC) and balanced accuracy of each machine learning model were calculated in this study.…”
Section: Methodsmentioning
confidence: 99%
“…For example, when the model is relatively small and the inference cost is low, one can choose a distributed framework like in FTW. Nowadays, Tensorflow 4 , Pytorch 5 , and several tools such as Ray [51] and Horovod [52] can easily achieve multiple machines distributed learning with minimal code changes compared to that in a single machine [53] .…”
Section: How To Become General Technology?mentioning
confidence: 99%