TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML

Kurth, Thorsten; Smorkalov, Mikhail; Mendygral, Peter; Sridharan, Srinivas; Mathuriya, Amrita

doi:10.1002/cpe.4989

Cited by 17 publications

(5 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Weak scaling of Cosmology DCGAN network using Horovod [22] and CrayPE [23] MPI libraries with Tensorflow at NERSC Most modern Deep Learning applications require large compute resources because of the large datasets and complex models needed to solve tasks. HPC facilities are particularly well suited to address this demand and work has already been done at NERSC to study GANs on large-scale HPC systems [21]. In figure 7 we show that we are able to scale GAN architectures up to 1000s of compute nodes with reasonable efficiency using modern MPI libraries.…”

Section: Discussionmentioning

confidence: 90%

Next Generation Generative Neural Networks for HEP

et al. 2019

Self Cite

View full text Add to dashboard Cite

Initial studies have suggested generative adversarial networks (GANs) have promise as fast simulations within HEP. These studies, while promising, have been insufficiently precise and also, like GANs in general, suffer from stability issues. We apply GANs to to generate full particle physics events (not individual physics objects), explore conditioning of generated events based on physics theory parameters and evaluate the precision and generalization of the produced datasets. We apply this to SUSY mass parameter interpolation and pileup generation. We also discuss recent developments in convergence and representations that match the structure of the detector better than images. In addition we describe on-going work making use of large-scale distributed resources on the Cori supercomputer at NERSC, and developments to control distributed training via interactive jupyter notebook sessions. This will allow tackling high-resolution detector data; model selection and hyper-parameter tuning in a productive yet scalable deep learning environment. *

show abstract

Section: Discussionmentioning

confidence: 90%

Next Generation Generative Neural Networks for HEP

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Furthermore, to ascertain whether a better prediction model for presenteeism and absenteeism exists, generalized logistic model (GLM), Naive Bayes (NB), recursive partitioning and regression trees (RPAR T), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and generalized boosted models (GBMs) were used as machine learning algorithm systems. Modern neural network layers, activation functions, optimizers, and tools for evaluating, measuring, and debugging deep neural networks are all supported by TensorFlow [ 21 ]. The area under the curve (AUC) and balanced accuracy of each machine learning model were calculated in this study.…”

Section: Methodsmentioning

confidence: 99%

Comparison of the Association Between Presenteeism and Absenteeism among Replacement Workers and Paid Workers: Cross-sectional Studies and Machine Learning Techniques

Park,

Sim,

et al. 2024

Safety and Health at Work

View full text Add to dashboard Cite

“…For example, when the model is relatively small and the inference cost is low, one can choose a distributed framework like in FTW. Nowadays, Tensorflow 4 , Pytorch 5 , and several tools such as Ray [51] and Horovod [52] can easily achieve multiple machines distributed learning with minimal code changes compared to that in a single machine [53] .…”

Section: How To Become General Technology?mentioning

confidence: 99%

AI in Human-computer Gaming: Techniques, Challenges and Opportunities

et al. 2023

View full text Add to dashboard Cite

With the breakthrough of AlphaGo, human-computer gaming AI has ushered in a big explosion, attracting more and more researchers all over the world. As a recognized standard for testing artificial intelligence, various human-computer gaming AI systems (AIs) have been developed, such as Libratus, OpenAI Five, and AlphaStar, which beat professional human players. The rapid development of human-computer gaming AIs indicates a big step for decision-making intelligence, and it seems that current techniques can handle very complex human-computer games. So, one natural question arises: What are the possible challenges of current techniques in human-computer gaming and what are the future trends? To answer the above question, in this paper, we survey recent successful game AIs, covering board game AIs, card game AIs, first-person shooting game AIs, and real-time strategy game AIs. Through this survey, we 1) compare the main difficulties among different kinds of games and the corresponding techniques utilized for achieving professional human-level AIs; 2) summarize the mainstream frameworks and techniques that can be properly relied on for developing AIs for complex human-computer games; 3) raise the challenges or drawbacks of current techniques in the successful AIs; and 4) try to point out future trends in human-computer gaming AIs. Finally, we hope that this brief review can provide an introduction for beginners and inspire insight for researchers in the field of AI in human-computer gaming.

show abstract

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML

Cited by 17 publications

References 5 publications

Next Generation Generative Neural Networks for HEP

Next Generation Generative Neural Networks for HEP

Comparison of the Association Between Presenteeism and Absenteeism among Replacement Workers and Paid Workers: Cross-sectional Studies and Machine Learning Techniques

AI in Human-computer Gaming: Techniques, Challenges and Opportunities

Contact Info

Product

Resources

About