Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Keuper, Janis; Pfreundt, Franz-Josef

doi:10.1109/mlhpc.2016.006

Cited by 74 publications

(70 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Redundancy reduction can also be seen in the context of distributed systems. Training DNNs on such systems is an active field of research [11,12,13]. A problem is the transfer of the weight updates in the form of gradients between the different nodes.…”

Section: Related Workmentioning

confidence: 99%

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Loroch

Pfreundt

Wehn

et al. 2019

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

Deep learning is finding its way into the embedded world with applications such as autonomous driving, smart sensors and augmented reality. However, the computation of deep neural networks is demanding in energy, compute power and memory. Various approaches have been investigated to reduce the necessary resources, one of which is to leverage the sparsity occurring in deep neural networks due to the high levels of redundancy in the network parameters. It has been shown that sparsity can be promoted specifically and the achieved sparsity can be very high. But in many cases the methods are evaluated on rather small topologies. It is not clear if the results transfer onto deeper topologies. In this paper, the TensorQuant toolbox has been extended to offer a platform to investigate sparsity, especially in deeper models. Several practical relevant topologies for varying classification problem sizes are investigated to show the differences in sparsity for activations, weights and gradients.

show abstract

Section: Related Workmentioning

confidence: 99%

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Loroch

Pfreundt

Wehn

et al. 2019

Communications in Computer and Information Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…The compute complexity is high; medium sized experiments and popular benchmarks can take days to run [8], severely compromising the productivity of the data scientist. Distributed scaling stalls only after a dozen nodes due to locking, messaging, synchronization and data locality issues.…”

Section: Challenges With Machine Learningmentioning

confidence: 99%

Cognitive Computing Architectures for Machine (Deep) Learning at Scale

Mittal¹

2017

Proceedings of the IS4SI 2017 Summit DIGITALISATION FOR a SUSTAINABLE SOCIETY, Gothenburg, Sweden, 12–16 June 2017.

View full text Add to dashboard Cite

The paper reviews existing models for organizing information for machine learning systems in heterogeneous computing environments. In this context, we focus on structured knowledge representations as they have played a key role in enabling machine learning at scale. The paper highlights recent case studies where knowledge structures when combined with the knowledge of the distributed computation graph have accelerated machine-learning applications by 10 times or more. We extend these concepts to the design of Cognitive Distributed Learning Systems to resolve critical bottlenecks in real-time machine learning applications such as Predictive Analytics and Recommender Systems.

show abstract

“…Gradient descent optimization is an indispensable element of solving many realworld problems including but not limited to training deep neural networks [14,19]. Because of its inherent sequentiality it is also particularly difficult to parallelize [17]. Recently a number of advances in developing distributed versions of gradient descent algorithms have been made [15,11,38,39,36].…”

Section: Introductionmentioning

confidence: 99%

Distributed Deep Reinforcement Learning: Learn How to Play Atari Games in 21 minutes

Adamski¹,

Adamski

Grel³

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

We present a study in Distributed Deep Reinforcement Learning (DDRL) focused on scalability of a state-of-the-art Deep Reinforcement Learning algorithm known as Batch Asynchronous Advantage Actor-Critic (BA3C). We show that using the Adam optimization algorithm with a batch size of up to 2048 is a viable choice for carrying out large scale machine learning computations. This, combined with careful reexamination of the optimizer's hyperparameters, using synchronous training on the node level (while keeping the local, single node part of the algorithm asynchronous) and minimizing the model's memory footprint, allowed us to achieve linear scaling for up to 64 CPU nodes. This corresponds to a training time of 21 minutes on 768 CPU cores, as opposed to the 10 hours required when using a single node with 24 cores achieved by a baseline single-node implementation. 5 The source code along with game-play videos can be found at:https://github.com/deepsense-ai/Distributed-BA3C.

show abstract

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

Cited by 74 publications

References 21 publications

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant

Cognitive Computing Architectures for Machine (Deep) Learning at Scale

Distributed Deep Reinforcement Learning: Learn How to Play Atari Games in 21 minutes

Contact Info

Product

Resources

About