Layer-Parallel Training of Deep Residual Neural Networks

Günther, Stefanie; Ruthotto, Lars; Schroder, Jacob B.; Cyr, Eric C; Gauger, Nicolas R.

doi:10.1137/19m1247620

Cited by 78 publications

(66 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The connection of neural network architectures to optimal control problems as introduced in [120,125] makes heavy use of PDE-constrained optimization techniques including efficient matrix vector products. This topic has also recently received more attention from within the machine learning community [52] and promises to be a very interesting field for combining traditional methods from numerical analysis with deep learning.…”

Section: Numerical Linear Algebra In Deep Learningmentioning

confidence: 99%

A literature survey of matrix methods for data science

Stoll

2020

GAMM-Mitteilungen

View full text Add to dashboard Cite

Efficient numerical linear algebra is a core ingredient in many applications across almost all scientific and industrial disciplines. With this survey we want to illustrate that numerical linear algebra has played and is playing a crucial role in enabling and improving data science computations with many new developments being fueled by the availability of data and computing resources. We highlight the role of various different factorizations and the power of changing the representation of the data as well as discussing topics such as randomized algorithms, functions of matrices, and high‐dimensional problems. We briefly touch upon the role of techniques from numerical linear algebra used within deep learning.

show abstract

Section: Numerical Linear Algebra In Deep Learningmentioning

confidence: 99%

A literature survey of matrix methods for data science

Stoll

2020

GAMM-Mitteilungen

View full text Add to dashboard Cite

show abstract

“…Very recently, Yalla and Enquist [40] showed the promise of using a machine learned model as coarse propagator for test problems. Going the other way, Schroder [32] and Günther et al [14] recently showed that parallel-in-time integration can be used to speed up the process of training neural networks.…”

Section: Related Workmentioning

confidence: 99%

Parareal with a learned coarse model for robotic manipulation

Agboh

Grainger

Ruprecht

et al. 2020

Comput. Visual Sci.

View full text Add to dashboard Cite

A key component of many robotics model-based planning and control algorithms is physics predictions, that is, forecasting a sequence of states given an initial state and a sequence of controls. This process is slow and a major computational bottleneck for robotics planning algorithms. Parallel-in-time integration methods can help to leverage parallel computing to accelerate physics predictions and thus planning. The Parareal algorithm iterates between a coarse serial integrator and a fine parallel integrator. A key challenge is to devise a coarse model that is computationally cheap but accurate enough for Parareal to converge quickly. Here, we investigate the use of a deep neural network physics model as a coarse model for Parareal in the context of robotic manipulation. In simulated experiments using the physics engine Mujoco as fine propagator we show that the learned coarse model leads to faster Parareal convergence than a coarse physics-based model. We further show that the learned coarse model allows to apply Parareal to scenarios with multiple objects, where the physics-based coarse model is not applicable. Finally, we conduct experiments on a real robot and show that Parareal predictions are close to real-world physics predictions for robotic pushing of multiple objects. Code (https://doi.org/10.5281/zenodo.3779085) and videos (https://youtu.be/wCh2o1rf-gA) are publicly available.

show abstract

“…In the most recent period, researchers have explored distributed deep training methods from various other dimensions to accelerate the training process, Günther et al 35 provides a proof-of-concept for layer-parallel training of ResNets and demonstrates two options to benefit from the layer-parallel approach. Recently, independent work on Pipedream 11 proposed a distributed pipeline system for DNN training.…”

Section: Related Workmentioning

confidence: 99%

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Zhang

Zhan

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

Exorbitant resources (computing and memory) are required to train a deep neural network (DNN). Often researchers deploy an approach that uses distributed parallel training to acquire larger models faster on GPUs. This approach has its detriments, though; on one hand, a GPU's expanded capacity to compute also produces bigger bottlenecks in inter-GPU's communications during model training, and multi-GPU systems lead to complex connectivity. Workload schedulers then end up having to consider hardware topology and requirements for workload communication, in hopes of allocating GPU resources to optimize execution time and improve usage in a heterogeneous environment. On the other hand, the high memory requirements to train a DNN model make running the training processes on GPUs onerous. To contend with this, we introduce two execution optimization methods based on pipeline-hybrid parallelism (using both data and model parallelism) in a GPU cluster with heterogeneous networking. First, we propose a model partition algorithm that accelerates pipeline-hybrid parallelism training between heterogeneously network-connected GPUs. Second, we introduce a cost-balanced recomputing algorithm to reduce memory usage in the pipeline mode. Experiments show that our solution (Pipe-Torch) averages a speedup of 1.4× compared with data parallelism, and reduces the memory footprint while maintaining pipelined load-balanced training. K E Y W O R D S distributed deep learning, efficient GPU memory, heterogeneous network environment, pipeline-hybrid parallelism 1 INTRODUCTION The last 10 years has witnessed rapid transformation in deep learning, including image recognition, 1 speech recognition, 2 language translation, 3 text identification 4 and clinical medicine. 5,6 Most inroads and successes are attributed to deep neural network (DNN) models' capacity to extract and discern complex features from massive data, while also factoring in that having increased-sized models promotes far greater accuracy. However, the increasing scale of training models-along with the rise in data-remarkably extends the training time. So, while on one hand, these distributed deep-learning frameworks are designed to parallel train DNN models across multiple machines to accelerate training, such as TensorFlow, 7 PyTorch, 8 MxNet; 9 on the other hand, most leading IT (hardware) companies are committed to designing specialized hardware equipment for training (eg, Nvidia GPU 10). However, even with parallel training and special hardware to accelerate training, training a DNN model is time-and memory-intensive. It takes eight Nvidia TITAN X GPUs about 5 days to complete merely one 90-epoch Vgg16 training execution on the ImageNet-1k data set. 11 In terms of

show abstract

Layer-Parallel Training of Deep Residual Neural Networks

Cited by 78 publications

References 42 publications

A literature survey of matrix methods for data science

A literature survey of matrix methods for data science

Parareal with a learned coarse model for robotic manipulation

Optimizing execution for pipelined‐based distributed deep learning in a heterogeneously networked GPU cluster

Contact Info

Product

Resources

About