Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid

Kirby, Andrew C.; Samsi, Siddharth; Jones, Michael; Reuther, Albert; Kepner, Jeremy; Gadepally, Vijay

doi:10.1109/hpec43674.2020.9286180

Cited by 11 publications

(12 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the pursuit for increased efficiency, several works have proposed approaches to parallelization across time or depth in neural networks. (Gunther et al, 2020;Kirby et al, 2020;Sun et al, 2020) use multigrid and penalty methods to achieve speedups in ResNets. Meng et al (2020) proposed a parareal variant of Physics-informed neural networks (PINNs) for PDEs.…”

Section: Time-parallelization In Neural Modelsmentioning

confidence: 99%

Differentiable Multiple Shooting Layers

Massaroli¹,

Poli²,

Sonoda³

et al. 2021

Preprint

View full text Add to dashboard Cite

We detail a novel class of implicit neural models. Leveraging time-parallel methods for differential equations, Multiple Shooting Layers (MSLs) seek solutions of initial value problems via parallelizable root-finding algorithms. MSLs broadly serve as drop-in replacements for neural ordinary differential equations (Neural ODEs) with improved efficiency in number of function evaluations (NFEs) and wallclock inference time. We develop the algorithmic framework of MSLs, analyzing the different choices of solution methods from a theoretical and computational perspective. MSLs are showcased in long horizon optimal control of ODEs and PDEs and as latent models for sequence generation. Finally, we investigate the speedups obtained through application of MSL inference in neural controlled differential equations (Neural CDEs) for time series classification of medical data.

show abstract

Section: Time-parallelization In Neural Modelsmentioning

confidence: 99%

Differentiable Multiple Shooting Layers

Massaroli¹,

Poli²,

Sonoda³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, as shown in (Gotmare et al, 2018), the performances of most of these methods are much worse than that of BP for deep convolutional neural networks. On the other hand, based on the similarity of ResNets training to the optimal control of nonlinear systems (E, 2017), the parareal method for solving differential equations is employed to replace the conventional forwardbackward propagation with iterative multigrid schemes (Günther et al, 2020;Parpas and Muir, 2019;Kirby et al, 2020). Although the locking issues can be resolved, the implementation is complicated and difficult to integrate with the existing library technologies such as BP and automatic differentiation.…”

Section: Related Workmentioning

confidence: 99%

“…Although the locking issues can be resolved, the implementation is complicated and difficult to integrate with the existing library technologies such as BP and automatic differentiation. Therefore, experiments were conducted on the simple ResNets across small datasets (Kirby et al, 2020), rather than the state-of-the-art ResNets across larger datasets.…”

Section: Related Workmentioning

confidence: 99%

“…However, as shown in (Gotmare et al, 2018), the performances of most of these methods are much worse than that of BP for deep convolutional neural networks. In another recent approach (Günther et al, 2020;Parpas and Muir, 2019;Kirby et al, 2020), based on the similarity of ResNets training to the optimal control of nonlinear systems (E, 2017), parareal method for solving differential equations is employed to replace the forward and backward passes with iterative multigrid schemes. However, due to its necessity of recording both the state and adjoint variables to perform control updates, the implementation is difficult to integrate with the existing automatic differentiation technologies.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Sun¹,

Dong²,

Chen³

et al. 2021

Preprint

View full text Add to dashboard Cite

Gradient-based methods for the distributed training of residual networks (ResNets) typically require a forward pass of the input data, followed by back-propagating the error gradient to update model parameters, which becomes time-consuming as the network goes deeper. To break the algorithmic locking and exploit synchronous module parallelism in both the forward and backward modes, auxiliary-variable methods have attracted much interest lately but suffer from significant communication overhead and lack of data augmentation. In this work, a novel joint learning framework for training realistic ResNets across multiple compute devices is established by trading off the storage and recomputation of external auxiliary variables. More specifically, the input data of each independent processor is generated from its low-capacity auxiliary network (AuxNet), which permits the use of data augmentation and realizes forward unlocking. The backward passes are then executed in parallel, each with a local loss function that originates from the penalty or augmented Lagrangian (AL) methods. Finally, the proposed AuxNet is employed to reproduce the updated auxiliary variables through an end-to-end training process. We demonstrate the effectiveness of our methods on ResNets and WideResNets across CIFAR-10, CIFAR-100, and ImageNet datasets, achieving speedup over the traditional layer-serial training method while maintaining comparable testing accuracy.

show abstract

“…Furthermore, Wu et al [71] proposed a multilevel training for video sequences. The multilevel methods were also explored in the context of layer-parallel training in References [34,47]. Let us note eventually that a variant of the multilevel line-search method was presented in Reference [23].…”

mentioning

confidence: 99%

Globally Convergent Multilevel Training of Deep Residual Networks

Kopaničáková¹,

Krause²

2021

Preprint

View full text Add to dashboard Cite

We propose a globally convergent multilevel training method for deep residual networks (ResNets). The devised method can be seen as a novel variant of the recursive multilevel trust-region (RMTR) method, which operates in hybrid (stochastic-deterministic) settings by adaptively adjusting mini-batch sizes during the training. The multilevel hierarchy and the transfer operators are constructed by exploiting a dynamical system's viewpoint, which interprets forward propagation through the ResNet as a forward Euler discretization of an initial value problem. In contrast to traditional training approaches, our novel RMTR method also incorporates curvature information on all levels of the multilevel hierarchy by means of the limited-memory SR1 method. The overall performance and the convergence properties of our multilevel training method are numerically investigated using examples from the field of classification and regression.

show abstract

Layer-Parallel Training with GPU Concurrency of Deep Residual Neural Networks via Nonlinear Multigrid

Cited by 11 publications

References 18 publications

Differentiable Multiple Shooting Layers

Differentiable Multiple Shooting Layers

Layer-Parallel Training of Residual Networks with Auxiliary-Variable Networks

Globally Convergent Multilevel Training of Deep Residual Networks

Contact Info

Product

Resources

About