2020
DOI: 10.1098/rsta.2019.0049
|View full text |Cite
|
Sign up to set email alerts
|

Optimal memory-aware backpropagation of deep join networks

Abstract: Deep learning training memory needs can prevent the user from considering large models and large batch sizes. In this work, we propose to use techniques from memory-aware scheduling and automatic differentiation (AD) to execute a backpropagation graph with a bounded memory requirement at the cost of extra recomputations. The case of a single homogeneous chain, i.e. the case of a network whose stages are all identical and form a chain, is well understood and optimal solutions have been proposed in the A… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
2

Relationship

2
4

Authors

Journals

citations
Cited by 11 publications
(6 citation statements)
references
References 12 publications
0
6
0
Order By: Relevance
“…Techniques which use speculative approaches [49], spiking neural network concepts [50,51] and memory use optimization [52] have also been proposed. Kim and Ko [53] and (separately) Ma, Lewis and Kleijn [54] have proposed techniques which train neural networks while specifically avoiding backpropagation altogether.…”
Section: Gradient Descent and Machine Learningmentioning
confidence: 99%
“…Techniques which use speculative approaches [49], spiking neural network concepts [50,51] and memory use optimization [52] have also been proposed. Kim and Ko [53] and (separately) Ma, Lewis and Kleijn [54] have proposed techniques which train neural networks while specifically avoiding backpropagation altogether.…”
Section: Gradient Descent and Machine Learningmentioning
confidence: 99%
“…Other techniques have incorporated federated learning and momentum [81] and used evolutionary algorithms [82], speculative approaches [83] and spiking neural network concepts [84,85]. Yet other techniques have focused on supporting deep networks [86], memory use optimization [87], bias factors [88,89] and initial condition sensitivity [90]. A recent technique, proposed by Zhang, et al [91], utilizes a combination of expert strategies and gradient descent for optimization.…”
Section: Gradient Descentmentioning
confidence: 99%
“…This approach involves re-computing results during the backward pass to avoid saving results in the forward pass, trading more compute for less memory but guaranteeing identical results. All work in this domain has focused on ways to balance this trade off for different types of acyclic network graphs (Chen et al 2016;Gruslys et al 2016;Kumar et al 2019;Kusumoto et al 2019;Beaumont et al 2020). Our work instead performs recomputation in the forward pass, so that the backward pass produces an equivalent result, while using less compute time and less memory.…”
Section: Related Workmentioning
confidence: 99%