Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation

Jentzen, Arnulf; Welti, Timo

doi:10.48550/arxiv.2003.01291

Cited by 7 publications

(22 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This, ( 24), (25), and the fact that for all x ∈ [a, b] d it holds that 0 (x) = 0 prove that for all φ, ψ ∈ K we have that…”

Section: Local Lipschitz Continuity Properties Of the True Risk Funct...mentioning

confidence: 79%

“…For more detailed overviews and further references on SGD optimization schemes we refer, e.g., to [8], [18,Section 1.1], [23, Section 1], and [39]. The effect of random initializations in the training of ANNs was studied, e.g., in [6,20,21,25,32,42] and the references mentioned therein. Another promising branch of research has investigated the convergence of SGD for the training of ANNs in the so-called overparametrized regime, where the number of ANN parameters has to be sufficiently large.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

Jentzen,

Riekert

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully-connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with d ∈ N neurons on the input layer, H ∈ N neurons on the hidden layer, and one neuron on the output layer). The learning rates of the SGD process are assumed to be sufficiently small and the input data used in the SGD process to train the artificial neural networks is assumed to be independent and identically distributed.

show abstract

“…This, ( 24), (25), and the fact that for all x ∈ [a, b] d it holds that 0 (x) = 0 prove that for all φ, ψ ∈ K we have that…”

Section: Local Lipschitz Continuity Properties Of the True Risk Funct...mentioning

confidence: 79%

Section: Introductionmentioning

confidence: 99%

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

Jentzen,

Riekert

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…the number of non-zero weights and biases). Moreover, many bounds on the generalization error require an estimate of the network width [3,26].…”

Section: Approximation Of Analytic Functionsmentioning

confidence: 99%

On the approximation of functions by tanh neural networks

De Ryck,

Lanthaler,

Mishra

2021

Preprint

View full text Add to dashboard Cite

We derive bounds on the error, in high-order Sobolev norms, incurred in the approximation of Sobolev-regular as well as analytic functions by neural networks with the hyperbolic tangent activation function. These bounds provide explicit estimates on the approximation error with respect to the size of the neural networks. We show that tanh neural networks with only two hidden layers suffice to approximate functions at comparable or better rates than much deeper ReLU neural networks.

show abstract

“…Hence, in that case our results show that a SGD scheme associated with the training of the network converges almost surely on the event of staying local. Concerning the training of neural networks via SGD we refer the reader to [BM11] [JW20]. Related target functions (loss landscapes) are analysed in [Coo18], [Ngu19], [Coo20], [PRV20] and [QZX20].…”

Section: Introductionmentioning

confidence: 99%

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

Dereich,

Kassing

2021

Preprint

View full text Add to dashboard Cite

In this article, we consider convergence of stochastic gradient descent schemes (SGD) under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays local we have convergence of the SGD if there is only a countable number of critical points or if the target function/landscape satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softplus, sigmoid and the hyperbolic tangent, SGD converges on the event of staying local, if the random variables modeling the signal and response in the training are compactly supported.

show abstract

Overall error analysis for the training of deep neural networks via stochastic gradient descent with random initialisation

Cited by 7 publications

References 26 publications

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

On the approximation of functions by tanh neural networks

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

Contact Info

Product

Resources

About