2021
DOI: 10.48550/arxiv.2102.09385
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Convergence of stochastic gradient descent schemes for Lojasiewicz-landscapes

Steffen Dereich,
Sebastian Kassing

Abstract: In this article, we consider convergence of stochastic gradient descent schemes (SGD) under weak assumptions on the underlying landscape. More explicitly, we show that on the event that the SGD stays local we have convergence of the SGD if there is only a countable number of critical points or if the target function/landscape satisfies Lojasiewicz-inequalities around all critical levels as all analytic functions do. In particular, we show that for neural networks with analytic activation function such as softp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

0
13
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 13 publications
(13 reference statements)
0
13
0
Order By: Relevance
“…There is empirical evidence that these deep learning-based methods work well at least for medium prescribed accuracies; see, e.g., the simulations in [13,18,32,4,3]. However, stochastic optimization methods may get trapped in local minima and there exists no theoretical convergence result; cf., e.g., [17].…”
Section: Introductionmentioning
confidence: 99%
“…There is empirical evidence that these deep learning-based methods work well at least for medium prescribed accuracies; see, e.g., the simulations in [13,18,32,4,3]. However, stochastic optimization methods may get trapped in local minima and there exists no theoretical convergence result; cf., e.g., [17].…”
Section: Introductionmentioning
confidence: 99%
“…In this section we establish in Proposition 9.4 below an abstract local convergence result for GD under a Kurdyka-Lojasiewicz assumption. In the scientific literature similar convergence results for GD type processes under a Lojasiewicz assumption can be found, e.g., in Absil et al [1], Attouch & Bolte [3], Dereich & Kassing [21], and Xu & Yin [63]. To prove Proposition 9.4 we transfer the ideas from the continuous-time setting in Section 8 to the discrete-time setting.…”
Section: Convergence Analysis For Gd Processesmentioning
confidence: 63%
“…Regarding abstract results on the convergence of GF and GD processes we refer, for example, to [5,34,49,50,56] for the case of convex objective functions, we refer, for instance, to [1,3,4,10,21,39,42,43,46,47,51] for convergence results for GF and GD processes under Lojasiewicz type conditions, and we refer, for instance, to [7,27,44,54] and the references mentioned therein for further results without convexity conditions. In general, without global assumptions on the objective function such as convexity, gradient-based methods may converge to non-global local minima or saddle points.…”
Section: Introduction and Main Resultsmentioning
confidence: 99%
“…There are several promising attempts in the scientific literature which intend to mathematically analyze GD optimization algorithms in the training of ANNs. In particular, there are various convergence results for GD optimization algorithms in the training of ANNs that assume convexity of the considered objective functions (cf., e.g., [4,5,24] and the references mentioned therein), there are general abstract convergence results for GD optimization algorithms that do not assume convexity of the considered objective functions (cf., e.g., [1,6,10,14,19,20,22] and the references mentioned therein), there are divergence results and lower bounds for GD optimization algorithms in the training of ANNs (cf., e.g., [8,18,23] and the references mentioned therein), there are mathematical analyzes regarding the initialization in the training of ANNs with GD optimization algorithms (cf., e.g., [15,16,23,27] and the references mentioned therein), and there are convergence results for GD optimization algorithms in the training of ANNs in the case of constant target functions (cf. [7]).…”
Section: Introductionmentioning
confidence: 99%