Abstract:This work considers the question: what convergence guarantees does the stochastic subgradient method have in the absence of smoothness and convexity? We prove that the stochastic subgradient method, on any semialgebraic locally Lipschitz function, produces limit points that are all first-order stationary. More generally, our result applies to any function with a Whitney stratifiable graph. In particular, this work endows the stochastic subgradient method, and its proximal extension, with rigorous convergence g… Show more
“…Note also that virtually all deep network architectures used in applications are actually definable, see e.g. [24]. (b) Despite our efforts we do not see any means to obtain Corollary 6 easily.…”
Section: Deep Neural Network and Nonsmooth Backpropagationmentioning
confidence: 98%
“…Proof : Using the chain rule characterization, all proofs boil down to providing a chain rule with the Clarke subdifferential for each of the above mentioned situation. We refer to [48] for convex, Clarke and prox regular functions, [24] for tame functions.…”
Section: Corollary 3 (Integrability and Clarke Subdifferential)mentioning
confidence: 99%
“…On a more applied side, conservative fields allow to analyze fundamental modern numerical algorithms in machine learning or numerical analysis based on automatic differentiation [49,28] and decomposition [17,24] in a nonsmooth context. Automatic differentiation is indeed proved to yield conservative fields which allows in turn to study discrete stochastic algorithms that are massively used to train AI systems.…”
Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice.
“…Note also that virtually all deep network architectures used in applications are actually definable, see e.g. [24]. (b) Despite our efforts we do not see any means to obtain Corollary 6 easily.…”
Section: Deep Neural Network and Nonsmooth Backpropagationmentioning
confidence: 98%
“…Proof : Using the chain rule characterization, all proofs boil down to providing a chain rule with the Clarke subdifferential for each of the above mentioned situation. We refer to [48] for convex, Clarke and prox regular functions, [24] for tame functions.…”
Section: Corollary 3 (Integrability and Clarke Subdifferential)mentioning
confidence: 99%
“…On a more applied side, conservative fields allow to analyze fundamental modern numerical algorithms in machine learning or numerical analysis based on automatic differentiation [49,28] and decomposition [17,24] in a nonsmooth context. Automatic differentiation is indeed proved to yield conservative fields which allows in turn to study discrete stochastic algorithms that are massively used to train AI systems.…”
Modern problems in AI or in numerical analysis require nonsmooth approaches with a flexible calculus. We introduce generalized derivatives called conservative fields for which we develop a calculus and provide representation formulas. Functions having a conservative field are called path differentiable: convex, concave, Clarke regular and any semialgebraic Lipschitz continuous functions are path differentiable. Using Whitney stratification techniques for semialgebraic and definable sets, our model provides variational formulas for nonsmooth automatic differentiation oracles, as for instance the famous backpropagation algorithm in deep learning. Our differential model is applied to establish the convergence in values of nonsmooth stochastic gradient methods as they are implemented in practice.
“…Our current work sits within the broader scope of analyzing subgradient and proximal methods for weakly convex problems [9, 11, 13-16, 25, 26]; see also the recent survey [12]. In particular, the paper [9] proves a global sublinear rate of convergence, in terms of a natural stationarity measure, of a (stochastic) subgradient method on any weakly convex function. In contrast, here we are interested in subgradient methods that are locally linearly convergent under the additional sharpness assumption.…”
Subgradient methods converge linearly on a convex function that grows sharply away from its solution set. In this work, we show that the same is true for sharp functions that are only weakly convex, provided that the subgradient methods are initialized within a fixed tube around the solution set. A variety of statistical and signal processing tasks come equipped with good initialization, and provably lead to formulations that are both weakly convex and sharp. Therefore, in such settings, subgradient methods can serve as inexpensive local search procedures. We illustrate the proposed techniques on phase retrieval and covariance estimation problems.
“…Thus, one might hope to prove convergence results for a GS algorithm (with predetermined stepsizes rather than line searches) that parallel convergence theory for stochastic gradient methods. Recent work by Davis, Drusvyatskiy, Kakade and Lee [DDKL18] gives convergence results for stochastic subgradient methods on a broad class of problems.…”
This paper reviews the gradient sampling methodology for solving nonsmooth, nonconvex optimization problems. An intuitively straightforward gradient sampling algorithm is stated and its convergence properties are summarized. Throughout this discussion, we emphasize the simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives. We then provide overviews of various enhancements that have been proposed to improve practical performance, as well as of several extensions that have been made in the literature, such as to solve constrained problems. The paper also includes clarification of certain technical aspects of the analysis of gradient sampling algorithms, most notably related to the assumptions one needs to make about the set of points at which the objective is continuously differentiable. Finally, we discuss possible future research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.