Abstract:We consider a family of algorithms that successively sample and minimize simple stochastic models of the objective function. We show that under reasonable conditions on approximation quality and regularity of the models, any such algorithm drives a natural stationarity measure to zero at the rate O(k −1/4 ). As a consequence, we obtain the first complexity guarantees for the stochastic proximal point, proximal subgradient, and regularized Gauss-Newton methods for minimizing compositions of convex functions wit… Show more
“…An essential step in the analysis of stochastic recursive algorithms by the differential inclusion method is the chain rule on a path (see [9] and the references therein). For an absolutely continuous function p : [0, ∞) → Ê n we denote by • p(·) its weak derivative: a measurable function such that…”
Section: Generalized Subdifferentials Of Composite Functionsmentioning
We propose a single time-scale stochastic subgradient method for constrained optimization of a composition of several nonsmooth and nonconvex functions. The functions are assumed to be locally Lipschitz and differentiable in a generalized sense. Only stochastic estimates of the values and generalized derivatives of the functions are used. The method is parameter-free. We prove convergence with probability one of the method, by associating with it a system of differential inclusions and devising a nondifferentiable Lyapunov function for this system. For problems with functions having Lipschitz continuous derivatives, the method finds a point satisfying an optimality measure with error of order 1/ √ N, after executing N iterations with constant stepsize.
“…An essential step in the analysis of stochastic recursive algorithms by the differential inclusion method is the chain rule on a path (see [9] and the references therein). For an absolutely continuous function p : [0, ∞) → Ê n we denote by • p(·) its weak derivative: a measurable function such that…”
Section: Generalized Subdifferentials Of Composite Functionsmentioning
We propose a single time-scale stochastic subgradient method for constrained optimization of a composition of several nonsmooth and nonconvex functions. The functions are assumed to be locally Lipschitz and differentiable in a generalized sense. Only stochastic estimates of the values and generalized derivatives of the functions are used. The method is parameter-free. We prove convergence with probability one of the method, by associating with it a system of differential inclusions and devising a nondifferentiable Lyapunov function for this system. For problems with functions having Lipschitz continuous derivatives, the method finds a point satisfying an optimality measure with error of order 1/ √ N, after executing N iterations with constant stepsize.
“…By providing a "relative" noise condition on f , Assumption A4 allows for a broader class of functions without global Lipschitz properties (as are typically assumed [8]), such as the phase retrieval and matrix completion objectives (Examples 1 and 2). It can allow exponential growth, addressing the challenges in Ex.…”
Section: Stability and Its Consequences For Weakly Convex Functionsmentioning
confidence: 99%
“…To describe convergence and stability guarantees in non-convex (even non-smooth) settings, we require appropriate definitions. Finding global minima of non-convex functions is computationally infeasible [26], so we follow established practice and consider convergence to stationary points, specifically using the convergence of the Moreau envelope [8,13]. To formalize, for x ∈ R n and λ ≥ 0, the Moreau envelope and associated proximal map are…”
Section: Stability and Its Consequences For Weakly Convex Functionsmentioning
confidence: 99%
“…For large enough λ, the minimizer x λ := prox F/λ (x) is unique whenever F is weakly convex. Adopting the techniques pioneered by Davis and Drusvyatskiy [8] for convergence of stochastic methods on weakly convex problems, our convergence machinery relies on the Moreau envelope's connections to (near) stationarity:…”
Section: Stability and Its Consequences For Weakly Convex Functionsmentioning
confidence: 99%
“…The three properties (8) imply that any nearly stationary point x of F λ (x)-when ∇F λ (x) 2 is small-is close to a nearly stationary point x λ of the original function F (·). To prove convergence for weakly convex methods, then, it is sufficient to show that ∇F λ (x k ) → 0.…”
Section: Stability and Its Consequences For Weakly Convex Functionsmentioning
Standard stochastic optimization methods are brittle, sensitive to stepsize choices and other algorithmic parameters, and they exhibit instability outside of well-behaved families of objectives. To address these challenges, we investigate models for stochastic optimization and learning problems that exhibit better robustness to problem families and algorithmic parameters. With appropriately accurate models-which we call the aProx family [2]-stochastic methods can be made stable, provably convergent and asymptotically optimal; even modeling that the objective is nonnegative is sufficient for this stability. We extend these results beyond convexity to weakly convex objectives, which include compositions of convex losses with smooth functions common in modern machine learning applications. We highlight the importance of robustness and accurate modeling with a careful experimental evaluation of convergence time and algorithm sensitivity.
This paper reviews the gradient sampling methodology for solving nonsmooth, nonconvex optimization problems. An intuitively straightforward gradient sampling algorithm is stated and its convergence properties are summarized. Throughout this discussion, we emphasize the simplicity of gradient sampling as an extension of the steepest descent method for minimizing smooth objectives. We then provide overviews of various enhancements that have been proposed to improve practical performance, as well as of several extensions that have been made in the literature, such as to solve constrained problems. The paper also includes clarification of certain technical aspects of the analysis of gradient sampling algorithms, most notably related to the assumptions one needs to make about the set of points at which the objective is continuously differentiable. Finally, we discuss possible future research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.