Nesterov's momentum trick is famously known for accelerating gradient descent, and has been proven useful in building fast iterative algorithms. However, in the stochastic setting, counterexamples exist and prevent Nesterov's momentum from providing similar acceleration, even if the underlying problem is convex.We introduce Katyusha, a direct, primal-only stochastic gradient method to fix this issue. It has a provably accelerated convergence rate in convex (off-line) stochastic optimization. The main ingredient is Katyusha momentum, a novel "negative momentum" on top of Nesterov's momentum. It can be incorporated into a variance-reduction based algorithm and speed it up, both in terms of sequential and parallel performance. Since variance reduction has been successfully applied to a growing list of practical problems, our paper suggests that in each of such cases, one could potentially try to give Katyusha a hug. * We would like to specially thank Shai Shalev-Shwartz for useful feedbacks and suggestions on this paper, thank Blake Woodworth and Nati Srebro for pointer to their paper [49], thank Guanghui Lan for correcting our citation of [16], thank Weston Jackson, Xu Chen and Zhe Li for verifying the proofs and correcting typos, and thank anonymous reviewers for a number of writing suggestions.
We design a non-convex second-order optimization algorithm that is guaranteed to return an approximate local minimum in time which scales linearly in the underlying dimension and the number of training examples. The time complexity of our algorithm to find an approximate local minimum is even faster than that of gradient descent to find a critical point. Our algorithm applies to a general class of optimization problems including training a neural network and other non-convex objectives arising in machine learning.
The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained neural networks overfit when the it is overparameterized (namely, having more parameters than statistically needed to overfit training data)?In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the overparameterized network. * V1 appears on this date, V2/V3/V4/V5 polish writing and parameters, V5 adds experiments. Authors sorted in alphabetical order. We would like to thank Greg Yang and Sebastien Bubeck for many enlightening conversations.
We study the design of nearly-linear-time algorithms for approximately solving positive linear programs. Both the parallel and the sequential deterministic versions of these algorithms require O(ε −4 ) iterations, a dependence that has not been improved since the introduction of these methods in 1993 by Luby and Nisan. Moreover, previous algorithms and their analyses rely on update steps and convergence arguments that are combinatorial in nature, and do not seem to arise naturally from an optimization viewpoint. In this paper, we leverage insights from optimization theory to construct a novel algorithm that breaks the longstanding O(ε −4 ) barrier. Our algorithm has a simple analysis and a clear motivation. Our work introduces a number of novel techniques, such as the combined application of gradient descent and mirror descent, and a truncated, smoothed version of the standard multiplicative weight update, which may be of independent interest.
How does a 110-layer ResNet learn a high-complexity classifier using relatively few training examples and short training time? We present a theory towards explaining this in terms of hierarchical learning. We refer hierarchical learning as the learner learns to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning efficiently and automatically simply by applying stochastic gradient descent (SGD) to the training objective.On the conceptual side, we present, to the best of our knowledge, the first theory result indicating how very deep neural networks can still be sample and time efficient on certain hierarchical learning tasks, when no known non-hierarchical algorithms (such as kernel method, linear regression over feature mappings, tensor decomposition, sparse coding, and their simple combinations) are efficient. We establish a new principle called "backward feature correction", which we believe is the key to understand the hierarchical learning in multi-layer neural networks.On the technical side, we show for regression and even for binary classification, for every input dimension d > 0, there is a concept class consisting of degree ω(1) multi-variate polynomials so that, using ω(1)-layer neural networks as learners, SGD can learn any target function from this class in poly(d) time using poly(d) samples to any 1 poly(d) regression or classification error, through learning to represent it as a composition of ω(1) layers of quadratic functions. In contrast, we present lower bounds stating that several non-hierarchical learners, including any kernel methods, neural tangent kernels, must suffer from super-polynomial d ω(1) sample or time complexity to learn functions in this concept class even to any d −0.01 error.
We develop several efficient algorithms for the classical Matrix Scaling problem, which is used in many diverse areas, from preconditioning linear systems to approximation of the permanent. On an input n × n matrix A, this problem asks to find diagonal (scaling) matrices X and Y (if they exist), so that XAY ε-approximates a doubly stochastic matrix, or more generally a matrix with prescribed row and column sums.We address the general scaling problem as well as some important special cases. In particular, if A has m nonzero entries, and if there exist X and Y with polynomially large entries such that XAY is doubly stochastic, then we can solve the problem in total complexity O(m + n 4/3 ). This greatly improves on the best known previous results, which were either O(n 4 ) or O(mn 1/2 /ε).Our algorithms are based on tailor-made first and second order techniques, combined with other recent advances in continuous optimization, which may be of independent interest for solving similar problems. arXiv:1704.02315v1 [cs.DS] 7 Apr 2017 problems. We say that A is asymptotically (r, c)-scalable if the row and column sums can reach r and c asymptotically: that is, for every > 0, there exist positive diagonal matrices X, Y such that, letting B = XAY , we have B1 − r ≤ ε and 1 B − c ≤ ε. 2 The combinatorial essence of asymptotic scaling follows from a well-known characterization (see Proposition 2.2). A matrix A is asymptotically (1, 1)-scalable if and only if the permanent of A is positive, namely if the bipartite graph defined by the positive entries in A has a perfect matching. A matrix A is asymptotically (r, c)-scalable if and only if a natural flow on the same bipartite graph 3 has a solution. Duality (Hall's theorem and max-flow-min-cut theorem) gives simple certificates of non-scalability in terms of the patterns of 0's in the matrix A.The main computational problem we study is: given a matrix A, vectors r, c and ε > 0, determine if A is ε-approximately (r, c) scalable, and if so, find the scaling matrices X, Y .Before diving into the history of matrix scaling, we explain one of its most basic applications, which also demonstrates its algorithmic importance.Preconditioning Linear Systems. When solving a linear system Az = b, it is often desirable -for numerical stability and efficiency purposes-to have matrix A be well-conditioned. When this is not the case, one tries to transform A into a "better conditioned" matrix A . Matrix scaling provides a natural and efficient reduction to do so. For instance, one would hope that a scaled matrix A , in which e.g. all row and column p-norms are (say) 1, is better conditioned. 4 For this reason, we can use a matrix scaling algorithm to obtain diagonal matrices X, Y , and define A = XAY . Now, the solution to Az = b can be obtained by solving the (hopefully more numerically stable) linear system A z = Xb and setting z = Y −1 z . We stress here that A and A have the same sparsity. History and Prior WorkThe matrix (r, c)-scaling problem is so natural and important that it was discove...
Positive linear programs (LP), also known as packing and covering linear programs, are an important class of problems that bridges computer science, operation research, and optimization. Efficient algorithms for solving such LPs have received significant attention in the past 20 years [2,3,4,6,7,9,11,15,16,18,19,21,24,25,26,29,30]. Unfortunately, all known nearly-linear time algorithms for producing (1+ε)-approximate solutions to positive LPs have a running time dependence that is at least proportional to ε −2 . This is also known as an O(1/ √ T ) convergence rate and is particularly poor in many applications.In this paper, we leverage insights from optimization theory to break this longstanding barrier. Our algorithms solve the packing LP in time O(N ε −1 ) and the covering LP in time O(N ε −1.5 ). At high level, they can be described as linear couplings of several first-order descent steps. This is the first application of our linear coupling technique (see [1]) to problems that are not amenable to blackbox applications known iterative algorithms in convex optimization. Our work also introduces a sequence of new techniques, including the stochastic and the non-symmetric execution of gradient truncation operations, which may be of independent interest.
We study streaming principal component analysis (PCA), that is to find, in O(dk) space, the top k eigenvectors of a d × d hidden matrix Σ with online vectors drawn from covariance matrix Σ.We provide global convergence for Oja's algorithm which is popularly used in practice but lacks theoretical understanding for k > 1. We also provide a modified variant Oja ++ that runs even faster than Oja's. Our results match the information theoretic lower bound in terms of dependency on error, on eigengap, on rank k, and on dimension d, up to poly-log factors. In addition, our convergence rate can be made gap-free, that is proportional to the approximation error and independent of the eigengap.In contrast, for general rank k, before our work (1) it was open to design any algorithm with efficient global convergence rate; and (2) it was open to design any algorithm with (even local) gap-free convergence rate in O(dk) space.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.