Alternating minimization represents a widely applicable and empirically successful approach for finding low-rank matrices that best fit the given data. For example, for the problem of low-rank matrix completion, this method is believed to be one of the most accurate and efficient, and formed a major component of the winning entry in the Netflix Challenge [17].In the alternating minimization approach, the low-rank target matrix is written in a bi-linear form, i.e. X = UV † ; the algorithm then alternates between finding the best U and the best V . Typically, each alternating step in isolation is convex and tractable. However the overall problem becomes non-convex and is prone to local minima. In fact, there has been almost no theoretical understanding of when this approach yields a good result.In this paper we present one of the first theoretical analyses of the performance of alternating minimization for matrix completion, and the related problem of matrix sensing. For both these problems, celebrated recent results have shown that they become well-posed and tractable once certain (now standard) conditions are imposed on the problem. We show that alternating minimization also succeeds under similar conditions. Moreover, compared to existing results, our paper shows that alternating minimization guarantees faster (in particular, geometric) convergence to the true matrix, while allowing a significantly simpler analysis.
Phase retrieval problems involve solving linear equations, but with missing sign (or phase, for complex numbers) information. More than four decades after it was first proposed, the seminal error reduction algorithm of Gerchberg and Saxton [21] and Fienup [19] is still the popular choice for solving many variants of this problem. The algorithm is based on alternating minimization; i.e. it alternates between estimating the missing phase information, and the candidate solution. Despite its wide usage in practice, no global convergence guarantees for this algorithm are known. In this paper, we show that a (resampling) variant of this approach converges geometrically to the solution of one such problem -finding a vector x from y, A, where y = |A T x| and |z| denotes a vector of element-wise magnitudes of z -under the assumption that A is Gaussian.Empirically, we demonstrate that alternating minimization performs similar to recently proposed convex techniques for this problem (which are based on "lifting" to a convex matrix problem) in sample complexity and robustness to noise. However, it is much more efficient and can scale to large problems. Analytically, for a resampling version of alternating minimization, we show geometric convergence to the solution, and sample complexity that is off by log factors from obvious lower bounds. We also establish close to optimal scaling for the case when the unknown vector is sparse. Our work represents the first theoretical guarantee for alternating minimization (albeit with resampling) for any variant of phase retrieval problems in the non-convex setting.
We consider the problem of sparse coding, where each sample consists of a sparse linear combination of a set of dictionary atoms, and the task is to learn both the dictionary elements and the mixing coefficients. Alternating minimization is a popular heuristic for sparse coding, where the dictionary and the coefficients are estimated in alternate steps, keeping the other fixed. Typically, the coefficients are estimated via ℓ 1 minimization, keeping the dictionary fixed, and the dictionary is estimated through least squares, keeping the coefficients fixed. In this paper, we establish local linear convergence for this variant of alternating minimization and establish that the basin of attraction for the global optimum (corresponding to the true dictionary and the coefficients) is O 1/s 2 , where s is the sparsity level in each sample and the dictionary satisfies RIP. Combined with the recent results of approximate dictionary estimation, this yields provable guarantees for exact recovery of both the dictionary elements and the coefficients, when the dictionary elements are incoherent.The problem of sparse coding consists of unsupervised learning of the dictionary and the coefficient matrices. Thus, given only unlabeled data, we aim to learn the set of dictionary atoms or basis functions that provide a good fit to the observed data. Sparse coding is applied in a variety of domains. Sparse coding of natural images has yielded dictionary atoms which resemble the receptive fields of neurons in the visual cortex [26,27], and has also yielded localized dictionary elements on speech and video data [19,25].An important strength of sparse coding is that it can incorporate overcomplete dictionaries, where the number of dictionary atoms r can exceed the observed dimensionality d. It has been argued that having overcomplete representation provides greater flexibility is modeling and more robustness to noise [19], which is crucial for encoding complex signals present in images, speech and video. It has been shown that the performance of most machine learning methods employed downstream is critically dependent on the choice of data representations, and overcomplete representations are the key to obtaining state-of-art prediction results [6].On the downside, the problem of learning sparse codes is computationally challenging, and is in general, NP-hard [9]. In practice, heuristics are employed based on alternating minimization. At a high level, this consists of alternating steps, where the dictionary is kept fixed and the coefficients are updated and vice versa. Such alternating minimization methods have enjoyed empirical success in a number of settings [18,10,2,20,35]. In this paper, we carry out a theoretical analysis of the alternating minimization procedure for sparse coding.
We consider the problem of learning overcomplete dictionaries in the context of sparse coding, where each sample selects a sparse subset of dictionary elements. Our main result is a strategy to approximately recover the unknown dictionary using an efficient algorithm. Our algorithm is a clustering-style procedure, where each cluster is used to estimate a dictionary element. The resulting solution can often be further cleaned up to obtain a high accuracy estimate, and we provide one simple scenario where ℓ 1 -regularized regression can be used for such a second stage.
Momentum based stochastic gradient methods such as heavy ball (HB) and Nesterov's accelerated gradient descent (NAG) method are widely used in practice for training deep networks and other supervised learning models, as they often provide significant improvements over stochastic gradient descent (SGD). Rigorously speaking, "fast gradient" methods have provable improvements over gradient descent only for the deterministic case, where the gradients are exact. In the stochastic case, the popular explanations for their wide applicability is that when these fast gradient methods are applied in the stochastic case, they partially mimic their exact gradient counterparts, resulting in some practical gain. This work provides a counterpoint to this belief by proving that there exist simple problem instances where these methods cannot outperform SGD despite the best setting of its parameters. These negative problem instances are, in an informal sense, generic; they do not look like carefully constructed pathological instances. These results suggest (along with empirical evidence) that HB or NAG's practical performance gains are a by-product of mini-batching.Furthermore, this work provides a viable (and provable) alternative, which, on the same set of problem instances, significantly improves over HB, NAG, and SGD's performance. This algorithm, referred to as Accelerated Stochastic Gradient Descent (ASGD), is a simple to implement stochastic algorithm, based on a relatively less popular variant of Nesterov's Acceleration. Extensive empirical results in this paper show that ASGD has performance gains over HB, NAG, and SGD. The code implementing the ASGD Algorithm can be found here 1 .
Markov Random Fields (MRFs), a.k.a. Graphical Models, serve as popular models for networks in the social and biological sciences, as well as communications and signal processing. A central problem is one of structure learning or model selection: given samples from the MRF, determine the graph structure of the underlying distribution. When the MRF is not Gaussian (e.g. the Ising model) and contains cycles, structure learning is known to be NP hard even with infinite samples. Existing approaches typically focus either on specific parametric classes of models, or on the sub-class of graphs with bounded degree; the complexity of many of these methods grows quickly in the degree bound. We develop a simple new 'greedy' algorithm for learning the structure of graphical models of discrete random variables. It learns the Markov neighborhood of a node by sequentially adding to it the node that produces the highest reduction in conditional entropy. We provide a general sufficient condition for exact structure recovery (under conditions on the degree/girth/correlation decay), and study its sample and computational complexity. We then consider its implications for the Ising model, for which we establish a self-contained condition for exact structure recovery.
Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods," provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in Õ(1/ǫ 7/4 ) iterations, faster than the Õ(1/ǫ 2 ) iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization. Guarantees Oracle AlgorithmIterations Simplicity First-order Stationary PointGradient GD [Nesterov, 1998] O(1/ǫ 2 ) Single-loop AGD [Ghadimi and Lan, 2016] O(1/ǫ 2 ) Single-loop Carmon et al. [2017] O 1/ǫ 7/4 Nested-loop Second-order Stationary Point Hessian -vector Carmon et al. [2016] O(1/ǫ 7/4 ) Nested-loop Agarwal et al. [2017] O(1/ǫ 7/4 ) Nested-loop Gradient Noisy GD [Ge et al., 2015] O(poly(d/ǫ)) Single-loop Perturbed GD [Jin et al., 2017] O(1/ǫ 2 ) Single-loop Perturbed AGD [This Work] O(1/ǫ 7/4 ) Single-loop Table 1: Complexity of finding stationary points. O(•) ignores polylog factors in d and ǫ.Convergence to first-order stationary points: Traditional analyses in this case assume only Lipschitz gradients (see Definition 1). Nesterov [1998] shows that GD finds an ǫ-first-order stationary point in O(1/ǫ 2 ) steps. Ghadimi and Lan [2016] guarantee that AGD also converges in O 1/ǫ 2 steps. Under the additional assumption of Lipschitz Hessians (see Definition 4), Carmon et al. [2017] develop a new algorithm that converges in O(1/ǫ 7/4 ) steps. Their algorithm is a nested-loop algorithm, where the outer loop adds a proximal term to reduce the nonconvex problem to a convex subproblem. A key novelty in their algorithm is the idea of "negative curvature exploitation," which inspired a similar step in our algorithm. In addition to the qualitative and quantitative differences between Carmon et al. [2017] and the current work, as summarized in Table 1, we note that while Carmon et al. [2017] analyze AGD applied to convex subproblems, we analyze AGD applied directly to nonconvex functions through a novel Hamiltonian framework.Convergence to second-order stationary points: All results in this setting assume Lipschitz conditions for both the gradient and Hessian. Classical approaches, such as cubic regularization [N...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.