We study the problem of decentralized optimization over time-varying networks with strongly convex smooth cost functions. In our approach, nodes run a multi-step gossip procedure after making each gradient update, thus ensuring approximate consensus at each iteration, while the outer loop is based on accelerated Nesterov scheme. The algorithm achieves precision ε > 0 in O( √ κ g χ log 2 (1/ε))communication steps and O( √ κ g log(1/ε)) gradient computations at each node, where κ g is the global function number and χ characterizes connectivity of the communication network. In the case of a static network, χ = 1/γ where γ denotes the normalized spectral gap of communication matrix W. The complexity bound includes κ g , which can be significantly better than the worst-case condition number among the nodes.
We propose a general yet simple theorem describing the convergence of SGD under the arbitrary sampling paradigm. Our theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form minibatches. This is the first time such an analysis is performed, and most of our variants of SGD were never explicitly considered in the literature before. Our analysis relies on the recently introduced notion of expected smoothness and does not rely on a uniform bound on the variance of the stochastic gradients. By specializing our theorem to different mini-batching strategies, such as sampling with replacement and independent sampling, we derive exact expressions for the stepsize as a function of the mini-batch size. With this we can also determine the mini-batch size that optimizes the total complexity, and show explicitly that as the variance of the stochastic gradient evaluated at the minimum grows, so does the optimal mini-batch size. For zero variance, the optimal mini-batch size is one. Moreover, we prove insightful stepsize-switching rules which describe when one should switch from a constant to a decreasing stepsize regime.
In order to mitigate the high communication cost in distributed and federated learning, various vector compression schemes, such as quantization, sparsification and dithering, have become very popular. In designing a compression method, one aims to communicate as few bits as possible, which minimizes the cost per communication round, while at the same time attempting to impart as little distortion (variance) to the communicated messages as possible, which minimizes the adverse effect of the compression on the overall number of communication rounds. However, intuitively, these two goals are fundamentally in conflict: the more compression we allow, the more distorted the messages become. We formalize this intuition and prove an uncertainty principle for randomized compression operators, thus quantifying this limitation mathematically, and effectively providing asymptotically tight lower bounds on what might be achievable with communication compression. Motivated by these developments, we call for the search for the optimal compression operator. In an attempt to take a first step in this direction, we consider an unbiased compression method inspired by the Kashin representation of vectors, which we call Kashin compression (KC). In contrast to all previously proposed compression mechanisms, KC enjoys a dimension independent variance bound for which we derive an explicit formula even in the regime when only a few bits need to be communicate per each vector entry.
In 2015 there appears a universal framework Catalyst that allows to accelerate almost arbitrary non-accelerated deterministic and randomized algorithms for smooth convex optimization problems Lin et al. (2015). This technique finds a lot of applications in Machine Learning due to the possibility to deal with sum-type target functions. The significant part of the Catalyst approach is accelerated proximal outer gradient method. This method used as an envelope for non-accelerated inner algorithm for the regularized auxiliary problem. One of the main practical problem of this approach is the selection of this regularization parameter. There exists a nice theory for that Lin et al. ( 2018), but this theory required prior knowledge about the smoothness constant of the target function. In this paper, we propose an adaptive variant of Catalyst that doesn't require such information. In combination with the adaptive inner nonaccelerated algorithm, we propose accelerated variants of well-known methods: steepest descent, adaptive coordinate descent, alternating minimization.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.