Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

Yuan, Kun; Alghunaim, Sulaiman A.

doi:10.48550/arxiv.2105.08023

Cited by 7 publications

(15 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…By comparing this result with the previously best known upper bound, Õ σ 2 nε + 1 p • σ √ pε + 1 p log 1 ε , by Pu and Nedić [39], we see that our upper bound improves the last two terms by a factor of c p ≥ 1 and that the first term matches with known lower bounds [37]. The D 2 algorithm [46] only converges under the assumption that c is a constant 2 and the recent upper bound from [55] coincides with our worst case complexity for GT on all topologies where D 2 can be applied. We provide additional comparison of GT convergence rates in the Tables 1 and 2.…”

Section: Introductionsupporting

confidence: 80%

“…In this paper, we develop a new, and improved, analysis of the gradient tracking algorithm with a novel proof technique. Along with the parallel contribution [55] that developed a tighter analysis of the D 2 algorithm, we now have a more accurate understanding of in which setting GT works well and in which ones it does not, and our results allow for a more detailed comparison between the D-SGD, GT and D 2 methods (see Section 5 below).…”

Section: Introductionmentioning

confidence: 95%

See 1 more Smart Citation

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

Koloskova¹,

Lin²,

Stich³

2022

Preprint

View full text Add to dashboard Cite

We consider decentralized machine learning over a network where the training data is distributed across n agents, each of which can compute stochastic model updates on their local data. The agent's common goal is to find a model that minimizes the average of all local loss functions. While gradient tracking (GT) algorithms can overcome a key challenge, namely accounting for differences between workers' local data distributions, the known convergence rates for GT algorithms are not optimal with respect to their dependence on the mixing parameter p (related to the spectral gap of the connectivity matrix). We provide a tighter analysis of the GT method in the stochastic strongly convex, convex and non-convex settings. We improve the dependency on p from O(p −2 ) to O(p −1 c −1 ) in the noiseless case and from O(p −3/2 ) to O(p −1/2 c −1 ) in the general stochastic case, where c ≥ p is related to the negative eigenvalues of the connectivity matrix (and is a constant in most practical applications). This improvement was possible due to a new proof technique which could be of independent interest.

show abstract

Section: Introductionsupporting

confidence: 80%

Section: Introductionmentioning

confidence: 95%

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

Koloskova¹,

Lin²,

Stich³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For this smooth formulation, variants of decentralized stochastic gradient descent (DSGD), e.g., [4,26,52,70], admit simple implementations yet provide competitive practical performance against centralized methods in homogeneous environments like data centers. When the data distributions across the network become heterogeneous, the performance of DSGD in both practice and theory degrades significantly [15,39,57,59,68]. To address this issue, stochastic methods that are robust to heterogeneous data have been proposed, e.g., D2 [51] that is derived from primal-dual formulations [22,25,47,69] and GT-DSGD [29,63] that is based on gradient tracking [10,33,38,41,67].…”

Section: Literature Reviewmentioning

confidence: 99%

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Xin,

Das,

Khan

et al. 2021

Preprint

View full text Add to dashboard Cite

Decentralized optimization is a promising parallel computation paradigm for large-scale data analytics and machine learning problems defined over a network of nodes. This paper is concerned with decentralized non-convex composite problems with population or empirical risk. In particular, the networked nodes are tasked to find an approximate stationary point of the average of local, smooth, possibly non-convex risk functions plus a possibly non-differentiable extended valued convex regularizer. Under this general formulation, we propose the first provably efficient, stochastic proximal gradient framework, called ProxGT. Specifically, we construct and analyze several instances of ProxGT that are tailored respectively for different problem classes of interest. Remarkably, we show that the sample complexities of these instances are network topology-independent and achieve linear speedups compared to that of the corresponding centralized optimal methods implemented on a single node. Contents

show abstract

“…We refer those iterations before decentralized SGD reaches its linear speedup stage as transient iterations (see the definition in Sec. 2), which is an important metric to measure the influence of partial-averaging [48,65] on convergence rate of decentralized SGD. The less effective the partial averaging is, the more transient iterations decentralized SGD needs to take.…”

Section: Topologymentioning

confidence: 99%

“…One line of research proposes new algorithms that are less sensitive to topologies. For example, [66,23,65,57,1] removed data heterogeneity with bias-correction techniques in [68,29,62,40,69], and [14,61,7,27] utilized periodic global averaging or multiple partial averaging steps. All these methods have improved topology dependence.…”

Section: Related Workmentioning

confidence: 99%

Exponential Graph is Provably Efficient for Decentralized Deep Training

Ying¹,

Yuan²,

Chen³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SGD efficient is to realize nearly-exact averaging using little communication. This requires a skillful choice of communication topology, which is an under-studied topic in decentralized optimization. In this paper, we study so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes. This work proves such graphs can lead to both fast communication and effective averaging simultaneously. We also discover that a sequence of log(n) one-peer exponential graphs, in which each node communicates to one single neighbor per iteration, can together achieve exact averaging. This favorable property enables one-peer exponential graph to average as effective as its static counterpart but communicates more efficiently. We apply these exponential graphs in decentralized (momentum) SGD to obtain the state-of-the-art balance between per-iteration communication and iteration complexity among all commonly-used topologies. Experimental results on a variety of tasks and models demonstrate that decentralized (momentum) SGD over exponential graphs promises both fast and highquality training. Our code is implemented through BlueFog and available at https://github.com/Bluefog-Lib/NeurIPS2021-Exponential-Graph.

show abstract

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

Cited by 7 publications

References 37 publications

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

An Improved Analysis of Gradient Tracking for Decentralized Machine Learning

A Stochastic Proximal Gradient Framework for Decentralized Non-Convex Composite Optimization: Topology-Independent Sample Complexity and Communication Efficiency

Exponential Graph is Provably Efficient for Decentralized Deep Training

Contact Info

Product

Resources

About