2021
DOI: 10.48550/arxiv.2105.08023
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Removing Data Heterogeneity Influence Enhances Network Topology Dependence of Decentralized SGD

Abstract: We consider decentralized stochastic optimization problems where a network of agents each owns a local cost function cooperate to find a minimizer of the global-averaged cost. A widely studied decentralized algorithm for this problem is D-SGD in which each node applies a stochastic gradient descent step, then averages its estimate with its neighbors. D-SGD is attractive due to its efficient single-iteration communication and can achieve linear speedup in convergence (in terms of the network size). However, D-S… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
14
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1

Relationship

2
3

Authors

Journals

citations
Cited by 7 publications
(15 citation statements)
references
References 37 publications
1
14
0
Order By: Relevance
“…By comparing this result with the previously best known upper bound, Õ σ 2 nε + 1 p • σ √ pε + 1 p log 1 ε , by Pu and Nedić [39], we see that our upper bound improves the last two terms by a factor of c p ≥ 1 and that the first term matches with known lower bounds [37]. The D 2 algorithm [46] only converges under the assumption that c is a constant 2 and the recent upper bound from [55] coincides with our worst case complexity for GT on all topologies where D 2 can be applied. We provide additional comparison of GT convergence rates in the Tables 1 and 2.…”
Section: Introductionsupporting
confidence: 80%
See 1 more Smart Citation
“…By comparing this result with the previously best known upper bound, Õ σ 2 nε + 1 p • σ √ pε + 1 p log 1 ε , by Pu and Nedić [39], we see that our upper bound improves the last two terms by a factor of c p ≥ 1 and that the first term matches with known lower bounds [37]. The D 2 algorithm [46] only converges under the assumption that c is a constant 2 and the recent upper bound from [55] coincides with our worst case complexity for GT on all topologies where D 2 can be applied. We provide additional comparison of GT convergence rates in the Tables 1 and 2.…”
Section: Introductionsupporting
confidence: 80%
“…In this paper, we develop a new, and improved, analysis of the gradient tracking algorithm with a novel proof technique. Along with the parallel contribution [55] that developed a tighter analysis of the D 2 algorithm, we now have a more accurate understanding of in which setting GT works well and in which ones it does not, and our results allow for a more detailed comparison between the D-SGD, GT and D 2 methods (see Section 5 below).…”
Section: Introductionmentioning
confidence: 95%
“…For this smooth formulation, variants of decentralized stochastic gradient descent (DSGD), e.g., [4,26,52,70], admit simple implementations yet provide competitive practical performance against centralized methods in homogeneous environments like data centers. When the data distributions across the network become heterogeneous, the performance of DSGD in both practice and theory degrades significantly [15,39,57,59,68]. To address this issue, stochastic methods that are robust to heterogeneous data have been proposed, e.g., D2 [51] that is derived from primal-dual formulations [22,25,47,69] and GT-DSGD [29,63] that is based on gradient tracking [10,33,38,41,67].…”
Section: Literature Reviewmentioning
confidence: 99%
“…We refer those iterations before decentralized SGD reaches its linear speedup stage as transient iterations (see the definition in Sec. 2), which is an important metric to measure the influence of partial-averaging [48,65] on convergence rate of decentralized SGD. The less effective the partial averaging is, the more transient iterations decentralized SGD needs to take.…”
Section: Topologymentioning
confidence: 99%
“…One line of research proposes new algorithms that are less sensitive to topologies. For example, [66,23,65,57,1] removed data heterogeneity with bias-correction techniques in [68,29,62,40,69], and [14,61,7,27] utilized periodic global averaging or multiple partial averaging steps. All these methods have improved topology dependence.…”
Section: Related Workmentioning
confidence: 99%