2021
DOI: 10.48550/arxiv.2110.13363
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Exponential Graph is Provably Efficient for Decentralized Deep Training

Abstract: Decentralized SGD is an emerging training method for deep learning known for its much less (thus faster) communication per iteration, which relaxes the averaging step in parallel SGD to inexact averaging. The less exact the averaging is, however, the more the total iterations the training needs to take. Therefore, the key to making decentralized SGD efficient is to realize nearly-exact averaging using little communication. This requires a skillful choice of communication topology, which is an under-studied top… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(7 citation statements)
references
References 50 publications
0
7
0
Order By: Relevance
“…Push matrix W is also used with directed graph. 3) Standard weight matrix W satisfies both W 1 = 1 and 1 T W = 1 T and used for undirected graph, as well as special directed graphs such as the exponential graph [33]. See Fig.…”
Section: A Concepts and Theoretical Foundationsmentioning
confidence: 99%
See 4 more Smart Citations
“…Push matrix W is also used with directed graph. 3) Standard weight matrix W satisfies both W 1 = 1 and 1 T W = 1 T and used for undirected graph, as well as special directed graphs such as the exponential graph [33]. See Fig.…”
Section: A Concepts and Theoretical Foundationsmentioning
confidence: 99%
“…When the network topology is sparse (e.g., a ring or a one-peer exponential graph [3], [33]), each partial averaging step (5) incurs O(1) latency and O(1) transmission time (the inverse of bandwidth), which are independent of n. Since each node only synchronizes with its direct neighbors, there is low synchronization overhead. effective in aggregating information than global averaging, some decentralized algorithms can match or exceed the performance of global-averaging-based distributed algorithms: [1], [29] established that decentralized SGD can achieve the same asymptotic linear speedup in convergence rate as (parameter server based) distributed SGD; [3], [33] used exponential graph topologies to realize both efficient communication and effective aggregation by partial averaging; [37], [38], [31], [39] improved the convergence rate of decentralized SGD by removing data heterogeneity between nodes; [40], [4], [30], [41] enhanced the effectiveness of partial averaging by periodically calling global averaging. BlueFog can implement all these algorithms including those use global averaging.…”
Section: A Concepts and Theoretical Foundationsmentioning
confidence: 99%
See 3 more Smart Citations