2019
DOI: 10.48550/arxiv.1905.03817
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Abstract: Recent developments on large-scale distributed machine learning applications, e.g., deep neural networks, benefit enormously from the advances in distributed non-convex optimization techniques, e.g., distributed Stochastic Gradient Descent (SGD). A series of recent works study the linear speedup property of distributed SGD variants with reduced communication. The linear speedup property enable us to scale out the computing capability by adding more computing nodes into our system. The reduced communication com… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

3
36
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(39 citation statements)
references
References 22 publications
3
36
0
Order By: Relevance
“…From Theorem 4.2, we immediately have the following convergence rate with a proper choice of learning rates: . This matches the convergence rate in distributed learning and FL without compression [10,11,35,37], indicating CFedAvg achieves high communication efficiency while not sacrificing learning accuracy in FL. When degenerating to i.i.d.…”
Section: R 3 (C Csupporting
confidence: 68%
“…From Theorem 4.2, we immediately have the following convergence rate with a proper choice of learning rates: . This matches the convergence rate in distributed learning and FL without compression [10,11,35,37], indicating CFedAvg achieves high communication efficiency while not sacrificing learning accuracy in FL. When degenerating to i.i.d.…”
Section: R 3 (C Csupporting
confidence: 68%
“…• In order to achieve a linear speedup, i.e., a convergence rate O( 1 √ mKT ), we show that the number of local updates K can be as large as T /m, which improves the T 1/3 /m result previously shown in Yu et al (2019a) and . As shown later in the communication complexity comparison in Table 1, a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead.…”
Section: Introductionmentioning
confidence: 50%
“…For large scale problems such as training deep convectional neural networks over ImageNet dataset (Russakovsky et al, 2015), it is hard to optimize problem (1) on a single machine. In this section, we extend the mini-batch Adam to the distributed Adam as distributed SGD method (Yu et al, 2019).…”
Section: Convergence Analysis For Distributed Adammentioning
confidence: 99%
“…Remark 17. Below, we give two remarks on above mini-batch Adam algorithm: (i) For distributed Adam, to achieve ε-stationary point, O(ε −4 s −1 ) iterations are needed, which is a linear speed-up with respect to the number of workers in the network, which is in the same order as it in distributed SGD (Yu et al, 2019).…”
Section: Convergence Analysis For Distributed Adammentioning
confidence: 99%
See 1 more Smart Citation