On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Yu, Hao; Jin, Rong; Yang, Sen

doi:10.48550/arxiv.1905.03817

Cited by 21 publications

(39 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From Theorem 4.2, we immediately have the following convergence rate with a proper choice of learning rates: . This matches the convergence rate in distributed learning and FL without compression [10,11,35,37], indicating CFedAvg achieves high communication efficiency while not sacrificing learning accuracy in FL. When degenerating to i.i.d.…”

Section: R 3 (C Csupporting

confidence: 68%

CFedAvg: Achieving Efficient Communication and Fast Convergence in Non-IID Federated Learning

Yang¹,

Liu²,

Bentley³

2021

Preprint

View full text Add to dashboard Cite

Federated learning (FL) is a prevailing distributed learning paradigm, where a large number of workers jointly learn a model without sharing their training data. However, high communication costs could arise in FL due to large-scale (deep) learning models and bandwidth-constrained connections. In this paper, we introduce a communication-efficient algorithmic framework called CFedAvg for FL with non-i.i.d. datasets, which works with general (biased or unbiased) SNR-constrained compressors. We analyze the convergence rate of CFedAvg for non-convex functions with constant and decaying learning rates. The CFedAvg algorithm can achieve an O (1/ √ + 1/ ) convergence rate with a constant learning rate, implying a linear speedup for convergence as the number of workers increases, where is the number of local steps, is the number of total communication rounds, and is the total worker number. This matches the convergence rate of distributed/federated learning without compression, thus achieving high communication efficiency while not sacrificing learning accuracy in FL. Furthermore, we extend CFedAvg to cases with heterogeneous local steps, which allows different workers to perform a different number of local steps to better adapt to their own circumstances. The interesting observation in general is that the noise/variance introduced by compressors does not affect the overall convergence rate order for non-i.i.d. FL. We verify the effectiveness of our CFedAvg algorithm on three datasets with two gradient compression schemes of different compression ratios. CCS CONCEPTS• Computing methodologies → Distributed algorithms; • Theory of computation → Distributed algorithms.

show abstract

Section: R 3 (C Csupporting

confidence: 68%

CFedAvg: Achieving Efficient Communication and Fast Convergence in Non-IID Federated Learning

Yang¹,

Liu²,

Bentley³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…• In order to achieve a linear speedup, i.e., a convergence rate O( 1 √ mKT ), we show that the number of local updates K can be as large as T /m, which improves the T 1/3 /m result previously shown in Yu et al (2019a) and . As shown later in the communication complexity comparison in Table 1, a larger number of local steps implies relatively fewer communication rounds, thus less communication overhead.…”

Section: Introductionmentioning

confidence: 50%

Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

Yang,

Fang,

Liu

2021

Preprint

View full text Add to dashboard Cite

Federated learning (FL) is a distributed machine learning architecture that leverages a large number of workers to jointly learn a model with decentralized data. FL has received increasing attention in recent years thanks to its data privacy protection, communication efficiency and a linear speedup for convergence in training (i.e., convergence performance increases linearly with respect to the number of workers). However, existing studies on linear speedup for convergence are only limited to the assumptions of i.i.d. datasets across workers and/or full worker participation, both of which rarely hold in practice. So far, it remains an open question whether or not the linear speedup for convergence is achievable under non-i.i.d. datasets with partial worker participation in FL. In this paper, we show that the answer is affirmative. Specifically, we show that the federated averaging (FedAvg) algorithm (with two-sided learning rates) on non-i.i.d. datasets in non-convex settings achieves a convergence rate O( 1 √ mKT + 1 T ) for full worker participation and a convergence rate O( 1 √ nKT + 1 T ) for partial worker participation, where K is the number of local steps, T is the number of total communication rounds, m is the total worker number and n is the worker number in one communication round if for partial worker participation. Our results also reveal that the local steps in FL could help the convergence and show that the maximum number of local steps can be improved to T /m. We conduct extensive experiments on MNIST and CIFAR-10 to verify our theoretical results.

show abstract

“…For large scale problems such as training deep convectional neural networks over ImageNet dataset (Russakovsky et al, 2015), it is hard to optimize problem (1) on a single machine. In this section, we extend the mini-batch Adam to the distributed Adam as distributed SGD method (Yu et al, 2019).…”

Section: Convergence Analysis For Distributed Adammentioning

confidence: 99%

“…Remark 17. Below, we give two remarks on above mini-batch Adam algorithm: (i) For distributed Adam, to achieve ε-stationary point, O(ε −4 s −1 ) iterations are needed, which is a linear speed-up with respect to the number of workers in the network, which is in the same order as it in distributed SGD (Yu et al, 2019).…”

Section: Convergence Analysis For Distributed Adammentioning

confidence: 99%

“…Still, few works answer how the local batch size and number of machines will affect the convergence of distributed Adam. In this work, by establishing the equivalence between distributed Adam and mini-batch Adam, we answer this question and show that distributed Adam can also achieve a linear speedup property as distributed SGD (Yu et al, 2019).…”

Section: Introductionmentioning

confidence: 96%

See 1 more Smart Citation

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Chen¹,

Shen²,

Zou³

et al. 2021

Preprint

View full text Add to dashboard Cite

Adam is one of the most influential adaptive stochastic algorithms for training deep neural networks, which has been pointed out to be divergent even in the simple convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam for solving large-scale non-convex stochastic optimization. This observation coupled with this sufficient condition gives much deeper interpretations on the divergence of Adam. On the other hand, in practice, mini-Adam and distributed-Adam are widely used without theoretical guarantee, we further give an analysis on how will the batch size or the number of nodes in the distributed system will affect the convergence of Adam, which theoretically shows that mini-batch and distributed Adam can be linearly accelerated by using a larger mini-batch size or more number of nodes. At last, we apply the generic Adam and minibatch Adam with sufficient condition for solving the counterexample and training several different neural networks on various real-world datasets. Experimental results are exactly in accord with our theoretical analysis.

show abstract

On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization

Cited by 21 publications

References 22 publications

CFedAvg: Achieving Efficient Communication and Fast Convergence in Non-IID Federated Learning

CFedAvg: Achieving Efficient Communication and Fast Convergence in Non-IID Federated Learning

Achieving Linear Speedup with Partial Worker Participation in Non-IID Federated Learning

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Contact Info

Product

Resources

About