MARINA: Faster Non-Convex Distributed Learning with Compression

Gorbunov, Eduard; Burlachenko, Konstantin; Li, Zhize; Richtárik, Peter

doi:10.48550/arxiv.2102.07845

Cited by 7 publications

(11 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We leave further improvements to future work. For example, one may ask whether our approach can be combined with the benefits provided by multiple local update steps (McMahan et al, 2017;Stich, 2019;Khaled et al, 2020a;Karimireddy et al, 2020), with additional variance reduction techniques (Horváth et al, 2019b;, and to what extent one can extend our results to structured nonconvex problems Li, 2021b;Gorbunov et al, 2021;.…”

Section: Discussionmentioning

confidence: 99%

“…Several recent theoretical results suggest that by combining an appropriate (randomized) compression operator with a suitably designed gradient-type method, one can obtain improvement in the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD) (Alistarh et al, 2017;Khirirat et al, 2018;Horváth et al, 2019a;, and distributed CGD methods which employ variance reduction to tame the variance introduced by compression (Hanzely et al, 2018;Mishchenko et al, 2019;Horváth et al, 2019b;Gorbunov et al, 2021).…”

Section: Methods With Compressed Communicationmentioning

confidence: 99%

See 1 more Smart Citation

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Li¹,

Richtárik²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014).In order to combine the benefits of communication compression and convergence acceleration, we propose a compressed and accelerated gradient method for distributed optimization, which we call CANITA. Our CANITA achieves the first accelerated rate O, which improves upon the state-of-the-art non-accelerated rate (Khaled et al., 2020b) for distributed general convex problems, where ǫ is the target error, L is the smooth parameter of the objective, n is the number of machines/devices, and ω is the compression parameter (larger ω means more compression can be applied, and no compression implies ω = 0). Our results show that as long as the number of devices n is large (often true in distributed/federated learning), or the compression ω is not very high, CANITA achieves the faster convergence rate O L ǫ , i.e., the number of communication rounds is O L ǫ (vs. O L ǫ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Methods With Compressed Communicationmentioning

confidence: 99%

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Li¹,

Richtárik²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The prevalent paradigm for training federated learning (FL) models [Konečný et al, 2016b,a, McMahan et al, 2017 (see also the recent surveys by Kairouz et al [2019], Li et al [2020a]) is to use distributed first-order optimization methods employing one or more tools for enhancing communication efficiency, which is a key bottleneck in the federated setting. These tools include communication compression [Konečný et al, 2016b, Alistarh et al, 2017, Khirirat et al, 2018 and techniques for progressively reducing the variance introduced by compression [ Mishchenko et al, 2019, Horváth et al, 2019, Gorbunov et al, 2020a, Li et al, 2020b, Gorbunov et al, 2021a, local computation [McMahan et al, 2017, Stich, 2020, Khaled et al, 2020, Mishchenko et al, 2021a and techniques for reducing the client drift introduced by local computation [Karimireddy et al, 2020, Gorbunov et al, 2021b, and partial participation [McMahan et al, 2017, Gower et al, 2019 and techniques for taming the slow-down introduced by partial participation [Gorbunov et al, 2020a, Chen et al, 2020.…”

Section: First-order Methods For Flmentioning

confidence: 99%

FedNL: Making Newton-Type Methods Applicable to Federated Learning

Safaryan¹,

Islamov²,

Qian³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Inspired by recent work of Islamov et al (2021), we propose a family of Federated Newton Learn (FedNL) methods, which we believe is a marked step in the direction of making second-order methods applicable to FL. In contrast to the aforementioned work, FedNL employs a different Hessian learning technique which i) enhances privacy as it does not rely on the training data to be revealed to the coordinating server, ii) makes it applicable beyond generalized linear models, and iii) provably works with general contractive compression operators for compressing the local Hessians, such as Top-K or Rank-R, which are vastly superior in practice. Notably, we do not need to rely on error feedback for our methods to work with contractive compressors.Moreover, we develop FedNL-PP, FedNL-CR and FedNL-LS, which are variants of FedNL that support partial participation, and globalization via cubic regularization and line search, respectively, and FedNL-BC, which is a variant that can further benefit from bidirectional compression of gradients and models, i.e., smart uplink gradient and smart downlink model compression.We prove local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Our communication efficient Hessian learning technique provably learns the Hessian at the optimum.Finally, we perform a variety of numerical experiments that show that our FedNL methods have state-of-the-art communication complexity when compared to key baselines.

show abstract

“…There are two popular lines of work for tackling this communication-efficient federated learning problem. The first makes use of general and also bespoke lossy compression operators to compress the communicated messages before they are sent over the network (Mishchenko et al, 2019;Gorbunov et al, 2021;Li and Richtárik, 2021a), and the second line bets on increasing the local workload by performing multiple local update steps, e.g., multiple 1 We point out that S-Local-SVRG (Gorbunov et al, 2020) only considered the case where S = N and K ≤ M , i.e., the sampled clients S always is the whole set of clients N for all communication rounds. As a result, the total communication complexity (i.e., number of rounds × communicated clients S in each round) of S-Local-SVRG is…”

Section: Algorithmmentioning

confidence: 99%

“…There are lots of works belonging to these two categories. In particular, for the first category, the current state-of-the-art results in strongly convex, convex, and nonconvex settings are given by ; Li and Richtárik (2021a); Gorbunov et al (2021), respectively. For the second category, local methods such as FedAvg (McMahan et al, 2017) and SCAFFOLD (Karimireddy et al, 2020) perform multiple local update steps in each communication round in the hope that these are useful to decrease the number of communication rounds needed to train the model.…”

Section: Related Workmentioning

confidence: 99%

FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning

Zhao,

Li,

Richtárik

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Federated Averaging (FedAvg, also known as Local-SGD) (McMahan et al., 2017) is a classical federated learning algorithm in which clients run multiple local SGD steps before communicating their update to an orchestrating server. We propose a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg. We show that FedPAGE uses much fewer communication rounds than previous local methods for both federated convex and nonconvex optimization. Concretely, 1) in the convex setting, the number of communication rounds of FedPAGE is O( N 3/4 S ), improving the best-known result O( N S ) of SCAFFOLD (Karimireddy et al., 2020) by a factor of N 1/4 , where N is the total number of clients (usually is very large in federated learning), S is the sampled subset of clients in each communication round, and is the target error; 2) in the nonconvex setting, the number of communication rounds of FedPAGE is O(S 2/3 2 ) of SCAFFOLD (Karimireddy et al., 2020) by a factor of N 1/6 S 1/3 , if the sampled clients S ≤ √ N . Note that in both settings, the communication cost for each round is the same for both FedPAGE and SCAFFOLD. As a result, FedPAGE achieves new state-of-the-art results in terms of communication complexity for both federated convex and nonconvex optimization.

show abstract

MARINA: Faster Non-Convex Distributed Learning with Compression

Cited by 7 publications

References 17 publications

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

FedNL: Making Newton-Type Methods Applicable to Federated Learning

FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning

Contact Info

Product

Resources

About