2021
DOI: 10.48550/arxiv.2102.07845
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MARINA: Faster Non-Convex Distributed Learning with Compression

Abstract: We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences which is reminiscent of but different from the strategy employed in the DIANA method of . Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theore… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5

Relationship

5
0

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 17 publications
0
11
0
Order By: Relevance
“…We leave further improvements to future work. For example, one may ask whether our approach can be combined with the benefits provided by multiple local update steps (McMahan et al, 2017;Stich, 2019;Khaled et al, 2020a;Karimireddy et al, 2020), with additional variance reduction techniques (Horváth et al, 2019b;, and to what extent one can extend our results to structured nonconvex problems Li, 2021b;Gorbunov et al, 2021;.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We leave further improvements to future work. For example, one may ask whether our approach can be combined with the benefits provided by multiple local update steps (McMahan et al, 2017;Stich, 2019;Khaled et al, 2020a;Karimireddy et al, 2020), with additional variance reduction techniques (Horváth et al, 2019b;, and to what extent one can extend our results to structured nonconvex problems Li, 2021b;Gorbunov et al, 2021;.…”
Section: Discussionmentioning
confidence: 99%
“…Several recent theoretical results suggest that by combining an appropriate (randomized) compression operator with a suitably designed gradient-type method, one can obtain improvement in the total communication complexity over comparable baselines not performing any compression. For instance, this is the case for distributed compressed gradient descent (CGD) (Alistarh et al, 2017;Khirirat et al, 2018;Horváth et al, 2019a;, and distributed CGD methods which employ variance reduction to tame the variance introduced by compression (Hanzely et al, 2018;Mishchenko et al, 2019;Horváth et al, 2019b;Gorbunov et al, 2021).…”
Section: Methods With Compressed Communicationmentioning
confidence: 99%
“…The prevalent paradigm for training federated learning (FL) models [Konečný et al, 2016b,a, McMahan et al, 2017 (see also the recent surveys by Kairouz et al [2019], Li et al [2020a]) is to use distributed first-order optimization methods employing one or more tools for enhancing communication efficiency, which is a key bottleneck in the federated setting. These tools include communication compression [Konečný et al, 2016b, Alistarh et al, 2017, Khirirat et al, 2018 and techniques for progressively reducing the variance introduced by compression [ Mishchenko et al, 2019, Horváth et al, 2019, Gorbunov et al, 2020a, Li et al, 2020b, Gorbunov et al, 2021a, local computation [McMahan et al, 2017, Stich, 2020, Khaled et al, 2020, Mishchenko et al, 2021a and techniques for reducing the client drift introduced by local computation [Karimireddy et al, 2020, Gorbunov et al, 2021b, and partial participation [McMahan et al, 2017, Gower et al, 2019 and techniques for taming the slow-down introduced by partial participation [Gorbunov et al, 2020a, Chen et al, 2020.…”
Section: First-order Methods For Flmentioning
confidence: 99%
“…There are two popular lines of work for tackling this communication-efficient federated learning problem. The first makes use of general and also bespoke lossy compression operators to compress the communicated messages before they are sent over the network (Mishchenko et al, 2019;Gorbunov et al, 2021;Li and Richtárik, 2021a), and the second line bets on increasing the local workload by performing multiple local update steps, e.g., multiple 1 We point out that S-Local-SVRG (Gorbunov et al, 2020) only considered the case where S = N and K ≤ M , i.e., the sampled clients S always is the whole set of clients N for all communication rounds. As a result, the total communication complexity (i.e., number of rounds × communicated clients S in each round) of S-Local-SVRG is…”
Section: Algorithmmentioning
confidence: 99%
“…There are lots of works belonging to these two categories. In particular, for the first category, the current state-of-the-art results in strongly convex, convex, and nonconvex settings are given by ; Li and Richtárik (2021a); Gorbunov et al (2021), respectively. For the second category, local methods such as FedAvg (McMahan et al, 2017) and SCAFFOLD (Karimireddy et al, 2020) perform multiple local update steps in each communication round in the hope that these are useful to decrease the number of communication rounds needed to train the model.…”
Section: Related Workmentioning
confidence: 99%