2022

DOI: 10.48550/arxiv.2205.05040

|View full text |Cite

Preprint

|

Sign up to set email alerts

|

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

Zhenxun Zhuang²,

et al.

Abstract: In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remain… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

1

Mentioning

19

Contrasting

0

Year Published

2023

2023

2023

2023

Publication Types

Select...

Other1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

(20 citation statements)

References 28 publications

(53 reference statements)

Supporting

1

Mentioning

19

Contrasting

0

Order By: Relevance

“…Gradient clipping (Pascanu et al, 2012; is a well-known strategy to improve the training of deep neural networks with the exploding gradient issue such as Recurrent Neural Networks (RNN) (Rumelhart et al, 1986;Elman, 1990;Werbos, 1988) and Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997). Although it is a widely-used strategy, formally analyzing gradient clipping in deep neural networks under the framework of nonconvex optimization only happened recently (Zhang et al, 2019a;Cutkosky & Mehta, 2021;Liu et al, 2022). In particular, Zhang et al (2019a) showed empirically that the gradient Lipschitz constant scales linearly in terms of the gradient norm when training certain neural networks such as AWD-LSTM (Merity et al, 2018), introduced the relaxed smoothness condition (i.e., (L 0 , L 1 )-smoothness 1 ), and proved that clipped gradient descent converges faster than any fixed step size gradient descent.…”

Section: Introductionmentioning

confidence: 99%

“…Although there is a vast literature on FL (see (Kairouz et al, 2019) and references therein), the theoretical and algorithmic understanding of gradient clipping algorithms for training deep neural networks in the FL setting remains nascent. To the best of our knowledge, Liu et al (2022) is the only work that has considered a communication-efficient distributed gradient clipping algorithm under the nonconvex and relaxed smoothness conditions in the FL setting. In particular, Liu et al (2022) proved that their algorithm achieves linear speedup in terms of the number of clients and reduced communication rounds.…”

Section: Introductionmentioning

confidence: 99%

“…To the best of our knowledge, Liu et al (2022) is the only work that has considered a communication-efficient distributed gradient clipping algorithm under the nonconvex and relaxed smoothness conditions in the FL setting. In particular, Liu et al (2022) proved that their algorithm achieves linear speedup in terms of the number of clients and reduced communication rounds. Nevertheless, their algorithm and analysis are only applicable to the case of homogeneous data.…”

Section: Introductionmentioning

confidence: 99%

“…Nevertheless, their algorithm and analysis are only applicable to the case of homogeneous data. In addition, the analyses of the stochastic gradient clipping algorithms in both single machine (Zhang et al, 2020a) and multiple-machine setting (Liu et al, 2022) require strong distributional assumptions on the stochastic gradient noise 3 , which may not hold in practice.…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we introduce a provably computation and communication efficient gradient clipping algorithm for nonconvex and relaxed-smooth functions in the general FL setting (i.e., heterogeneous data, limited communication) and without any distributional assumptions on the stochastic gradient noise. Compared with previous work on gradient clipping (Zhang et al, 2019a;Cutkosky & Mehta, 2020;Liu et al, 2022) and FL with heterogeneous data (Li et al, 2020a;Karimireddy et al, 2020), our algorithm design relies on two novel techniques: episodic gradient clipping and periodic resampled corrections. In a nutshell, at the beginning of each communication round, the algorithm resamples each client's stochastic gradient; this information is used to decide whether to apply clipping in the current round (i.e., episodic gradient clipping), and to perform local corrections to each client's update (i.e., periodic resampled corrections).…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data

Bao²,

Liu³

2023

Preprint

View full text Add to dashboard Cite

Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called episodic gradient clipping and periodic resampled corrections. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets, including text classification and image classification, show the superior performance of EPISODE over several strong baselines in FL. The code is available at https://github.com/MingruiLiu-ML-Lab/episode.

“…Gradient clipping (Pascanu et al, 2012; is a well-known strategy to improve the training of deep neural networks with the exploding gradient issue such as Recurrent Neural Networks (RNN) (Rumelhart et al, 1986;Elman, 1990;Werbos, 1988) and Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997). Although it is a widely-used strategy, formally analyzing gradient clipping in deep neural networks under the framework of nonconvex optimization only happened recently (Zhang et al, 2019a;Cutkosky & Mehta, 2021;Liu et al, 2022). In particular, Zhang et al (2019a) showed empirically that the gradient Lipschitz constant scales linearly in terms of the gradient norm when training certain neural networks such as AWD-LSTM (Merity et al, 2018), introduced the relaxed smoothness condition (i.e., (L 0 , L 1 )-smoothness 1 ), and proved that clipped gradient descent converges faster than any fixed step size gradient descent.…”

Section: Introductionmentioning

confidence: 99%

“…Although there is a vast literature on FL (see (Kairouz et al, 2019) and references therein), the theoretical and algorithmic understanding of gradient clipping algorithms for training deep neural networks in the FL setting remains nascent. To the best of our knowledge, Liu et al (2022) is the only work that has considered a communication-efficient distributed gradient clipping algorithm under the nonconvex and relaxed smoothness conditions in the FL setting. In particular, Liu et al (2022) proved that their algorithm achieves linear speedup in terms of the number of clients and reduced communication rounds.…”

Section: Introductionmentioning

confidence: 99%

“…To the best of our knowledge, Liu et al (2022) is the only work that has considered a communication-efficient distributed gradient clipping algorithm under the nonconvex and relaxed smoothness conditions in the FL setting. In particular, Liu et al (2022) proved that their algorithm achieves linear speedup in terms of the number of clients and reduced communication rounds. Nevertheless, their algorithm and analysis are only applicable to the case of homogeneous data.…”

Section: Introductionmentioning

confidence: 99%

“…Nevertheless, their algorithm and analysis are only applicable to the case of homogeneous data. In addition, the analyses of the stochastic gradient clipping algorithms in both single machine (Zhang et al, 2020a) and multiple-machine setting (Liu et al, 2022) require strong distributional assumptions on the stochastic gradient noise 3 , which may not hold in practice.…”

Section: Introductionmentioning

confidence: 99%

“…In this work, we introduce a provably computation and communication efficient gradient clipping algorithm for nonconvex and relaxed-smooth functions in the general FL setting (i.e., heterogeneous data, limited communication) and without any distributional assumptions on the stochastic gradient noise. Compared with previous work on gradient clipping (Zhang et al, 2019a;Cutkosky & Mehta, 2020;Liu et al, 2022) and FL with heterogeneous data (Li et al, 2020a;Karimireddy et al, 2020), our algorithm design relies on two novel techniques: episodic gradient clipping and periodic resampled corrections. In a nutshell, at the beginning of each communication round, the algorithm resamples each client's stochastic gradient; this information is used to decide whether to apply clipping in the current round (i.e., episodic gradient clipping), and to perform local corrections to each client's update (i.e., periodic resampled corrections).…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

EPISODE: Episodic Gradient Clipping with Periodic Resampled Corrections for Federated Learning with Heterogeneous Data

Bao²,

Liu³

2023

Preprint

View full text Add to dashboard Cite

Gradient clipping is an important technique for deep neural networks with exploding gradients, such as recurrent neural networks. Recent studies have shown that the loss functions of these networks do not satisfy the conventional smoothness condition, but instead satisfy a relaxed smoothness condition, i.e., the Lipschitz constant of the gradient scales linearly in terms of the gradient norm. Due to this observation, several gradient clipping algorithms have been developed for nonconvex and relaxed-smooth functions. However, the existing algorithms only apply to the single-machine or multiple-machine setting with homogeneous data across machines. It remains unclear how to design provably efficient gradient clipping algorithms in the general Federated Learning (FL) setting with heterogeneous data and limited communication rounds. In this paper, we design EPISODE, the very first algorithm to solve FL problems with heterogeneous data in the nonconvex and relaxed smoothness setting. The key ingredients of the algorithm are two new techniques called episodic gradient clipping and periodic resampled corrections. At the beginning of each round, EPISODE resamples stochastic gradients from each client and obtains the global averaged gradient, which is used to (1) determine whether to apply gradient clipping for the entire round and (2) construct local gradient corrections for each client. Notably, our algorithm and analysis provide a unified framework for both homogeneous and heterogeneous data under any noise level of the stochastic gradient, and it achieves state-of-the-art complexity results. In particular, we prove that EPISODE can achieve linear speedup in the number of machines, and it requires significantly fewer communication rounds. Experiments on several heterogeneous datasets, including text classification and image classification, show the superior performance of EPISODE over several strong baselines in FL. The code is available at https://github.com/MingruiLiu-ML-Lab/episode.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.