2022
DOI: 10.48550/arxiv.2205.05040
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

Abstract: In distributed training of deep neural networks or Federated Learning (FL), people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the FL setting is still in its infancy: it remain… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

1
19
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(20 citation statements)
references
References 28 publications
(53 reference statements)
1
19
0
Order By: Relevance
“…Gradient clipping (Pascanu et al, 2012; is a well-known strategy to improve the training of deep neural networks with the exploding gradient issue such as Recurrent Neural Networks (RNN) (Rumelhart et al, 1986;Elman, 1990;Werbos, 1988) and Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997). Although it is a widely-used strategy, formally analyzing gradient clipping in deep neural networks under the framework of nonconvex optimization only happened recently (Zhang et al, 2019a;Cutkosky & Mehta, 2021;Liu et al, 2022). In particular, Zhang et al (2019a) showed empirically that the gradient Lipschitz constant scales linearly in terms of the gradient norm when training certain neural networks such as AWD-LSTM (Merity et al, 2018), introduced the relaxed smoothness condition (i.e., (L 0 , L 1 )-smoothness 1 ), and proved that clipped gradient descent converges faster than any fixed step size gradient descent.…”
Section: Introductionmentioning
confidence: 99%
See 4 more Smart Citations
“…Gradient clipping (Pascanu et al, 2012; is a well-known strategy to improve the training of deep neural networks with the exploding gradient issue such as Recurrent Neural Networks (RNN) (Rumelhart et al, 1986;Elman, 1990;Werbos, 1988) and Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber, 1997). Although it is a widely-used strategy, formally analyzing gradient clipping in deep neural networks under the framework of nonconvex optimization only happened recently (Zhang et al, 2019a;Cutkosky & Mehta, 2021;Liu et al, 2022). In particular, Zhang et al (2019a) showed empirically that the gradient Lipschitz constant scales linearly in terms of the gradient norm when training certain neural networks such as AWD-LSTM (Merity et al, 2018), introduced the relaxed smoothness condition (i.e., (L 0 , L 1 )-smoothness 1 ), and proved that clipped gradient descent converges faster than any fixed step size gradient descent.…”
Section: Introductionmentioning
confidence: 99%
“…Although there is a vast literature on FL (see (Kairouz et al, 2019) and references therein), the theoretical and algorithmic understanding of gradient clipping algorithms for training deep neural networks in the FL setting remains nascent. To the best of our knowledge, Liu et al (2022) is the only work that has considered a communication-efficient distributed gradient clipping algorithm under the nonconvex and relaxed smoothness conditions in the FL setting. In particular, Liu et al (2022) proved that their algorithm achieves linear speedup in terms of the number of clients and reduced communication rounds.…”
Section: Introductionmentioning
confidence: 99%
See 3 more Smart Citations