2022
DOI: 10.1145/3544782
|View full text |Cite
|
Sign up to set email alerts
|

Scheduling Hyperparameters to Improve Generalization: From Centralized SGD to Asynchronous SGD

Abstract: This paper 1 studies how to schedule hyperparameters to improve generalization of both centralized single-machine stochastic gradient descent (SGD) and distributed asynchronous SGD (ASGD). SGD augmented with momentum variants (e.g., heavy ball momentum (SHB) and Nesterov’s accelerated gradient (NAG)) has been the default optimizer for many tasks, in both centralized and distributed environments. However, many advanced momentum variants, de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(6 citation statements)
references
References 35 publications
0
4
0
Order By: Relevance
“…The convolutional neural networks confuse the L 2 regularization and make minimization too difficult for SGDW. However, the two techniques can improve minimization of the loss function using projection [38] and hyperparameter methods [39]. Before introducing the projection technique for SGD, it is necessary to recall batch normalization [40].…”
Section: Sgd-type Algorithmsmentioning
confidence: 99%
“…The convolutional neural networks confuse the L 2 regularization and make minimization too difficult for SGDW. However, the two techniques can improve minimization of the loss function using projection [38] and hyperparameter methods [39]. Before introducing the projection technique for SGD, it is necessary to recall batch normalization [40].…”
Section: Sgd-type Algorithmsmentioning
confidence: 99%
“…Besides momentum and Nesterov condition, this algorithm can be equipped with L 2 -regularization (extension of weight decay), projection and hyper-parameters, which significantly increase test accuracy in various type of neural networks. These tools are still actual in other modifications of SGD, such as SGDW [11], SGDP [12] and QHM [13]. For achieving higher accuracy in every artificial neural network, SGDM with Nestrov condition is not the most appropriate approach.…”
Section: Preliminaries a Gradient Descent With Step-size Adaptationmentioning
confidence: 99%
“…Because the architecture of these neural networks confuses the L 2 regularization and make the process of minimization too difficult for SGDW. But there are two techniques, which can improve the quality of minimization of the loss function using projection [31] and hyper-parameter methods [32].…”
Section: Sgd-type Algorithmsmentioning
confidence: 99%