2022
DOI: 10.1016/j.neucom.2022.01.014
|View full text |Cite
|
Sign up to set email alerts
|

Calibrating the adaptive learning rate to improve convergence of ADAM

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 36 publications
(16 citation statements)
references
References 2 publications
0
14
0
Order By: Relevance
“…input layer and hidden layers) based on a given dropout probability in every iteration. The ‘learning rate’ of a DNN represents the amount of changes needed at every weight update of the model based on the estimated error [ 80 ]. The convergence of the model for an optimal solution depends on the learning rate.…”
Section: Resultsmentioning
confidence: 99%
“…input layer and hidden layers) based on a given dropout probability in every iteration. The ‘learning rate’ of a DNN represents the amount of changes needed at every weight update of the model based on the estimated error [ 80 ]. The convergence of the model for an optimal solution depends on the learning rate.…”
Section: Resultsmentioning
confidence: 99%
“…As a hyper-parameter, the learning rate of SGD is often difficult to tune because the magnitudes of multiple parameters change greatly, and adjustment is required during the training process. Several adaptive gradient descent variants have been created to address this problem, including Adaptive Moment Estimation (Adam) [ 115 ], RMSprop [ 116 ], Ranger [ 117 ], Momentum [ 118 ], and Nesterov [ 119 ]. These algorithms automatically adapt the learning rate to different parameters, based on the statistics of gradient leading to faster convergence, simplifying learning strategies, and have been seen in many neural networks applied to CEA applications, as demonstrated in Figure 11 .…”
Section: Discussionmentioning
confidence: 99%
“…We took the mean over pings for all loss terms. We took the mean over the batch dimension; for outputs conditioned on the orientation of the echosounder, we masked out irrelevant samples The model was optimized using the RangerVA optimizer (Wright, 2019), which combines RAdam, Lookahead, and gradient centralization (Zhang et al, 2019;Liu et al, 2020;Yong et al, 2020;Tong et al, 2022), with a weight decay of 1 × 10 −5 . We used a batch size of 12 samples, and stratified the batches to contain the same ratio of downfacing and upfacing samples as available in the aggregated training set.…”
Section: Model Trainingmentioning
confidence: 99%