Understanding Decoupled and Early Weight Decay

Björck, Johan; Weinberger, Kilian Q.; Gomes, Carla P.

doi:10.1609/aaai.v35i8.16837

Cited by 36 publications

(39 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This idea of allowing a hyper-parameter value to change during training (in replacement for a learning rate schedule) has been extended to other hyper-parameters, such as weight decay [13][14][15][16], and batch sizes [17]. Previous work [20], has demonstrated empirically a relationship between the optimal hyper-parameters of learning rate (LR), weight decay (WD), batch size (BS), and momentum (m) as…”

Section: Hyper-parameters and Regularizationmentioning

confidence: 99%

See 1 more Smart Citation

General Cyclical Training of Neural Networks

Smith¹

2023

AAIML

View full text Add to dashboard Cite

This position paper describes the principle of “General Cyclical Training” in machine learning, where training starts and ends with “easy training” and the “hard training” happens during the middle epochs. We propose several manifestations for training neural networks, including algorithmic examples (via hyper-parameters and loss functions), data-based examples, and model-based examples. Specifically, we introduce several novel techniques: cyclical weight decay, cyclical batch size, cyclical focal loss, cyclical softmax temperature, cyclical data augmentation, cyclical gradient clipping, and cyclical semi-supervised learning. In addition, we demonstrate that cyclical weight decay, cyclical softmax temperature, and cyclical gradient clipping (as three examples of this principle) are beneficial in the test accuracy performance of a trained model. Furthermore, we discuss model-based examples (such as pretraining and knowledge distillation) from the perspective of general cyclical training and recommend some changes to the typical training methodology. In summary, this paper defines the general cyclical training concept and discusses several specific ways in which this concept can be applied to training neural networks. In the spirit of reproducibility, the code used in our experiments is available at https://github.com/lnsmith54/CFL.

show abstract

Section: Hyper-parameters and Regularizationmentioning

confidence: 99%

“…General cyclical training provides an intuitive understanding for the value of a one cycle training regime. Furthermore, this idea of allowing a hyper-parameter value to change during training has been extended to other hyperparameters, such as weight decay [13][14][15][16], and batch sizes [17].…”

Section: Introductionmentioning

confidence: 99%

General Cyclical Training of Neural Networks

Smith¹

2023

AAIML

View full text Add to dashboard Cite

show abstract

“…Additionally, different layers are added to prevent the problem of overfitting, such as dropout layers and normalization layer that keeps the mean close to 0 and the standard deviation close to 1 for the output. This layer will hence accelerate training [13].…”

Section: Fully Connected Layer (Fc)mentioning

confidence: 99%

Overview of convolutional neural networks architectures for brain tumor segmentation

Alshboual

Gharaibeh

Najadat

et al. 2023

IJECE

View full text Add to dashboard Cite

<p><span lang="EN-US">Due to the paramount importance of the medical field in the lives of people, researchers and experts exploited advancements in computer techniques to solve many diagnostic and analytical medical problems. Brain tumor diagnosis is one of the most important computational problems that has been studied and focused on. The brain tumor is determined by segmentation of brain images using many techniques based on magnetic resonance imaging (MRI). Brain tumor segmentation methods have been developed since a long time and are still evolving, but the current trend is to use deep convolutional neural networks (CNNs) due to its many breakthroughs and unprecedented results that have been achieved in various applications and their capacity to learn a hierarchy of progressively complicated characteristics from input without requiring manual feature extraction. Considering these unprecedented results, we present this paper as a brief review for main CNNs architecture types used in brain tumor segmentation. Specifically, we focus on researcher works that used the well-known brain tumor segmentation (BraTS) dataset.</span></p>

show abstract

“…Batch normalization is a technique to standardize activations in intermediate layers of deep neural networks across minibatches. It has demonstrated improved accuracies and faster convergences due to its stabilization of the learning process [30]. Additionally, introducing batch normalization allows the in and outputs of the regression model to remain unscaled, thus retaining the hierarchical structure of the coherency-loss function.…”

Section: Regressor Designmentioning

confidence: 99%

Structural hierarchical learning for energy networks

Leprince¹,

Khan²,

Madsen³

et al. 2023

Preprint

View full text Add to dashboard Cite

Many sectors nowadays require accurate and coherent predictions across their organization to effectively operate. Otherwise, decision-makers would be planning using disparate views of the future, resulting in inconsistent decisions across their sectors. To secure coherency across hierarchies, recent research has put forward hierarchical learning, a coherency-informed hierarchical regressor leveraging the power of machine learning thanks to a custom loss function founded on optimal reconciliation methods. While promising potentials were outlined, results exhibited discordant performances in which coherency information only improved hierarchical forecasts in one setting. This work proposes to tackle these obstacles by investigating custom neural network designs inspired by the topological structures of hierarchies. Results unveil that, in a data-limited setting, structural models with fewer connections perform overall best and demonstrate the value brought by coherency information in both accuracy and coherency forecasting performances, provided individual forecasts were generated within reasonable accuracy limits. Overall, this work expands and improves hierarchical learning methods thanks to a structurally-scaled learning mechanism extension coupled with tailored network designs, producing a resourceful, data-efficient, and information-rich learning process.

show abstract

Understanding Decoupled and Early Weight Decay

Cited by 36 publications

References 22 publications

General Cyclical Training of Neural Networks

General Cyclical Training of Neural Networks

Overview of convolutional neural networks architectures for brain tumor segmentation

Structural hierarchical learning for energy networks

Contact Info

Product

Resources

About