Symbolic Discovery of Optimization Algorithms

Chen, Xiangning; Chen, Liang; Huang, Da; Real, Esteban; Wang, Kaiyuan; Liu, Yao; Pham, Hieu; Dong, Xuanyi; Luong, Thang M.; Hsieh, Cho‐Jui; Lu, Yong; Le, Quoc V.

doi:10.48550/arxiv.2302.06675

Cited by 33 publications

(40 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• Another question is, why not use an optimizer such as Lion [7] which does not divide updates by any value, and is therefore immune to the stuck-in-the-past scenario. We believe this may be a promising path forward.…”

Section: E Stableadamw Continued E1 Qandamentioning

confidence: 99%

“…We study these two directions in the context of contrastive language-image pre-training (CLIP) [44]. We examine CLIP-style models because of their importance in computer vision: CLIP-style models reach state-ofthe-art performance on a wide range of image classification tasks [44,63,42,7] and underlie image generation methods such as DALLE•2 [47] and Stable Diffusion [49]. Our contributions towards fast training and stable training are as follows.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

200

161

View full text Add to dashboard Cite

We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. While SwitchBack proves effective for float8, we show that standard techniques are also successful if the network is trained and initialized so that large feature magnitudes are discouraged, which we accomplish via layer-scale initialized with zeros. 2) Towards stable training, we analyze loss spikes and find they consistently occur 1-8 iterations after the squared gradients become underestimated by their AdamW second moment estimator. As a result, we recommend an AdamW-Adafactor hybrid, which we refer to as StableAdamW because it avoids loss spikes when training a CLIP ViT-Huge model and outperforms gradient clipping.

show abstract

Section: E Stableadamw Continued E1 Qandamentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Robust fine-tuning of zero-shot models

Wortsman

Ilharco

Kim

et al. 2022

2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

200

161

View full text Add to dashboard Cite

show abstract

“…Precisions and optimizers. In Table 18, we show that sufficiently pre-trained EVA-02 representations are robust enough that can be fine-tuned using various numerical precisions (e.g., fp16 and bf16) and optimizers (e.g., Lion [25], AdamW [64,84], and SGD [87]). Remarkably, the fine-tuning can be done using the SGD optimizer with only little performance drop.…”

Section: A2 Additional Results For Image Classificationmentioning

confidence: 99%

“…The final IN-1K fine-tuning for all-sized models (including EVA-02-Ti and -S) can be done without using strong regularization such as cutmix [141], mixup [143] and random erasing [146]. In the Appendix, we show that our pre-trained representations are robust enough that can be fine-tuned using various numerical precisions (e.g., fp16 and bf16) and optimizers (e.g., Lion [25], AdamW [64,84], and SGD [87]). Remarkably, the fine-tuning can be done even using the SGD optimizer with only 0.1-point performance drop.…”

Section: Image Classificationmentioning

confidence: 98%

EVA-02: A Visual Representation for Neon Genesis

Fang¹,

Sun²,

Wang³

et al. 2023

Preprint

View full text Add to dashboard Cite

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling. With an updated plain Transformer architecture as well as extensive pre-training from an open & accessible giant CLIP vision encoder, EVA-02 demonstrates superior performance compared to prior state-of-the-art approaches across various representative vision tasks, while utilizing significantly fewer parameters and compute budgets. Notably, using exclusively publicly accessible training data, EVA-02 with only 304M parameters achieves a phenomenal 90.0 fine-tuning top-1 accuracy on ImageNet-1K val set. Additionally, our EVA-02-CLIP can reach up to 80.4 zero-shot top-1 on ImageNet-1K, outperforming the previous largest & best open-sourced CLIP with only ∼1/6 parameters and ∼1/6 image-text training data. We offer four EVA-02 variants in various model sizes, ranging from 6M to 304M parameters, all with impressive performance. To facilitate open access and open research, we release the complete suite of EVA-02 to the community. * Notice that the scale of each axis in the radar chart is normalized by the performance of EVA, and the stride of each axis are the same

show abstract

“…By default, AdamW [61], a variant of Adam which decouples the L 2 regularization and the weight decay, is the most widely used optimizer for Transformers. More recently, Google searches optimization algorithms and discovers a simple and effective optimizer called Lion [18]. Lion only keeps track of the momentum with the first-order gradient, and its update only considers the sign direction and has the same magnitude for each parameter, which is very different from the adaptive optimizers like AdamW.…”

Section: Optimizationmentioning

confidence: 99%

A Survey on Efficient Training of Transformers

Zhuang¹,

Liu²,

Pan³

et al. 2023

Preprint

View full text Add to dashboard Cite

Recent advances in Transformers have come with a huge requirement on computing resources, highlighting the importance of developing efficient training techniques to make Transformer training faster, at lower cost, and to higher accuracy by the efficient use of computation and memory resources. This survey provides the first systematic overview of the efficient training of Transformers, covering the recent progress in acceleration arithmetic and hardware, with a focus on the former. We analyze and compare methods that save computation and memory costs for intermediate tensors during training, together with techniques on hardware/algorithm co-design. We finally discuss challenges and promising areas for future research.

show abstract

Symbolic Discovery of Optimization Algorithms

Cited by 33 publications

References 28 publications

Robust fine-tuning of zero-shot models

Robust fine-tuning of zero-shot models

EVA-02: A Visual Representation for Neon Genesis

A Survey on Efficient Training of Transformers

Contact Info

Product

Resources

About