2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.00780
|View full text |Cite
|
Sign up to set email alerts
|

Robust fine-tuning of zero-shot models

Abstract: We introduce new methods for 1) accelerating and 2) stabilizing training for large language-vision models. 1) Towards accelerating training, we introduce Switch-Back, a linear layer for int8 quantized training which provides a speed-up of 13-25% while matching the performance of bfloat16 training within 0.1 percentage points for the 1B parameter CLIP ViT-Huge-the largest int8 training to date. Our main focus is int8 as GPU support for float8 is rare, though we also analyze float8 training through simulation. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

6
161
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 174 publications
(211 citation statements)
references
References 66 publications
6
161
1
Order By: Relevance
“…In contrast, Jia et al (2022) learns both additional inputs to multiple layers in a pretrained vision transformer and also finetunes a linear classifier on top. Another finetuning-based approach proposed in Wortsman et al (2022b) is to ensemble the weights between zero-shot and finetuned models and Zhang et al (2020) which trains additional networks that are fused via summation. Specific to vision-language models, Zhou et al (2022b) learn an adaptation network between CLIP's vision and text encoders.…”
Section: Adaptation Methodsmentioning
confidence: 99%
“…In contrast, Jia et al (2022) learns both additional inputs to multiple layers in a pretrained vision transformer and also finetunes a linear classifier on top. Another finetuning-based approach proposed in Wortsman et al (2022b) is to ensemble the weights between zero-shot and finetuned models and Zhang et al (2020) which trains additional networks that are fused via summation. Specific to vision-language models, Zhou et al (2022b) learn an adaptation network between CLIP's vision and text encoders.…”
Section: Adaptation Methodsmentioning
confidence: 99%
“…• Robustness. Wortsman et al (2022) study robust fine-tuning of zero-shot models. Fang et al (2022a) report that data determines distributional robustness in CLIP.…”
Section: Advanced Topicsmentioning
confidence: 99%
“…However, retaining a large collection of fine-tuned task expert models in the CL setting is memory intensive, impractical, and undesired. Instead, we show that we can simulate the empirical benefits highlighted in [41] through repeated momentum interpolation between our foundation model and a continuously fine-tuned variant. This allows us to avoid the drawbacks of pure fine-tuning, while both specializing on the new stream of tasks, and retaining the generalizability of our foundation model.…”
Section: Introductionmentioning
confidence: 98%
“…Their application to the CL problem set, which tackles a continuous distribution shift, stands to reason, with recent works showing notable benefits in the use of foundation models [40,21,31,42], particularly highlighting a reduction in catastrophic forgetting. Still, as learners are adapted to continuously shifting training distribution, even foundation models will suffer from forgetting through fine-tuning [41].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation