2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01759
|View full text |Cite
|
Sign up to set email alerts
|

LiT: Zero-Shot Transfer with Locked-image text Tuning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
134
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 171 publications
(135 citation statements)
references
References 20 publications
1
134
0
Order By: Relevance
“…The high-level idea is to learn a shared embedding space for both image and text, such that paired image and text stays close to each other, while unpaired image and text are distant from each other. The follow up work (Pham et al, 2021;Zhai et al, 2022b) studies the impact of the training data and batch size in contrastive learning. They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…The high-level idea is to learn a shared embedding space for both image and text, such that paired image and text stays close to each other, while unpaired image and text are distant from each other. The follow up work (Pham et al, 2021;Zhai et al, 2022b) studies the impact of the training data and batch size in contrastive learning. They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning.…”
Section: Related Workmentioning
confidence: 99%
“…The follow up work (Pham et al, 2021;Zhai et al, 2022b) studies the impact of the training data and batch size in contrastive learning. They observed that additional high quality data (Pham et al, 2021) or a pretrained vision model (Zhai et al, 2022b) can lead to better vision-language models, and a large batch size is generally beneficial to contrastive learning. Furthermore, Zhai et al (2022b) show that with a pretrained and locked vision model, one needs to train only a paired text encoder model to get good language embeddings.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations