2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.00692
|View full text |Cite
|
Sign up to set email alerts
|

Multimodal Contrastive Training for Visual Representation Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
39
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
5

Relationship

1
9

Authors

Journals

citations
Cited by 98 publications
(44 citation statements)
references
References 29 publications
0
39
0
Order By: Relevance
“…ICMLM [48] and VirTex [15] showed that language supervision on COCO Captions produced useful visual representations. Prior to CLIP, Multimodal Contrastive Training [62] adds contrastive imageimage and language-image losses to VirTex which further improve performance. CLIP [45] quickly garnered significant attention for its simplicity, scale, and strong results.…”
Section: Related Workmentioning
confidence: 99%
“…ICMLM [48] and VirTex [15] showed that language supervision on COCO Captions produced useful visual representations. Prior to CLIP, Multimodal Contrastive Training [62] adds contrastive imageimage and language-image losses to VirTex which further improve performance. CLIP [45] quickly garnered significant attention for its simplicity, scale, and strong results.…”
Section: Related Workmentioning
confidence: 99%
“…Co-teaching+ (Yu et al, 2019) combines the features of both, in which each model teaches each other only on disagreement data. Yang et al (2021) propose mutual contrastive learning that enables two networks to learn extra contrastive information from each other. As text data is discrete and compositional, qualities of multiple augmentations can be uneven, which may corrupt the generalization of sentence embeddings.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, [68][69][70][71][72] employ pretrained language models and object detectors to learn visual features well aligned with the embeddings of caption words. Recent works [73,74] improve training efficiency by removing the need for object detectors and scale to hundreds of millions of samples for substantial performance gains [75]. Moreover, [28] proposes a novel open-vocabulary learning task and shows that pretrained visual features improve not only the detection performance on base classes but also novel classes.…”
Section: Related Workmentioning
confidence: 99%