2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01553
|View full text |Cite
|
Sign up to set email alerts
|

Learning the Best Pooling Strategy for Visual Semantic Embedding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
57
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 115 publications
(57 citation statements)
references
References 24 publications
0
57
0
Order By: Relevance
“…This focus makes our work closely related the recent work of image-text models such as CLIP [64] and ALIGN [37], both of which have shown zero-shot transfer ability for image classification. Similar to CLIP and ALIGN, our work also learns the mapping between images and texts, which is related to many previous works, such as [2,3,8,13,22,24,32,33,36,42,51,52,54,56,57,60,[71][72][73]91].…”
Section: Related Workmentioning
confidence: 83%
“…This focus makes our work closely related the recent work of image-text models such as CLIP [64] and ALIGN [37], both of which have shown zero-shot transfer ability for image classification. Similar to CLIP and ALIGN, our work also learns the mapping between images and texts, which is related to many previous works, such as [2,3,8,13,22,24,32,33,36,42,51,52,54,56,57,60,[71][72][73]91].…”
Section: Related Workmentioning
confidence: 83%
“…Text-to-image R@1 R@5 R@10 R@1 R@5 R@10 46.6 71.3 82. 4 fectiveness of the proposed architecture. Tab.…”
Section: Ctc-1k Image-to-textmentioning
confidence: 99%
“…Text-to-image Image-to-text Text-to-image R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 SCAN [22] -67. 4 art methods at a low speed. Our model is not affected in those datasets when the modality of scene text is missing and still performs well on downstream tasks due to the fusion token based vision and scene text aggregation.…”
Section: Flickr30k (1k) Ms-coco (5k) Image-to-textmentioning
confidence: 99%
“…Early works demonstrated how to train zero-shot classifiers based on attributes [35] or numerical descriptors [36]. Another approach, which we adopt in this work, is to learn an alignment between image and text embedding spaces [6,15,21,22,31,69]. This approach has demonstrated that with modern architectures, contrastive learning, and large data sources it is possible to obtain performance that is competitive with the classical two-step approach that involves fine-tuning on the downstream data [30,45].…”
Section: Related Workmentioning
confidence: 99%