Proceedings of the 30th ACM International Conference on Multimedia 2022
DOI: 10.1145/3503161.3548320
|View full text |Cite
|
Sign up to set email alerts
|

CAliC: Accurate and Efficient Image-Text Retrieval via Contrastive Alignment and Visual Contexts Modeling

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(1 citation statement)
references
References 6 publications
0
1
0
Order By: Relevance
“…Methods [4], [5], [7], [10], [13], [14], [15], [16], [17], [18] employ two separate encoders to independently extract features for visual and textual data. CLIP [10] effectively applies contrastive learning to learn image-language alignment from a large volume of noisy image-text pairs, achieving remarkable performance on vision-language tasks, as demonstrated in [19], [20], [21], [22], [23]. In VATT [15], the authors employ contrastive learning to align the videos, audios and texts, and achieve impressive performance on the downstream tasks.…”
Section: Vision-language Pre-trainingmentioning
confidence: 99%
“…Methods [4], [5], [7], [10], [13], [14], [15], [16], [17], [18] employ two separate encoders to independently extract features for visual and textual data. CLIP [10] effectively applies contrastive learning to learn image-language alignment from a large volume of noisy image-text pairs, achieving remarkable performance on vision-language tasks, as demonstrated in [19], [20], [21], [22], [23]. In VATT [15], the authors employ contrastive learning to align the videos, audios and texts, and achieve impressive performance on the downstream tasks.…”
Section: Vision-language Pre-trainingmentioning
confidence: 99%