2019 IEEE/CVF International Conference on Computer Vision (ICCV) 2019
DOI: 10.1109/iccv.2019.00586
|View full text |Cite
|
Sign up to set email alerts
|

CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval

Abstract: Text-image cross-modal retrieval is a challenging task in the field of language and vision. Most previous approaches independently embed images and sentences into a joint embedding space and compare their similarities. However, previous approaches rarely explore the interactions between images and sentences before calculating similarities in the joint space. Intuitively, when matching between images and sentences, human beings would alternatively attend to regions in images and words in sentences, and select t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

2
135
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 254 publications
(149 citation statements)
references
References 36 publications
(85 reference statements)
2
135
0
Order By: Relevance
“…According to the granularity of representation, studies on imagetext matching can be categorized into two groups: 1) global embedding based methods [5,6,34,44], and 2) local inference based methods [3,16,18,21,26,37]. The former ones first embed the whole images and sentences into a joint embedding space, and then calculate the visual-semantic similarity.…”
Section: Related Work 21 Image-text Matchingmentioning
confidence: 99%
See 4 more Smart Citations
“…According to the granularity of representation, studies on imagetext matching can be categorized into two groups: 1) global embedding based methods [5,6,34,44], and 2) local inference based methods [3,16,18,21,26,37]. The former ones first embed the whole images and sentences into a joint embedding space, and then calculate the visual-semantic similarity.…”
Section: Related Work 21 Image-text Matchingmentioning
confidence: 99%
“…For instance, Lee et al [16] presented a cross attention to align both image regions and words to infer image-text similarity. Wang et al [37] proposed a crossmodal adaptive message passing method to perform fine-grained interaction and filter irrelevant information with a gating strategy. Liu et al [21] designed a focal attention network comprising the preassigning and re-assigning attention, which focuses on eliminating irrelevant fragments.…”
Section: Related Work 21 Image-text Matchingmentioning
confidence: 99%
See 3 more Smart Citations