2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01267
|View full text |Cite
|
Sign up to set email alerts
|

IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
135
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 266 publications
(157 citation statements)
references
References 19 publications
0
135
0
Order By: Relevance
“…For instance, BFAN is proposed to eliminate partial irrelevant words and regions from the shared semantic in imagetext pairs to achieves state-of-the-art performance on several benchmark datasets. IMRAM (Chen et al, 2020) proposes a recurrent attention memory which incorporates a cross-modal attention unit and a memory distillation unit to refine the correspondence between image regions and text words. However, those attention mechanisms used by the manyto-many matching methods are usually complicated with high computation complexity.…”
Section: Text-image Matchingmentioning
confidence: 99%
See 1 more Smart Citation
“…For instance, BFAN is proposed to eliminate partial irrelevant words and regions from the shared semantic in imagetext pairs to achieves state-of-the-art performance on several benchmark datasets. IMRAM (Chen et al, 2020) proposes a recurrent attention memory which incorporates a cross-modal attention unit and a memory distillation unit to refine the correspondence between image regions and text words. However, those attention mechanisms used by the manyto-many matching methods are usually complicated with high computation complexity.…”
Section: Text-image Matchingmentioning
confidence: 99%
“…Due to this advantage, a mount of crossmodal hashing methods have been proposed Su et al, 2019;Lin et al, 2020;Tu et al, 2020;Shi et al, 2019). For example, SDCH (Lin et al, 2020) utilizes a semantic label branches to preserve semantic information of the learned features by integrating with inter-modal pairwise loss, cross-entropy loss and quantization loss.…”
Section: Cross-modal Hashingmentioning
confidence: 99%
“…[3,7,10] targeted at fine-grained alignment between images and texts by attention mechanism or gated-fusion strategy. Based on recurrent memory cell stacking, [2,9] could enhance visual representation and conduct semantic reasoning to preserve salient information. Wang et.al [8] concentrated on the object similarity and relation similarity in virtue of scene graph structure.…”
Section: Related Workmentioning
confidence: 99%
“…Image-Text matching [1,2,3] is one representative task in cross-modal learning, aiming at retrieving the most relevant instances of one modality (e.g., image) given a query from another modality (e.g., text). The essence of image-text retrieval consists in the evaluation of semantic similarity between image and text modalities, which remains to be a challenge due to the heterogeneous gap in cross-modal data distribution.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation