2021 IEEE International Conference on Image Processing (ICIP) 2021
DOI: 10.1109/icip42928.2021.9506438
|View full text |Cite
|
Sign up to set email alerts
|

Attend, Correct And Focus: A Bidirectional Correct Attention Network For Image-Text Matching

Abstract: Image-text matching task aims to learn the fine-grained correspondences between images and sentences. Existing methods use attention mechanism to learn the correspondences by attending to all fragments without considering the relationship between fragments and global semantics, which inevitably lead to semantic misalignment among irrelevant fragments. To this end, we propose a Bidirectional Correct Attention Network (BCAN), which leverages global similarities and local similarities to reassign the attention we… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
53
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 28 publications
(57 citation statements)
references
References 42 publications
0
53
0
Order By: Relevance
“…SCAN has been used as a baseline for many methods and has led to technological developments since its proposal. Examples include the bidirectional focal attention network (BFAN) [15] and the position focused attention network (PFAN) [16]. In BFAN, irrelevant image regions and words cause deterioration in the correspondence between images and text; thus, they are removed.…”
Section: B Methods For Local Image-text Matchingmentioning
confidence: 99%
“…SCAN has been used as a baseline for many methods and has led to technological developments since its proposal. Examples include the bidirectional focal attention network (BFAN) [15] and the position focused attention network (PFAN) [16]. In BFAN, irrelevant image regions and words cause deterioration in the correspondence between images and text; thus, they are removed.…”
Section: B Methods For Local Image-text Matchingmentioning
confidence: 99%
“…Most text-based image retrieval approaches are based on deep neural networks [38,16,18,10,34,5]. The main ob-jective of the retrieval system is to accurately measure the similarity between the inputs from two different modalities.…”
Section: Text-based Image Retrievalmentioning
confidence: 99%
“…Cross-Modal Projection Learning (CMPL) [38] is proposed to pull image and text embeddings into an aligned space. To further enhance the retrieval performance in a fine-grained way, [16,18,10,34] proposed different attention-based approaches, applying visual attention between every image region and word.…”
Section: Text-based Image Retrievalmentioning
confidence: 99%
See 2 more Smart Citations