2021
DOI: 10.48550/arxiv.2101.01368
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Similarity Reasoning and Filtration for Image-Text Matching

Abstract: Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(7 citation statements)
references
References 0 publications
0
7
0
Order By: Relevance
“…Unsupervised Learning for Dense Prediction Learning fine-grained semantic correspondence is essential for dense prediction tasks, e.g., object detection [15,38], semantic segmentation [6], etc.Recent studies have proposed various paradigms to tackle this problem [33,28,36,11,59]. However, most of these methods require a pre-trained object detector to generate proposals of interested objects.…”
Section: Related Workmentioning
confidence: 99%
“…Unsupervised Learning for Dense Prediction Learning fine-grained semantic correspondence is essential for dense prediction tasks, e.g., object detection [15,38], semantic segmentation [6], etc.Recent studies have proposed various paradigms to tackle this problem [33,28,36,11,59]. However, most of these methods require a pre-trained object detector to generate proposals of interested objects.…”
Section: Related Workmentioning
confidence: 99%
“…Following up, [20] proposed a stacked cross attention network to model the latent alignments between image regions and words. Additional models have explored the roll of attention mechanisms [24,35,44,45,47], and Graph Convolutional Neural Networks (GCN) [7,19,21,25]. External modules have been explored to improve retrieval results such as the usage of an iterative recurrent attention module [3] and an external consensus knowledge base [41].…”
Section: Related Workmentioning
confidence: 99%
“…In this section, we compare the behaviour of existing systems by evaluating them on the newly proposed metrics. We evaluate the following state-of-the-art models: VSE++ [9], SCAN [20], VSRN [21], CVSE [41], SGR and SAF [7]. The experiment depicting the top-5 text-to-image retrieval scores for non-ground truth relevant items is shown in It is worth noting in Figure 3 that according to the recall (R@5), the models have a steady raise in recall scores as the number of relevant images m increase.…”
Section: Insights On State-of-the-art Retrievalmentioning
confidence: 99%
See 1 more Smart Citation
“…Most existing instance retrieval solutions use Deep Metric Learning methodology [1,3,6,7,13,16], in which a deep learning model is trained to transform images to a vector representation, so that samples from the same class are close to each other. At the retrieval stage, the query embedding is scored against all gallery embeddings and the most similar ones are returned.…”
Section: Introductionmentioning
confidence: 99%