2021
DOI: 10.1609/aaai.v35i2.16209
|View full text |Cite
|
Sign up to set email alerts
|

Similarity Reasoning and Filtration for Image-Text Matching

Abstract: Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
89
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 200 publications
(89 citation statements)
references
References 40 publications
0
89
0
Order By: Relevance
“…Text-and-Image Matching The cosine similarity based attention alignment proposed by SCAN (Lee et al, 2018) is most widely used in Text-and-Image Matching , (Chen and Luo, 2020), (Diao et al, 2021), (Dong et al, 2021), . They applied text-to-image (t2i) and image-to-text(i2t) attention in two separate variants to filter the cross-modal relevant representations for later image-sentence matching.…”
Section: Related Workmentioning
confidence: 99%
“…Text-and-Image Matching The cosine similarity based attention alignment proposed by SCAN (Lee et al, 2018) is most widely used in Text-and-Image Matching , (Chen and Luo, 2020), (Diao et al, 2021), (Dong et al, 2021), . They applied text-to-image (t2i) and image-to-text(i2t) attention in two separate variants to filter the cross-modal relevant representations for later image-sentence matching.…”
Section: Related Workmentioning
confidence: 99%
“…In order to justify the superiority of our unified loss over the state-of-the-art image-text retrieval models, we conduct experiments on VSE++, BFAN [27] and SGRAF [11] by only replacing the loss functions.…”
Section: Image-text Retrieval Without Pre-trainingmentioning
confidence: 99%
“…Despite the difference in image types, these comparisons can be achieved by analyzing the human-detectable details in the clothes, such as cloth category, color, pattern, prints on the clothes, and so on. Most current retrieval solutions [ [5], [6], [7], [8], [9], [10]] incorporate deep learning models that convert actual images into vector representation so that the query image's embedding can be compared against all the images' embeddings from the list, and the closest one can be returned. For that, triplet loss is the most widely used comparative loss technique.…”
Section: Introductionmentioning
confidence: 99%