2018
DOI: 10.1007/978-3-030-01225-0_13
|View full text |Cite
|
Sign up to set email alerts
|

Stacked Cross Attention for Image-Text Matching

Abstract: In this paper, we study the problem of image-text matching. Inferring the latent semantic alignment between objects or other salient stuff (e.g. snow, sky, lawn) and the corresponding words in sentences allows to capture fine-grained interplay between vision and language, and makes image-text matching more interpretable. Prior work either simply aggregates the similarity of all possible pairs of regions and words without attending differentially to more and less important words or regions, or uses a multi-step… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

6
919
0
2

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
1
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 861 publications
(927 citation statements)
references
References 37 publications
6
919
0
2
Order By: Relevance
“…We find that previous methods based on global representations [43] have low generalization performance with top-10 recall as low as 12 on COCO. Fine-grained representations based on attention [26] generalize better compared to [43]. Following Tables 5 and 6, the jWAE-MH framework significantly improves the generalization across datasets further, owing to the semantic continuity from the Gaussian regularization.…”
Section: Image-to-textmentioning
confidence: 80%
See 3 more Smart Citations
“…We find that previous methods based on global representations [43] have low generalization performance with top-10 recall as low as 12 on COCO. Fine-grained representations based on attention [26] generalize better compared to [43]. Following Tables 5 and 6, the jWAE-MH framework significantly improves the generalization across datasets further, owing to the semantic continuity from the Gaussian regularization.…”
Section: Image-to-textmentioning
confidence: 80%
“…Wehrmann et al [45] improve sentence representations with a character level inception module and [20,26] improve image representations for image-text matching models. Huang et al [20] use multi-label classification to extract various concepts in images, requiring additional image annotations.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Modeling interactions among splice sites is essential for circular RNA prediction because backsplices occur when the donors prefer the upstream acceptors over the downstream ones. Inspired by recent successes in natural language processing [22] and computer vision [32], we propose the cross-attention layer to learn deep interaction between acceptors and donors.…”
Section: Cross-attention For Modeling Deep Interactionmentioning
confidence: 99%