2019
DOI: 10.1609/aaai.v33i01.33018489
|View full text |Cite
|
Sign up to set email alerts
|

Few-Shot Image and Sentence Matching via Gated Visual-Semantic Embedding

Abstract: Although image and sentence matching has been widely studied, its intrinsic few-shot problem is commonly ignored, which has become a bottleneck for further performance improvement. In this work, we focus on this challenging problem of few-shot image and sentence matching, and propose a Gated Visual-Semantic Embedding (GVSE) model to deal with it. The model consists of three corporative modules in terms of uncommon VSE, common VSE, and gated metric fusion. The uncommon VSE exploits external auxiliary resources … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(12 citation statements)
references
References 16 publications
0
12
0
Order By: Relevance
“…Recent studies on image-caption retrieval have employed cross-model attentions to pay attention to concepts shared by a query and a target [11], [12], [16]- [19]. Crossmodal attentions are performed at an early phase over multiple cropped regions of an image and words of a caption.…”
Section: Contribution Of Proposed Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…Recent studies on image-caption retrieval have employed cross-model attentions to pay attention to concepts shared by a query and a target [11], [12], [16]- [19]. Crossmodal attentions are performed at an early phase over multiple cropped regions of an image and words of a caption.…”
Section: Contribution Of Proposed Methodsmentioning
confidence: 99%
“…Cutting-edge methods for image-caption retrieval sometimes employ an object detector and a cross-modal attention [11], [12], [16]- [19]. A pretrained object detector crops multiple subregions in an image, and then a cross-modal attention is performed over the cropped regions and words in a caption.…”
Section: Cross-model Attentionmentioning
confidence: 99%
See 2 more Smart Citations
“…Existing work has focused on the matching between text words and image entities, and retrieves them by capturing the co-occurrence relationship between certain words in a text and certain entities in a image [7]. However, when the text is long with multiple sentences or even paragraphs (e.g.…”
Section: Introductionmentioning
confidence: 99%