2016
DOI: 10.1007/978-3-319-46448-0_49
|View full text |Cite
|
Sign up to set email alerts
|

Grounding of Textual Phrases in Images by Reconstruction

Abstract: Abstract. Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

4
553
1

Year Published

2016
2016
2021
2021

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 422 publications
(558 citation statements)
references
References 47 publications
(95 reference statements)
4
553
1
Order By: Relevance
“…[72] combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase.…”
Section: Contemporaneous and Subsequent Workmentioning
confidence: 99%
“…[72] combine CNNs with LSTMs for visual grounding. The model first encodes a phrase which describes part of an image using an LSTM, then learns to attend to the appropriate location in the image to accurately reconstruct the phrase.…”
Section: Contemporaneous and Subsequent Workmentioning
confidence: 99%
“…R@1 R@5 R@10 MCB [11] 48.7 --GroundeR [35] 47.8 --Embedding Network [43] 51.0 70.4 75.5 Similarity Network [43] 51.0 70.3 75.0 SPC [33] 55.4 --IGOP [47] 53.9 --CITE [32] 59. Table 7.…”
Section: Methodsmentioning
confidence: 99%
“…We next analyze the benefit of our jWAE framework for phrase localization on the Flickr30k Entities dataset [34]. Phrase localization associates (grounds) a phrase to a region in the image using bounding boxes [5,35,43,47]. Following [43], we formulate phrase localization as a retrieval problem where given an image and a phrase from its associated sentence, the phrase is mapped to the regions in the image.…”
Section: Phrase Localizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Motivated from co-reference resolution tasks in NLP, a number of studies have investigated matching free-form phrases with images where the task is to locate each visual entity mentioned in a caption by predicting a bounding box in the corresponding image (Hodosh et al, 2010;Kong et al, 2014;Plummer et al, 2015;Rohrbach et al, 2015).…”
Section: Text-to-image Co-referencingmentioning
confidence: 99%