2021
DOI: 10.1109/tpami.2021.3058684
|View full text |Cite
|
Sign up to set email alerts
|

Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

Abstract: In this paper, we are tackling the weakly-supervised referring expression grounding task, for the localization of a referent object in an image according to a query sentence, where the mapping between image regions and queries are not available during the training stage. In traditional methods, an object region that best matches the referring expression is picked out, and then the query sentence is reconstructed from the selected region, where the reconstruction difference serves as the loss for back-propagati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 41 publications
(13 citation statements)
references
References 28 publications
0
9
0
Order By: Relevance
“…We report comparison results with existing unsupervised [54,62,63] and weakly-supervised [38,49,55] methods. Note that the weakly-supervised methods are trained with expensive annotated queries.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…We report comparison results with existing unsupervised [54,62,63] and weakly-supervised [38,49,55] methods. Note that the weakly-supervised methods are trained with expensive annotated queries.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…Visual grounding is a crucial component in vision and language, and it serves as the fundamental of other tasks, such as VQA. Recent visual grounding methods can be summarized into three categories: fully-supervised [8,13,22,23,33,35], weakly-supervised [6,10,19,36,38,49,55,58], and unsupervised [54,63]. Fully-supervised methods rely heavily on the manual labeled patch-query pairs.…”
Section: Natural Language Visual Groundingmentioning
confidence: 99%
See 2 more Smart Citations
“…To extract the linguistic feature f l q , q is first parsed into multiple discriminative triads {t k } M k=1 [44]. Each triad represents a piece of discriminative information to distinguish the target from the distracting or reference objects.…”
Section: Linguistic Componentmentioning
confidence: 99%