Discriminative Triad Matching and Reconstruction for Weakly Referring Expression Grounding

Sun, Mingjie; Xiao, Jimin; Lim, Eng Gee; Liu, Si; Goulermas, John Y.

doi:10.1109/tpami.2021.3058684

Cited by 41 publications

(13 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We report comparison results with existing unsupervised [54,62,63] and weakly-supervised [38,49,55] methods. Note that the weakly-supervised methods are trained with expensive annotated queries.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…Visual grounding is a crucial component in vision and language, and it serves as the fundamental of other tasks, such as VQA. Recent visual grounding methods can be summarized into three categories: fully-supervised [8,13,22,23,33,35], weakly-supervised [6,10,19,36,38,49,55,58], and unsupervised [54,63]. Fully-supervised methods rely heavily on the manual labeled patch-query pairs.…”

Section: Natural Language Visual Groundingmentioning

confidence: 99%

“…Previous visual grounding methods rely on expensive human annotations, i.e., image region-query pairs for fullysupervised approaches [13,22,35] or image-query pairs for weakly-supervised approaches [36,37,49]. We firstly propose a pseudo language query based method without using any task-related annotations at training.…”

Section: Overviewmentioning

confidence: 99%

“…Most existing visual grounding methods can be categorized into two types: fully-supervised [8,13,22,23,33,35] and weakly-supervised [6,10,19,36,38,49,55,58]. Although these two lines of works have made remarkable successes, they rely heavily on manually annotated datasets.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Jiang¹,

Lin²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

Visual grounding, i.e., localizing objects in images according to natural language queries, is an important topic in visual language understanding. The most effective approaches for this task are based on deep learning, which generally require expensive manually labeled image-query or patch-query pairs. To eliminate the heavy dependence on human annotations, we present a novel method, named Pseudo-Q, to automatically generate pseudo language queries for supervised training. Our method leverages an off-the-shelf object detector to identify visual objects from unlabeled images, and then language queries for these objects are obtained in an unsupervised fashion with a pseudo-query generation module. Then, we design a task-related query prompt module to specifically tailor generated pseudo language queries for visual grounding tasks. Further, in order to fully capture the contextual relationships between images and language queries, we develop a visual-language model equipped with multi-level cross-modality attention mechanism. Extensive experimental results demonstrate that our method has two notable benefits: (1) it can reduce human annotation costs significantly, e.g., 31% on RefCOCO [65] without degrading original model's performance under the fully supervised setting, and (2) without bells and whistles, it achieves superior or comparable performance compared to state-of-theart weakly-supervised visual grounding methods on all the five datasets we have experimented. Code is available at https://github.com/LeapLabTHU/Pseudo-Q.* Equal contribution. † This work was done during an internship at Tsinghua.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Section: Natural Language Visual Groundingmentioning

confidence: 99%

Section: Overviewmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Pseudo-Q: Generating Pseudo Language Queries for Visual Grounding

Jiang¹,

Lin²,

Han³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…To extract the linguistic feature f l q , q is first parsed into multiple discriminative triads {t k } M k=1 [44]. Each triad represents a piece of discriminative information to distinguish the target from the distracting or reference objects.…”

Section: Linguistic Componentmentioning

confidence: 99%

Iterative Shrinking for Referring Expression Grounding Using Deep Reinforcement Learning

Sun

Xiao

Lim

2021

2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Self Cite

View full text Add to dashboard Cite

In this paper, we are tackling the proposal-free referring expression grounding task, aiming at localizing the target object according to a query sentence, without relying on off-the-shelf object proposals. Existing proposal-free methods employ a query-image matching branch to select the highest-score point in the image feature map as the target box center, with its width and height predicted by another branch. Such methods, however, fail to utilize the contextual relation between the target and reference objects, and lack interpretability on its reasoning procedure. To solve these problems, we propose an iterative shrinking mechanism to localize the target, where the shrinking direction is decided by a reinforcement learning agent, with all contents within the current image patch comprehensively considered. Besides, the sequential shrinking processes enable to demonstrate the reasoning about how to iteratively find the target. Experiments show that the proposed method boosts the accuracy by 4.32% against the previous state-of-theart (SOTA) method on the RefCOCOg dataset, where query sentences are long and complex with many targets referred by other reference objects.

show abstract