A Fast and Accurate One-Stage Approach to Visual Grounding

Yang, Zhengyuan; Gong, Boqing; Wang, Liwei; Huang, Wenbing; Yu, Dong; Luo, Jiebo

doi:10.1109/iccv.2019.00478

Cited by 287 publications

(309 citation statements)

References 47 publications

Supporting

Mentioning

286

Contrasting

Order By: Relevance

“…The task of referring expression comprehension has attracted increasing attention in recent years, which expects to locate corresponding objects within an image based on input expressions. Previous referring expression comprehension methods [2, 6, 11, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-51] can be mainly divided into two types, including proposal-region-based [2,6,11,22,25,27,28,[41][42][43][47][48][49][50][51] and grid-region-based methods [16,35,45] .…”

Section: Related Workmentioning

confidence: 99%

“…Grid-region-based methods [16,35,45] usually fuse the language features with grid region features, and then leverage one-stage object detectors (e.g. YOLOv3 [33]) to directly localize the object corresponding to the input expression.…”

Section: Related Workmentioning

confidence: 99%

“…To model the relationships between language and vision, existing methods [2, 16, 22, 25, 27, 28, 35, 41-43, 45, 47-50] usually combine the language features and the regular image region features, such as object proposal regions [2,22,25,27,28,[41][42][43][47][48][49][50] and grid regions [16,35,45], as shown in Figure 1 (a) and (b), respectively. However, these methods ignore some fine-grained object information related to the natural language, such as object shapes and poses, which are often described in language expressions and important in referring expression comprehension to localize and distinguish the target objects.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Qiu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Referring expression comprehension expects to accurately locate an object described by a language expression, which requires precise language-aware visual object representations. However, existing methods usually use rectangular object representations, such as object proposal regions and grid regions. They ignore some finegrained object information like shapes and poses, which are often described in language expressions and important to localize objects. Additionally, rectangular object regions usually contain background contents and irrelevant foreground features, which also decrease the localization performance. To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. Rather than extracting rectangular object representations, LDC adaptively samples a set of key points based on the image and language to represent objects. This type of object representations can capture more fine-grained object information (e.g., shapes and poses) and suppress noises in accordance with language and thus, boosts the object localization performance. Based on the language-aware finegrained object representation, we next design a bidirectional interaction model (BIM) that leverages a modified co-attention mechanism to build cross-modal bidirectional interactions to further improve the language and object representations. Furthermore, we propose a hierarchical fine-grained representation network (HFRN) to learn language-aware fine-grained object representations and cross-modal bidirectional interactions at local word level and global sentence level, respectively. Our proposed method outperforms the state-of-the-art methods on the RefCOCO, RefCOCO+ and Ref-COCOg datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Qiu

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Bajaj et al [3] achieved significant improvement by using Gated Graph Neural Networks to formulate the dependency among phrases and image regions. Yang et al [51] proposed a one-stage approach, fusing text query embeddings into the YOLOv3 object detector while augmenting by using spatial features. Lai et al [29] proposed to use transformers to capture contextual representations for text tokens and image regions.…”

Section: Related Workmentioning

confidence: 99%

Cross-Modal Omni Interaction Modeling for Phrase Grounding

Hui

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Phrase grounding aims to localize the objects described by phrases in a natural language specification. Previous works model the interaction of inputs from text modality and visual modality only in the intra-modal global level and consequently lacks the ability to capture the precise and complete context information. In this paper, we propose a novel Cross-Modal Omni Interaction network (COI Net) composed of a neighboring interaction module, a global interaction module, a cross-modal interaction module and a multilevel alignment module. Our approach formulates the complex spatial and semantic relationship among image regions and phrases through multi-level multi-modal interaction. We capture the local relationship using the interaction among neighboring regions and then collect the global context through the interaction among all regions using a transformer encoder. We further use a co-attention module to apply the interaction between two modalities to gather the crossmodal context for all image regions and phrases. In addition to the omni interaction modeling, we also leverage a straightforward yet effective multilevel alignment regularization to formulate the dependencies among all grounding decisions. We extensively validate the effectiveness of our model. Experiments show that our approach outperforms existing state-of-the-art methods by large margins on two popular datasets in terms of accuracy: 6.15% on Flickr30K Entities (71.36% increased to 77.51%) and 21.25% on ReferItGame (44.91% * Equal contribution.

show abstract

“…description sentence, for example "break the eggs", visual grounding aims at localizing the query objects described in the sentence on the given image or video. Recently, great progress has been made on image grounding [4,15,30,31]. On the basis of this, researchers started to explore grounding in the video domain [5,7,14,25,35].…”

mentioning

confidence: 99%

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Chen

Bao

Kong

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

In this paper, we study the problem of weakly-supervised spatiotemporal grounding from raw untrimmed video streams. Given a video and its descriptive sentence, spatio-temporal grounding aims at predicting the temporal occurrence and spatial locations of each query object across frames. Our goal is to learn a grounding model in a weakly-supervised fashion, without the supervision of both spatial bounding boxes and temporal occurrences during training. Existing methods have been addressed in trimmed videos, but their reliance on object tracking will easily fail due to frequent camera shot cut in untrimmed videos. To this end, we propose a novel spatio-temporal multiple instance learning framework for untrimmed video grounding. Spatial MIL and temporal MIL are mutually guided to ground each query to specific spatial regions and the occurring frames of a video. Furthermore, an activity described in the sentence is captured to use the informative contextual cues for region proposals refinement and text representation. We conduct extensive evaluation on YouCookII and RoboWatch datasets, and demonstrate our method outperforms state-of-the-art methods.

show abstract

A Fast and Accurate One-Stage Approach to Visual Grounding

Cited by 287 publications

References 47 publications

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension

Cross-Modal Omni Interaction Modeling for Phrase Grounding

Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos

Contact Info

Product

Resources

About