Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Zhang, Zhu; Zhao, Zhou; Zhao, Yang; Wang, Qi; Liu, Huasheng; Gao, Lianli

doi:10.1109/cvpr42600.2020.01068

Cited by 71 publications

(49 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…one-stage methods are also proposed [18,35,38,39,20] which produce both object proposals and matching scores. In addition, similar approaches can be applied to object visual grounding in streaming video frames [43,37,30,31,44,46,45] to ground objects or referring expressions in videos.…”

Section: Related Workmentioning

confidence: 99%

“…Localizing objects described by referring expressions in vision signals, also known as visual grounding, has long been a major motive for robotics and embodied vision. So far, we have seen growing efforts devoted to visual grounding in images [17,36,13,40,24,29,33,5,41,11,42,10,9,12,19,47,18,35,38,39,20] and videos [46,45,43,37,30,31,44]. Suppose that a robot is going to take 'the spoon on the table in the kitchen' following your command [14,23]; this would require a Figure 1: We present a novel task of 3D visual grounding in single-view RGBD images given a referring expression, and propose a bottom-up neural approach to address it.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Liu¹,

Lin²,

Han³

et al. 2021

Preprint

View full text Add to dashboard Cite

Grounding referring expressions in RGBD image has been an emerging field. We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion. In contrast to previous works that directly generate object proposals for grounding in the 3D scenes, we propose a bottom-up approach to gradually aggregate context-aware information, effectively addressing the challenge posed by the partial geometry. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that coarsely localizes the relevant regions in the RGBD image. Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object. We evaluate the proposed method by comparing to the state-of-the-art methods on both the RGBD images extracted from the ScanRefer dataset and our newly collected SUNRefer dataset. Experiments show that our method outperforms the previous methods by a large margin (by 11.2% and 15.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Liu¹,

Lin²,

Han³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Video clips in this dataset are all trimmed, hence it is not suitable for temporal localization. Among all datasets, the most relevant dataset is the Vidstg dataset [33]. It is extended from the dataset Vidor [35], which is a dataset originally collected for detecting relations in videos.…”

Section: Comparison With the Existing Datasetsmentioning

confidence: 99%

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

Tang

Liao

Liu

et al. 2020

Preprint

View full text Add to dashboard Cite

In this work, we introduce a novel task -Humancentric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period of time is concerned. HC-STVG is a video grounding task that requires both spatial (where) and temporal (when) localization. Unfortunately, the existing grounding methods cannot handle this task well. We tackle this task by proposing an effective baseline method named Spatio-Temporal Grounding with Visual Transformers (STGVT), which utilizes Visual Transformers to extract cross-modal representations for video-sentence matching and temporal localization. To facilitate this task, we also contribute an HC-STVG dataset 1 consisting of 5,660 video-sentence pairs on complex multi-person scenes. Specifically, each video lasts for 20 seconds, pairing with a natural query sentence with an average of 17.25 words. Extensive experiments are conducted on this dataset, demonstrating the newlyproposed method outperforms the existing baseline methods.

show abstract

“…Krishna et al [15] explored referring relationships in images by iterative message passing between subjects and objects. While these works focus on image grounding, more recent efforts [1,3,10,32,39,46,47] also attempted to ground objects in videos. Zhou et al [47] explored weakly-supervised grounding of descriptive nouns in separate frames in a frame-weighted retrieval fashion.…”

Section: Related Workmentioning

confidence: 99%

“…It was originally tackled in language-based visual fragment-retrieval [9,12,13], and has recently attracted widespread attention as a task onto itself. While lots of the existing efforts are made on referring expression grounding in static images [8,19,22,23,28,41,42,44], recent research attempts to study visual grounding in videos by finding the objects either in individual frames [10,32,47] or in video clips spatio-temporally [1,3,46]. Nonetheless, all these works focus on grounding in videos the objects depicted by natural language sentences.…”

Section: Introductionmentioning

confidence: 99%

Visual Relation Grounding in Videos

Xiao

Shang

Yang

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we explore a novel task named visual Relation Grounding in Videos (vRGV). The task aims at spatio-temporally localizing the given relations in the form of subject-predicate-object in the videos, so as to provide supportive visual facts for other high-level video-language tasks (e.g., video-language grounding and video question answering). The challenges in this task include but not limited to: (1) both the subject and object are required to be spatio-temporally localized to ground a query relation; (2) the temporal dynamic nature of visual relations in videos is difficult to capture; and (3) the grounding should be achieved without any direct supervision in space and time. To ground the relations, we tackle the challenges by collaboratively optimizing two sequences of regions over a constructed hierarchical spatio-temporal region graph through relation attending and reconstruction, in which we further propose a message passing mechanism by spatial attention shifting between visual entities. Experimental results demonstrate that our model can not only outperform baseline approaches significantly, but also produces visually meaningful facts to support visual grounding. (Code is available at https://github.com/doc-doc/vRGV).

show abstract

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Cited by 71 publications

References 39 publications

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

Visual Relation Grounding in Videos

Contact Info

Product

Resources

About