2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020
DOI: 10.1109/cvpr42600.2020.01068
|View full text |Cite
|
Sign up to set email alerts
|

Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form Sentences

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
49
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(49 citation statements)
references
References 39 publications
0
49
0
Order By: Relevance
“…one-stage methods are also proposed [18,35,38,39,20] which produce both object proposals and matching scores. In addition, similar approaches can be applied to object visual grounding in streaming video frames [43,37,30,31,44,46,45] to ground objects or referring expressions in videos.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…one-stage methods are also proposed [18,35,38,39,20] which produce both object proposals and matching scores. In addition, similar approaches can be applied to object visual grounding in streaming video frames [43,37,30,31,44,46,45] to ground objects or referring expressions in videos.…”
Section: Related Workmentioning
confidence: 99%
“…Localizing objects described by referring expressions in vision signals, also known as visual grounding, has long been a major motive for robotics and embodied vision. So far, we have seen growing efforts devoted to visual grounding in images [17,36,13,40,24,29,33,5,41,11,42,10,9,12,19,47,18,35,38,39,20] and videos [46,45,43,37,30,31,44]. Suppose that a robot is going to take 'the spoon on the table in the kitchen' following your command [14,23]; this would require a Figure 1: We present a novel task of 3D visual grounding in single-view RGBD images given a referring expression, and propose a bottom-up neural approach to address it.…”
Section: Introductionmentioning
confidence: 99%
“…Video clips in this dataset are all trimmed, hence it is not suitable for temporal localization. Among all datasets, the most relevant dataset is the Vidstg dataset [33]. It is extended from the dataset Vidor [35], which is a dataset originally collected for detecting relations in videos.…”
Section: Comparison With the Existing Datasetsmentioning
confidence: 99%
“…Krishna et al [15] explored referring relationships in images by iterative message passing between subjects and objects. While these works focus on image grounding, more recent efforts [1,3,10,32,39,46,47] also attempted to ground objects in videos. Zhou et al [47] explored weakly-supervised grounding of descriptive nouns in separate frames in a frame-weighted retrieval fashion.…”
Section: Related Workmentioning
confidence: 99%
“…It was originally tackled in language-based visual fragment-retrieval [9,12,13], and has recently attracted widespread attention as a task onto itself. While lots of the existing efforts are made on referring expression grounding in static images [8,19,22,23,28,41,42,44], recent research attempts to study visual grounding in videos by finding the objects either in individual frames [10,32,47] or in video clips spatio-temporally [1,3,46]. Nonetheless, all these works focus on grounding in videos the objects depicted by natural language sentences.…”
Section: Introductionmentioning
confidence: 99%