2021
DOI: 10.48550/arxiv.2103.10191
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Decoupled Spatial Temporal Graphs for Generic Visual Grounding

Abstract: Visual grounding is a long-lasting problem in visionlanguage understanding due to its diversity and complexity. Current practices concentrate mostly on performing visual grounding in still images or well-trimmed video clips. This work, on the other hand, investigates into a more general setting, generic visual grounding, aiming to mine all the objects satisfying the given expression, which is more challenging yet practical in real-world scenarios. Importantly, grounding results are expected to accurately local… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 60 publications
0
2
0
Order By: Relevance
“…The objective of video REC is to localize the spatial-temporal tube according to the natural language query. Most of the previous works [19,22,43,45,62] can be divided into two categories, i.e., two-stage methods and one-stage methods. However, both kinds of methods require time-consuming post-processing steps, which hinders their practical applications.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The objective of video REC is to localize the spatial-temporal tube according to the natural language query. Most of the previous works [19,22,43,45,62] can be divided into two categories, i.e., two-stage methods and one-stage methods. However, both kinds of methods require time-consuming post-processing steps, which hinders their practical applications.…”
Section: Related Workmentioning
confidence: 99%
“…Current video REC methods can be classified into two major categories: two-stage, proposal-driven methods and one-stage, proposal-free methods. For the two-stage methods [19,20,26,62], they extract potential spatio-temporal tubes and then align these candidates to the sentence to find the best matching one. The other stream of one-stage methods [7,13,43,45,59] fuses visual-text features and directly predicts bounding boxes densely at all spatial locations.…”
Section: Introductionmentioning
confidence: 99%