2020
DOI: 10.48550/arxiv.2011.05049
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Human-centric Spatio-Temporal Video Grounding With Visual Transformers

Abstract: In this work, we introduce a novel task -Humancentric Spatio-Temporal Video Grounding (HC-STVG). Unlike the existing referring expression tasks in images or videos, by focusing on humans, HC-STVG aims to localize a spatiotemporal tube of the target person from an untrimmed video based on a given textural description. This task is useful, especially for healthcare and security related applications, where the surveillance videos can be extremely long but only a specific person during a specific period of time is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(2 citation statements)
references
References 44 publications
0
2
0
Order By: Relevance
“…The Human-centric Spatio-Temporal Video Grounding (HC-STVG) [1] dataset provides 16k annotation-video pairs with different movie scenes. The duration of each video is 20 seconds.…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…The Human-centric Spatio-Temporal Video Grounding (HC-STVG) [1] dataset provides 16k annotation-video pairs with different movie scenes. The duration of each video is 20 seconds.…”
Section: Datasetmentioning
confidence: 99%
“…Tube trimming is to localize the temporal boundary of the target person since tube proposals may contain redundant transition frames. We follow [1] to apply tube trimming. During the experiment, we find that the performance with tube trimming is even worse.…”
Section: Tube Trimmingmentioning
confidence: 99%