2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01108
|View full text |Cite
|
Sign up to set email alerts
|

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

Abstract: This paper addresses the problem of temporal sentence grounding (TSG), which aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. Previous works either compare pre-defined candidate segments with the query and select the best one by ranking, or directly regress the boundary timestamps of the target segment. In this paper, we propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine m… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
41
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 94 publications
(41 citation statements)
references
References 46 publications
0
41
0
Order By: Relevance
“…DORi [119] incorporates appearance features and captures the relations between objects and actions guided by query. CBLN [120] addresses TSGV from a new perspective. It reformulates TSGV by scoring all pairs of start and end indices simultaneously and predict moment with a biaffine structure.…”
Section: Span-based Methodsmentioning
confidence: 99%
“…DORi [119] incorporates appearance features and captures the relations between objects and actions guided by query. CBLN [120] addresses TSGV from a new perspective. It reformulates TSGV by scoring all pairs of start and end indices simultaneously and predict moment with a biaffine structure.…”
Section: Span-based Methodsmentioning
confidence: 99%
“…2) Query encoder Following previous works [16], [20], [78], we first employ the Glove model [79] to embed each word of the given sentence query into a dense vector. Then, we use multi-head self-attention [80] and Bi-GRU [81] modules to encode its sequential information.…”
Section: Video and Query Encoders 1) Video Encodermentioning
confidence: 99%
“…At last, we apply grounding heads on the feature H to predict the target segment semantically corresponding to the query information. There are many grounding heads proposed in recent years: proposal-ranking based grounding head [18], [20], [71], [78] and the boundary-regression grounding head [25]- [27]. In this paper, we follow the former one [18], [20], [74] to determine the target video segment with pre-defined segment proposals.…”
Section: F Grounding Headmentioning
confidence: 99%
See 1 more Smart Citation
“…State-of-the-art video grounding methods [11,12,14,25,32,34,35] have relied on existing benchmarks to design novel modules (e.g. proposal generation, context modeling, and multi-modality fusion).…”
Section: Related Workmentioning
confidence: 99%