Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1518
|View full text |Cite
|
Sign up to set email alerts
|

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization

Abstract: In this paper, we focus on natural language video localization: localizing (i.e., grounding) a natural language description in a long and untrimmed video sequence. All currently published models for addressing this problem can be categorized into two types: (i) top-down approach: it does classification and regression for a set of pre-cut video segment candidates; (ii) bottom-up approach: it directly predicts probabilities for each video frame as the temporal boundaries (i.e., start and end time point). However… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
103
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 118 publications
(104 citation statements)
references
References 34 publications
0
103
0
Order By: Relevance
“…-anchor based methods: TGN [16], CMIN [17] and CBP [38], SCDM [18], -anchor free methods: ACRN [12], ROLE [23], SLTA [28], DEBUG [27], VSLNet [19], GDP [26] LGI [24], ABLR [20], TMLGA [25], ExCL [21] and DRN [22], -reinforcement learning based methods: RWM-RL [29], SM-RL [30], TripNet [31] and TSP-RPL [32],…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…-anchor based methods: TGN [16], CMIN [17] and CBP [38], SCDM [18], -anchor free methods: ACRN [12], ROLE [23], SLTA [28], DEBUG [27], VSLNet [19], GDP [26] LGI [24], ABLR [20], TMLGA [25], ExCL [21] and DRN [22], -reinforcement learning based methods: RWM-RL [29], SM-RL [30], TripNet [31] and TSP-RPL [32],…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
“…In terms of their context modeling, most approaches [12], [16], [17], [20]- [22], [25] gradually aggregate the context information through a recurrent structure. Some approaches [18], [23], [29] model surrounding clips as the local context using 1D convolution layers, while other approaches model the entire clip as the global context through self-attention modules [19], [24], [26], [27]. Since clips are the shortest moments, the clip-level context is a subset of moment-level context.…”
Section: Related Workmentioning
confidence: 99%
“…Though some of them use an additional regression layer to predict the offsets, their candidate-level feature is not suitable for boundary-level regression and result in inferior performance. On the other hand, by comparing our method with frame-based bottom-up approaches (DEBUG [27], TGN [4], CBP [36], GDP [6]), we can observe that our method works better. Since these approaches only use frame-level representation for moment localization, the boundary features are unaware of the moment content they constitute and lack of consistency, which results in poor performance.…”
Section: Performance Comparisonmentioning
confidence: 84%
“…Following [4], Authors of [5] utilized cross-gated attended recurrent network with the cross-modal interactor and the self interactor to catch the interactions between the sentence and video. Authors of [27] making full use of positive samples to alleviate the severe imbalance problem. Authors of [6] use a Graph-FPN layer to encoder scene relationships and semantics.…”
Section: Moment Localization By Languagementioning
confidence: 99%
See 1 more Smart Citation