DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization

Lu, Chujie; Chen, Long; Tan, Chilie; Li, Xin; Xiao, Jun

doi:10.18653/v1/d19-1518

Cited by 118 publications

(104 citation statements)

References 34 publications

Supporting

Mentioning

103

Contrasting

Order By: Relevance

“…-anchor based methods: TGN [16], CMIN [17] and CBP [38], SCDM [18], -anchor free methods: ACRN [12], ROLE [23], SLTA [28], DEBUG [27], VSLNet [19], GDP [26] LGI [24], ABLR [20], TMLGA [25], ExCL [21] and DRN [22], -reinforcement learning based methods: RWM-RL [29], SM-RL [30], TripNet [31] and TSP-RPL [32],…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…In terms of their context modeling, most approaches [12], [16], [17], [20]- [22], [25] gradually aggregate the context information through a recurrent structure. Some approaches [18], [23], [29] model surrounding clips as the local context using 1D convolution layers, while other approaches model the entire clip as the global context through self-attention modules [19], [24], [26], [27]. Since clips are the shortest moments, the clip-level context is a subset of moment-level context.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

330

427

View full text Add to dashboard Cite

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

330

427

View full text Add to dashboard Cite

show abstract

“…Though some of them use an additional regression layer to predict the offsets, their candidate-level feature is not suitable for boundary-level regression and result in inferior performance. On the other hand, by comparing our method with frame-based bottom-up approaches (DEBUG [27], TGN [4], CBP [36], GDP [6]), we can observe that our method works better. Since these approaches only use frame-level representation for moment localization, the boundary features are unaware of the moment content they constitute and lack of consistency, which results in poor performance.…”

Section: Performance Comparisonmentioning

confidence: 84%

“…Following [4], Authors of [5] utilized cross-gated attended recurrent network with the cross-modal interactor and the self interactor to catch the interactions between the sentence and video. Authors of [27] making full use of positive samples to alleviate the severe imbalance problem. Authors of [6] use a Graph-FPN layer to encoder scene relationships and semantics.…”

Section: Moment Localization By Languagementioning

confidence: 99%

“…We evaluate our proposed DPIN approach on three datasets and compare our model with the state-of-the-art methods, including: Candidatebased (top-down) approaches: CTRL [9], MCF [39], ACRN [24], SAP [7], CMIN [50], ACL [10], SCDM [43], ROLE [25], SLTA [16],MAN [47], Xu et al [41], SCDM [43], 2D-TAN [48]. Frame-based (Bottomup) approaches: ABLR [44],GDP [6], TGN [4], CBP [36], ExCL [11], DEBUG [27]. Reinforcement learning approach: SM-RL [37].…”

Section: Performance Comparisonmentioning

confidence: 99%

See 1 more Smart Citation

Dual Path Interaction Network for Video Moment Localization

Wang

Zha

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video moment localization aims to localize a specific moment in a video by a natural language query. Previous works either use alignment information to find out the best-matching candidate (i.e., topdown approach) or use discrimination information to predict the temporal boundaries of the match (i.e., bottom-up approach). Little research has taken both the candidate-level alignment information and frame-level boundary information together and considers the complementarity between them. In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. Our model includes a boundary prediction pathway encoding the frame-level representation and an alignment pathway extracting the candidatelevel representation. The two branches of our network predict two complementary but different representations for moment localization. To enforce the consistency and strengthen the connection between the two representations, we propose a semantically conditioned interaction module. The experimental results on three popular benchmarks (i.e., TACoS, Charades-STA, and Activity-Caption) demonstrate that the proposed approach effectively localizes the relevant moment and outperforms the state-of-the-art approaches. CCS CONCEPTS • Information systems → Video search; Novelty in information retrieval.

show abstract

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Chen

Jiang

2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

DEBUG: A Dense Bottom-Up Grounding Approach for Natural Language Video Localization

Cited by 118 publications

References 34 publications

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Dual Path Interaction Network for Video Moment Localization

Hierarchical Visual-Textual Graph for Temporal Activity Localization via Language

Contact Info

Product

Resources

About