Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.585
|View full text |Cite
|
Sign up to set email alerts
|

Span-based Localizing Network for Natural Language Video Localization

Abstract: Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top o… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
161
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 152 publications
(161 citation statements)
references
References 40 publications
0
161
0
Order By: Relevance
“…It is worth noting that on TACoS dataset (see Table 4), our MS-2D-TAN surpasses the previous best approach CBP [38] , by approximate 18 points and 25 points in term of Rank1@0.3 and Rank5@0.3, respectively. Moreover, on the large-scale ActivityNet Captions dataset, MS-2D-TAN also outperforms the top ranked method DRN [22] and VSLNet [19] with respect to IoU @0.5 and 0.7. It validates that MS-2D-TAN is able to localize the moment boundary more precisely.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 89%
See 2 more Smart Citations
“…It is worth noting that on TACoS dataset (see Table 4), our MS-2D-TAN surpasses the previous best approach CBP [38] , by approximate 18 points and 25 points in term of Rank1@0.3 and Rank5@0.3, respectively. Moreover, on the large-scale ActivityNet Captions dataset, MS-2D-TAN also outperforms the top ranked method DRN [22] and VSLNet [19] with respect to IoU @0.5 and 0.7. It validates that MS-2D-TAN is able to localize the moment boundary more precisely.…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 89%
“…-anchor based methods: TGN [16], CMIN [17] and CBP [38], SCDM [18], -anchor free methods: ACRN [12], ROLE [23], SLTA [28], DEBUG [27], VSLNet [19], GDP [26] LGI [24], ABLR [20], TMLGA [25], ExCL [21] and DRN [22], -reinforcement learning based methods: RWM-RL [29], SM-RL [30], TripNet [31] and TSP-RPL [32],…”
Section: Comparison With State-of-the-art Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…It has various applications such as robotic navigation, video entertainment, and autonomous driving, to *Shucheng Huang(schuang@just.edu.cn) is the corresponding author. name a few [1,2,3,4,5]. Despite much progress has been achieved in recent years [6,7,8,9,10,11,12,13], VMR remains difficult due to the harsh nature of videos and texts, including complex temporal relations, fine-grained semantic structures, and huge cross-modal gap between visual and textual features [11,14,15,16].…”
Section: Introductionmentioning
confidence: 99%
“…The current dominant approaches for video moment retrieval is to learn the semantic correlation between the query and the video. To this end, numerous cross-modality alignment strategies are designed such as cross-attention [1,2], recurrent neural networks [17,18], semantic conditioned dynamic modulation [11], and 2D temporal adjacent network [14]. Although achieving favorable performance, most current methods do not take full advantage of the fine-grained and comprehensive relation information in both semantic and visual structures: (1) Many existing VMR approaches only encode the semantic information of the query in a global manner [9,19,10,20,14,12,13], i.e., embedding the texts into a global vector representation by using LSTM or other sequential models, but ignore the intrinsic and fine-grained structure of the sentence.…”
Section: Introductionmentioning
confidence: 99%