Span-based Localizing Network for Natural Language Video Localization

Zhang, Hao; Sun, Aixin; Wei, Jing; Zhou, Joey Tianyi

doi:10.18653/v1/2020.acl-main.585

Cited by 152 publications

(161 citation statements)

References 40 publications

Supporting

Mentioning

161

Contrasting

Order By: Relevance

“…It is worth noting that on TACoS dataset (see Table 4), our MS-2D-TAN surpasses the previous best approach CBP [38] , by approximate 18 points and 25 points in term of Rank1@0.3 and Rank5@0.3, respectively. Moreover, on the large-scale ActivityNet Captions dataset, MS-2D-TAN also outperforms the top ranked method DRN [22] and VSLNet [19] with respect to IoU @0.5 and 0.7. It validates that MS-2D-TAN is able to localize the moment boundary more precisely.…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 89%

“…-anchor based methods: TGN [16], CMIN [17] and CBP [38], SCDM [18], -anchor free methods: ACRN [12], ROLE [23], SLTA [28], DEBUG [27], VSLNet [19], GDP [26] LGI [24], ABLR [20], TMLGA [25], ExCL [21] and DRN [22], -reinforcement learning based methods: RWM-RL [29], SM-RL [30], TripNet [31] and TSP-RPL [32],…”

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

“…In general, there are three common ways to map a clip to a moment score: anchor-based methods, anchor-free methods, and RL-based methods. Anchor-based methods define a set of anchors with a fixed length for each clip [16]- [18], while anchor-free methods directly predict the start and the end time through classification [19], [20] or regression [12], [21]- [28]. RL-based methods model the task as a sequential decision-making problem and solve it by reinforcement learning [29]- [32].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

282

389

View full text Add to dashboard Cite

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

show abstract

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 89%

Section: Comparison With State-of-the-art Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

282

389

View full text Add to dashboard Cite

show abstract

“…It has various applications such as robotic navigation, video entertainment, and autonomous driving, to *Shucheng Huang(schuang@just.edu.cn) is the corresponding author. name a few [1,2,3,4,5]. Despite much progress has been achieved in recent years [6,7,8,9,10,11,12,13], VMR remains difficult due to the harsh nature of videos and texts, including complex temporal relations, fine-grained semantic structures, and huge cross-modal gap between visual and textual features [11,14,15,16].…”

Section: Introductionmentioning

confidence: 99%

“…The current dominant approaches for video moment retrieval is to learn the semantic correlation between the query and the video. To this end, numerous cross-modality alignment strategies are designed such as cross-attention [1,2], recurrent neural networks [17,18], semantic conditioned dynamic modulation [11], and 2D temporal adjacent network [14]. Although achieving favorable performance, most current methods do not take full advantage of the fine-grained and comprehensive relation information in both semantic and visual structures: (1) Many existing VMR approaches only encode the semantic information of the query in a global manner [9,19,10,20,14,12,13], i.e., embedding the texts into a global vector representation by using LSTM or other sequential models, but ignore the intrinsic and fine-grained structure of the sentence.…”

Section: Introductionmentioning

confidence: 99%

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Gao

Huang

et al. 2021

2021 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Existing dominant approaches for video moment retrieval task are to learn semantic correlation between a given query and the video. However, these methods rarely explore the fine-grained semantic structure and comprehensive visual structure, leading to insufficient utilization of textual and visual relations. In this paper, we propose a unified framework for video moment retrieval, which considers to simultaneously encode semantic and visual structures. Specifically, a semantic role tree is built to reveal the fine-grained semantic information by generating hierarchical textual embeddings. Then the semantic structure is adopted to facilitate the visual structure learning with a contextual attention-based proposal interaction module. Finally, we adaptively aggregate and obtain the visual-semantic matching information through a multi-level fusion strategy to select the best matching moment proposal. Extensive experiments on two popular benchmarks (Charades-STA and ActivityNet Captions) show that our proposed method achieves state-of-the-art performance. Codes are available in the Supplementary Material.

show abstract

Synthesizing Counterfactual Samples for Overcoming Moment Biases in Temporal Video Grounding

Zhai

Jing

et al. 2022

Pattern Recognition and Computer Vision

View full text Add to dashboard Cite

Span-based Localizing Network for Natural Language Video Localization

Cited by 152 publications

References 40 publications

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Diving Into The Relations: Leveraging Semantic and Visual Structures For Video Moment Retrieval

Synthesizing Counterfactual Samples for Overcoming Moment Biases in Temporal Video Grounding

Contact Info

Product

Resources

About