VAL: Visual-Attention Action Localizer

Song, Xiaomeng; Han, Yahong

doi:10.1007/978-3-030-00767-6_32

Cited by 26 publications

(21 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another way is to use the surrounding clips as the local context for a moment. Gao et al, Liu et al, Song et al and Ge et al concatenate the moment feature with clip features before and after current clip as its representation [5], [11]- [13]. Since these methods only consider one or two specific moments, the rich context information from other possible moments is ignored.…”

Section: Related Workmentioning

confidence: 99%

“…The key idea of cross-modal attention is to attend relevant video clips/moments or query words from another modality. Some methods attend relevant video features through words [12], [28], while most other methods attend both the relevant video features and words via the co-attention module [13], [16], [17], [19], [20], [23], [25], [27], [28], [35], [37], [38]. For sentence syntactic modeling, Zhang et al [17] enhance the sentence modeling with the queries' syntactic graph.…”

Section: Related Workmentioning

confidence: 99%

“…Most current language-queried moment localization methods follow a two-step pipeline [5], [6], [11]- [13]. It first utilizes sliding windows to generate moment segments from the input videos.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

330

427

View full text Add to dashboard Cite

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Zhang

Peng

et al. 2020

AAAI

330

427

View full text Add to dashboard Cite

show abstract

“…One way is to use the whole video as the global context. Specifically (Gao et al 2017;Liu et al 2018b;Song and Han 2018;Ge et al 2019). Since these methods model the context with a one-dimension sliding window, the moments longer than the window would be ignored.…”

Section: Related Workmentioning

confidence: 99%

“…Video moment localization with natural language has a wide range of applications, such as video question answering (Lei et al 2018), video content retrieval (Shao et al 2018), as well as video storytelling (Gella, Lewis, and Rohrbach 2018). Most of the current language-queried moment localization models follow a two-step pipeline (Gao et al 2017;Hendricks et al 2017;Ge et al 2019;Liu et al 2018b;Song and Han 2018). Moment candidates are first selected from the input video with sliding windows.…”

Section: Introductionmentioning

confidence: 99%

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Zhang

Luo

2019

Proceedings of the 27th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, Activi-tyNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.

show abstract

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Chen

Jiang

Liu

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

VAL: Visual-Attention Action Localizer

Cited by 26 publications

References 14 publications

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language

Exploiting Temporal Relationships in Video Moment Localization with Natural Language

Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos

Contact Info

Product

Resources

About