Multi-shot Temporal Event Localization: a Benchmark

Liu, Xiaolong; Hu, Yao; Bai, Song; Ding, Fei; Bai, Xiang; Torr, Philip H. S.

doi:10.1109/cvpr46437.2021.01241

Cited by 68 publications

(36 citation statements)

References 64 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This model achieves an average mAP of 58.7% (Table 3 row 5, ) -a major boost of 15.9%. We note that this model already outperforms the best reported results (56.7% mAP at tIoU=0.5 from [45]). This result shows that our Transformer model is very powerful for TAL, and serves as the main course of performance gain.…”

Section: Baselinementioning

confidence: 53%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer-a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local selfattention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 65.6% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 8.7 absolute percentage points and crossing the 60% mAP for the first time. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.0% average mAP) and the more recent EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at https://github. com/happyharrycn/actionformer_release.

show abstract

Section: Baselinementioning

confidence: 53%

ActionFormer: Localizing Moments of Actions with Transformers

Zhang¹,

Wu²,

Li³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…recognizing places or actions in those scenes. In [35], multi-shot clips of movies and TV episodes were categorized into 25 event-classes for their temporal localization. The results in [35] show that stateof-the-art event localization models [52] [53] do not perform as well on long-form movie and TV episodes compared to their performance on short-form video datasets like THUMOS14 [28].…”

Section: Related Workmentioning

confidence: 99%

“…In [35], multi-shot clips of movies and TV episodes were categorized into 25 event-classes for their temporal localization. The results in [35] show that stateof-the-art event localization models [52] [53] do not perform as well on long-form movie and TV episodes compared to their performance on short-form video datasets like THUMOS14 [28]. A long-form video understanding (LVU) dataset was recently proposed in [50] with nine different tasks related to semantic understanding of video-clips that were cut-out from full-length movies.…”

Section: Related Workmentioning

confidence: 99%

Movies2Scenes: Learning Scene Representations Using Movie Similarities

Chen¹,

Xiang²,

Nie³

et al. 2022

Preprint

View full text Add to dashboard Cite

like-this information) to define a measure of movie-similarity which is then used to guide the process of learning a general-purpose scenerepresentation. The illustration shown in this figure presents two similar movies with the same genre along with four non-similar movies with different genres. We automatically select thematically similar scenes from similar movies, and use them in a contrastive-learning setting along with randomly selected scenes from non-similar movies. This allows us to learn a general-purpose scene-representation that achieves state-of-the-art results for a wide variety of downstream tasks related to understanding movie scenes.

show abstract

“…Many recent approaches employ this proposal-based formulation [15,16,17]. Specifically, this is the case for state-of-the-art approaches we consider in this paper -G-TAD [1], PGCN [2] and MUSES baseline [5]. Both G-TAD [1] and PGCN [2] use graph convolutional networks and the concept of edges to share context and background information between proposals.…”

Section: Related Workmentioning

confidence: 99%

“…TAL is an active area of research and several approaches have been proposed to tackle the problem [1,2,3,4,5,6]. For the most part, existing approaches depend solely on the visual modality (RGB, Optical Flow).…”

Section: Introductionmentioning

confidence: 99%

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Bagchi¹,

Mahmood²,

Fernandes³

et al. 2021

Preprint

View full text Add to dashboard Cite

State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on largescale benchmark datasets -

show abstract

Multi-shot Temporal Event Localization: a Benchmark

Cited by 68 publications

References 64 publications

ActionFormer: Localizing Moments of Actions with Transformers

ActionFormer: Localizing Moments of Actions with Transformers

Movies2Scenes: Learning Scene Representations Using Movie Similarities

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Contact Info

Product

Resources

About