2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
DOI: 10.1109/cvpr46437.2021.01241
|View full text |Cite
|
Sign up to set email alerts
|

Multi-shot Temporal Event Localization: a Benchmark

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
35
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 68 publications
(36 citation statements)
references
References 64 publications
0
35
0
Order By: Relevance
“…This model achieves an average mAP of 58.7% (Table 3 row 5, ) -a major boost of 15.9%. We note that this model already outperforms the best reported results (56.7% mAP at tIoU=0.5 from [45]). This result shows that our Transformer model is very powerful for TAL, and serves as the main course of performance gain.…”
Section: Baselinementioning
confidence: 53%
“…This model achieves an average mAP of 58.7% (Table 3 row 5, ) -a major boost of 15.9%. We note that this model already outperforms the best reported results (56.7% mAP at tIoU=0.5 from [45]). This result shows that our Transformer model is very powerful for TAL, and serves as the main course of performance gain.…”
Section: Baselinementioning
confidence: 53%
“…recognizing places or actions in those scenes. In [35], multi-shot clips of movies and TV episodes were categorized into 25 event-classes for their temporal localization. The results in [35] show that stateof-the-art event localization models [52] [53] do not perform as well on long-form movie and TV episodes compared to their performance on short-form video datasets like THUMOS14 [28].…”
Section: Related Workmentioning
confidence: 99%
“…In [35], multi-shot clips of movies and TV episodes were categorized into 25 event-classes for their temporal localization. The results in [35] show that stateof-the-art event localization models [52] [53] do not perform as well on long-form movie and TV episodes compared to their performance on short-form video datasets like THUMOS14 [28]. A long-form video understanding (LVU) dataset was recently proposed in [50] with nine different tasks related to semantic understanding of video-clips that were cut-out from full-length movies.…”
Section: Related Workmentioning
confidence: 99%
“…Many recent approaches employ this proposal-based formulation [15,16,17]. Specifically, this is the case for state-of-the-art approaches we consider in this paper -G-TAD [1], PGCN [2] and MUSES baseline [5]. Both G-TAD [1] and PGCN [2] use graph convolutional networks and the concept of edges to share context and background information between proposals.…”
Section: Related Workmentioning
confidence: 99%
“…TAL is an active area of research and several approaches have been proposed to tackle the problem [1,2,3,4,5,6]. For the most part, existing approaches depend solely on the visual modality (RGB, Optical Flow).…”
Section: Introductionmentioning
confidence: 99%