2021
DOI: 10.48550/arxiv.2106.14118
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Abstract: State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 40 publications
(70 reference statements)
0
7
0
Order By: Relevance
“…Bagchi et al [2] divided the methods for segment proposal estimation in temporal action localization into two main categories: methods based on anchors and methods based on predicting the boundary probabilities. As for the anchor-based, these methods mainly use sliding windows in the video, such as S-CNN [59], CDC [58], TURN-TAP [18] and CTAP [17].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Bagchi et al [2] divided the methods for segment proposal estimation in temporal action localization into two main categories: methods based on anchors and methods based on predicting the boundary probabilities. As for the anchor-based, these methods mainly use sliding windows in the video, such as S-CNN [59], CDC [58], TURN-TAP [18] and CTAP [17].…”
Section: Related Workmentioning
confidence: 99%
“…It is worth noting that all these methods are unimodal, which is not optimal for the task of temporal forgery detection. The importance of multimodality was demonstrated recently by AVFusion [2]. Proposed Approach.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…[39] proposes a new task of audiovisual event localization that aims at predicting the event class from a 10second clip. [4] studies multi-modal fusion approaches for audiovisual localization but ablates it on THUMOS14 and ActivityNet . Compared to them, we design our method for long, diverse egocentric videos.…”
Section: Related Workmentioning
confidence: 99%
“…However, their performance on Charades [79] or AVA [32], shown in Figure 7 (b), is not satisfactory to conduct step-level detection before verification. Although [3,5,11,28,53,54,75,76,92,[96][97][98]100,102,103] perform well on ActivityNet [8] or THUMOS14 [38], it lacks persuasion since the two datasets either contain a few action classes or action instances per video. Thus, we introduce a simple but effective baseline CosAlignment Transformer (abbreviated as CAT), which leverages 2D convolution to extract discriminative features from sampled frames and utilizes a transformer to model inter-step temporal cor-relation in a video clip.…”
Section: Introductionmentioning
confidence: 99%