Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Bagchi, Anurag; Mahmood, Jazib; Fernandes, Dolton; Sarvadevabhatla, Ravi Kiran

doi:10.48550/arxiv.2106.14118

Cited by 4 publications

(8 citation statements)

References 40 publications

(70 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bagchi et al [2] divided the methods for segment proposal estimation in temporal action localization into two main categories: methods based on anchors and methods based on predicting the boundary probabilities. As for the anchor-based, these methods mainly use sliding windows in the video, such as S-CNN [59], CDC [58], TURN-TAP [18] and CTAP [17].…”

Section: Related Workmentioning

confidence: 99%

“…It is worth noting that all these methods are unimodal, which is not optimal for the task of temporal forgery detection. The importance of multimodality was demonstrated recently by AVFusion [2]. Proposed Approach.…”

Section: Related Workmentioning

confidence: 99%

“…For the task of temporal forgery detection, both the audio and visual information are important, in addition to the required precise boundary proposals. In this paper, we introduce a multimodal method based on boundary probabilities and compare the performance with BMN [39], AGT [46], MDS [11] and AVFusion [2].…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Cai¹,

Stefanov²,

Dhall³

et al. 2022

Preprint

View full text Add to dashboard Cite

Due to its high societal impact, deepfake detection is getting active attention in the computer vision community. Most deepfake detection methods rely on identity, facial attribute and adversarial perturbation based spatio-temporal modifications at the whole video or random locations, while keeping the meaning of the content intact. However, a sophisticated deepfake may contain only a small segment of video/audio manipulation, through which the meaning of the content can be, for example, completely inverted from sentiment perspective. To address this gap, we introduce a content driven audio-visual deepfake dataset, termed as Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization. Specifically, the content driven audio-visual manipulations are performed at strategic locations in order to change the sentiment polarity of the whole video. Our baseline method for benchmarking the proposed dataset is a 3DCNN model, termed as Boundary Aware Temporal Forgery Detection (BA-TFD), which is guided via contrastive, boundary matching and frame classification loss functions. Our extensive quantitative analysis demonstrates the strong performance of the proposed method for both tasks of temporal forgery localization and deepfake detection.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Cai¹,

Stefanov²,

Dhall³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…[39] proposes a new task of audiovisual event localization that aims at predicting the event class from a 10second clip. [4] studies multi-modal fusion approaches for audiovisual localization but ablates it on THUMOS14 and ActivityNet . Compared to them, we design our method for long, diverse egocentric videos.…”

Section: Related Workmentioning

confidence: 99%

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Ramazanova¹,

Escorcia²,

Heilbron³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, their performance on Charades [79] or AVA [32], shown in Figure 7 (b), is not satisfactory to conduct step-level detection before verification. Although [3,5,11,28,53,54,75,76,92,[96][97][98]100,102,103] perform well on ActivityNet [8] or THUMOS14 [38], it lacks persuasion since the two datasets either contain a few action classes or action instances per video. Thus, we introduce a simple but effective baseline CosAlignment Transformer (abbreviated as CAT), which leverages 2D convolution to extract discriminative features from sampled frames and utilizes a transformer to model inter-step temporal cor-relation in a video clip.…”

Section: Introductionmentioning

confidence: 99%

SVIP: Sequence VerIfication for Procedures in Videos

Qian¹,

Luo²,

Lian³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations but still conducting the same task. Such a challenging task resides in an open-set setting without prior action detection or segmentation that requires event-level or even frame-level annotations. To that end, we carefully reorganize two publicly available action-related datasets with step-procedure-task structure. To fully investigate the effectiveness of any method, we collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments. Besides, a novel evaluation metric Weighted Distance Ratio is introduced to ensure equivalence for different step-level transformations during evaluation. In the end, a simple but effective baseline based on the transformer with a novel sequence alignment loss is introduced to better characterize long-term dependency between steps, which outperforms other action recognition methods. Codes and data will be released.

show abstract

Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization

Cited by 4 publications

References 40 publications

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

SVIP: Sequence VerIfication for Procedures in Videos

Contact Info

Product

Resources

About