Slow-Fast Auditory Streams for Audio Recognition

Kazakos, Evangelos; Nagrani, Arsha; Zisserman, Andrew; Damen, Dima

doi:10.1109/icassp39728.2021.9413376

Cited by 37 publications

(20 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Features. We experiment with TBN [25], SlowFast visual [19], and SlowFast auditory [26] features. We observe that using SlowFast features shows superior performance than TBN.…”

Section: Implementation Detailsmentioning

confidence: 99%

“…For reporting results on the test set, we do not use validation set for training, compared to [15]. Second column indicates feature backbones used for the ablation: TSN [41], I3D [11], SF(A) [26], SF(V) [19]. the viewers.…”

Section: Limitationsmentioning

confidence: 99%

“…Our focus is to investigate audiovisual inputs; thus, we consider the encoders that process auditory and visual signals. We consider TBN [25] and SlowFast [19,26] networks as our feature encoders. TBN operates on RGB, Flow, and spectrogram, and Visual and Auditory SlowFast take video frames and spectrogram, respectively, as inputs.…”

Section: A1 Feature Encodersmentioning

confidence: 99%

See 2 more Smart Citations

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Ramazanova¹,

Escorcia²,

Heilbron³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Features. We experiment with TBN [25], SlowFast visual [19], and SlowFast auditory [26] features. We observe that using SlowFast features shows superior performance than TBN.…”

Section: Implementation Detailsmentioning

confidence: 99%

Section: Limitationsmentioning

confidence: 99%

See 1 more Smart Citation

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

Ramazanova¹,

Escorcia²,

Heilbron³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…For EGTEA, see appendix F. Auditory features. We use Auditory SlowFast [33] for audio feature extraction when present. Similarly to the visual features, we extract 10 clips of 1s each uniformly spaced for each action segment, with average pooling and concatenation of the features from the Slow and Fast streams, and the resulting features have the same dimensionality, d a = 2304.…”

Section: Implementation Detailsmentioning

confidence: 99%

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

Kazakos¹,

Huh²,

Nagrani³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit language model providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and language model to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN. IntroductionAction recognition in egocentric video streams from sources like EPIC-KITCHENS poses a number of challenges that differ substantially from those of conventional third-person action recognition -where training and evaluation is on 10 second video clips and classes are quite high-level [31]. Actions are fine-grained (e.g. 'open bottle') and noticeably short, often one second or shorter. Along with the challenge, the footage offers an under-explored opportunity, as actions are captured in long untrimmed videos of well-defined and at-times predictable sequences. For example the action 'wash aubergine' can be part of the following sequence -you first 'take the aubergine', 'turn on the tap', 'wash the aubergine' and finally 'turn off the tap' (Fig. 1). Furthermore, the objects (the aubergine and tap in this case) are persistent over some of the neighbouring actions.

show abstract

“…For the video encoder, we follow the design of SlowFast network with the modifications proposed in CVRL (Feichtenhofer et al, 2019;Qian et al, 2021). For the audio encoder, we followed the design of (Al-Tahan & Mohsenzadeh, 2021; Kazakos et al, 2021), however due to memory restrains we apply max-pooling to the temporal dimension, contrary to the implementation proposed by Kazakos et al (2021). All models were trained from random initialization with 4 and 8 NVIDIA v100 Tesla GPUs.…”

Section: Audiovisual Encodermentioning

confidence: 99%

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Al-Tahan¹,

Mohsenzadeh²

2021

Preprint

View full text Add to dashboard Cite

Contrastive learning of auditory and visual perception has been extremely successful when investigated individually. However, there are still major questions on how we could integrate principles learned from both domains to attain effective audiovisual representations. In this paper, we present a contrastive framework to learn audiovisual representations from unlabeled videos. The type and strength of augmentations utilized during self-supervised pre-training play a crucial role for contrastive frameworks to work sufficiently. Hence, we extensively investigate composition of temporal augmentations suitable for learning audiovisual representations; we find lossy spatio-temporal transformations that do not corrupt the temporal coherency of videos are the most effective. Furthermore, we show that the effectiveness of these transformations scales with higher temporal resolution and stronger transformation intensity. Compared to self-supervised models pretrained on only sampling-based temporal augmentation, self-supervised models pre-trained with our temporal augmentations lead to approximately 6.5% gain on linear classifier performance on AVE dataset. Lastly, we show that despite their simplicity, our proposed transformations work well across self-supervised learning frameworks (SimSiam, MoCoV3, etc), and benchmark audiovisual dataset (AVE).

show abstract

Slow-Fast Auditory Streams for Audio Recognition

Cited by 37 publications

References 19 publications

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition

The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning

Contact Info

Product

Resources

About