ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413376
|View full text |Cite
|
Sign up to set email alerts
|

Slow-Fast Auditory Streams for Audio Recognition

Abstract: We propose a two-stream convolutional network for audio recognition, that operates on time-frequency spectrogram inputs. Following similar success in visual recognition, we learn Slow-Fast auditory streams with separable convolutions and multi-level lateral connections. The Slow pathway has high channel capacity while the Fast pathway operates at a fine-grained temporal resolution. We showcase the importance of our two-stream proposal on two diverse datasets: VGG-Sound and EPIC-KITCHENS-100, and achieve stateo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
20
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 37 publications
(20 citation statements)
references
References 19 publications
0
20
0
Order By: Relevance
“…Features. We experiment with TBN [25], SlowFast visual [19], and SlowFast auditory [26] features. We observe that using SlowFast features shows superior performance than TBN.…”
Section: Implementation Detailsmentioning
confidence: 99%
See 2 more Smart Citations
“…Features. We experiment with TBN [25], SlowFast visual [19], and SlowFast auditory [26] features. We observe that using SlowFast features shows superior performance than TBN.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…For reporting results on the test set, we do not use validation set for training, compared to [15]. Second column indicates feature backbones used for the ablation: TSN [41], I3D [11], SF(A) [26], SF(V) [19]. the viewers.…”
Section: Limitationsmentioning
confidence: 99%
See 1 more Smart Citation
“…For EGTEA, see appendix F. Auditory features. We use Auditory SlowFast [33] for audio feature extraction when present. Similarly to the visual features, we extract 10 clips of 1s each uniformly spaced for each action segment, with average pooling and concatenation of the features from the Slow and Fast streams, and the resulting features have the same dimensionality, d a = 2304.…”
Section: Implementation Detailsmentioning
confidence: 99%
“…For the video encoder, we follow the design of SlowFast network with the modifications proposed in CVRL (Feichtenhofer et al, 2019;Qian et al, 2021). For the audio encoder, we followed the design of (Al-Tahan & Mohsenzadeh, 2021; Kazakos et al, 2021), however due to memory restrains we apply max-pooling to the temporal dimension, contrary to the implementation proposed by Kazakos et al (2021). All models were trained from random initialization with 4 and 8 NVIDIA v100 Tesla GPUs.…”
Section: Audiovisual Encodermentioning
confidence: 99%