Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-3040
|View full text |Cite
|
Sign up to set email alerts
|

Spatio-Temporal Attention Pooling for Audio Scene Classification

Abstract: Acoustic scenes are rich and redundant in their content. In this work, we present a spatio-temporal attention pooling layer coupled with a convolutional recurrent neural network to learn from patterns that are discriminative while suppressing those that are irrelevant for acoustic scene classification. The convolutional layers in this network learn invariant features from time-frequency input. The bidirectional recurrent layers are then able to encode the temporal dynamics of the resulting convolutional featur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
25
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

3
5

Authors

Journals

citations
Cited by 25 publications
(25 citation statements)
references
References 21 publications
(42 reference statements)
0
25
0
Order By: Relevance
“…Since the frequency domain characteristics of spectrogram features remains static over different time frames, we choose to process temporal and frequency domains separately as proposed in [ 36 ] instead of jointly processing them together as an image. Through this approach, we extract global temporal and frequency attention vectors, e.g., and , and generate the final attention map , according to …”
Section: Proposed Temporal-frequency Attention Based Classification Frameworkmentioning
confidence: 99%
“…Since the frequency domain characteristics of spectrogram features remains static over different time frames, we choose to process temporal and frequency domains separately as proposed in [ 36 ] instead of jointly processing them together as an image. Through this approach, we extract global temporal and frequency attention vectors, e.g., and , and generate the final attention map , according to …”
Section: Proposed Temporal-frequency Attention Based Classification Frameworkmentioning
confidence: 99%
“…The sequence of recurrent outputs is then reduced to a feature vector (i.e. view-specific embedding) via spatio-temporal attention pooling suggested in [8]. For classification purpose, the 2D CRNNs make use of two fullyconnected layers with ReLU activation, followed by a final output layer with softmax.…”
Section: Network Architecturementioning
confidence: 99%
“…Good embeddings are crucial for machine learning tasks [1,2,3]. For audio and music classification, in particular, such an embedding can be learned from a variety of low-level features which have been developed alongside the development of the research field, such as Mel-scaled spectrogram [4,5,6,7], Gammatone spectrogram [8,9,2], Constant-Q transform (CQT) spectrogram [10,11,12], and even raw waveform [13,14]. Oftentimes, recognition results obtained from embeddings learned from different low-level inputs vary in the sense that one embedding is good for some target classes while another is good for some other target classes.…”
Section: Introductionmentioning
confidence: 99%
“…Although the translation of local patterns in the time domain has little effect on the classification of sound events, the difference across frequency bands has a significant impact on the performance of sound classification [9]. To capture the information about which parts of the features are more relevant to the sound events, attention mechanisms have been proposed [10][11][12][13][14][15][16][17], especially for weakly labelled data where the timing information about the sound events is not available in the training data. In these methods, temporal attention [11,14] is applied to obtain weights for combining feature vectors at different time steps, however, the importance of different frequency bands is not considered.…”
Section: Introductionmentioning
confidence: 99%