Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1587
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Scale Time-Frequency Attention for Acoustic Event Detection

Abstract: Most attention-based methods only concentrate along the time axis, which is insufficient for Acoustic Event Detection (AED). Meanwhile, previous methods for AED rarely considered that target events possess distinct temporal and frequential scales. In this work, we propose a Multi-Scale Time-Frequency Attention (MTFA) module for AED. MTFA gathers information at multiple resolutions to generate a time-frequency attention mask which tells the model where to focus along both time and frequency axis. With MTFA, the… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 6 publications
(8 citation statements)
references
References 17 publications
(38 reference statements)
0
8
0
Order By: Relevance
“…Compared with other state-of-the-art methods, the performance of our model is competitive. Table 8 and Table 9 show the F1 and ER of the 1D-CRNN [15], TFA [21], MTFA [22], and MTF-CRNN on the T2-dev and T2-eval datasets for three target sound events, respectively. The 1D-CRNN [15] applies a 1-dimensional convolution layer, batch normalization(BN) and a pooling layer, followed by RNN layers and a fully connected layer.…”
Section: ) Test Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…Compared with other state-of-the-art methods, the performance of our model is competitive. Table 8 and Table 9 show the F1 and ER of the 1D-CRNN [15], TFA [21], MTFA [22], and MTF-CRNN on the T2-dev and T2-eval datasets for three target sound events, respectively. The 1D-CRNN [15] applies a 1-dimensional convolution layer, batch normalization(BN) and a pooling layer, followed by RNN layers and a fully connected layer.…”
Section: ) Test Resultsmentioning
confidence: 99%
“…TFA [21] presented a temporal-frequential attention model for sound event detection, which is an innovative pretrained method with about 260K parameter counts for 3×3, 5×5, and 7×7 convolution kernels that are used. MTFA [22] used a multiscale hourglass structure that outperformed previous single-model methods for rare sound event detection, with a performance that depends on the quantity of annotated data. Our model performs worse than other state-of-the-art methods but has the fewest parameter counts.…”
Section: ) Test Resultsmentioning
confidence: 99%
See 3 more Smart Citations