Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2049
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Pooling Structure for Weakly Labeled Sound Event Detection

Abstract: Sound event detection with weakly labeled data is considered as a problem of multi-instance learning. And the choice of pooling function is the key to solving this problem. In this paper, we proposed a hierarchical pooling structure to improve the performance of weakly labeled sound event detection system. Proposed pooling structure has made remarkable improvements on three types of pooling function without adding any parameters. Moreover, our system has achieved competitive performance on Task 4 of Detection … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
8
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 7 publications
0
8
0
Order By: Relevance
“…Model is trained with ADAM optimizer with an initial learning rate of 0.001 for 30 epochs, and the learning rate is reduced by half every 10 epochs. We chose Classification F1 Dev (%) Eval (%) Xu et al [5] [14] 46.8 N/A Yan et al [6] 51.3 55.1 He et al [12] 46.5 53.4 Ours 49.9 49.4 Table 3: Results on DCASE 2017 task 4B: weakly supervised AED for smart cars the best weights out of 30 epochs based on classification F1 on the dev set. The batch size is set to 48 due to GPU memory constraints.…”
Section: Dcase 2018 Taskmentioning
confidence: 99%
See 1 more Smart Citation
“…Model is trained with ADAM optimizer with an initial learning rate of 0.001 for 30 epochs, and the learning rate is reduced by half every 10 epochs. We chose Classification F1 Dev (%) Eval (%) Xu et al [5] [14] 46.8 N/A Yan et al [6] 51.3 55.1 He et al [12] 46.5 53.4 Ours 49.9 49.4 Table 3: Results on DCASE 2017 task 4B: weakly supervised AED for smart cars the best weights out of 30 epochs based on classification F1 on the dev set. The batch size is set to 48 due to GPU memory constraints.…”
Section: Dcase 2018 Taskmentioning
confidence: 99%
“…The winner of DCASE 2017 challenge used an ensemble of CNNs with various lengths of analysis windows for multiple input scaling [11]. He et al [12] proposed a hierarchical pooling structure to improve the performance of CRNN. The effect of different pooling/attention methods on AED and audio tagging also have been analyzed in previous works [13,14,15].…”
Section: Introductionmentioning
confidence: 99%
“…For example, Kong et al [12] proposed an attention model as a pooling function, which is achieved by a weighted sum of the results over frames. He et al [13] proposed a hierarchical attention pooling structure. By assigning larger weights to the instance corresponding to the sound events, these methods could dynamically determine the contribution of each instance.…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, the RNN, which can remember previous inputs through time, also performs well due to the time-series characteristics of the audio signal. Recently, the CNN and CRNN models with additional techniques, including the modified CNN [15] and pooling methods [16], were proposed. Furthermore, further studies combined with other tasks such as sound event detection and segmentation using the weakly labeled data [17] and joint sound event detection and localization [18] were proposed.In contrast, in the speech recognition domain, the use of the LMFB as a feature vector and the CNN, RNN, and CRNN as the classifiers for the acoustic model are similar to the SED, but studies of integrating a preprocessor, such as acoustic beamforming or dereverberation with the acoustic model using multi-channel audio signals, have been actively conducted to improve recognition accuracy [19][20][21].…”
mentioning
confidence: 99%
“…Furthermore, the RNN, which can remember previous inputs through time, also performs well due to the time-series characteristics of the audio signal. Recently, the CNN and CRNN models with additional techniques, including the modified CNN [15] and pooling methods [16], were proposed. Furthermore, further studies combined with other tasks such as sound event detection and segmentation using the weakly labeled data [17] and joint sound event detection and localization [18] were proposed.…”
mentioning
confidence: 99%