2020
DOI: 10.1109/taslp.2020.3014737
|View full text |Cite
|
Sign up to set email alerts
|

Sound Event Detection of Weakly Labelled Data With CNN-Transformer and Automatic Threshold Optimization

Abstract: Sound event detection (SED) is a task to detect sound events in an audio recording. One challenge of the SED task is that many datasets such as the Detection and Classification of Acoustic Scenes and Events (DCASE) datasets are weakly labelled. That is, there are only audio tags for each audio clip without the onset and offset times of sound events. We compare segment-wise and clip-wise training for SED that is lacking in previous works. We propose a convolutional neural network transformer (CNN-Transfomer) fo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
74
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 101 publications
(75 citation statements)
references
References 30 publications
1
74
0
Order By: Relevance
“…We have compared the SED performance of our methods with those of other conventional methods, such as SED using a α min-max sub- sampling method within CRNN [18], the batch dice loss-based SED [23,27], multitask learning of SED and sound activity detection [28], and Transformer-based SED [12,13]. As the Transformerbased SED, we used three CNN layers with the same structure as the CNN-BiGRU, followed by two Transformer encoder layers and two dense layers.…”
Section: Comparison With Conventional Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We have compared the SED performance of our methods with those of other conventional methods, such as SED using a α min-max sub- sampling method within CRNN [18], the batch dice loss-based SED [23,27], multitask learning of SED and sound activity detection [28], and Transformer-based SED [12,13]. As the Transformerbased SED, we used three CNN layers with the same structure as the CNN-BiGRU, followed by two Transformer encoder layers and two dense layers.…”
Section: Comparison With Conventional Methodsmentioning
confidence: 99%
“…For SED, many methods using neural networks, such as a convolutional neural network (CNN) [9], recurrent neural network (RNN) [10], convolutional recurrent neural network (CRNN) [11], and Transformer-based neural network [12,13], have been proposed. In these methods, an audio clip is segmented into short time frames (e.g., 40 ms frames), and each frame is regarded as one data sample for model training and evaluation.…”
Section: Introductionmentioning
confidence: 99%
“…However, these models discard the temporal order of frame-level features in their construction, leading to unsatisfactory final results. Different scholars have adopted different solutions to this type of time series information loss problem; for example, Pablo used End-to-End neural networks to solve this type of problem [20], Kong proposed an attention model and explained this model in a novel probabilistic perspective [21]. In addition, he proposed a convolutional neural network converter (CNN-Transformer) for audio tagging and SED and showed that the performance of the CNN-Transformer is similar to that of the CRNN [22].…”
Section: Related Work In Sound Event Detection Systemsmentioning
confidence: 99%
“…All the experiments are repeated 3 times and we report the best result of each model. In the post-processing phase, we use the same post-processing method as in paper [15] to get framelevel predictions and clip-level predictions.…”
Section: Model and Experimental Setupmentioning
confidence: 99%