Polyphonic Sound Event Detection and Localization using a Two-Stage
                        Strategy

Cao, Yin; Kong, Qiuqiang; Iqbal, Turab; An, Fengyan; Wang, Wenwu; Plumbley, Mark D.

doi:10.33682/4jhy-bj81

Cited by 90 publications

(121 citation statements)

References 22 publications

Supporting

Mentioning

111

Contrasting

Order By: Relevance

“…Input features 2 Since distance from the listener is not relevant for the task, when converting to and from cartesian coordinates, we always assume the norm r = 1, that is we consider direction of arrivals as points on the unit sphere. are logscale Mel-magnitude spectrogram (logmels) and Generalized Cross-Correlation Phase Transform (GCC-PHAT) of the mutual channels, as in [12,26]. All wav-files were downsampled at a sampling rate of 32 kHz.…”

Section: Methodsmentioning

confidence: 99%

“…The approaches that has been adopted to solve this problem can be classified in two main categories: parametric-based methods, like multiple signal classification (MUSIC) [1] and others [2][3][4], and deep neural network (DNN)-based methods [5][6][7][8][9][10][11][12][13][14][15][16][17]. DNN-based models often combine DOA estimation with other tasks such as sound activity detection (SAD), estimation of number of active sources and sound event detection (SED) [11][12][13]. In particular, Sound Event Localization and Detection was the task 3 of Detection and Classification of Acoustic Scenes and Events 2019 Challenge (DCASE2019 Challenge) [18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation

Mazzon¹,

Yasuda²,

Harada³

2019

Proceedings of the Detection and Classification of Acoustic Scenes And Events 2019 Workshop (DCASE2019)

View full text Add to dashboard Cite

In this paper, we propose a novel data augmentation method for training neural networks for Direction of Arrival (DOA) estimation. This method focuses on expanding the representation of the DOA subspace of a dataset. Given some input data, it applies a transformation to it in order to change its DOA information and simulate new potentially unseen one. Such transformation, in general, is a combination of a rotation and a reflection. It is possible to apply such transformation due to a well-known property of First Order Ambisonics (FOA). The same transformation is applied also to the labels, in order to maintain consistency between input data and target labels. Three methods with different level of generality are proposed for applying this augmentation principle. Experiments are conducted on two different DOA networks. Results of both experiments demonstrate the effectiveness of the novel augmentation strategy by improving the DOA error by around 40%.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation

Mazzon¹,

Yasuda²,

Harada³

2019

Proceedings of the Detection and Classification of Acoustic Scenes And Events 2019 Workshop (DCASE2019)

View full text Add to dashboard Cite

show abstract

“…Cao et al (Cao Surrey) [19], had the second best performing system, following the first one closely. However, the authors kept the general SELDnet architecture and advanced it with a number of informed domain-specific choices.…”

Section: B Analysis Of Individual Systemsmentioning

confidence: 99%

“…Additionally, they used both FOA and MIC input and ensemble averaging. According to ablation studies in [19], the better input features and the two-stage training architecture have a drastic effect in performance.…”

Section: B Analysis Of Individual Systemsmentioning

confidence: 99%

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

Politis

Mesaros

Adavanne

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Sound event localization and detection is a novel area of research that emerged from the combined interest of analyzing the acoustic scene in terms of the spatial and temporal activity of sounds of interest. This paper presents an overview of the first international evaluation on sound event localization and detection, organized as a task of the DCASE 2019 Challenge. A large-scale realistic dataset of spatialized sound events was generated for the challenge, to be used for training of learning-based approaches, and for evaluation of the submissions in an unlabeled subset. The overview presents in detail how the systems were evaluated and ranked and the characteristics of the best-performing systems. Common strategies in terms of input features, model architectures, training approaches, exploitation of prior knowledge, and data augmentation are discussed. Since ranking in the challenge was based on individually evaluating localization and event classification performance, part of the overview focuses on presenting metrics for the joint measurement of the two, together with a reevaluation of submissions using these new metrics. The new analysis reveals submissions that performed better on the joint task of detecting the correct type of event close to its original location than some of the submissions that were ranked higher in the challenge. Consequently, ranking of submissions which performed strongly when evaluated separately on detection or localization, but not jointly on both, was affected negatively.

show abstract

“…Although Sound Event Detection and Localization (SEDL), as well as anomalous SED, appear to fit the criteria mentioned above, it is essential to point out that these two areas belong to different domains. SEDL is the combined task of identifying temporal activities of each sound event and the estimation of their respective spatial location trajectories when active [33][34][35].…”

Section: Introductionmentioning

confidence: 99%

A Comprehensive Review of Polyphonic Sound Event Detection

Chan

Chin

2020

IEEE Access

View full text Add to dashboard Cite

One of the most amazing functions of the human auditory system is the ability to detect all kinds of sound events in the environment. With the technologies and hardware advances, polyphonic Sound Event Detection (SED) can be developed to mimic the ability of the human auditory system. However, the development of a SED system is no trivial task, and several different factors often hinder accuracy. Although there are several overview papers available, most of them only provide a theoretical overview of algorithms used with little discussion. Thus, to the best of the authors' knowledge, there is no comprehensive review that covers this particular domain. Therefore, this paper aims to provide an in-depth discussion of different methodologies proposed by various authors that include the features used, detection algorithms, and their corresponding accuracy and limitations. Additional information on possible trends is also discussed that can be useful for future development works.

show abstract

Polyphonic Sound Event Detection and Localization using a Two-Stage Strategy

Cited by 90 publications

References 22 publications

First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation

First Order Ambisonics Domain Spatial Augmentation for DNN-based Direction of Arrival Estimation

Overview and Evaluation of Sound Event Localization and Detection in DCASE 2019

A Comprehensive Review of Polyphonic Sound Event Detection

Contact Info

Product

Resources

About