This paper presents and discusses various metrics proposed for evaluation of polyphonic sound event detection systems used in realistic situations where there are typically multiple sound sources active simultaneously. The system output in this case contains overlapping events, marked as multiple sounds detected as being active at the same time. The polyphonic system output requires a suitable procedure for evaluation against a reference. Metrics from neighboring fields such as speech recognition and speaker diarization can be used, but they need to be partially redefined to deal with the overlapping events. We present a review of the most common metrics in the field and the way they are adapted and interpreted in the polyphonic case. We discuss segment-based and event-based definitions of each metric and explain the consequences of instance-based and class-based averaging using a case study. In parallel, we provide a toolbox containing implementations of presented metrics.
Public evaluation campaigns and datasets promote\ud
active development in target research areas, allowing direct\ud
comparison of algorithms. The second edition of the challenge\ud
on Detection and Classification of Acoustic Scenes and Events\ud
(DCASE 2016) has offered such an opportunity for development\ud
of state-of-the-art methods, and succeeded in drawing together a\ud
large number of participants from academic and industrial backgrounds.\ud
In this paper, we report on the tasks and outcomes of\ud
the DCASE 2016 challenge. The challenge comprised four tasks:\ud
acoustic scene classification, sound event detection in synthetic\ud
audio, sound event detection in real-life audio, and domestic\ud
audio tagging. We present in detail each task and analyse the\ud
submitted systems in terms of design and performance. We\ud
observe the emergence of deep learning as the most popular\ud
classification method, replacing the traditional approaches based\ud
on Gaussian mixture models and support vector machines. By\ud
contrast, feature representations have not changed substantially\ud
throughout the years, as mel frequency-based representations\ud
predominate in all tasks. The datasets created for and used in\ud
DCASE 2016 are publicly available and are a valuable resource\ud
for further research
The work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.