Despite the success of the automatic speech recognition framework in its own application field, its adaptation to the problem of acoustic event detection has resulted in limited success. In this article, instead of treating the problem similar to the segmentation and classification tasks in speech recognition, we pose it as a regression task and propose an approach based on random forest regression. Furthermore, event localization in time can be efficiently handled as a joint problem. We firstly decompose the training audio signals into multiple interleaved superframes which are annotated with the corresponding event class labels and their displacements to the temporal onsets and offsets of the events. For a specific event category, a random-forest regression model is learned using the displacement information. Given an unseen superframe, the learned regressor will output the continuous estimates of the onset and offset locations of the events. To deal with multiple event categories, prior to the category-specific regression phase, a superframe-wise recognition phase is performed to reject the background superframes and to classify the event superframes into different event categories. While jointly posing event detection and localization as a regression problem is novel, the superior performance on two databases ITC-Irst and UPC-TALP demonstrates the efficiency and potential of the proposed approach.Index Terms-acoustic event detection, regression forest, random forest, superframe.
The version in the Kent Academic Repository may differ from the final published version. Users are advised to check http://kar.kent.ac.uk for the status of the paper. Users should always cite the published version of record.
We introduce in this work an efficient approach for audio scene classification using deep recurrent neural networks. An audio scene is firstly transformed into a sequence of high-level label tree embedding feature vectors. The vector sequence is then divided into multiple subsequences on which a deep GRUbased recurrent neural network is trained for sequence-to-label classification. The global predicted label for the entire sequence is finally obtained via aggregation of subsequence classification outputs. We will show that our approach obtains an F1-score of 97.7% on the LITIS Rouen dataset, which is the largest dataset publicly available for the task. Compared to the best previously reported result on the dataset, our approach is able to reduce the relative classification error by 35.3%.
Abstract-Virtual 3-D sound can be easily delivered to a listener by binaural audio signals that are reproduced via headphones, which guarantees that only the correct signals reach the corresponding ears. Reproducing the binaural audio signal by two or more loudspeakers introduces the problems of crosstalk on the one hand, and, of reverberation on the other hand. In crosstalk cancellation, the audio signals are fed through a network of prefilters prior to loudspeaker reproduction to ensure that only the designated signal reaches the corresponding ear of the listener. Since room impulse responses are very sensitive to spatial mismatch, and since listeners might slightly move while listening, robust designs are needed. In this paper, we present a method that jointly handles the three problems of crosstalk, reverberation reduction, and spatial robustness with respect to varying listening positions for one or more binaural source signals and multiple listeners. The proposed method is based on a multichannel room impulse response reshaping approach by optimizing a -norm based criterion. Replacing the well-known least-squares technique by a -norm based method employing a large value for allows us to explicitly control the amount of crosstalk and to shape the remaining reverberation effects according to a desired decay.
The purpose of room impulse response reshaping is to reduce reverberation and thus to improve the perceived quality of the received signal by prefiltering the source signal before it is played with a loudspeaker. The filter design is usually carried out by solving an optimization problem.There are, in general, two possibilities to improve the robustness of the equalizers against small movements of the listener and/or receiver; namely multi-position approaches or the utilization of a regularization term. Multi-position approaches suffer from the extensive effort of measuring multiple room impulse responses. Stochastic models may describe the average system error due to spatial mismatch, but only quadratic penalty terms have been considered so far.In this contribution we propose a third method to improve robustness against spatial misalignment. We combine the two approaches by generating multiple realizations of distorted room impulse responses and feeding them into the multiposition algorithm. Based on our previous work, we propose a model to capture the perturbations with respect to the assumed displacement.Index Terms-room impulse response, RIR reshaping, p-norm, spatial robustness.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.