This work investigates how to detect emergency vehicles such as ambulances, fire engines, and police cars based on their siren sounds. Recognizing that car drivers may sometimes be unaware of the siren warnings from the emergency vehicles, especially when in-vehicle audio systems are used, we propose to develop an automatic detection system that determines whether there are siren sounds from emergency vehicles nearby to alert other vehicles' drivers to pay attention. A convolutional neural network (CNN)-based ensemble model (SirenNet) with two network streams is designed to classify sounds of traffic soundscape to siren sounds, vehicle horns, and noise, in which the first stream (WaveNet) directly processes raw waveform, and the second one (MLNet) works with a combined feature formed by MFCC (Mel-frequency cepstral coefficients) and log-mel spectrogram. Our experiments conducted on a diverse dataset show that the raw data can complement the MFCC and log-mel features to achieve a promising accuracy of 98.24% in the siren sound detection. In addition, the proposed system can work very well with variable input length. Even for short samples of 0.25 seconds, the system still achieves a high accuracy of 96.89%. The proposed system could be helpful for not only drivers but also autopilot systems. INDEX TERMS Audio recognition, convolutional neural networks, emergency vehicle detection, siren sounds.
Although numerous works have studied the problem of automatic speaker identification (SID), there are only few works on the SID for overlapping speech, and none of them consider the case of more than two simultaneous speakers. Recognizing that overlapping speech occurs frequently in real-life scenarios, such as in meetings or debates, this work investigates the methods for overlapping SID (OSID) that can determine identities in the overlapping speech from up to five simultaneous speakers. We propose two deep-learning OSID systems, one is two-stage and the other is single-stage. The two-stage system determines the number of simultaneous speakers firstly, followed by identifying the speaker(s). The single-stage system uses a single classifier to perform OSID directly, which is slightly more computationally efficient than the two-stage system. Our experiments show that the two-stage OSID system achieves better identification accuracy than that of the single-stage system. In addition, both the OSID systems based on one-dimensional convolutional neural networks (1DCNN) perform better than the systems based on multilayer perceptron (MLP) and Gaussian mixture models (GMMs). The proposed 1DCNN-based two-stage OSID system achieves 98.55% OSID accuracy for the clean audio data containing up to five simultaneous speakers. In more challenging experimental conditions involving both background noises and high overlapping energy ratios, the system still attained accuracies of above 90%.INDEX TERMS overlapping speech, speaker identification, simultaneous speakers, neural networks, Gaussian mixture models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.