Sound-event recognition often utilizes time-frequency analysis to produce an image-like spectrogram that provides a rich visual representation of original signal in time and frequency. Convolutional Neural Networks (CNN) with the ability of learning discriminative spectrogram patterns are suitable for sound-event recognition. However, there is relatively little effort that CNN makes full use of the important temporal information. In this paper, we propose MCRNN, a Convolutional Recurrent Neural Networks (CRNN) architecture for sound-event recognition, the letter “M” in the name “MCRNN” of our model denotes the multi-sized convolution filters. Richer features are extracted by using several different convolution filter sizes at the last convolution layer. In addition, cochleagram images are used as the input layer of the network, instead of the traditional spectrogram image of a sound signal. Experiments on the RWCP dataset shows that the recognition rate of the proposed method achieved 98.4% in clean conditions, and it robustly outperforms the existing methods, the recognition rate increased by 0.9%, 1.9% and 10.3% in 20 dB, 10 dB and 0 dB signal-to-noise ratios (SNR), respectively.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.