There are two major questions regarding Environmental Sound Classification (ESC). What is the best audio recognition framework, and what is the most robust audio feature? For investigating above problems, the Gated Recurrent Unit (GRU) network was used to analyze the effect of single features such as Mel Scale Spectrogram (Mel), Log-Mel Scale Spectrogram (LM), and Mel frequency cepstral coefficient (MFCC) as well as multi-feature about Mel-MFCC, LM-MFCC, and Mel-LM-MFCC (T-M) in this paper. The experiment results show that in the ESC tasks, multi-features are better than the single features in the same dimensions, and LM-MFCC has the strongest robustness. Meanwhile, reverse sequence MFCC (R-MFCC) and forward and reverse mixed sequence MFCC (FR-MFCC) were also proposed to study the effects of sequence changes on audio. The experiment results show that the sequence transformation of audio features has less influence on the recognition tasks. Furthermore, to investigate the ESC task further we introduced the attention weight similar model (AWS) in to the multi-feature. The AWS model allows different audio feature attention weights of the same sound to learn from each other. It makes the GRU-AWS model focus on the frame-level features more effectively. The experiment results show that the GRU-AWS gets excellent performance with a recognition rate of 94.3%, and it outperforms other state-of-the-art methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.