Automatic speech emotion recognition is one of the challenging tasks in machine learning community mainly due to the significant variations across individuals while expressing the same emotion cue. The success of emotion recognition with machine learning techniques primarily depends on the feature set chosen to learn. The formulation of appropriate feature set that cater all the variations in emotion cues however is not a trivial task. Recent works on emotion recognition with deep learning techniques thus focus on the end−to−end learning scheme which identifies the features directly from the raw speech signal instead of relying on hand-designed feature set. The existing methods in this scheme however did not take into account the fact that speech signals often exhibit significant features at different time scales and frequencies than in the raw form. To address this issue, in this work, an end − to − end neural network model named as Multi-scale Convolution Neural Network (MCNN) is proposed to automatically identify the features at different time scales and frequencies of the raw speech signal. The proposed model further leverages on the multi-branch input layer and tunable convolution layers to learn the identified features and then recognizes the emotion cues accordingly. The MCNN method is evaluated with the SAVEE emotion database and results highlight that the proposed method improves the emotion recognition accuracy significantly as compared to the existing methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.