Speech emotion recognition is a challenging task and an important step towards more natural human-machine interaction. We show that pre-trained language models can be finetuned for text emotion recognition, achieving an accuracy of 69.5 % on Task 4A of SemEval 2017, improving upon the previous state of the art by over 3 % absolute. We combine these language models with speech emotion recognition, achieving results of 73.5 % accuracy when using provided transcriptions and speech data on a subset of four classes of the IEMOCAP dataset. The use of noise-induced transcriptions and speech data results in an accuracy of 71.4 %. For our experiments, we created IEmoNet, a modular and adaptable bimodal framework for speech emotion recognition based on pre-trained language models. Lastly, we discuss the idea of using an emotional classifier as a reward for reinforcement learning as a step towards more successful and convenient human-machine interaction.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.