Speech emotion recognition (SER) aims to automatically recognize emotional category for a given speech utterance. The performance of a SER system heavily relies on the effectiveness of global representation expressed at utterance level. To effectively extract such a global feature, the mainstream of recent SER architectures adopts a pipeline with two key modules, feature extraction and aggregation. Although variant module designs have brought impressive progresses, SER is still a challenging task. In contrast with those previous works, herein we propose a novel strategy for global SER feature extraction by applying an additional enhancement module on top of the current SER pipeline. To verify its effect, an endto-end SER architecture is proposed where stacked multiple transformer layers are explored to enhance the aggregated global feature. Such an architecture is evaluated on IEMO-CAP and results strongly substantiate the effectiveness of our proposal. In terms of weighted accuracy on four emotion categories, our proposed SER system outperforms the prior arts by a large margin of relatively 20% improvement. Our codes and the pre-trained SER models are made publicly available.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.