Zheng-Wei Huang scite author profile

As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect-related, discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.Index Terms-Affective-salient discriminative feature analysis, convolutional neural networks, feature learning, speech emotion recognition.

show abstract

Speech Emotion Recognition Using CNN

Huang

et al. 2014

View full text Add to dashboard Cite

Deep learning systems, such as Convolutional Neural Networks (CNNs), can infer a hierarchical representation of input data that facilitates categorization. In this paper, we propose to learn affect-salient features for Speech Emotion Recognition (SER) using semi-CNN. The training of semi-CNN has two stages. In the first stage, unlabeled samples are used to learn candidate features by contractive convolutional neural network with reconstruction penalization. The candidate features, in the second step, are used as the input to semi-CNN to learn affect-salient, discriminative features using a novel objective function that encourages the feature saliency, orthogonality and discrimination. Our experiment results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and environment distortion), and outperforms several well-established SER features.

show abstract

Unsupervised domain adaptation for speech emotion recognition using PCANet

Huang

Xue

Mao

et al. 2016

Multimed Tools Appl

View full text Add to dashboard Cite

Speech emotion recognition with unsupervised feature learning

Huang

Xue

Mao

2015

Frontiers Inf Technol Electronic Eng

View full text Add to dashboard Cite

Learning speech emotion features by joint disentangling-discrimination

Xue

Huang

Luo

et al. 2015

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

hi@scite.ai

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Zheng-Wei Huang

Learning Salient Features for Speech Emotion <newline/>Recognition Using Convolutional <newline/>Neural Networks

Speech Emotion Recognition Using CNN

Unsupervised domain adaptation for speech emotion recognition using PCANet

Speech emotion recognition with unsupervised feature learning

Learning speech emotion features by joint disentangling-discrimination

Contact Info

Product

Resources

About