Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Kim, Jaebok; Park, Jeong-Sik

doi:10.1016/j.engappai.2016.02.018

Cited by 21 publications

(14 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, pre‐processing the Naxi speech signal to be identified, then extracting its feature parameters, entering the next model to match, and comparing the obtained template with the template stored in the GMM model library. The highest matching probability is the recognition result [11]. GMM speaker recognition system process was shown in Fig.…”

Section: Speech Recognition Methodsmentioning

confidence: 99%

Research on acquisition and recognition of Naxi speaker's speech information

Xiaoli

Wang

et al. 2019

J. eng.

View full text Add to dashboard Cite

Section: Speech Recognition Methodsmentioning

confidence: 99%

Research on acquisition and recognition of Naxi speaker's speech information

Xiaoli

Wang

et al. 2019

J. eng.

View full text Add to dashboard Cite

“…Their system was evaluated on Arabic Emirati-Accented and SUSAS datasets and they obtained an average recognition rate of 83.97% and 86.67%, respectively. Kim and Park [17] proposed a multistage data selection method for speech emotion recognition from previous voice data accumulated on personal devices. Multistage data selection is conducted using log likelihood distance based measure and a universal background model [17] and obtained an average recognition rate of 83.9%.…”

Section: Literature Reviewmentioning

confidence: 99%

“…Kim and Park [17] proposed a multistage data selection method for speech emotion recognition from previous voice data accumulated on personal devices. Multistage data selection is conducted using log likelihood distance based measure and a universal background model [17] and obtained an average recognition rate of 83.9%. Literature shows that many of the researches obtained higher accuracy by using hybrid classifier models.…”

Section: Literature Reviewmentioning

confidence: 99%

Emotion Recognition From Speech Using Wavelet Packet Transform Cochlear Filter Bank and Random Forest Classifier

et al. 2020

View full text Add to dashboard Cite

This research aims to design and implement an artificial emotional intelligence system that is capable of identifying the unknown emotion of the speaker. To that end, we propose a novel framework for emotion recognition in the presence of noise and interference. Our approach accounts for energy, time and spectral parameters to examine the emotion of the speaker. However, rather than using Gammatone filterbank and short-time Fourier transform (STFT), commonly adopted in the literature, we propose employing a novel wavelet packet transform (WPT) based cochlear filterbank. Our system, coupling this representation with random forest classifier, shows superior performance over other existing algorithms when appraised on three distinct speech corpora in two different languages, and considering also stressful and noisy talking conditions.

show abstract

“…To ignite the interactions between smart devices and their owners, automatic speaker recognition (ASR) plays an important role to determine the speaker identity based on a short piece of audio. Moreover, the capability of ASR comes with a wide range of applications, such as biometric authentication [23], forensics [10], and personalized services in electronics [13]. In particular, the text-independent ASR with only acoustic information is the most general and non-trial task, which can be used in everyday situations.…”

Section: Introductionmentioning

confidence: 99%

Automatic Speaker Recognition with Limited Data

Jiang

et al. 2020

Proceedings of the 13th International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

Automatic speaker recognition (ASR) is a stepping-stone technology towards semantic multimedia understanding and benefits versatile downstream applications. In recent years, neural network-based ASR methods have demonstrated remarkable power to achieve excellent recognition performance with sufficient training data. However, it is impractical to collect sufficient training data for every user, especially for fresh users. Therefore, a large portion of users usually has a very limited number of training instances. As a consequence, the lack of training data prevents ASR systems from accurately learning users acoustic biometrics, jeopardizes the downstream applications, and eventually impairs user experience.In this work, we propose an adversarial few-shot learning-based speaker identification framework (AFEASI ) to develop robust speaker identification models with only a limited number of training instances. We first employ metric learning-based few-shot learning to learn speaker acoustic representations, where the limited instances are comprehensively utilized to improve the identification performance. In addition, adversarial learning is applied to further enhance the generalization and robustness for speaker identification with adversarial examples. Experiments conducted on a publicly available large-scale dataset demonstrate that AFEASI significantly outperforms eleven baseline methods. An in-depth analysis further indicates both effectiveness and robustness of the proposed method.

show abstract

Multistage data selection-based unsupervised speaker adaptation for personalized speech emotion recognition

Cited by 21 publications

References 28 publications

Research on acquisition and recognition of Naxi speaker's speech information

Research on acquisition and recognition of Naxi speaker's speech information

Emotion Recognition From Speech Using Wavelet Packet Transform Cochlear Filter Bank and Random Forest Classifier

Automatic Speaker Recognition with Limited Data

Contact Info

Product

Resources

About