Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech

Haque, Abul L.; Rao, K. Sreenivasa

doi:10.1007/s10772-016-9386-9

Cited by 14 publications

(7 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Accordingly, a subjective evaluation was performed using the comparison mean opinion score (CMOS) of the evaluation measures, and speaker similarity. CMOS/MOS tests have been used for drawing similarities between the synthesized and target emotions [13], [15], [17], [33], [57], [84]- [86], while speaker-similarity scores provide the extent to which the identity of a speaker is preserved after conversion. The ranking scales used for estimating CMOS and speaker similarity are explained in Tables.…”

Section: B Subjective Measuresmentioning

confidence: 99%

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

et al. 2020

View full text Add to dashboard Cite

We propose a hybrid network-based learning framework for speaker-adaptive vocal emotion conversion, tested on three different datasets (languages), namely, EmoDB (German), IITKGP (Telugu), and SAVEE (English). The optimized learning model introduced is unique because of its ability to synthesize emotional speech with an acceptable perceptive quality while preserving speaker characteristics. The multilingual model is extremely beneficial in scenarios wherein emotional training data from a specific target speaker are sparsely available. The proposed model uses speaker-normalized mel-generalized cepstral coefficients for spectral training with data adaptation using the seed data from the target speaker. The fundamental frequency (F0) is transformed using a wavelet synchrosqueezed transform prior to mapping to obtain a sharpened time-frequency representation. Moreover, a feedforward artificial neural network, together with particle swarm optimization, was used for F0 training. Additionally, static-intensity modification was also performed for each test utterance. Using the framework, we were able to capture the spectral and pitch contour variabilities of emotional expression better than with other state-of-the-art methods used in this study. Considering the overall performance scores across datasets, an average melcepstral distortion (MCD) of 4.98 and root mean square error (RMSE-F0) of 10.67 were obtained in objective evaluations, and an average comparative mean opinion score (CMOS) of 3.57 and speaker similarity score of 3.70 were obtained for the proposed framework. Particularly, the best MCD of 4.09 (EmoDB-happiness) and RMSE-F0 of 9.00 (EmoDB-anger) were obtained, along with the maximum CMOS of 3.7 and speaker similarity of 4.6, thereby highlighting the effectiveness of the hybrid network model.

show abstract

Section: B Subjective Measuresmentioning

confidence: 99%

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Subjective evaluations have been conducted using the comparative mean opinion score (CMOS) in this work. A CMOS test is conducted for evaluating the similarity of the synthesized speech in relation to the target emotion [22], [24], [26].…”

Section: B Subjective Measuresmentioning

confidence: 99%

Hybrid Framework for Speaker-Independent Emotion Conversion Using i-Vector PLDA and Neural Network

et al. 2019

View full text Add to dashboard Cite

Expressive speech can be synthesized using acoustic feature modeling by mapping the spectral and fundamental frequency parameters between neutral speech and target emotions based on context. Speaker and text-independent emotion conversion are challenging modeling problems in this paradigm. In this paper, spectral mapping using an i-vector-based framework of fixed dimensions is proposed for the speaker-independent emotion conversion, considering the entire problem in the utterance domain, rather than the existing approaches using frame-level processing. The high dimensionality of i-vectors and reduced utterances for i-vector training necessitate the use of Probabilistic Linear Discriminant Analysis (PLDA) to derive the emotion dependent latent vector. The i-vector setup does not require parallel data or alignment procedures at any stage of training. F 0 training is conducted on a multilayer feed-forward neural network using limited aligned seed parallel data. The framework is tested on three different languages (datasets) viz. German (EmoDB), Telugu (IITKGP), and English (SAVEE). The proposed approach delivered superior performance compared to the baseline under both clean and noisy data conditions considered for analysis. Under clean data conditions, the proposed model was found to perform better than the baseline with a Mel Cepstral Distortion as low as 3.8 (fear), an F 0-RMSE of 26.31 (happiness), and a Perceptual Evaluation of Speech Quality (PESQ) of 3.64 (anger) across datasets. Subjective testing provided a maximum CMOS of 4.10 (anger), 4.44 (fear), and 3.43 (happiness). INDEX TERMS CV-GMM, speech emotion, feed-forward ANN, i-vector, MFCC, PLDA. I. INTRODUCTION Emotions form a prominent para-linguistic element of human communication, consisting of speech, facial expressions, gestures, body language, etc. Among these, speech is the most readily accessible information source containing datapoints such as the message conveyed, speaker identity, gender, emotion, and speaker state of health. Emotions animate our speech and is essential for effective dialogue delivery in human-machine interaction and socio-cultural relationships. Furthermore, expressive speech synthesis finds applications The associate editor coordinating the review of this manuscript and approving it for publication was Kathiravan Srinivasan. in story-telling, speaking aids for the disabled [1]-[6], video games and speech-to-speech translators [7] to name a few. An expression synthesis system is normally added as a post-processing stage in text-to-speech synthesis (TTS) systems. There is often a need for a TTS synthesizer tested across multiple languages. This case is particularly relevant in multilingual countries such as India, where 22 official languages [8] exist along-with several other unofficial languages. Effective training of prosodic and spectral parameters from multiple languages is particularly useful in designing affective speech-to-speech translators in lowresource languages. Human-like dialogue delivery often encounters spontaneou...

show abstract

“…RB approach is relatively simple and direct when compared to other methods, the rules employed decides the naturalness and quality of the emotional speech. The rule-based approaches have been used in English [10], Dutch [11], Spanish [12], Catalan [13], German [14], Korean [15] and some Indian languages [16,17,18].…”

Section: Modification Of Prosody For Emotion Conversion Using Gaussian Regression Modelmentioning

confidence: 99%

“…Pathak [34] has used Discrete Wavelet Transform (DWT) for modeling emotional speech of Source and Target speakers. Haque [17] has used spectral energy, epoch strength and epoch sharpness with Pitch and intensity. Filter bank approach was used to modify energy spectra and the pitch contour of target emotion was predicted using Gaussian Normalization and polynomial regression method.…”

Section: Modification Of Prosody For Emotion Conversion Using Gaussian Regression Modelmentioning

confidence: 99%

“…The evaluation of GRM witnessed the important role of pitch contour in the perception of emotions. Haque [17], has also used Gaussian Normalization method to modify the pitch contour of a neutral speech signal to target (Sadness and anger) emotional speech signal. With the pitch contour, the epoch strength, epoch sharpness and spectral energy are modified.…”

Section: B Subjective Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Modification of Prosody for Emotion Conversion using Gaussian Regression Model

Geethashree¹,

Ravi²

2019

IJRTE

View full text Add to dashboard Cite

Emotion conversion is one of the most inspiring forefronts of research in the arena of emotional speech synthesis. The main focus of the work is to convert a neutral speech sentence to the target emotional speech sentence using signal processing techniques. The parameters used for emotion conversion are pitch contour and intensity along with the duration of the sentence. Kannada Emotional Speech (KES) Database is created and used for analysis. The database consists of 4 (sadness, happy, anger, and fear) emotions with neutral. The pitch contour of different emotional sentences are analyzed and Gaussian Regression Model (GRM) is proposed for predicting the target pitch contour. The evaluation of the proposed method is done using Objective test & Subjective test. For objective test, mean pitch, the standard deviation of pitch, mean intensity and duration of the sentences are used. Evaluation using a subjective test is performed by calculating Emotion Recognition Rate (ERR) with the help of confusion matrix and also by taking the Mean Opinion Score (MOS) rating of the conversion system on the scale of 1-5. The result of Subjective test indicates that the effectiveness and discernment of emotion are improved when GRM is used for pitch contour modification with intensity and duration. The most recognized emotion was sadness with MOS of 3.52 and ERR of 83% and the least recognized emotion was anger with MOS of 1.74 and ERR of 66%. The results of the subjective and objective test show that the converted sadness, happy and fear speech is seeming very close to usual sadness, anger and fear emotion.

show abstract

Modification of energy spectra, epoch parameters and prosody for emotion conversion in speech

Cited by 14 publications

References 24 publications

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Emotional Voice Conversion Using a Hybrid Framework With Speaker-Adaptive DNN and Particle-Swarm-Optimized Neural Network

Hybrid Framework for Speaker-Independent Emotion Conversion Using i-Vector PLDA and Neural Network

Modification of Prosody for Emotion Conversion using Gaussian Regression Model

Contact Info

Product

Resources

About