Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis

Zhang, Guangyan; Qiu, Shirong; Qin, Ying; Lee, Tan

doi:10.1109/iscslp49672.2021.9362098

Cited by 8 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…are pivotal in speech recognition, differentiating speech sounds based on their positions and transitions. Although they are not typically regarded as prosodic features, formants are instrumental in recognizing vowels and consonants, providing essential phonetic information in speech analysis [31].…”

Section: Prosodic and Phonetic Regulariser Features's Descriptionmentioning

confidence: 99%

Estimation of Hazardous Environments Through Speech and Ambient Noise Analysis

Porco,

Dongshik

2023

IJACSA

View full text Add to dashboard Cite

In recent years, significant attention has been directed towards the development of artificial empathy within the engineering academic community. Replicating artificial empathy necessitates the capability of agents to discern human emotions and comprehend environmental risks. Analyzing acoustic data in real environments offers a higher level of non-invasive privacy compared to video and camera data, limiting the agent's understanding to specific patterns. However, current studies are negatively affected by subjective inferences from real data, which can result in inaccurate predictions, leading to both false positives and negatives, especially when contextual data and human speech are involved. This paper work proposes the estimation of a dangerous environment in accordance with the emotional speech and additional ambient noises. In this approach we implement a variational autoencoder model in conjunction with a classifier for training the classification task. Additional regularization techniques are applied to bridge the gap between the original training data and the expected data. The classifier utilizes feature data generated by the variational autoencoder to extract class patterns and determine whether the environment is hazardous. Emotional speech is classified as angry, sad, or scared emotions, contributing to the classification of danger, while happy, calm, and neutral emotions are considered safe. Various ambient noise types, including gunfire and broken glass, are categorized as dangerous, while real-life indoor noises like cooking, eating, and movements are considered safe.

show abstract

Section: Prosodic and Phonetic Regulariser Features's Descriptionmentioning

confidence: 99%

Estimation of Hazardous Environments Through Speech and Ambient Noise Analysis

Porco,

Dongshik

2023

IJACSA

View full text Add to dashboard Cite

show abstract

“…Each prosodic feature was then verified for the model fit considering these factors. The goodness of fit of prosodic features for fixed-effect and random-effect variables, as shown in equation (1).…”

Section: Lmm Analysismentioning

confidence: 99%

“…The prosodic features employed for emotion recognition play an essential role in the quality of the human-computer interaction that replicates human speech emotions. Supra-segmental features or the prosody features such as intensity, pitch, duration, etc., contribute additional information to speech known as paralinguistic information [1][2][3][4] and characterize the emotional speech. Developing a prosodic model for emotional utterances for less-studied languages is very challenging.…”

Section: Introductionmentioning

confidence: 99%

Linear Mixed Effect Modelling for Analyzing Prosodic Parameters for Marathi Language Emotions

Harhare¹,

Shah²

2021

IJACSA

View full text Add to dashboard Cite

Along with linguistic messages, prosody is an essential paralinguistic component of emotional speech. Prosodic parameters such as intensity, fundamental frequency (F0), and duration were studied worldwide to understand the relationship between emotions and corresponding prosody features for various languages. For evaluating prosodic aspects of emotional Marathi speech, the Marathi language has received less attention. This study aims to see how different emotions affect suprasegmental properties such as pitch, duration, and intensity in Marathi's emotional speech. This study investigates the changes in prosodic features based on emotions, gender, speakers, utterances, and other aspects using a database with 440 utterances in happiness, fear, anger, and neutral emotions recorded by eleven Marathi professional artists in a recording studio. The acoustic analysis of the prosodic features was employed using PRAAT, a speech analysis framework. A statistical study using a two-way Analysis of Variance (two-way ANOVA) explores emotion, gender, and their interaction for mean pitch, mean intensity, and sentence utterance time. In addition, three distinct linear mixed-effect models (LMM), one for each prosody characteristic designed comprising emotion and gender factors as fixed effect variables, whereas speakers and sentences as random effect variables. The relevance of the fixed effect and random effect on each prosodic variable was verified using likelihood ratio tests that assess the goodness of fit. Based on Marathi's emotional speech, the R programming language examined linear mixed modeling for mean pitch, mean intensity, and sentence duration.

show abstract

“…The prosodic representation is obtained as one of the learned factors, parallel with non-prosodic factors that correspond to content, speaker, channel, etc. In [12,[15][16][17], adversarial learning was applied to address the problem that the learned prosodic representation might contain substantial information related to non-prosodic factors. The use of an adversarial classifier requires the availability of the labels for one of the disentangled non-prosodic factors.…”

Section: Introductionmentioning

confidence: 99%

“…The design of the adversarial classifier is specific to only one nonprosodic factor and can not be applied to other non-prosodic factors. Furthermore, the non-prosodic factors(e.g., speaker) might be related to prosody [15], while disentangling with an adversarial classifier might also result in low prosody information in the prosodic representation.…”

Section: Introductionmentioning

confidence: 99%

Applying the Information Bottleneck Principle to Prosodic Representation Learning

Zhang¹,

Qin²,

Tan³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

This paper describes a novel design of a neural network-based speech generation model for learning prosodic representation. The problem of representation learning is formulated according to the information bottleneck (IB) principle. A modified VQ-VAE quantized layer is incorporated in the speech generation model to control the IB capacity and adjust the balance between reconstruction power and disentangle capability of the learned representation. The proposed model is able to learn word-level prosodic representations from speech data. With an optimized IB capacity, the learned representations not only are adequate to reconstruct the original speech but also can be used to transfer the prosody onto different textual content. Extensive results of the objective and subjective evaluation are presented to demonstrate the effect of IB capacity control, the effectiveness, and potential usage of the learned prosodic representation in controllable neural speech generation.

show abstract

Estimating Mutual Information in Prosody Representation for Emotional Prosody Transfer in Speech Synthesis

Cited by 8 publications

References 22 publications

Estimation of Hazardous Environments Through Speech and Ambient Noise Analysis

Estimation of Hazardous Environments Through Speech and Ambient Noise Analysis

Linear Mixed Effect Modelling for Analyzing Prosodic Parameters for Marathi Language Emotions

Applying the Information Bottleneck Principle to Prosodic Representation Learning

Contact Info

Product

Resources

About