2021
DOI: 10.1016/j.neunet.2021.04.021
|View full text |Cite
|
Sign up to set email alerts
|

Learning emotions latent representation with CVAE for text-driven expressive audiovisual speech synthesis

Abstract: Great improvement has been made in the field of expressive audiovisual Textto-Speech synthesis (EAVTTS) thanks to deep learning techniques. However, generating realistic speech is still an open issue and researchers in this area have been focusing lately on controlling the speech variability. In this paper, we use different neural architectures to synthesize emotional speech. We study the application of unsupervised learning techniques for emotional speech modeling as well as methods for restructuring emotions… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(1 citation statement)
references
References 50 publications
0
1
0
Order By: Relevance
“…The core idea is to use a non-autoregressive context decoder to generate acoustic features efficiently, and then add a shallow autoregressive acoustic decoder on top of the non-autoregressive context decoder to retrieve the temporal information of the acoustic signal. Dahmani et al [20] first presented an expressive audiovisual corpus, and then proposed to learn emotional latent representation with a conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In Nallanthighal et al's study [21], they emphasized the importance of respiratory voice during TTS by exploring techniques for sensing breathing signal and breathing parameters from speech using deep learning architectures.…”
mentioning
confidence: 99%
“…The core idea is to use a non-autoregressive context decoder to generate acoustic features efficiently, and then add a shallow autoregressive acoustic decoder on top of the non-autoregressive context decoder to retrieve the temporal information of the acoustic signal. Dahmani et al [20] first presented an expressive audiovisual corpus, and then proposed to learn emotional latent representation with a conditional variational auto-encoder for text-driven expressive audiovisual speech synthesis. In Nallanthighal et al's study [21], they emphasized the importance of respiratory voice during TTS by exploring techniques for sensing breathing signal and breathing parameters from speech using deep learning architectures.…”
mentioning
confidence: 99%