Interspeech 2021 2021
DOI: 10.21437/interspeech.2021-189
|View full text |Cite
|
Sign up to set email alerts
|

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Abstract: Emotional Text-To-Speech (TTS) is an important task in the development of systems (e.g., human-like dialogue agents) that require natural and emotional speech. Existing approaches, however, only aim to produce emotional TTS for seen speakers during training, without consideration of the generalization to unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive emotion-controllable TTS model that allows users to synthesize any speaker's emotional speech using only a short, neutral speech segm… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 39 publications
(91 reference statements)
0
3
0
Order By: Relevance
“…In order for robots to sound emotionally expressive, as noted by the older adults in our study, "emotional voice conversion" (i.e., changing the emotion of the utterance) can be applied in text-to-speech (TTS) synthesis that allows variability in vocal intonation (see (Zhou et al, 2022) for a recent review). Recent methods have also incorporated LLMs into speech synthesis with emotional adaptation (Kang et al, 2023;Leng et al, 2023). Furthermore, Voicebox (Le et al, 2023) Mimicking user expressions and behaviors, such as smiling and laughing with the user, can improve interpersonal coordination, boost interaction smoothness, and increase the likeability of the robot (Vicaria and Dickens, 2016).…”
Section: Reflection Of Congruent Emotionsmentioning
confidence: 99%
“…In order for robots to sound emotionally expressive, as noted by the older adults in our study, "emotional voice conversion" (i.e., changing the emotion of the utterance) can be applied in text-to-speech (TTS) synthesis that allows variability in vocal intonation (see (Zhou et al, 2022) for a recent review). Recent methods have also incorporated LLMs into speech synthesis with emotional adaptation (Kang et al, 2023;Leng et al, 2023). Furthermore, Voicebox (Le et al, 2023) Mimicking user expressions and behaviors, such as smiling and laughing with the user, can improve interpersonal coordination, boost interaction smoothness, and increase the likeability of the robot (Vicaria and Dickens, 2016).…”
Section: Reflection Of Congruent Emotionsmentioning
confidence: 99%
“…In this study, we used speech emotion recognition in two steps [25]. First, emotion embeddings were utilized to generate emotion information for typical utterances.…”
Section: Speech Emotion Recognition (Ser)mentioning
confidence: 99%
“…The diffusion model-based emotion synthesis model proposed in this paper is divided into two styles [25].…”
Section: Diffusion Models With Mel-spectrogramsmentioning
confidence: 99%