Speech Prosody 2020 2020
DOI: 10.21437/speechprosody.2020-191
|View full text |Cite
|
Sign up to set email alerts
|

Introducing Prosodic Speaker Identity for a Better Expressive Speech Synthesis Control

Abstract: To have more control over Text-to-Speech (TTS) synthesis and to improve expressivity, it is necessary to disentangle prosodic information carried by the speaker's voice identity from the one belonging to linguistic properties. In this paper, we propose to analyze how information related to speaker voice identity affects a Deep Neural Network (DNN) based multi-speaker speech synthesis model. To do so, we feed the network with a vector encoding speaker information in addition to a set of basic linguistic feature… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 11 publications
(14 reference statements)
0
1
0
Order By: Relevance
“…To the data generated from the various models listed representing the fake audio datasets, we added audio snippets from various recordings and datasets like synplaflex which is a corpus of audiobooks in French composed of 87 hours of good quality speech [14] and other recorded audio messages mainly in French language thus representing the authentic audio snippets dataset. Then, from this dataset, we segmented the audios into 10 seconds snippets and 02 seconds snippets.…”
Section: A Datamentioning
confidence: 99%
“…To the data generated from the various models listed representing the fake audio datasets, we added audio snippets from various recordings and datasets like synplaflex which is a corpus of audiobooks in French composed of 87 hours of good quality speech [14] and other recorded audio messages mainly in French language thus representing the authentic audio snippets dataset. Then, from this dataset, we segmented the audios into 10 seconds snippets and 02 seconds snippets.…”
Section: A Datamentioning
confidence: 99%
“…NEB has read numerous books whose recordings are available on Librivox. In the Syn-PaFlex project, more than 87 hours of this voice were extracted and annotated according to various expressive aspects in order to build a corpus dedicated to French expressive TTS [9]. Indeed, the speaker is able to change her prosody and modify her voice in order to personify some characters with a distinct style from the indirect speech [10].…”
Section: Datamentioning
confidence: 99%
“…LibriSpeech was also borrowed in TTS-related task to control emotion of generated speech [212,305]. LibriVox is a collection of public audiobooks that can be used in controllable deep audio synthesis [213,306]. Emotional Speech Database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise) [307].…”
Section: Audiomentioning
confidence: 99%