Paragraph Prosodic Patterns to Enhance Text-to-Speech Naturalness

Peiró-Lilja, Alex; Farrús, Mireia

doi:10.21437/speechprosody.2018-124

Cited by 7 publications

(8 citation statements)

References 14 publications

(27 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More experiments need to be carried out to figure out the optimal criteria for the potential detection of paratone boundaries, whether based on raw Momel pitch extractions or symbolical INTSINT labels. We have not taken into account "intraparagraph features" as reported in [28] but we spotted potential candidates. Explicit enumeration discourse markers ("first", "second", "third") were not necessarily realised as autonomous initial paratone boundaries.…”

Section: Discussionmentioning

confidence: 99%

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus

Méli¹,

Ballier²,

Falaise³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

This article describes an experiment in paratone detection based on a spoken corpus of English for Academic Purposes (EAP) recently automatically re-annotated with prosodic information. The Momel and INTSINT annotations were carried out using SPPAS. The EIIDA corpus was chosen as it offered long uninterrupted stretches of speech of academic presentations. We describe the clustering method adopted for automatic detection, contrasting a supervised and an unsupervised method of paratone boundary detection. We showcase the relevance of the annotation scheme followed for this corpus and contribute to the investigation of the phonostyle of lecture delivery. We discuss the relevance of clustering methods applied to the labels of the pitch targets for the analysis of paratones.

show abstract

Section: Discussionmentioning

confidence: 99%

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus

Méli¹,

Ballier²,

Falaise³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…Encoding discourse structure in TTS systems is still a relatively unexplored field. Recent work has focused on generic paragraph-based features [16,17]. In this work we propose an approach to encode DR information in neural statistical parametric speech synthesis (SPSS).…”

Section: Related Workmentioning

confidence: 99%

Improving Speech Synthesis with Discourse Relations

Aubin¹,

Cervone²,

Watts³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

This paper explores whether adding Discourse Relation (DR) features improves the naturalness of neural statistical parametric speech synthesis (SPSS) in English. We hypothesize firstin the light of several previous studies-that DRs have a dedicated prosodic encoding. Secondly, we hypothesize that encoding DRs in a speech synthesizer's input will improve the naturalness of its output. In order to test our hypotheses, we prepare a dataset of DR-annotated transcriptions of audiobooks in English. We then perform an acoustic analysis of the corpus which supports our first hypothesis that DRs are acoustically encoded in speech prosody. The analysis reveals significant correlation between specific DR categories and acoustic features, such as F0 and intensity. Then, we use the corpus to train a neural SPSS system in two configurations: a baseline configuration making use only of conventional linguistic features, and an experimental one where these are supplemented with DRs. Augmenting the inputs with DR features improves objective acoustic scores on a test set and leads to significant preference by listeners in a forced choice AB test for naturalness.

show abstract

“…More recently, efforts have been made to use additional text embeddings derived from pre-trained LMs, such as the BERT model [25], to improve the modelling of prosody [26][27][28]. Moreover, prosody modelling with texts consist of multiple sentences have also been studied for SPSS [29][30][31][32][33][34]. Apart from sentence positions [31][32][33], discourse relations (DRs), which describe the logical relationship between two discourse units like sentences, are also used to improve the prosody generation [29,30,34].…”

Section: Introductionmentioning

confidence: 99%

“…Moreover, prosody modelling with texts consist of multiple sentences have also been studied for SPSS [29][30][31][32][33][34]. Apart from sentence positions [31][32][33], discourse relations (DRs), which describe the logical relationship between two discourse units like sentences, are also used to improve the prosody generation [29,30,34].…”

Section: Introductionmentioning

confidence: 99%

Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis

Song

Zhang

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Although speech prosody is related to the linguistic information up to the discourse structure, most text-to-speech (TTS) systems only take into account the information within each sentence. This makes it challenging when converting a paragraph of text into natural and expressive speech. In this paper, we propose to use the text embed dings of the neighboring sentences to improve the prosody genera tion for each utterance of a paragraph in an end-to-end fashion with out using any explicit prosody features. More specifically, cross utterance (CU) context vectors, which are produced by an additional CU encoder based on the sentence embeddings extracted by a pre trained BERT model, are used to augment the input of the Tacotron2 decoder. Two types of BERT embeddings are investigated, which leads to the use of different CU encoder structures. Experimental results on a Mandarin audiobook dataset and the LJ-Speech English audiobook dataset demonstrate the use of CU information can im prove the naturalness and expressiveness of the synthesized speech. Subjective listening testing shows most of the participants prefer the voice generated using the CU encoder over that generated using stan dard Tacotron2. It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.

show abstract

Paragraph Prosodic Patterns to Enhance Text-to-Speech Naturalness

Cited by 7 publications

References 14 publications

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus

An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus

Improving Speech Synthesis with Discourse Relations

Improving Prosody Modelling with Cross-Utterance Bert Embeddings for End-to-End Speech Synthesis

Contact Info

Product

Resources

About