Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Clark, Rob; Silén, Hanna; Kenter, Tom; Leith, Ralph

doi:10.21437/ssw.2019-18

Cited by 19 publications

(17 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since we provide isolated sentences to listeners, there is no concept of correct prosody [5], as such we do not provide a visible reference and do not require listeners to rate two systems as 0 and 100. 25 listeners completed this test.…”

Section: Prosody Predictionmentioning

confidence: 99%

See 1 more Smart Citation

Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari

Moinet

Karlapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a contextdependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

show abstract

Section: Prosody Predictionmentioning

confidence: 99%

“…Unrealistic prosody modelling is often linked to a lack of contextual information. Indeed, without sufficient context, predicting prosody is an ill-posed problem [5], as any number of prosodies could be deemed appropriate for a given text. We believe that this is the biggest limitation of current state-of-the-art TTS models.…”

Section: Introductionmentioning

confidence: 99%

Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari

Moinet

Karlapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Despite the interest in TTS that speaks longer chunks of text, e.g., [16], no major engine currently assumes a connection between consecutive audio files in the corpus. Instead, utterances are traditionally treated in isolation.…”

Section: Bigram Corpus Methodsmentioning

confidence: 99%

“…The main reasons for retaining this custom is that 1) most conventional TTS training material is comprised of sentences read in isolation, and 2) loading longer speech segments (e.g., audiobook paragraphs instead of sentences) into a conventional TTS engine would quickly result in the system running out of memory. This state of affairs is unfortunate in the case of corpora created from longer continuous recordings of speech, such as audiobooks [16,17]. It is even more problematic for spontaneous speech, where syntactic forms do not follow the conventions of written language and cannot be easily separated into standalone, coherent semantic units.…”

Section: Bigram Corpus Methodsmentioning

confidence: 99%

Breathing and Speech Planning in Spontaneous Speech Synthesis

Székely

Henter

Beskow

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Breathing and speech planning in spontaneous speech are coordinated processes, often exhibiting disfluent patterns. While synthetic speech is not subject to respiratory needs, integrating breath into synthesis has advantages for naturalness and recall. At the same time, a synthetic voice reproducing disfluent breathing patterns learned from the data can be problematic. To address this, we first propose training stochastic TTS on a corpus of overlapping breath-group bigrams, to take context into account. Next, we introduce an unsupervised automatic annotation of likely-disfluent breath events, through a product-of-experts model that combines the output of two breathevent predictors, each using complementary information and operating in opposite directions. This annotation enables creating an automatically-breathing spontaneous speech synthesiser with a more fluent breathing style. A subjective evaluation on two spoken genres (impromptu and rehearsed) found the proposed system to be preferred over the baseline approach treating all breath events the same.

show abstract

“…Another, somewhat complementary reason arises from the lack of explicit control inherent to the "black-box" machine learning architectures, such as s2s systems. On the one hand, the existing systems are not designed to capture the long-range semantic dependencies [7], on the other hand, they do not facilitate explicit control of prosody akin to older parametric synthesis approaches, where linguistic and prosodic labels were utilized and prosodic parameters were modelled separately [8].…”

Section: Introductionmentioning

confidence: 99%

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Suni¹,

Kakouros²,

Vainio³

et al. 2020

Speech Prosody 2020

View full text Add to dashboard Cite

Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech. Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text.In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization of sentence prosody. We use an automatic wavelet-based technique to extract such labels from speech material, and use them as an input to a tacotron-like synthesis system alongside textual information.The results of objective evaluation of synthesized speech show that using the prosodic labels significantly improves the output in terms of faithfulness of f0 and energy contours, in comparison with state-of-the-art implementations.

show abstract

Evaluating Long-form Text-to-Speech: Comparing the Ratings of Sentences and Paragraphs

Cited by 19 publications

References 7 publications

Camp: A Two-Stage Approach to Modelling Prosody in Context

Camp: A Two-Stage Approach to Modelling Prosody in Context

Breathing and Speech Planning in Spontaneous Speech Synthesis

Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis

Contact Info

Product

Resources

About